×

Announcing: Slashdot Deals - Explore geek apps, games, gadgets and more. (what is this?)

Thank you!

We are sorry to see you leave - Beta is different and we value the time you took to try it out. Before you decide to go, please take a look at some value-adds for Beta and learn more about it. Thank you for reading Slashdot, and for making the site better!

Disk Drive Failures 15 Times What Vendors Say

Zonk posted more than 7 years ago | from the cough-sputter-wheeze-choke dept.

Data Storage 284

jcatcw writes "A Carnegie Mellon University study indicates that customers are replacing disk drives more frequently than vendor estimates of mean time to failure (MTTF) would require.. The study examined large production systems, including high-performance computing sites and Internet services sites running SCSI, FC and SATA drives. The data sheets for the drives indicated MTTF between 1 and 1.5 million hours. That should mean annual failure rates of 0.88%, annual replacement rates were between 2% and 4%. The study also shows no evidence that Fibre Channel drives are any more reliable than SATA drives."

Sorry! There are no comments related to the filter you selected.

Repeat? (2, Insightful)

Corith (19511) | more than 7 years ago | (#18211738)

Didn't we already see this evidence with Google's report?

Re:Repeat? (3, Informative)

georgewilliamherbert (211790) | more than 7 years ago | (#18211790)

We did both this study and the Google study in the first couple of days after FAST was over. Completely redundant....

Redundancy (3, Funny)

pizza_milkshake (580452) | more than 7 years ago | (#18212160)

I thought storage-related redundancy was supposed to be a good thing ;)

Re:Redundancy (5, Funny)

georgewilliamherbert (211790) | more than 7 years ago | (#18212226)

Redundant Array of Irritating Discussions?

Re:Repeat? (2, Interesting)

LiquidCoooled (634315) | more than 7 years ago | (#18211818)

Yes, and its mentioned in the report.
The best part about the entire thing is the very last quote:

"If they told me it was 100,000 hours, I'd still protect it the same way. If they told me if was 5 million hours I'd still protect it the same way. I have to assume every drive could fail."

Just common sense.

Re:Repeat? (5, Informative)

ajs (35943) | more than 7 years ago | (#18211992)

The best part about the entire thing is the very last quote:

"If they told me it was 100,000 hours, I'd still protect it the same way. If they told me if was 5 million hours I'd still protect it the same way. I have to assume every drive could fail."

Just common sense.
It's "common sense," but not as useful as one might hope. What MTTF tells you is, within some expected margin of error, how much failure you should plan on in a statistically significant farm. So, for example, I know of an installation that has thousands of disks used for everything from root disks on relatively drop-in-replaceable compute servers to storage arrays. On the budgetary side, that installation wants to know how much replacement cost to expect per annum. On the admin side, that installation wants to be prepared with an appropriate number of redundant systems, and wants to be able to assert a failure probability for key systems. That is, if you have a raid array with 5 disks and one spare, then you want to know the probability that three disks will fail on it in the, let's say, 6 hour worst-case window before you can replace any of them. That probability is non-zero, and must be accounted for in your computation of anticipated downtime, along with every other unlikely, but possible event that you can account for.

When a vendor tells you to expect 1 0.2% failure rate, but it's really 2-4% that's a HUGE shift in the impact to your organization.

When you just have one or a handful of disks in your server at home, that's a very different situation from a datacenter full of systems with all kinds of disk needs.

just assume 3 years (4, Informative)

crabpeople (720852) | more than 7 years ago | (#18212416)

A good rule of thumb is 3 years. Most hard drives fail in 3 years. I dont know why, but im currently seeing alot of bad 2004 branded drives and consider that right on schedule. Last year the 02-03 drives were the ones failing left and right. I just pulled one this morning thats stamped march 04. Just started acting up a few days ago. Like clockwork.

Re:Repeat? (1)

PitaBred (632671) | more than 7 years ago | (#18212394)

But what kind of money would you budget for replacing/fixing drive failure in each case? That's the rub.

Re:Repeat? (1)

countSudoku() (1047544) | more than 7 years ago | (#18211836)

Yes, it was posted last week... It's still very interesting though.

http://hardware.slashdot.org/article.pl?sid=07/02/ 21/004233 [slashdot.org]

Re:Repeat? (1, Funny)

Anonymous Coward | more than 7 years ago | (#18211986)

I read so much from the firehose these days, I can't tell a dupe from the scoop anymore. I guess I need a new tag - dejavu.

Re:Repeat? (0)

Anonymous Coward | more than 7 years ago | (#18211886)

Didn't we already see this?

http://hardware.slashdot.org/article.pl?sid=07/02/ 21/004233 [slashdot.org]

Re:Repeat? (0)

Anonymous Coward | more than 7 years ago | (#18212436)

Yes,

Your post IS redundant

Re:Repeat? (0)

Anonymous Coward | more than 7 years ago | (#18212276)

No, this is an indirect duplicate of the *other* disk drive endurance paper that was posted back in mid-February. What is the sound of Slashdot clapping with one hand? A posting of some other online crud that's merely a poor synopsis of the original, already posted item. So typical...

Re:Repeat? (1)

Ramble (940291) | more than 7 years ago | (#18212578)

Yes, but as the description says these have only been testing on high-performance rigs and servers. Usage and environmental conditions affect MTBF. I'm pretty sure that a desktop drive that Granny keeps for playing solitaire in a cool environment will last longer than a hard disk in a hot server room being hammered.

it's relative. (4, Funny)

User 956 (568564) | more than 7 years ago | (#18211744)

The data sheets for the drives indicated MTTF between 1 and 1.5 million hours.

Yeah, but I bet they didn't say what planet those hours are on.

Re:it's relative. (2, Funny)

bigtangringo (800328) | more than 7 years ago | (#18211804)

Or what percentage of the speed of light they were traveling.

It's not relative. (1)

tomhudson (43916) | more than 7 years ago | (#18212052)

... its because they were on "Internet Time."

Re:it's relative. (1)

astrashe (7452) | more than 7 years ago | (#18211812)

If an observer on a rail platform measures the MTF of a hard disk on a rail car moving at speeds close to the speed of light...

Personally I am SHOCKED (2, Insightful)

dingbatdr (702519) | more than 7 years ago | (#18211814)

Yes, I am SHOCKED that companies have implemented a systematic program of distorting the truth in order to increase profits.

I propose a new term for the heinous practice---"marketing".

Re:Personally I am SHOCKED (4, Informative)

Beardo the Bearded (321478) | more than 7 years ago | (#18212126)

What, really?

The same companies that lie about the capacity on EVERY SINGLE DRIVE they make? You don't think that they're a bunch of lying fucking weasels? (We're both using sarcasm here.)

I don't care how you spin it. 1024 is the multiple. NOT 1000!

Failure doesn't get fixed because making a drive more reliable means it costs more. If it costs more, it's not going to get purchased.

Re:Personally I am SHOCKED (3, Informative)

Lord Ender (156273) | more than 7 years ago | (#18212506)

Before computers were used in real engineering, we could get away with "k" sometimes meaning 1024 (like in memory addresses) and sometimes meaning 1000 (like in network speeds). Those days are past. Now that computers are part of real engineering work, even the slightest amount of ambiguity is not acceptable .

Differentiating between "k" (=1000) and "ki" (=1024) is a sign that the computer industry is finally maturing. It's called progress.

Re:Personally I am SHOCKED (1)

Hamilton Lovecraft (993413) | more than 7 years ago | (#18212712)

That's pretty hilarious. Computers used to be part of real engineering work. Now they're toys.

Re:Personally I am SHOCKED (1)

DogDude (805747) | more than 7 years ago | (#18212524)

Failure doesn't get fixed because making a drive more reliable means it costs more. If it costs more, it's not going to get purchased.

I couldn't disagree more. I know that I would pay more for even somewhat more reliable drives. The problem is that I can't find any sold that guarantee any kind of reliability other than the rock-bottom standard one year.

Re:Personally I am SHOCKED (1)

JackMeyhoff (1070484) | more than 7 years ago | (#18212522)

You mean like the AMD CPU ratings for example? Shocking isnt it.

And that's a really wide range (2, Funny)

VampireByte (447578) | more than 7 years ago | (#18211984)

I feel sorry for anyone buying drives on the low end of that range. A MTTF of 1 hour really sucks.

Re:And that's a really wide range (1)

User 956 (568564) | more than 7 years ago | (#18212004)

I feel sorry for anyone buying drives on the low end of that range. A MTTF of 1 hour really sucks.

Well, they don't call it "Best Borrow" for no reason.

Re:it's relative. (1)

goombah99 (560566) | more than 7 years ago | (#18211996)

How does it compare to flash MTBF. Or between Manufacturers? If the ratio of actual to stated MTBF is the same for all hard disks that's fine I guess since I know how to divide by 15. But if it varies between manufaruters or between alternative technologies (dvd, harddrive, flash drive, metal film drive, tape) then this matters a great deal as one will make the wrong choices or pay way too much for reliability not gained.

unless they warantee this, which none do, the spec is meaningless, and they might as well lie.

Masters of estimates (0)

Anonymous Coward | more than 7 years ago | (#18211774)

First the thing about the drive sizes (1000 or 1024?), now thins guesstimate...

Re:Masters of estimates (1)

dangitman (862676) | more than 7 years ago | (#18212374)

Well, the hard-drive makers are correct on the size thing - a Gigabyte is 1000 Megabytes, and the OS and software makers are wrong. I wish the software side would fix this problem. Does anybody know of any way to change preferences in MacOS or Windows so that filesizes are read out correctly? i.e, that Gigabytes are actually displayed as Gigabytes, or that the listing is changed to correctly display Gibibytes as the value? (or Kibibytes, Mebibytes, whatever)

In other news... (4, Informative)

Mr. Underbridge (666784) | more than 7 years ago | (#18211808)

...Carnegie Mellon researchers can't tell a mean from a median. This is inherently a long-tailed distribution in which the mean will be much higher than the median. Imagine a simple situation in which failure rates are 50%/yr, but those that last beyond a year last a long time. Mean time to failure might be 1000 years. You simply can't compare the statistics the way they have without knowing a lot more about the distribution than I saw in the article. Perhaps I missed it while skimming.

Even better ... (3, Interesting)

khasim (1285) | more than 7 years ago | (#18211994)

Give me 6 month failure rates.

Start with 100 drives. Continuous usage.

How many fail in the first 6 months? 12 months? 18 months? ... 60 months? That would be the info that I'd need. Where's the big failure spike? I'm going to be replacing them right before that.

Re:Even better ... (1)

ivan256 (17499) | more than 7 years ago | (#18212230)

The big spike is at the beginning.

Re:Even better ... (1)

Grail (18233) | more than 7 years ago | (#18212466)

TFA tells you that there is no "bathtub curve" and no "failure spike". The drives just fail more frequently as they get older - it's an exponentially rising curve.

Re:In other news... (0)

Anonymous Coward | more than 7 years ago | (#18211998)

Except this isn't the first of these studies. Google ran one based on their own drive usage, and another group did one over an even larger set of drives than Google used (!)

Their findings: 1) Drives suck. 2) Expensive drives don't suck less. and finally 3)

those that last beyond a year last a long time

is false. There is no "bathtub" distribution of drive failures with a spike at the beginning and the end. The "burn-in" myth is just that.

Both of these reports were on /. just a few weeks ago.

Re:In other news... (0)

Anonymous Coward | more than 7 years ago | (#18212312)

You've been busted, Mr. Underbridge!

Re:In other news... (3, Informative)

Falkkin (97268) | more than 7 years ago | (#18212300)

In other news, Carnegie Mellon researchers know more about statistics than you give them credit for; blame ComputerWorld for crappy coverage of what the paper says. If you read the paper or the abstract, the researchers actually claim the opposite of what you are suggesting, namely, that the "infant mortality effect" (bathtub curve) often claimed for hard drives isn't actually the case. See Figure 4 in the paper and Section 5 ("Statistical properties of disk failures"). The paper is online here:

http://www.usenix.org/events/fast07/tech/schroeder /schroeder_html/index.html [usenix.org]

I believe it... (2, Informative)

madhatter256 (443326) | more than 7 years ago | (#18211832)

Yeh. Don't rely on the HDD after it surpasses its' manufacturer warranty.

Re:I believe it... (2, Insightful)

SighKoPath (956085) | more than 7 years ago | (#18211900)

Also, don't rely on the HDD before it surpasses its manufacturer warranty. All the warranty means is you get a replacement if it breaks - it doesn't provide any extra guarantees of the disk not failing.

Re:I believe it... (0)

Anonymous Coward | more than 7 years ago | (#18212168)

I sort of rely on the fact that drives won't. I havn't been able to work the system as well as a friend of mine, but as long as they fail within the warantee window, that's basically a free upgrade.

Re:I believe it... (2, Insightful)

The Clockwork Troll (655321) | more than 7 years ago | (#18212532)

In my experiences with several major drive vendors, I have never gotten an "upgrade". What you get is a replacement drive, but generally it's the same drive (perhaps refurbished or firmware-revised) and the original warranty period is still in effect (with perhaps a 30 day extension to account for your downtime). I've RMA'd a lot of drives and never have I gotten one of different spec/size. I'm not even sure this would be desirable, e.g. in the case of replacing a drive in a RAID array with something of different specification (yes, even "better" specification). Symmetry and everything.

Re:I believe it... (1)

Short Circuit (52384) | more than 7 years ago | (#18212610)

Manufacturers don't want to spend more than a certain percentage of their sales on warranty replacements, so they limit their warranty periods to a value that would yield a comfortably low number of RMAs.

By comparing manufacturer warranty rates, one can get a rough idea of how confident different manufacturers are about the lifetime of their products.

However, the only justification I can think of for not relying on a drive beyond the warranty would be that one doesn't get a free drive as replacement if it fails. But buying a new drive every three-to-five years, just because one can't get a free drive, seems silly to me.

Re:I believe it... (1)

drinkypoo (153816) | more than 7 years ago | (#18212418)

Don't rely on a HDD ever. This is why we have backups and RAID. Even RAID's not enough by itself.

or BEFORE... (1)

toby (759) | more than 7 years ago | (#18212670)

Sigh.

As Schwartz [sun.com] put it recently, there are two kinds of disk: Those that have failed, and those that are going to.

Before that (1)

phorm (591458) | more than 7 years ago | (#18212728)

Hell, nowadays I wouldn't rely on one single drive before it reaches warranty. Usually by the time of the smaller warranty's (1yr) you've accumulated enough important stuff to make the data-loss much more painful than the cost of the replacement drive.

Now in some cases manufacturers with longer warranties are stating that they have more faith in their product, and certainly the sudden drop in warranty length (from 2-3 years down to one for many) indicates a lack of faith in their products.

Basically, a warranty isn't so much your guarantee on a product so much it says:
This warranty length gives us the maximal profit on drive sales vs returns. In other words, any longer than that and the returns are going to eat into the company's profits, but there will be drive deaths both before and after that term. Nowadays a three year warranty isn't any sort of guarantee of such longevity, but rather the point at which the manufacturer is no longer willing to eat the cost of returns.

Statistics (0)

Anonymous Coward | more than 7 years ago | (#18211864)

>The data sheets for the drives indicated MTTF between 1 and 1.5 million hours

In statistics the average alone doesn't say anything, you need to give the variance

http://en.wikipedia.org/wiki/Variance [wikipedia.org]

You can give an average value of espected life, but you also need to know how open your distribution is to understand if your product last longer than the competition.

Re:Statistics (0)

Anonymous Coward | more than 7 years ago | (#18212440)

Sorry but MTTF is actually the only statistic that directly translates into expected cost of replacement in a large HDD farm. Unless the downtime incurred by an unexpected HDD failure is so high that you're considering preemptively replacing some HDDs before they fail, variance and higher moments only have entertainment value.

Fuzzy math (1, Insightful)

Spazmania (174582) | more than 7 years ago | (#18211876)

Disk Drive Failures 15 Times What Vendors Say [...] That should mean annual failure rates of 0.88% [but] annual replacement rates were between 2% and 4%.

0.88 * 15 = 4?

Re:Fuzzy math (1)

mistahkurtz (1047838) | more than 7 years ago | (#18212040)

uh, yeah.... and 2+2=5

Re:Not So Fuzzy math (4, Informative)

Annoying (245064) | more than 7 years ago | (#18212636)

0.88% != 0.88
0.0088 * 15 = 0.132 (13%)
13% you say? The excerpt says 2%-4%. RTA and you'll see though they report up to 13% on some systems.

This study is useless. (2, Interesting)

Lendrick (314723) | more than 7 years ago | (#18211880)

In the article, they mention that the study didn't track actual failures, just the how often customers *thought* there was a failure and replaced their drive. There are all sorts of reasons someone might think a drive has failed. They're not all correct. I can't begin to guess what percentage of those perceived failures were for real.

This study is not news. All it says is that people *think* their hard drives fail more often than the mean time to failure.

Re:This study is useless. (1)

mandelbr0t (1015855) | more than 7 years ago | (#18211932)

And I think they fail less often than the MTTF. There, the statistics are satisfied as well, and it's still not news.

Re:This study is useless. (3, Interesting)

crabpeople (720852) | more than 7 years ago | (#18212520)

Thats fair, but if you pull a bad drive, ghost it (assuming its not THAT bad), plop the new drive in, and the system works flawlessly, what are you to assume?

I dont really care to know exactly what is wrong with the drive. If i replace it, and the problem goes away, I would consdier that a bad drive. Even if you could still read and write to it. I just did one this morning that showed no symptoms other than windows taking what I considered a long time, to boot. All the user complained about was sluggish performance, and there were no errors or drive noises to speak of. Problem fixed, user happy, drive bad.

As I already posted, a good rule of thumb is 3 years from the date of manufacture, is when most drives go bad.

Re:This study is useless. (1)

Lendrick (314723) | more than 7 years ago | (#18212752)

You obviously know what you're doing. Not all users do... in fact, the bitter techie in my is screaming that most don't. :)

Over my career, I've replaced a TON of SCSI drives (1)

mmell (832646) | more than 7 years ago | (#18211904)

I've still got quite a few of them, sizes ranging from 20MB - 2GB. Still operational (I presume). I wonder if those'd count towards the average?

Re:Over my career, I've replaced a TON of SCSI dri (1)

Pojut (1027544) | more than 7 years ago | (#18212008)

I've got a big ol' 5-inch 20MB hard drive whirring at home still...

In fact, my TRS-80 is still functional too...the tape drive is a little wonky, but what are ya gonna do?

MTTF to annual failure rates (0)

Anonymous Coward | more than 7 years ago | (#18211914)

The data sheets for the drives indicated MTTF between 1 and 1.5 million hours. That should mean annual failure rates of 0.88%
Shouldn't that mean annual failure rates between 0.58% and 880000% ?

Interface matters why? (3, Interesting)

neiko (846668) | more than 7 years ago | (#18211936)

TFA seems surprised by SATA drives lasting as long as Fibre...why one earth would your data interface have any consequences on the drive internals? Or are we talking assuming Interface = Data Throughput?

Re:Interface matters why? (0)

Anonymous Coward | more than 7 years ago | (#18212002)

I think the general assumption is that more expensive "enterprise" level drives are significantly more reliable then much cheaper consumer level equipment. Recent studies show this not to be true.

Re:Interface matters why? (3, Insightful)

ender- (42944) | more than 7 years ago | (#18212028)

TFA seems surprised by SATA drives lasting as long as Fibre...why one earth would your data interface have any consequences on the drive internals? Or are we talking assuming Interface = Data Throughput?

That statement is based on the long-held assumption that hard drive manufacturers put better materials and engineering into enterprise-targeted drives [Fibre] than they put into consumer-level drives [SATA].

Guess not...

Re:Interface matters why? (2, Informative)

Spazmania (174582) | more than 7 years ago | (#18212324)

They certainly charge enough more. SATA drives run about $0.50 per gig. Comparable Fibre Channel drives run about $3 per gig. A sensible person would expect the Fibre Channel drive to be as much as 6 times as reliable, but per the article there is no difference.

Re:Interface matters why? (1)

Danga (307709) | more than 7 years ago | (#18212076)

I thought the exact same thing. They are just dumbasses. The interface has probably zero effect on failure rate compared to the mechanical parts which are just about the same in all the drives.

FTA:

"the things that can go wrong with a drive are mechanical -- moving parts, motors, spindles, read-write heads," and these components are usually the same"

The only effect I can see it having would be if really shitty parts were used for one interface compared to the other.

Re:Interface matters why? (5, Informative)

mollymoo (202721) | more than 7 years ago | (#18212110)

TFA seems surprised by SATA drives lasting as long as Fibre...why one earth would your data interface have any consequences on the drive internals?

Fibre Channel drives, like SCSI drives, are assumed to be "enterprise" drives and therefore better built than "consumer" SATA and PATA drives. It's nothing inherent to the interface, but a consequence of the environment in which that interface is expected to be used. At least, that's the idea.

Re:Interface matters why? (1)

Penguin's Advocate (126803) | more than 7 years ago | (#18212166)

It is probably assumed that FC drives are more reliable because they are expensive and only really used in relatively expensive servers. It's the whole "professional vs consumer grade" issue. It is generally assumed that "professional grade" drives should be more reliable than "consumer grade" drives. In my experience this is true, the 10000RPM scsi drives in my 10 year old Sun Ultra2s (which see continuous round the clock use) still work great, while I've never had a regular desktop drive from any manufacturer last more than 5 years. Not that my experience counts for much, I've only dealt with several hundred harddrives vs. the several hundred thousand in these studies. (Just for reference, the ratio I've seen between FC/SCSI and SATA/ATA drives failing is about 15:1 in favor of FC/SCSI, and I've never had a SCSI drive last less than 3 years, while I've had plenty of SATA and ATA drives last less than a couple months).

Re:Interface matters why? (1)

Spazmania (174582) | more than 7 years ago | (#18212360)

My 10 year old drives are still working great too. Its my 1 to 4 year old drives that are failing with alacrity.

Re:Interface matters why? (1)

Bill Dimm (463823) | more than 7 years ago | (#18212192)

TFA seems surprised by SATA drives lasting as long as Fibre...why one earth would your data interface have any consequences on the drive internals?

Because drive manufacturers claim [usenix.org] they use different hardware for the drive based on the interface. For example, a SCSI drive supposedly contains a disk designed for heavier use than an ATA drive, they aren't just the same disk with different interfaces.

Re:Interface matters why? (1)

Mr.Ziggy (536666) | more than 7 years ago | (#18212200)

In theory, just changing the interface board on a drive would not change the reliability of the drive. BUT manufacturers are charging much more for Fibre Channel drives than SATA or IDE, because they are of supposed 'enterprise' quality. With suggestions of batch sorting or higher tolerances. It turns out those who are paying more for drive reliability are wrong. You can get more speed by spending more $/GB, but not more reliability.

I have thought the MTTF is bullshit for a while (5, Interesting)

Danga (307709) | more than 7 years ago | (#18211940)

I have had 3 personal use hard drives go bad in the last 5 years, they were either Maxtor or Wester Digital. I am not hard on the drives other than leaving them on 24/7. The drives that failed were all just for data backup and I put them in big, well ventilated boxes. With this use I would think the drives would last for years (at least 5 years), but nope! The drives did not arrive broken either, they all functioned great for 1-2 years before dying. The quality of consumer hard drives nowadays is way, WAY low, and the manufacturers should do something about it.

I don't consider myself a fluke because I know quite a few other people who have had similar problems. What's the deal?

Also, does anyone else find this quote interesting?:

"and may have failed for any reason, such as a harsh environment at the customer site and intensive, random read/write operations that cause premature wear to the mechanical components in the drive."

It's a f$#*ing hard drive! Jesus H Tapdancing Christ how can they call that premature wear, do they calculate the MTTF by just letting the drive sit idle and never reading and writing to it? That actually wouldn't suprise me.

Re:I have thought the MTTF is bullshit for a while (1)

dextromulous (627459) | more than 7 years ago | (#18212086)

I have had 3 personal use hard drives go bad in the last 5 years, they were either Maxtor or Wester Digital. I am not hard on the drives other than leaving them on 24/7.
Ever read the manufacturer's fine print on how they determine MTBF? Last time I did (yeah, it was over a year ago,) it read: "8 hour a day usage." Drives that are on 24/7 get HOT, and heat leads to mechanical failure.

Re:I have thought the MTTF is bullshit for a while (1)

XenoPhage (242134) | more than 7 years ago | (#18212186)

Ever read the manufacturer's fine print on how they determine MTBF? Last time I did (yeah, it was over a year ago,) it read: "8 hour a day usage." Drives that are on 24/7 get HOT, and heat leads to mechanical failure.

MTTF, no? MTBF would indicate a fixable system.

Yeah, but there has to be a plateau to the heat curve at some point. It's not as if the heat just keeps going up and up.. I would think that the constant on/off each day, causing expansion and contraction of the parts as they heat and cool, would cause much more wear over time. Leaving it on 24/7 in a well ventilated and cooled system should, I would think, keep the drives running better.

Where are the majority of the failures anyway? In the mechanical components or on the disk platters themselves? ie, is this mechanical wear causing failures, or a breakdown of the chemicals used to coat the drive platters?

Re:I have thought the MTTF is bullshit for a while (1)

dextromulous (627459) | more than 7 years ago | (#18212262)

Acronyms schmackronyms... anyway, I found at least one paper that I read in the past that states the 8 hours/day thing I was referring to: http://www.seagate.com/content/docs/pdf/whitepaper /D2c_More_than_Interface_ATA_vs_SCSI_042003.pdf [seagate.com]

The 8 hours/day is referring to personal storage (as opposed to enterprise storage systems,) and this discussion is supposed to be about enterprise storage, so I'm off topic anyway. (BTW, the whitepaper I linked to does specify it as MTBF, for what it's worth)

Re:I have thought the MTTF is bullshit for a while (1)

dextromulous (627459) | more than 7 years ago | (#18212194)

I was able to quickly find at least one reference [seagate.com] to this measure (8 hours/300 days a year for personal storage [PS] drives, 24 hours 265 for enterprise storage [ES] drives.)

The most significant difference in the reliability specifica- tion of PS and ES drives is the expected power-on hours (POH) for each drive type. The MTBF calculation for PS assumes a POH of 8 hours/day for 300 days/year1 while the ES specification assumes 24 hours per day, 365 days per year.

Re:I have thought the MTTF is bullshit for a while (1)

mollymoo (202721) | more than 7 years ago | (#18212244)

Do you seriously think a drive won't have reached thermal equilibrium after an hour, let alone after several hours? Mine seem to get up to their 'normal' temperatures in 30 minutes or less. And according to the Google study, heat doesn't lead to a significantly increased risk of failure till you get above 45 C or so.

Re:I have thought the MTTF is bullshit for a while (1)

dextromulous (627459) | more than 7 years ago | (#18212442)

Do you seriously think a drive won't have reached thermal equilibrium after an hour, let alone after several hours? Mine seem to get up to their 'normal' temperatures in 30 minutes or less.
Sure, they will have reached "thermal equilibrium" after a short period of time. See Figure 9 in this paper [seagate.com] " Reliability reduction with increased power on hours, ranging from a few hours per day to 24 x 7 operation " to see how I'm not sure that merely being hot is the problem.

And according to the Google study, heat doesn't lead to a significantly increased risk of failure till you get above 45 C
I'll have to take your word for it, I haven't read their study yet.

I am shocked! (2, Insightful)

Anonymous Coward | more than 7 years ago | (#18211958)

I just can't believe that the same vendors that would misrepresent the capacity of their disk by redefining a Gigabyte as 1,000,000,000 bytes instead of 1,073,741,824 bytes would misrepresent their MTBF too! And by the way, nobody actually runs a statistically significant sample set their equipment for 10,000 hours to arrive at a MTBF of 10,000 hours, so isn't their methodology a little suspect in the first place?

Off-Topic: SI Units (5, Informative)

ewhac (5844) | more than 7 years ago | (#18212634)

I just can't believe that the same vendors that would misrepresent the capacity of their disk by redefining a Gigabyte as 1,000,000,000 bytes instead of 1,073,741,824 bytes would misrepresent their MTBF too!

Not that this is actually relevant or anything, but there's been a long-standing schism between the computing community and the scientific community concerning the meaning of the SI prefixes Kilo, Mega, and Giga. Until computers showed up, Kilo, Mega, and Giga referred exclusively to multipliers of exactly 1,000, 1,000,000, and 1,000,000,000, respectively. Then, when computers showed up and people had to start speaking of large storage sizes, the computing guys overloaded the prefixes to mean powers of two which were "close enough." Thus, when one speaks of computer storage, Kilo, Mega, and Giga refer to 2**10, 2**20, and 2**30 bytes, respectively. Kilo, Mega, and Giga, when used in this way, are properly slang, but they've gained traction in the mainstream, causing confusion among members of differing disciplines.

As such, there has been a decree [nist.gov] to give the powers of two their own SI prefix names. The following have been established:

  • 2**10: Kibi (abbreviated Ki)
  • 2**20: Mebi (Mi)
  • 2**30: Gibi (Gi)

These new prefixes are gaining traction in some circles. If you have a recent release of Linux handy, type /sbin/ifconfig and look at the RX and TX byte counts. It uses the new prefixes.

Schwab

Mod parent up! (1)

Jaqenn (996058) | more than 7 years ago | (#18212734)

I burned all my mod points this morning, and this one definitely deserves +X informative.

You takes your chances (1)

davidwr (791652) | more than 7 years ago | (#18212024)

At any given time, the drive has a finite probability of failing in the next 30 days of normal use.

When this probability is high enough, you should replace it or take actions (like more frequent backups) that raise your tolerance for failure.

Imagine drives had a failure rate similar to radioactive decay:

2% of drives failed in the 1st year,
2% of the remaining drives failed in the 2nd year,
2% of the remaining drives failed in the 3rd year,
and so on.

Why should I replace my 5 year old drive with an identical new one? I shouldn't.

However, that's not the real world. In the real world, drives are more like cars - a drive with the equivalent of 100,000 miles and 10 years on it is a lot more likely to have a mechanical breakdown than one with 6 months and 5,000 miles.

Re:You takes your chances (1)

ivan256 (17499) | more than 7 years ago | (#18212402)

This isn't about data loss, it's about cost. No smart is taking their chances and playing the odds. They are protecting their data with redundancy and backups. You're going to run the drive until it dies, has performance impacting error rates, or needs to be upgraded for some other reason. This isn't about knowing when you need to buy a new drive to save your stuff. It's about knowing how much budget to allocate to drive replacements in your organization that has 50,000 drives. Tolerance for failure is not measured in data lost. It is measured in dollars.

Having read the paper and seen the talk... (2, Informative)

reset_button (903303) | more than 7 years ago | (#18212036)

Here are the main conclusions:
  • the MTTF is always much lower than the observed time to disk replacement
  • SATA is not necessarily less reliable than FC and SCSI disks
  • contrary to popular belief, hard drive replacement rates to not enter steady state after the first year of operation, and in fact steadily increase over time.
  • early onset of wear-out has a stronger impact on replacement than infant mortality.
  • they show that the common assumptions that the time between failure follows an exponential distribution, and that failures are independent, are not correct.
It was an interesting paper (won the best paper award) at this year's FAST (File and Storage Technologies) conference. Here is a link [cmu.edu] to the paper, and the summary [usenix.org] from the conference.

Perhaps your data is safe if you DUPElicate it (0)

Anonymous Coward | more than 7 years ago | (#18212042)

At least we know slashdot won't be in danger of losing their data if that's the case ;-)
http://hardware.slashdot.org/article.pl?sid=07/02/ 21/004233 [slashdot.org]

Corporations misrepresent products, news at 11:00! (1)

NerveGas (168686) | more than 7 years ago | (#18212058)

Is there anyone out there that actually believed the published MTBF figures, even BEFORE these articles came out?

It's hard to take someone seriously when they claim that their drives have a 100+ year MTBF, especially since precious few are still functional after 1/10th of that much use. To make it better, many drives are NOT rated for continuous use, but only a certain number of hours per day. I didn't know that anyone EVER believed the MTBF B.S..

Re:Corporations misrepresent products, news at 11: (1)

mollymoo (202721) | more than 7 years ago | (#18212504)

It's hard to take someone seriously when they claim that their drives have a 100+ year MTBF, especially since precious few are still functional after 1/10th of that much use.

You're misinterpreting MTBF. A 100 year MTBF does not mean the drive will last 100 years, it means that 1/100 drives will fail each year. There will be another spec somewhere which specifies the design lifetime. For the Fujitsu MHT2060AT [fujitsu.com] drive which was in my laptop the MTBF is 300 000 hours, but the component life is a crappy 20 000 hours or 3 years - 93% of drives should make it that far given the MTBF. After the end of the design lifetime, all bets are off.

I replace Drives for reasons other than failure. (1)

zibix (654122) | more than 7 years ago | (#18212074)

I didn't notice anything in the article that would indicate that they only took into account drive being replaced due to failure. It seems like this would be common sense, but I'd like some verification that only drive-failures were being included in this "replacement" study.

Check SMART Info (3, Interesting)

Bill Dimm (463823) | more than 7 years ago | (#18212094)

Slightly off-topic, but if you haven't checked the Self-Monitoring, Analysis and Reporting Technology (SMART) info provided by your drive to see if it is having errors, you probably should. You can download smartmontools [sourceforge.net] , which works on Linux/Unix and Windows. Your Linux distro may have it included, but may not have the daemon running to automatically monitor the drive (smartd).

To view the SMART info for drive /dev/sda do:
smartctl -a /dev/sda
To do a full disk read check (can take hours) do:
smartctl -t long /dev/sda

Sadly, I just found read errors on a 375-hour-old drive (manufacturer's software claimed that repair succeeded). Fortunately, they were on the Windows partition :-)

RAID = Redundant Articles of Identical Discourse (2, Informative)

MasterC (70492) | more than 7 years ago | (#18212102)

New meaning for RAID: Redundant Articles of Identical Discourse.
Slashdot has a high rate of RAID, which is a bad thing. Which is a bad thing. It has been a whole 9 days. Slashdot needs a story moderation system so dupe articles can get modded out of existance. Ditto for slashdot editors who do the duping! :) (I have long since disabled tagging since 99% of the tags were completely worthless: "yes", "no", "maybe", "fud", etc. If tagging is actually useful now, please let me know!)

Can we get redundant posting on the story about google's paper [slashdot.org] ?

Re:RAID = Redundant Articles of Identical Discours (1)

Nimey (114278) | more than 7 years ago | (#18212538)

They aren't useful yet. Given the crowd, won't be until they're rethought.

Unfortunately (1)

nmos (25822) | more than 7 years ago | (#18212176)

Unfortunately the data was skewed by one large web site that reported it's results multiple times.

Re:Unfortunately (1)

winkydink (650484) | more than 7 years ago | (#18212400)

Well put! That took me a second. :)

Thanks for the tip! (1)

PatPending (953482) | more than 7 years ago | (#18212206)

He echoed storage vendors and analysts in pointing out that as many as half of the drives returned to vendors actually work fine and may have failed for any reason, such as a harsh environment at the customer site and intensive, random read/write operations that cause premature wear to the mechanical components in the drive.
Random read/write operations? Oh, okay, I'll start using *sequential* read/write operations instead! Thanks for the tip!

Odd numbers for memory failure? (1)

nmos (25822) | more than 7 years ago | (#18212290)

One of the things that bugged me last time this report was on /. was that 2 of the three sources reported that memory was replaced after 20% or more of their system failures. That seems pretty odd because in my experience memory hardly ever just goes bad. Sure sometimes it's bad right out of the box which is why I test every module that I buy but once it's installed and test memory tends to keep working just about forever. If that number is off then I wonder how seriously I should take their other numbers.

Re:Odd numbers for memory failure? (2, Interesting)

Akaihiryuu (786040) | more than 7 years ago | (#18212654)

I had a 4mb 72-pin parity SIMM go bad one time...this was about 12 years ago in a 486 I used to have. It just didn't work one day (it worked for the first two months). Turn the computer on, get past BIOS start, bam...parity error before bootloader could even start. Reboot, try again, parity error. Turn off parity checking, it actually started to boot and then crashed. The RAM was obviously very defective...when I took that 1 stick out the computer booted normally even with parity on, if I tried to boot with just that stick it would never even POST. That's the only time I have ever seen memory fail...but then it came from a really shady local dealer who regularly scammed people...this same guy had a rack of "shareware" DOS games with neatly printed labels (all labels he printed) for like $5/disk, all of the disks completely blank (not even formatted). I had happened to get one of those when I got the RAM, and my friend did too (from another part of the rack, we didn't give much thought to that at the time, was just an "oh, this looks like it might be neat" thing). Neither disk was even formatted. The CDROM drives he sold me and my friend died within a month also (about a month after the RAM). Amazingly the store was still in business when I went back with the stick of RAM...he looked at it with a magnifying glass, claimed it was "scratched" and therefore abused. I burned rubber out of his parking lot, tossing a lot of gravel against the windows, then I found a reputable place to get RAM (though this was back in the days when 4MB cost $200). 2 days later I drove by, the place was boarded up and closed. Both CDROM drives died within 2 days of each other a month later. Nothing that came out of that place worked.

Re:Odd numbers for memory failure? (1, Informative)

Anonymous Coward | more than 7 years ago | (#18212736)

Where I work we have some large compute clusters where the nodes report memory errors. It's actually very common for a memory module to start throwing errors that eventually exceed a threshold for replacement.

We see everything eventually die - power supplies, fans, motherboards, RAM, CPUs, drives. Nothing is immune from "wearing out" except maybe the boxes themselves.

Which is why I use Samsung (1)

WindBourne (631190) | more than 7 years ago | (#18212350)

Samsung seems to have pretty decent QC at this time. I have no issues with them. OTH, I have seen maxtors die with less than 2 years on them.

No way (2, Funny)

Tablizer (95088) | more than 7 years ago | (#18212388)

High rate of failure? That's a bunch of

Seagate (3, Insightful)

mabu (178417) | more than 7 years ago | (#18212390)

After 12 years of running Internet servers, I won't put anything but Seagate SCSI drives in any mission critical servers. My experience indicates Seagate drives are superior. Who's the worst? Quantum. The only thing Quantum drives are good for is starting a fire IMO.

Re:Seagate (2, Funny)

CelticWhisper (601755) | more than 7 years ago | (#18212482)

Well, duh. Why do you think they used to call them Fireballs?

Faster, cheaper, more reliable (2, Informative)

dangitman (862676) | more than 7 years ago | (#18212478)

Pick any two.

I've noticed this personally. Now, anecdotal evidence doesn't count for a lot, and it may be a case that we are pushing our drives more. But back in the day of 40MB hard drives that cost a fortune, they used to last forever. The only drive I ever had fail on me in the old days were the Syquest removable HD cartridges, for obvious reasons. But even they didn't fail that often, considering the extra wear-and-tear of having a removable platter with separate heads in the drive.

But these days, with our high-capacity ATA drives, I see hard drives failing every month. Sure, the drives are cheap and huge, but they don't seem to make them like they used to. I guess it's just a consequence of pushing the storage and speed to such high levels, and cheap mass-production. Although the drives are cheap, if somebody doesn't back up their data, the costs are incalculable if the data is valuable.

A Story (1)

alan_dershowitz (586542) | more than 7 years ago | (#18212518)

When I was in high school in 1995, I was a network intern. We had a 486 Novell Netware server for the high school building. The actual admin was a LOTR fan, and named it GANDALF, others were SAMWISE, etc. One day about four years ago, a friend of mine who worked for the school district calls me and says, "hey, I saw Gandalf in the dumpster today. I thought you might want him, so I grabbed him."

Besides nostalgia, there wasn't a lot I could do with a giant, noisy 486 anymore, so I ended up just pulling the SCSI interface and drive for use in another machine I had and dumping the rest. I was living in a trailer at the time, and was using a closet as my "server room." After about six months of service, the machine died on me. Everything inside the case had a crust on it. It turned out that I had a roof leak in the closet, and it eventually soaked and killed the machine.

Anyway, it's 2007 and I'm still using that drive in a Samba print server. It's still alive despite a decade having passed and it being soaked with rainwater.
Load More Comments
Slashdot Login

Need an Account?

Forgot your password?