Yottabytes: Storage and Disaster Recovery

Nov 18 2014   2:11PM GMT

Here’s the One Thing to Look At to See If Your Hard Drive Will Fail

Sharon Fisher Sharon Fisher Profile: Sharon Fisher

Tags:
Backup
Storage

Anyone who’s had a hard drive fail just as they were about to do a backup on it, honest! will understand how much we’d all like to know when our hard disks are about to fail.

Some time ago (between 1995 and 2004, depending on how you count), a standard was developed called Self-Monitoring, Analysis and Reporting Technology (SMART, get it?) that was intended to help with this problem.

Unfortunately, like many other technologies, its user experience was not the best. SMART defines — and measures, for those vendors that support it — more than 70 characteristics of a particular disk drive. But while it’s great to know how many High Fly Writes or Free Fall Events a disk has undergone, these figures aren’t necessarily useful in any real sense of being able to predict a hard drive failure.

Part of this is because of the typical problem with standards: Just because two vendors implement a standard, it doesn’t mean they’ve implemented it in the same way. So the way Seagate counts something might not be the same way as Hitachi counts something. In addition, vendors might not implement all of the standard. Finally, in some cases, even the standard itself is…unclear, as with Disk Shift, or the distance the disk has shifted relative to the spindle (usually due to shock or temperature), where Wikipedia notes, “Unit of measure is unknown.”

That’s not going to be helpful if, for example, one vendor is measuring it in microns and one in centimeters.

There have been various attempts at dealing with this problem of figuring out which of these statistics are actually useful. One in particular was a paper presented at 2007 Usenix by three Google engineers, “Failure Trends in a Large Disk Drive Population.” What was interesting about Google is that it used enough hard drives to be able to develop some useful correlations between these 70-odd (and some of them are very odd) measurements and actual failure.

Now there’s sort of an update to that paper, but it uses littler words and is generally more accessible to people. It’s put out by Brian Beach, an engineer at BackBlaze; we’ve written about them before. Like Google, their insights into commodity hard disk drives are useful, simply because they use so darn many of them.

What BackBlaze has done this time is look at all the drives they have that have failed, and then go back and look at all their SMART statistics, and then correlate them. The company also looked at how different vendors measure these different statistics, so they have a good idea about which statistics are relatively common across vendors. This gives us a better idea of which statistics we should actually be paying attention to.

As it turns out, there’s really just one: SMART 187 – Reported_Uncorrectable_Errors.

“Number 187 reports the number of reads that could not be corrected using hardware [Error Correcting Code] ECC,” BackBlaze explains. “Drives with 0 uncorrectable errors hardly ever fail. Once SMART 187 goes above 0, we schedule the drive for replacement.”

Interestingly, this particular statistic isn’t even mentioned in the Google paper, nor is it called out in the Wikipedia entry for SMART as being a potential indicator of imminent electromechanical failure.

BackBlaze also discusses its results with several other statistics, and explains why it doesn’t find them useful. Finally, for the statistics wonks among you, the company also published a complete list of SMART results among its 40,000 disk drives. (And for some, that’s still not enough; in the comments section, people are asking BackBlaze to release the raw data in spreadsheet form.)

In addition to giving us one useful stat to look at rather than 70 un-useful ones, this research will hopefully encourage hardware vendors to work together to report their statistics more meaningfully, and for software vendors to develop better, more useful tools to interpret the statistics.

Disclaimer: I am a BackBlaze customer.

2  Comments on this Post

 
There was an error processing your information. Please try again later.
Thanks. We'll let you know when a new response is added.
Send me notifications when other members comment.
  • Eric Parizo
    Some irony in 187 being the police code for death. 
    3,115 pointsBadges:
    report
  • Sharon Fisher
    ha! good one!
    9,715 pointsBadges:
    report

Forgot Password

No problem! Submit your e-mail address below. We'll send you an e-mail containing your password.

Your password has been sent to:

Share this item with your network: