Yottabytes: Storage and Disaster Recovery

Nov 18 2016   6:21PM GMT

These 5 SMART Statistics Predict Disk Failure

Sharon Fisher Sharon Fisher Profile: Sharon Fisher

Tags:
Backblaze
Failure
SMART

When will a disk drive fail? The people at BackBlaze, who use 67,814 disk drives, have been looking at Self-Monitoring, Analysis and Reporting Technology  (SMART) disk drive statistics to try to predict this.

They’ve been looking at this for a while, at least since 2014. As I wrote then:

“SMART defines — and measures, for those vendors that support it — more than 70 characteristics of a particular disk drive. But while it’s great to know how many High Fly Writes or Free Fall Events a disk has undergone, these figures aren’t necessarily useful in any real sense of being able to predict a hard drive failure.

“Part of this is because of the typical problem with standards: Just because two vendors implement a standard, it doesn’t mean they’ve implemented it in the same way. So the way Seagate counts something might not be the same way as Hitachi counts something. In addition, vendors might not implement all of the standard. Finally, in some cases, even the standard itself is…unclear, as with Disk Shift, or the distance the disk has shifted relative to the spindle (usually due to shock or temperature), where Wikipedia notes, ‘Unit of measure is unknown.’”

At that point, BackBlaze had determined that out of the 70 statistics SMART tracked, there was really only one that mattered: SMART 187, or Reported_Uncorrectable_Errors. At the time, BackBlaze wrote: “Drives with 0 uncorrectable errors hardly ever fail. Once SMART 187 goes above 0, we schedule the drive for replacement.”

Since then, the company has been looking at the SMART statistics some more, and it’s now added four other statistics that it’s determined have a correlation to failed drives, writes senior marketing manager Andy Klein:

  • SMART 5 Reallocated Sectors Count
  • SMART 188 Command Timeout
  • SMART 197 Current Pending Sector Count
  • SMART 198 Uncorrectable Sector Count

The company didn’t say why it started looking at other SMART statistics if it had already determined that there was one statistic that was correlated with failure. (It also makes its raw statistics available in case you want to play with correlations yourself.) He also points out that not all vendors report all the statistics, or in the same way — an issue two years ago as well.

Another factor is the period of time in which the errors occur, Klein writes. “For example, let’s start with a hard drive that jumps from zero to 20 Reported Uncorrectable Errors (SMART 187) in one day,” he writes. “Compare that to a second drive which has a count of 60 SMART 187 errors, with one error occurring on average once a month over a five year period. Which drive is a better candidate for failure?” He doesn’t actually say, though he implies that it’s the first one.

Incidentally, BackBlaze has even started looking at High Fly Writes as a possible indicator of future disk failure. “This stat is the cumulative count of the number of times the recording head ‘flies’ outside its normal operating range,” Klein explains, noting that while 47 percent of failed drives have a SMART 189 value of greater than zero, so do 16.4 percent of drives that work. “The false positive percentage of operational drives having a greater than zero value may at first glance seem to render this stat meaningless. But what if I told you that for most of the operational drives with SMART 189 errors, that those errors were distributed fairly evenly over a long period of time?” he asks. “For example, there was one error a week on average for 52 weeks. In addition, what if I told you that many of the failed drives with this error had a similar number of errors, but they were distributed over a much shorter period of time, for example 52 errors over a one-week period. Suddenly SMART 189 looks very interesting in predicting failure by looking for clusters of High Fly Writes over a small period of time.”

That’s not to say that any of the statistics, or even a combination of them, is a perfect predictor of when a disk drive is going to fail – or when it isn’t. The organization points out, “Operational drives with one or more of our five SMART stats greater than zero – 4.2 percent. Failed drives with one or more of our five SMART stats greater than zero – 76.7 percent,” writes Klein. “That means that 23.3 percent of failed drives showed no warning from the SMART stats we record.” But if nothing else, it sounds like starting to see these particular SMART errors is a way to bet.

Disclaimer: I am a BackBlaze customer.

 Comment on this Post

 
There was an error processing your information. Please try again later.
Thanks. We'll let you know when a new response is added.
Send me notifications when other members comment.

Forgot Password

No problem! Submit your e-mail address below. We'll send you an e-mail containing your password.

Your password has been sent to:

Share this item with your network: