Yottabytes: Storage and Disaster Recovery

Nov 27 2019   10:05PM GMT

32,768-Hour Hard Disk Drive Failure Strikes HPE

Sharon Fisher Sharon Fisher Profile: Sharon Fisher


People creating a new system sometimes underestimate how long it’ll be around. That was the core of the “Y2K Problem,” which is when people were concerned that computer programs around the world would fail because the designers had never considered the idea of a year after 1999.

Boy, that feels like a long time ago.

Most of the Y2K bugs got worked out before everything went poof at midnight on December 31, 1999, but it’s not unusual for there to be similar bugs related to data fields that get filled up. In addition, hackers have learned to create and exploit these bugs by putting a system into a vulnerable state through a buffer overflow, such as with the “heartbleed” bug from about five years ago.

But more recently, there’s a doozy.

“Bulletin: HPE SAS Solid State Drives – Critical Firmware Upgrade Required for Certain HPE SAS Solid State Drive Models to Prevent Drive Failure at 32,768 Hours of Operation,” reported the Hewlett Packard Enterprise Support Center earlier this month.

If that seems like an odd number, it’s not – literally, that is. It’s 2 to the 15th power.

So let’s take a guess – some field associated with the solid state drive is 15 bits long, and when the hour count gets beyond that (which is about 1,365 days, or 3 ¾ years), the field fills up and the system is froached.

The power-on counter in the affected drives uses a 16-bit Two’s Complement value (which can range from −32,768 to 32,767). Once the counter exceeds the maximum value, it fails hard,” writes Marco Chiappetta in Forbes.

And it gets really froached.

After the SSD failure occurs, neither the SSD nor the data can be recovered,” HPE notes. “In addition, SSDs which were put into service at the same time will likely fail nearly simultaneously.”

Chiappetta goes into more detail about that aspect. “This issue can be particularly catastrophic because the affected enterprise-class drives were likely installed as part of a many-drive JBOD (Just A Bunch Of Disks) or RAID (Redundant Array of Independent Disks), so the potential for ALL of the drives to fail nearly simultaneously (assuming they were all powered on for the first time together) is very likely.”

Oh goody.

HPE said that one of its vendors had discovered the problem. “HPE was notified by a Solid State Drive (SSD) manufacturer of a firmware defect affecting certain SAS SSD models (reference the table below) used in a number of HPE server and storage products (i.e., HPE ProLiant, Synergy, Apollo, JBOD D3xxx, D6xxx, D8xxx, MSA, StoreVirtual 4335 and StoreVirtual 3200 are affected).

One wonders how this bug presented itself. Did someone happen to run across it just in time? How long have HPE drives been crashing and burning until this bug was tracked down and repaired?

And which vendor was this? HPE doesn’t say, but one would guess that HPE might not be using that vendor again in the future.

“This HPD8 firmware is considered a critical fix and is required to address the issue detailed below. HPE strongly recommends immediate application of this critical fix.”

You don’t say.

 Comment on this Post

There was an error processing your information. Please try again later.
Thanks. We'll let you know when a new response is added.
Send me notifications when other members comment.

Forgot Password

No problem! Submit your e-mail address below. We'll send you an e-mail containing your password.

Your password has been sent to:

Share this item with your network: