Companies tend to focus on the positive aspects of using SATA disk drives for a growing portion of their enterprise storage needs but as some companies are finding out, managing thousands or tens of thousands of SATA disk drives can take on a life of its own.
Recently, I spoke to Lawrence Livermore National Laboratories (LLNL) which is a huge DataDirect Networks user. By huge, I mean they use multiple DataDirect Network Storage Systems with the total number of SATA disk drives in production numbering in the tens of thousands, possibly even up to a hundred thousand SATA disk drives. More impressive, LLNL uses these storage systems in conjunction with some of the world’s fastest supercomputers, including the BlueGene/L currently rated #1 among the world’s fastest computers.
The issue that crops up when companies own tens of thousands of disk drives — SATA or FC – is the growing task of managing failed disk drives. Companies such as Nexsan Technologies report failure rates of less than half of 1% of all SATA disk drives that they have deployed out in the field. Those numbers sound impressive until one begins to encounter environments like LLNL that may have up to a hundred thousand SATA disk drives in their environment. Using a .005% failure rate in that scenario, companies can statistically expect a SATA disk drive to fail about every other day, which is inline with LLNL’s experience.
This is in no way intended to reflect negatively on DataDirect Networks. If users were to deploy a similar numbers of disk drives from any other SATA storage system provider, be it Excel Meridian, Nexsan Technologies or Winchester Systems, they could expect similar SATA disk drive failure rates.
The cautionary note for users here is twofold. First, be sure your disk management practices keep up with your growth in disk drives. Replacing a disk drive may not sound like a big deal, but consider what is involved with a disk drive replacement:
- Discovering the disk drive failure
- Contacting and scheduling time for the vendor to replace the disk drive
- Monitoring the rebuild of the spare disk drive
- Determining if there is application impact during the disk drive rebuild
- Physically changing out the disk drive
Assuming a .005% failure rate, companies with hundreds of disk drives will repeat this process once a year, those with thousands of disk drives once a quarter and those with tens of thousands once a week. Once a company crosses the 10,000 threshold barrier, companies need to seriously contemplate dedicating a person at least a part-time just to monitor and manage the task of disk drive replacements regardless of which vendor’s storage system one selects.
The other cautionary note is that the more disk drives one deploys, the more likely it becomes that two or even three disk drives in the same RAID group will fail before a recovery of an existing failed disk drive is complete. Companies, now more than ever, need to ensure they are using RAID-6 for their SATA disk drive array groups and, when crossing the 10,000 disk drive threshold, should consider the new generation of SATA storage systems from companies such as DataDirect Networks and NEC. These systems give companies more data protection and recovery options for their SATA disk drives.