In case you’re like me and can’t get enough of the technical nitty-gritty on the new self-healing storage systems from Atrato and Xiotech, here are some tidbits from the cutting room floor so to speak, that didn’t make it into the article I did this week comparing the two systems.
This in particular was a paragraph that could have been fleshed out into a whole separate piece: “Both vendors use various error correction codes to identify potential drive failures, and both said they can work around a bad drive head by storing data on the remaining good sectors of the drive.”
This is where I’m running into each vendor’s unwillingness to expose their IP, which is understandable, and so trying to get to the bottom of this may be a fruitless endeavor. But that’s never stopped me before, so here’s a few more steps down the rabbit hole for those who are interested.
Xiotech’s whitepapers and literature talk a lot about the ANSI T10 DIF (Data Integrity Field), which is part of how its system checks that virtual blocks are written to the right physical disk, and that physical blocks match up with virtual blocks. The standard, which is also used by Seagate, Oracle, LSI and Emulex in their data integrity initiative, adds 8K per 512K block with data integrity information. I asked Xiotech CTO and ISE mastermind Steve Sicola about what kind of overhead that adds to the system, but the only answer I got was that it’s spread out over so many different disk drives working in parallel that it’s not noticeable.
Then along comes Atrato, claiming to base its self-healing technology on a concept from satellite engineering called FDIR, for Fault Detection, Isolation and Recovery. The term was first coined, according to Wikipedia, in relation to the Extended Duration Orbiter in the 90′s.
An Atrato whitepaper reveals three standard codes used for the first step in that process–fault or failure detection. Among them are S.M.A.R.T., which, again according to Wikipedia, “tests all data and all sectors of a drive by using off-line data collection to confirm the drive’s health during periods of inactivity”; SCSI Enclosure Services (SES), which tests non-data characteristics including power and temperature; and the SCSI Request Sense Command, which determines whether drives are SCSI-compliant.
The thing about all of these methods is that they have existed long before either the ISE or Atrato’s Velocity array. There are, of course, key differences between the way the systems are packaged, including the fact that Xiotech puts the controller right next to groups of between 20 and 40 disk drives, and Atrato manages 160 drives at once, but when it comes down to the actual self-healing aspects, the vendors are not disclosing anything about what new codes are being used to supplement those standards.
As Sicola put it to me, “What we’re doing is like S.M.A.R.T., but it goes way beyond that.” How far ‘way beyond that’ actually is, is proprietary. Which is kind of too bad, because it’s hard to tell how much of a hurdle there would be to more entrants in this market.
An analyst I was talking to about these new systems said some are talking about them as a desperation move for Xiotech, which has not exactly been burning down the market in recent years (it reinvented itself once already as an e-Discovery and compliance company after the acquisition of Daticon, which I haven’t heard much about lately).
Then again, others point out, Xiotech has Seagate’s backing (and can start from scratch with clear code on each disk drive, as well as use Seagate’s own drive testing software within the machine. Meanwhile, the ability to adequately market this technology has also been called into question with regards to Atrato.
But while it’s obviously going to take quite some time to assess the real viability of these particular products, it’s exciting for me as an industry observer to see vendors at least trying to do something fundamentally different with the way storage is managed. I think both of them share the same idea, that the individual disk drive is too small a unit to manage at the capacities today’s storage admins are dealing with.
Even if the products don’t perfectly live up to the claims of zero service events in a full three or five years, as ISE beta tester I was speaking with put it, “anything that will make the SAN more reliable has benefits.” It’s pretty easy to get caught up in all the marketechture noise and miss that forest for the trees.
Even further reading: IBM’s Tony Pearson is less than enthused (but has links to lots of other blogs / writeups on this subject)
The inimitable Robin Harris summarizes his thoughts on ISE, and gets an interesting comment from John Spiers of LeftHand Networks (another storage competitor heard from!).