Another busy week, and I make my triumphant (if slightly raspy) return to the podcast.
The Sidekick data-loss debacle may be drawing to a close.
According to a post on Microsoft’s website by corporate vice president Roz Ho,
We are pleased to report that we have recovered most, if not all, customer data for those Sidekick customers whose data was affected by the recent outage. We plan to begin restoring users’ personal data as soon as possible, starting with personal contacts, after we have validated the data and our restoration plan. We will then continue to work around the clock to restore data to all affected users, including calendar, notes, tasks, photographs and high scores, as quickly as possible.
Ho also went on to provide some further details as to what caused the outage and how it was handled:
We have determined that the outage was caused by a system failure that created data loss in the core database and the back-up. We rebuilt the system component by component, recovering data along the way. This careful process has taken a significant amount of time, but was necessary to preserve the integrity of the data…we have made changes to improve the overall stability of the Sidekick service and initiated a more resilient backup process to ensure that the integrity of our database backups is maintained.
All’s well that ends well, but I do wonder if this will make people more conscientious about making local copies of important data sent to a public cloud.
In a week chock full of product news from Storage Networking World (SNW) and elsewhere, some new standards have slipped in under the radar that may become important once the dust settles.
The first of these is the announcement of a new Storage Performance Council (SPC) benchmark for testing the power consumption of storage devices in the data center. The new SPC-1E spec follows the SPC-1 C/E spec announced in June. Where the SPC-1C/E spec covered storage components and small subsystems (limited to a maximum of 48 storage devices in no larger than a 4U enclosure profile), the SPC-1E spec expands that support to include larger, more complex storage configurations.
According to an SPC presentation on the new benchmark, “SPC-1/E is applicable to any SPC-1 storage
configuration that can be measured with a single SPC approved power meter/analyzer.”
For more on how the SPC-1C/E and SPC-1E benchmarks work, see our story on the SPC-1C/E announcement. Users should especially be aware of the parts of the benchmark calculation that can only be specified by vendors.
Still, even an approximate or idealized lab result for power consumption of storage systems would be an improvement over the tools avialable to reliably spec power consumption, increasingly a key cost factor for data centers that users in economically strapped times are looking to cut.
Speaking of cutting costs, Serial Attached SCSI (SAS) devices are widely regarded as the cheaper choice of the future to replace Fibre Channel systems. With 6 Gbps SAS products now beginning to ship, the SCSI Trade Association laid out its roadmap for the future of connectivity between Serial Attached SCSI drives and other elements of the infrastructure.
3 Gbps SAS devices connected via InfiniBand connectors; the Mini-SAS HD connector will be used with most 6 Gbps devices. The new roadmap laid out this week specifies that the Mini-SAS HD connector will be the hardware of choice going forward for all types of connectivity into SAS devices.
Why do you care? Because the development plans for the Mini-SAS HD connector going forward will allow it to serve optical, active and passive copper cables with one connector device, and automatically detect the type of cable it’s attached to — meaning that by the time 12 Gbps SAS rolls around, less hardware wil need to be ripped and replaced to support it. Another thing the connector will support in the future is managed connections, meaning a tiny bit of memory in the connector itself that allows the devices to be queried for reporting and monitoring.
The ability to connect SAS devices over optical and active copper cables is a pretty big deal — cable length and expandability limitations have improved significantly with SAS-2, but native cable lengths currently remain limited to 10 meters. While this is already making data center SAS subsystems a reality, it will need more robust connectivity attributes to compete directly with Fibre Channel. Optical cables can stretch as far as 100 meters, and active copper (so called because it contains transcievers that boost signals) to 20 meters.
Quantum’s chief marketing officer said it was news to her that EMC customer are swapping out Quantum’s deduplication software installed on EMC Disk Libraries, as EMC division president Frank Slootman claims. According to Quantum CMO Janae Lee, EMC customers have continued to buy Quantum software with DLs even since EMC spent $2.1 billion on Data Domain.
“We don’t have visibility to the swapouts he’s talking about,” Lee said, “but we do see their sales reports and customers are continuing to install what we’re offering. It shows a difference in our approach to Data Domain’s approach. We don’t feel deduplication should be a disrupting standalone product. We’re leveraging installed hardware. There’s a basic difference of opinion about how deduplicsation fits.”
EMC has sold Quantum software with its Disk Libraries as part of an OEM deal signed last year.
According to the blog post, which appeared at RoughlyDrafted Magazine:
To the engineers familiar with Microsoft’s internal operations who spoke with us, that suggests two possible scenarios. First, that Microsoft decided to suddenly replace Danger’s existing infrastructure with its own, and simply failed to carry this out. Danger’s existing system to support Sidekick users was built using an Oracle Real Application Cluster, storing its data in a SAN (storage area network) so that the information would be available to a cluster of high availability servers. This approach is expressly designed to be resilient to hardware failure.
Danger’s Sidekick data center had ”been running on autopilot for some time, so I don’t understand why they would be spending any time upgrading stuff unless there was a hardware failure of some kind,“ wrote the insider. Given Microsoft’s penchant for ”for running the latest and greatest,“ however, ”I wouldn’t be surprised if they found out that [storage vendor] EMC had some new SAN firmware and they just had to put it on the main production servers right away.“
Reached for comment today, an EMC spokesperson said no EMC products were involved.
Another blog yesterday also cited an anonymous source in saying that a SAN upgrade project allegedly involved in the outage was outsourced to Hitachi, but did not identify the brand of SAN involved. Multiple HDS spokespeople have not returned phone calls and emails seeking comment since yesterday.
A Microsoft spokesperson made the following comment for Storage Soup:
I can clarify that the Sidekick runs on Danger’s proprietary service that Microsoft inherited when it acquired Danger in 2008. The Danger service is built on a mix of Danger created technologies and 3rd party technologies. However, other than that we do not have anything else to share right now.
It actually may not matter at the end of the day whose SAN it was — it seems it was human error (or, as the RoughlyDrafted blog goes on to speculate, possible sabotage) responsible for the outage. The RoughlyDrafted blog goes on to claim:
A variety of ”dogfooding“ or aggressive upgrades could have resulted in data failure, the source explained, ”especially when the right precautions haven’t been taken and the people you hired to do the work are contractors who might not know what they’re doing.“ The Oracle database Danger was using was ”definitely one of the more confusing and troublesome to administer, from my limited experience. It’s entirely possible that they weren’t backing up the ’single copy’ of the database properly, despite the redundant SAN and redundant servers.“
“Just because there may have been an error during a SAN upgrade doesn’t mean the guy’s an idiot or that the storage vendor’s stuff doesn’t work. The fundamental question here is where are the backups?” said backup expert W. Curtis Preston.
This remains an open question as of this hour, as a new statement issued by T-Mobile suggests there may be some data that’s recoverable– “We…remain hopeful that for the majority of our customers, personal content can be recovered.”
A New York Times report released this week cited a T-Mobile official as saying data on the Sidekick server and its backup server were corrupted.
But it also can’t be assumed that thorough secondary copies of data were made by the cloud service. Slightly higher-end online PC backup services like Carbonite and SpiderOak, previously questioned about geographic redundancy available for their services should their primary data centers fail (this following a high-profile outage and lawsuit for Carbonite–where users experienced data loss), have cited costs and pricing pressures as reasons for not offering that level of redundancy for consumer customers.
Another important point in all this is that users might not be losing data if they synced data to their PCs as well as the cloud. T-Mobile offers an IntelliSync service for a fee to sync data between the Sidekick and the PC; there are also free synchronization clients available online. Users would’ve had to have those services in place prior to the outage, however.
“The bottom line is that a free cloud service shouldn’t be your only copy of data,” Preston said.
News broke this morning of an outage for users of the Sidekick mobile smartphone, in which T-Mobile warned users of the device not to power down their phones, or personal data would be irretrievably lost thanks to a server outage at Danger, a Microsoft subsidiary that supports the Sidekick.
Meanwhile, Engadget has blogged that the storage and backup infrastructure at Danger was to blame for the outage:
Alleged details on the events leading up to Danger’s doomsday scenario are starting to come out of the woodwork, and it all paints a truly embarrassing picture: Microsoft, possibly trying to compensate for lost and / or laid-off Danger employees, outsources an upgrade of its Sidekick SAN to Hitachi, which — for reasons unknown — fails to make a backup before starting. Long story short, the upgrade runs into complications, data is lost, and without a backup to revert to, untold thousands of Sidekick users get shafted in an epic way rarely seen in an age of well-defined, well-understood IT strategies.
If confirmed, it would be the second high-profile outage Hitachi has been associated with in the last six months. An HDS SAN was also implicated when Barclay’s ATMs in the UK stopped working in June.
Regardless of the source of the failure, outages like this usually draw attention to the fundamental risk of cloud computing — the things that can happen when all of users’ data “eggs” are put in one service provider’s “basket.”
Requests for comment are in to Microsoft and HDS and have not yet been returned. Stay tuned.
Oracle OpenWorld kicked off yesterday in San Francisco (at the Moscone Center, same place VMWorld was held). Sun Microsystems Chairman and co-founder Scott McNealy and Oracle founder and CEO Larry Ellison took the stage for keynotes Sunday night, highlights of which were available on Oracle’s website this morning.
For perhaps the first time at an official public event, the word “storage” was uttered by an exec from the merging companies, who have already assured the world that server hardware development will continue.
According to McNealy,
If you think about the Sun technology that we’re bringing to the party, here, it’s the data center. It’s the servers, the storage, the networking, the infrastructure software, all the pieces, all of the executable environment within the cloud, the data center, the distributed computing environment, whatever else you want to say, and then you bring in the database, and the applications and ERP and middleware capabilities and developer tool capabilities of Oracle, and you have a very nice data center. A very robust, very scalable…enterprise data center.
This end to end “stack” vision would be in keeping with the other big players in the market, which are beginning to offer prepackaged product bundles and looking to be soup-to-nuts suppliers to the enterprise data center. Oracle’s competitive landscape for end-to-end stacks includes Cisco Systems Inc., IBM Corp., Hewlett-Packard Co. (HP) and Dell Inc.
There are advantages, Ellison said, in a company being able to control the engineering of both hardware and software. “We are not selling the hardware business-no part of the hardware business are we selling,” Ellison said in his keynote, though he went on to specifically discuss mostly server technologies like Sun’s SPARC chips. (Here’s where Sun might point out that it recently merged servers and storage together in terms of its engineering departments and in terms of its strategic thinking with Amber Road…)
So the biggest question for the storage hardware market with this merger still comes down to tape. Some of the competitive “stack” offerings like those from IBM include tape — in fact, with its latest Information Archive appliance, IBM is offering tape as an option managed by the GPFS global namespace, a setup highly remeniscent of the way Sun’s SAM-FS can manage data in disk repositories as well as StorageTek tape libraries.
Judging by the speeches from McNealy and Ellison, it seems no hardware product is being taken completely off the table yet, but what the newly merged entity will do with tape storage hardware specifically remains uncertain at this point.
I am sick this week, with a croaky voice, so my colleague Chris Griffin kindly filled in for me on this podcast. It’s a long’un this week — plenty of news going out this time of year.
(14:00) i365 launches EVault Offsite Replication cloud data backup and disaster recovery service
Remember the research paper Google made a splash with two years ago on disk drive failure rates? The one that showed that most failed drives didn’t raise significant SMART flags, failed to find a correlation between temperature and utilizaation with failure rates, and instead established that failure rates are more correlated to drive manufacturer, model and age?
Well, there’s now a DRAM equivalent — and it doesn’t paint a much prettier picture than the one on hard drive failures.
According to a new paper, “DRAM Errors in the Wild: A Large-Scale Field Study“, engineers from Google and the University of Toronto found that once again, failure rates and patterns did not match the received wisdom in the industry about how Dual Inline Memory Modules (DIMMs) behave. According to the paper:
We find that DRAM error behavior in the field differs in many key aspects from commonly held assumptions. For example, we observe DRAM error rates that are orders of magnitude higher than previously reported, with 25,000 to 70,000 errors per billion device hours per Mbit and more than 8% of DIMMs affected by errors per year. We provide strong evidence that memory errors are dominated by hard errors, rather than soft errors, which previous work suspects to be the dominant error mode. We find that temperature, known to strongly impact DIMM error rates in lab conditions, has a surprisingly small effect on error behavior in the field, when taking all other factors into account. Finally, unlike commonly feared, we don’t observe any indication that newer generations of DIMMs have worse error behavior.
As with the disk drive study, temperature also doesn’t play a huge role in DRAM failures. Here, vendor and model didn’t make as much difference as in the disk drive study.
However, the study showed errors were more highly dependent on motherboard design than previously thought. And contrary to conventional wisdom about DRAM, more failures were hardware than software-based. According to an article analyzing the paper by Data Mobility Group’s Robin Harris,
This means that some popular [motherboards] have poor EMI hygiene. Route a memory trace too close to noisy component or shirk on grounding layers and instant error problems…For all platforms they found that 20% of the machines with errors make up more than 90% of all observed errors on that platform. There be lemons out there!
These two reports raise one common question, according to Harris — why didn’t we know about these things before? As he put it, “Big system vendors have scads of data on disk drives, DRAM, network adapters, OS and filesystem based on mortality and tech support calls, but do they share this with the consuming public? Nothing to see here folks, just move along.”
IBM held a meeting Tuesday at Boston’s Four Seasons Hotel for press and analysts to discuss its new strategy for offering users integrated “stacks” of servers, software, storage and services. The two main products introduced were the new IBM Smart Business Storage Cloud bundle, and Information Archive appliance.
But there were some other tidbits to be gleaned from the announcements and meeting as well:
- VP Barry Rudolph said IBM is “about ready to announce and deliver” solid-state drives (SSD) in its SAN Volume Controller, which Rudolph said will double the performance of the storage virtualization device. IBM previewed the product as Project Quicksilver with Fusion-io last year. Execs wouldn’t give a more specific time frame than “imminent.”
- Scale-out File Services (SOFS) in its first iteration required an ongoing services engagement (as it did when reference customer Kantana Animation first installed it last year). The new Storage Cloud based on SOFS has the option of deployment services only, as well as an ongoing managed service, but IBM also added some consulting services to go along with the new product package, including Strategy and Change Services for Cloud Adoption for end users, Strategy and Change Services for Cloud Providers, and Testing Services for Cloud (helping build a business case for cloud-based test environments).
- Smart Business Storage Cloud is being offered for private cloud deployments right now, but IBM also plans to offer a public cloud based on the package and the CloudBurst product it announced in June, which also features automated provisioning and file-set-level chargeback through Tivoli Services Automation Manager (TSAM).
Analyst reviews of the event were mixed. Wikibon analyst Dave Vellante said he thinks IBM has some work to do on the Information Archive. “I loved the line about ‘The keep everything forever model has failed’ – it’s true,” he wrote to Storage Soup in an email. “Unfortunately, what IBM announced yesterday (IBM Information Archive) is more of the same old same old. New hardware, some decent integration but NO INDEXING AND NO SEARCH. In my mind that is not very useful to customers. Supposedly search and indexing ‘is coming soon’ but I think IBM was rushing to replace the DR550 line.”
He added, “good news for IBM is all the archiving vendors are missing the mark. Systems still don’t scale, nobody does classification right and there’s no good way to defensibly delete un-needed data.”
Evaluator Group analyst John Webster said that after following IBM storage for years, he sees them rationalizing different product lines more effectively these days. “Last year at this time things were more disjointed,” he said. “Now they’re able to rationalize XIV with the DS8000, for example.”
When it comes to the single vertically-integrated stack concept, analysts say they’ve seen this movie before. “I wonder, to what degree is server virtualization and VMware driving the desire to integrate everything into a box?” Webster said. “It reminds me of a concept people used to talk about years ago called a ‘God box,’ basically a big switch that did everything. But nobody wanted to go there–it was enough to talk about an intelligent switch. I’m not sure it’s progressed much farther, but I don’t know that it matters–Cisco has thrown down the gauntlet and other large players have to cover their bets.”
Everything’s cyclical, pointed out Analytico Tom Trainer. “Consolidation and innovation patterns in the market are like a sine wave,” Trainer said. “We were probably at the height of new companies and innovation in the dotcom era of 1999 to 2000, and as politics and economics come into play, the pendulum looks like it’s swinging back toward consolidation.”
However, consolidation can open up space in the market for new companies to emerge. “I’m talking to startups receiving good funding recently,” Trainer said. New storage startups have begun coming out of stealth in the last week, such as Avere.
When asked about industry consolidation, IBM’s Rudolph saw a similar picture. “I think you’re starting to see major shifts in our competitive framework, but I don’t think there’ll be a lack of new innovation and three or four huge corporations and that’s it,” he said.