Posted by: Beth Pariseau
According to the blog post, which appeared at RoughlyDrafted Magazine:
To the engineers familiar with Microsoft’s internal operations who spoke with us, that suggests two possible scenarios. First, that Microsoft decided to suddenly replace Danger’s existing infrastructure with its own, and simply failed to carry this out. Danger’s existing system to support Sidekick users was built using an Oracle Real Application Cluster, storing its data in a SAN (storage area network) so that the information would be available to a cluster of high availability servers. This approach is expressly designed to be resilient to hardware failure.
Danger’s Sidekick data center had ”been running on autopilot for some time, so I don’t understand why they would be spending any time upgrading stuff unless there was a hardware failure of some kind,“ wrote the insider. Given Microsoft’s penchant for ”for running the latest and greatest,“ however, ”I wouldn’t be surprised if they found out that [storage vendor] EMC had some new SAN firmware and they just had to put it on the main production servers right away.“
Reached for comment today, an EMC spokesperson said no EMC products were involved.
Another blog yesterday also cited an anonymous source in saying that a SAN upgrade project allegedly involved in the outage was outsourced to Hitachi, but did not identify the brand of SAN involved. Multiple HDS spokespeople have not returned phone calls and emails seeking comment since yesterday.
A Microsoft spokesperson made the following comment for Storage Soup:
I can clarify that the Sidekick runs on Danger’s proprietary service that Microsoft inherited when it acquired Danger in 2008. The Danger service is built on a mix of Danger created technologies and 3rd party technologies. However, other than that we do not have anything else to share right now.
It actually may not matter at the end of the day whose SAN it was — it seems it was human error (or, as the RoughlyDrafted blog goes on to speculate, possible sabotage) responsible for the outage. The RoughlyDrafted blog goes on to claim:
A variety of ”dogfooding“ or aggressive upgrades could have resulted in data failure, the source explained, ”especially when the right precautions haven’t been taken and the people you hired to do the work are contractors who might not know what they’re doing.“ The Oracle database Danger was using was ”definitely one of the more confusing and troublesome to administer, from my limited experience. It’s entirely possible that they weren’t backing up the ’single copy’ of the database properly, despite the redundant SAN and redundant servers.“
“Just because there may have been an error during a SAN upgrade doesn’t mean the guy’s an idiot or that the storage vendor’s stuff doesn’t work. The fundamental question here is where are the backups?” said backup expert W. Curtis Preston.
This remains an open question as of this hour, as a new statement issued by T-Mobile suggests there may be some data that’s recoverable– “We…remain hopeful that for the majority of our customers, personal content can be recovered.”
A New York Times report released this week cited a T-Mobile official as saying data on the Sidekick server and its backup server were corrupted.
But it also can’t be assumed that thorough secondary copies of data were made by the cloud service. Slightly higher-end online PC backup services like Carbonite and SpiderOak, previously questioned about geographic redundancy available for their services should their primary data centers fail (this following a high-profile outage and lawsuit for Carbonite–where users experienced data loss), have cited costs and pricing pressures as reasons for not offering that level of redundancy for consumer customers.
Another important point in all this is that users might not be losing data if they synced data to their PCs as well as the cloud. T-Mobile offers an IntelliSync service for a fee to sync data between the Sidekick and the PC; there are also free synchronization clients available online. Users would’ve had to have those services in place prior to the outage, however.
“The bottom line is that a free cloud service shouldn’t be your only copy of data,” Preston said.