According to the blog post, which appeared at RoughlyDrafted Magazine:
To the engineers familiar with Microsoft’s internal operations who spoke with us, that suggests two possible scenarios. First, that Microsoft decided to suddenly replace Danger’s existing infrastructure with its own, and simply failed to carry this out. Danger’s existing system to support Sidekick users was built using an Oracle Real Application Cluster, storing its data in a SAN (storage area network) so that the information would be available to a cluster of high availability servers. This approach is expressly designed to be resilient to hardware failure.
Danger’s Sidekick data center had ”been running on autopilot for some time, so I don’t understand why they would be spending any time upgrading stuff unless there was a hardware failure of some kind,“ wrote the insider. Given Microsoft’s penchant for ”for running the latest and greatest,“ however, ”I wouldn’t be surprised if they found out that [storage vendor] EMC had some new SAN firmware and they just had to put it on the main production servers right away.“
Reached for comment today, an EMC spokesperson said no EMC products were involved.
Another blog yesterday also cited an anonymous source in saying that a SAN upgrade project allegedly involved in the outage was outsourced to Hitachi, but did not identify the brand of SAN involved. Multiple HDS spokespeople have not returned phone calls and emails seeking comment since yesterday.
A Microsoft spokesperson made the following comment for Storage Soup:
I can clarify that the Sidekick runs on Danger’s proprietary service that Microsoft inherited when it acquired Danger in 2008. The Danger service is built on a mix of Danger created technologies and 3rd party technologies. However, other than that we do not have anything else to share right now.
It actually may not matter at the end of the day whose SAN it was — it seems it was human error (or, as the RoughlyDrafted blog goes on to speculate, possible sabotage) responsible for the outage. The RoughlyDrafted blog goes on to claim:
A variety of ”dogfooding“ or aggressive upgrades could have resulted in data failure, the source explained, ”especially when the right precautions haven’t been taken and the people you hired to do the work are contractors who might not know what they’re doing.“ The Oracle database Danger was using was ”definitely one of the more confusing and troublesome to administer, from my limited experience. It’s entirely possible that they weren’t backing up the ’single copy’ of the database properly, despite the redundant SAN and redundant servers.“
“Just because there may have been an error during a SAN upgrade doesn’t mean the guy’s an idiot or that the storage vendor’s stuff doesn’t work. The fundamental question here is where are the backups?” said backup expert W. Curtis Preston.
This remains an open question as of this hour, as a new statement issued by T-Mobile suggests there may be some data that’s recoverable– “We…remain hopeful that for the majority of our customers, personal content can be recovered.”
A New York Times report released this week cited a T-Mobile official as saying data on the Sidekick server and its backup server were corrupted.
But it also can’t be assumed that thorough secondary copies of data were made by the cloud service. Slightly higher-end online PC backup services like Carbonite and SpiderOak, previously questioned about geographic redundancy available for their services should their primary data centers fail (this following a high-profile outage and lawsuit for Carbonite–where users experienced data loss), have cited costs and pricing pressures as reasons for not offering that level of redundancy for consumer customers.
Another important point in all this is that users might not be losing data if they synced data to their PCs as well as the cloud. T-Mobile offers an IntelliSync service for a fee to sync data between the Sidekick and the PC; there are also free synchronization clients available online. Users would’ve had to have those services in place prior to the outage, however.
“The bottom line is that a free cloud service shouldn’t be your only copy of data,” Preston said.
News broke this morning of an outage for users of the Sidekick mobile smartphone, in which T-Mobile warned users of the device not to power down their phones, or personal data would be irretrievably lost thanks to a server outage at Danger, a Microsoft subsidiary that supports the Sidekick.
Meanwhile, Engadget has blogged that the storage and backup infrastructure at Danger was to blame for the outage:
Alleged details on the events leading up to Danger’s doomsday scenario are starting to come out of the woodwork, and it all paints a truly embarrassing picture: Microsoft, possibly trying to compensate for lost and / or laid-off Danger employees, outsources an upgrade of its Sidekick SAN to Hitachi, which — for reasons unknown — fails to make a backup before starting. Long story short, the upgrade runs into complications, data is lost, and without a backup to revert to, untold thousands of Sidekick users get shafted in an epic way rarely seen in an age of well-defined, well-understood IT strategies.
If confirmed, it would be the second high-profile outage Hitachi has been associated with in the last six months. An HDS SAN was also implicated when Barclay’s ATMs in the UK stopped working in June.
Regardless of the source of the failure, outages like this usually draw attention to the fundamental risk of cloud computing — the things that can happen when all of users’ data “eggs” are put in one service provider’s “basket.”
Requests for comment are in to Microsoft and HDS and have not yet been returned. Stay tuned.
Oracle OpenWorld kicked off yesterday in San Francisco (at the Moscone Center, same place VMWorld was held). Sun Microsystems Chairman and co-founder Scott McNealy and Oracle founder and CEO Larry Ellison took the stage for keynotes Sunday night, highlights of which were available on Oracle’s website this morning.
For perhaps the first time at an official public event, the word “storage” was uttered by an exec from the merging companies, who have already assured the world that server hardware development will continue.
According to McNealy,
If you think about the Sun technology that we’re bringing to the party, here, it’s the data center. It’s the servers, the storage, the networking, the infrastructure software, all the pieces, all of the executable environment within the cloud, the data center, the distributed computing environment, whatever else you want to say, and then you bring in the database, and the applications and ERP and middleware capabilities and developer tool capabilities of Oracle, and you have a very nice data center. A very robust, very scalable…enterprise data center.
This end to end “stack” vision would be in keeping with the other big players in the market, which are beginning to offer prepackaged product bundles and looking to be soup-to-nuts suppliers to the enterprise data center. Oracle’s competitive landscape for end-to-end stacks includes Cisco Systems Inc., IBM Corp., Hewlett-Packard Co. (HP) and Dell Inc.
There are advantages, Ellison said, in a company being able to control the engineering of both hardware and software. “We are not selling the hardware business-no part of the hardware business are we selling,” Ellison said in his keynote, though he went on to specifically discuss mostly server technologies like Sun’s SPARC chips. (Here’s where Sun might point out that it recently merged servers and storage together in terms of its engineering departments and in terms of its strategic thinking with Amber Road…)
So the biggest question for the storage hardware market with this merger still comes down to tape. Some of the competitive “stack” offerings like those from IBM include tape — in fact, with its latest Information Archive appliance, IBM is offering tape as an option managed by the GPFS global namespace, a setup highly remeniscent of the way Sun’s SAM-FS can manage data in disk repositories as well as StorageTek tape libraries.
Judging by the speeches from McNealy and Ellison, it seems no hardware product is being taken completely off the table yet, but what the newly merged entity will do with tape storage hardware specifically remains uncertain at this point.
I am sick this week, with a croaky voice, so my colleague Chris Griffin kindly filled in for me on this podcast. It’s a long’un this week — plenty of news going out this time of year.
(14:00) i365 launches EVault Offsite Replication cloud data backup and disaster recovery service
Remember the research paper Google made a splash with two years ago on disk drive failure rates? The one that showed that most failed drives didn’t raise significant SMART flags, failed to find a correlation between temperature and utilizaation with failure rates, and instead established that failure rates are more correlated to drive manufacturer, model and age?
Well, there’s now a DRAM equivalent — and it doesn’t paint a much prettier picture than the one on hard drive failures.
According to a new paper, “DRAM Errors in the Wild: A Large-Scale Field Study“, engineers from Google and the University of Toronto found that once again, failure rates and patterns did not match the received wisdom in the industry about how Dual Inline Memory Modules (DIMMs) behave. According to the paper:
We find that DRAM error behavior in the field differs in many key aspects from commonly held assumptions. For example, we observe DRAM error rates that are orders of magnitude higher than previously reported, with 25,000 to 70,000 errors per billion device hours per Mbit and more than 8% of DIMMs affected by errors per year. We provide strong evidence that memory errors are dominated by hard errors, rather than soft errors, which previous work suspects to be the dominant error mode. We find that temperature, known to strongly impact DIMM error rates in lab conditions, has a surprisingly small effect on error behavior in the field, when taking all other factors into account. Finally, unlike commonly feared, we don’t observe any indication that newer generations of DIMMs have worse error behavior.
As with the disk drive study, temperature also doesn’t play a huge role in DRAM failures. Here, vendor and model didn’t make as much difference as in the disk drive study.
However, the study showed errors were more highly dependent on motherboard design than previously thought. And contrary to conventional wisdom about DRAM, more failures were hardware than software-based. According to an article analyzing the paper by Data Mobility Group’s Robin Harris,
This means that some popular [motherboards] have poor EMI hygiene. Route a memory trace too close to noisy component or shirk on grounding layers and instant error problems…For all platforms they found that 20% of the machines with errors make up more than 90% of all observed errors on that platform. There be lemons out there!
These two reports raise one common question, according to Harris — why didn’t we know about these things before? As he put it, “Big system vendors have scads of data on disk drives, DRAM, network adapters, OS and filesystem based on mortality and tech support calls, but do they share this with the consuming public? Nothing to see here folks, just move along.”
IBM held a meeting Tuesday at Boston’s Four Seasons Hotel for press and analysts to discuss its new strategy for offering users integrated “stacks” of servers, software, storage and services. The two main products introduced were the new IBM Smart Business Storage Cloud bundle, and Information Archive appliance.
But there were some other tidbits to be gleaned from the announcements and meeting as well:
- VP Barry Rudolph said IBM is “about ready to announce and deliver” solid-state drives (SSD) in its SAN Volume Controller, which Rudolph said will double the performance of the storage virtualization device. IBM previewed the product as Project Quicksilver with Fusion-io last year. Execs wouldn’t give a more specific time frame than “imminent.”
- Scale-out File Services (SOFS) in its first iteration required an ongoing services engagement (as it did when reference customer Kantana Animation first installed it last year). The new Storage Cloud based on SOFS has the option of deployment services only, as well as an ongoing managed service, but IBM also added some consulting services to go along with the new product package, including Strategy and Change Services for Cloud Adoption for end users, Strategy and Change Services for Cloud Providers, and Testing Services for Cloud (helping build a business case for cloud-based test environments).
- Smart Business Storage Cloud is being offered for private cloud deployments right now, but IBM also plans to offer a public cloud based on the package and the CloudBurst product it announced in June, which also features automated provisioning and file-set-level chargeback through Tivoli Services Automation Manager (TSAM).
Analyst reviews of the event were mixed. Wikibon analyst Dave Vellante said he thinks IBM has some work to do on the Information Archive. “I loved the line about ‘The keep everything forever model has failed’ – it’s true,” he wrote to Storage Soup in an email. “Unfortunately, what IBM announced yesterday (IBM Information Archive) is more of the same old same old. New hardware, some decent integration but NO INDEXING AND NO SEARCH. In my mind that is not very useful to customers. Supposedly search and indexing ‘is coming soon’ but I think IBM was rushing to replace the DR550 line.”
He added, “good news for IBM is all the archiving vendors are missing the mark. Systems still don’t scale, nobody does classification right and there’s no good way to defensibly delete un-needed data.”
Evaluator Group analyst John Webster said that after following IBM storage for years, he sees them rationalizing different product lines more effectively these days. “Last year at this time things were more disjointed,” he said. “Now they’re able to rationalize XIV with the DS8000, for example.”
When it comes to the single vertically-integrated stack concept, analysts say they’ve seen this movie before. “I wonder, to what degree is server virtualization and VMware driving the desire to integrate everything into a box?” Webster said. “It reminds me of a concept people used to talk about years ago called a ‘God box,’ basically a big switch that did everything. But nobody wanted to go there–it was enough to talk about an intelligent switch. I’m not sure it’s progressed much farther, but I don’t know that it matters–Cisco has thrown down the gauntlet and other large players have to cover their bets.”
Everything’s cyclical, pointed out Analytico Tom Trainer. “Consolidation and innovation patterns in the market are like a sine wave,” Trainer said. “We were probably at the height of new companies and innovation in the dotcom era of 1999 to 2000, and as politics and economics come into play, the pendulum looks like it’s swinging back toward consolidation.”
However, consolidation can open up space in the market for new companies to emerge. “I’m talking to startups receiving good funding recently,” Trainer said. New storage startups have begun coming out of stealth in the last week, such as Avere.
When asked about industry consolidation, IBM’s Rudolph saw a similar picture. “I think you’re starting to see major shifts in our competitive framework, but I don’t think there’ll be a lack of new innovation and three or four huge corporations and that’s it,” he said.
There have been rumors for years that Hewlett-Packard might buy Brocade, and they intensified today after a Wall Street Journal report that Brocade has put itself up for sale.
The WSJ cited unidentified sources and obviously none of the companies named would comment, but the article mentioned HP and Oracle as potential bidders. Wall Street and storage industry analysts who follow Brocade say HP is the likely buyer if Brocade gets acquired. HP has a long-term relationship with Brocade, and Oracle is currently trying to complete its Sun deal and integrate that company.
“It is possible that HP is looking to buy Brocade,” Wedbush Securities analyst Kaushik Roy said today. Roy said he “would guess” Brocade would go for about $11 per share or between $4 billion and $5 billion.
However, there is likely a good reason why HP hasn’t already acquired Brocade. If it did, Brocade would probably lose a good piece of its business because its large OEM customers EMC and IBM wouldn’t be so enthusiastic about selling switches owned by their competitor HP.
“If HP buys Brocade, they would in reality pay a much higher premium because the future revenue forecasts would be revised downwards,” Roy said. “Brocade is an OEM business. EMC is likely to move from Brocade more to Cisco [for Fibre Channel switches] and IBM is likely to move towards Juniper on Ethernet.”
In a note to clients today, Stifel Nicolaus Equity Research analyst Aaron Rakers wrote that HP makes most sense as a Brocade suitor but threw a few others into the mix.
“We find it a bit interesting that the [WSJ] article is not including names such as
IBM and Juniper,” Rakers wrote.
Enterprise Strategy Group analyst Bob Laliberte said when he heard about the WSJ story, “My first thought was that HP would be a potential suitor. When you look at a company the size of Brocade and what they offer, you’re down to IBM, HP, Oracle, maybe Dell. I don’t think you’ll see EMC or Cisco buy them.”
A Cisco-Brocade deal probably wouldn’t clear anti-trust regulation, and EMC is too close to Cisco to buy Cisco’s chief switch competitor.
A Brocade acquisition by anybody is still a big if at this point. The WSJ story said no deal is imminent, and it sounds like Brocade could just be shopping to see how much interest is there.
One thing for sure is that Brocade’s stock price is soaring. It opened at $8.60 today, more than 12% above its Friday closing price of $7.65.
Open-source data backup software company Zmanda Inc. is releasing version 2.0 of its Zmanda Cloud Backup (ZCB) for Windows today.
New features include:
- Geography control – customers can tag data so that it’s backed up to a cloud data center in a certain region. For ecample, users in Europe can specify data that has to stay in Europe per European Union regulations. Customers can also choose to send data to data centers closest to their location for better performance of data migrations and retrieval over the network.
- Selective restore – the ability to restore one file from a data set; not new for Zmanda’s main backup product, but new for ZCB.
- Windows Security Certificate Encryption – Previously data sent to the cloud through ZCB was encrypted using standard AES encryption; support for the Windows certificate “is the highest level of encryption for Windows systems,” said Zmanda CEO Chander Kant. “It means they can use the same certificate they’re used to if they encrypt files on their Windows server and can make bare-metal restores for DR easier.”
Zmanda Cloud Backup 1.0 was first released last December. Kant said there are currently about 100 customers using it to backup systems to the cloud.
Despite its aggressive push of Fibre Channel over Ethernet (FCoE), Cisco executives say Fibre Channel will remain its main storage protocol for another five to 10 years and the vendor remains committed to extending its MDS FC switching platform.
Cisco reps claimed it is a myth that the vendor is abandoning FC for FCoE today during a webcast on storage networking innovation.
“Cisco is not going out and saying ‘Get rid of the Fibre Channel infrastructure,’” said Ed Chapman, VP of product management for Cisco’s server access and virtualization group.
Added VP of Cisco’s data center switching technology group Rajiv Ramaswami: “Fibre Channel is here, it’s healthy, it’s going to be here for a long time.” When asked how long before FCoE becomes the primary storage protocol, he said at least five to 10 years.
Ramaswami says Cisco plans call for an 8Gbps Fibre Channel module for the Nexus 5000 switch this year and a 16Gbps FC card for its MDS 9000 director switches by early 2011. He said Cisco will also add new intelligent storage services for the MDS platform, as well as an FCoE module.
He said FC will play a major part alongside FCoE in Cisco’s unified platform. “Unified computing is not just another name for FCoE,” he said. “FCoE is a building block in a unified fabric. FCoE is about consolidation of I/O on the server. A unified platform is about building an end to end network along with unified storage.”
Cisco added to its FCoE platform today with the Nexus 4000, the first blade switch for its Nexus unified fabric platform. Cisco expects OEM deals with blade server vendors to ship the Nexus 4000 inside their blades.
Cisco, which has deeper roots in Ethernet than FC, has pushed FCoE more than its chief switching rival Brocade, which began as a FC vendor and added Ethernet when it acquired Foundry Networks last year. Brocade beat Cisco to the punch with 4Gbps FC and 8 Gbps and gained FC market share during the refresh cycles. So it will be interesting to see if Cisco makes good on its FC roadmap pledges, especially for 16-gig.