Question: Amazon’s recent AWS outage affected a surprisingly large number of sites. What can we learn about cloud resiliency and how can we minimize these outages in the future?
AWS, Amazon’s hosted web services offering suffered a major outage with some data lose at one of its data centers on April 21, 2011. It was not the first such outage and I rather doubt it will be the last, but it was the one that was exposed what I call the dirty secret about cloud computing: the illusion of low cost high availability, systems backup and protection, and how quickly so many cloud services have become interdependent.
Ultimately, data protection and high availability boils down to having multiple copies of your data and IT systems in multiple locations with good reliable bandwidth connecting them. Traditionally, high availability (that is 5 nines and up) has been expensive due to the cost of the bandwidth and hardware needed to deliver the level of service required.
On the surface moving your IT infrastructure to the cloud looks and sounds very attractive. In theory, the cloud offers a great solution. By purchasing cloud services, anyone can leverage the investments of Amazon, Google, Rackspace and the other major cloud vendors in state of the art data centers with full redundancy, and big network pipes for a tiny fraction of the cost of doing it in-house. By moving IT infrastructure to the cloud you can take advantage of the redundancy and resiliency of using multiple vendors and multiple data centers and get enterprise class data protection at rock bottom prices. Reading between the lines of the standard service level agreements for the low cost cloud services paints a very different picture. Amazon guaranties 98% up-time, hardly earth shatteringly difficult to achieve. Once you add in all those pesky asterisks and inter-dependencies, it is unlikely that anyone is going to be able to collect on this incident or any downtime at all.
Setting aside the issue of Amazon services level agreements, all of this assumes that you have control over most if not all of the systems and services in your IT stack. What this outage highlighted for many companies is that even if they had built in the best fail-over and high availability into their systems, they were still dependent on vendors and services that might not have been quite so diligent. As more companies take advantage of the increasingly specialized cloud services built on top of the cloud utility vendors’ infrastructure, insuring up-time is going to be increasing more difficult to determine through the maze of inter-dependent services.
The bottom line for a business that wants to gain the advantage of high availability at low cost is that you need to make sure you have not only architected your own service to have a full fail-over solution, but you will also need to spend time doing diligence on all of your vendors’ policies and architectures as well. No matter how good the SLA is, if one of your upstream service providers does not have a good policy in place, your site will still be affected by their lack of planning.
About the Author
Beth Cohen, Cloud Technology Partners, Inc. Moving companies’ IT services into the cloud the right way, the first time!