Posted by: Sharon Fisher
amazon, disaster recovery, ec2
This week featured millions of people glued to computer screens, waiting for all to be revealed, sharing their predictions, and crying when they finally saw the reality.
Oh, yeah, and there was a Royal Wedding.
But ten minutes before that (not that they were trying to hide anything, of course), Amazon also released the post-mortem of its extended Elastic Compute Cloud (EC2) outage of the previous week.
In case you were under a rock, a number of major computer sites — including foursquare, Reddit, and Quota — were down for a day, sometimes more, on April 21, due to a problem with Amazon’s web hosting business. It wasn’t until Monday or Tuesday of this week that all the sites really recovered.
If you’re familiar with the concept of “thrashing,” where a too-full hard disk or computer memory is so busy trying to find places to work that it doesn’t get anything done, that’s basically what happened to Amazon, on a mammoth scale. Due to a configuration problem, the cloud went down, and the first thing all the servers did when they came up was try to re-mirror themselves — which they couldn’t do because all the other servers that were up were trying to do the same thing. The actual summary goes into a lot more detail, if you really want to know, but that’s basically it.
So now the Internet is seeing a storm of a different kind: A pundit storm where people talk about 1) What It All Means and 2) Where We Go From Here and 3) Could It Happen Again?
1) S*** happens. 2) Don’t have a single point of failure, duh. 3) Of course.
Oh, you wanted more detail?
What it all means is that people are human and machines are stupid. This does not change, and will not change. Count on it. Problems happen. Then we institute new systems that help us protect against the most recent problem, and wait for a new problem to happen.
You know, like the TSA.
Where We Go From Here is that Amazon is instituting a number of changes in processes and procedures, both human and machine, that are intended to keep this from happening again.
Organizations that use the cloud — anybody’s cloud, not just Amazon’s — should take this as a wake-up call. Even if you weren’t affected by this outage, you could be on the next one. Don’t just have a backup. Have a backup for the backup. Yes, it costs money. How much money does it cost for your business to be out for a day? (Even if Amazon did give all its affected customers a freebie.) Forrester analyst Rachel Dines wrote a blog post listing a number of questions organizations should ask their cloud provider about backups and failover strategies.
Finally, accept that it’s going to happen — whether it’s from a natural disaster like the earthquake in Japan or the tornadoes in the American South, government action to shut down the Internet like in Egypt, widespread electrical failures, or simply a flu pandemic. As Dines says, “Assume nothing” — check every step in the disaster recovery plan, and figure out what the alternative is for every component that could fail.