Yottabytes: Storage and Disaster Recovery

Feb 28 2017   11:32PM GMT

Amazon AWS Outage Casts Doubt on Cloud

Sharon Fisher Sharon Fisher Profile: Sharon Fisher

Tags:
Amazon
AWS
Cloud storage

“Don’t have your own storage! Just put everything on the cloud! It’ll always be available!”

Yeah. About that.

The Amazon Web Services (AWS) Simple Storage Service (S3) was down for some four hours earlier today, causing chaos, consternation, and first-world problems as the various services that depend on it were unable to gain access to data and images stored in it.

As you may recall, AWS was also down in 2012 due to weather – not even a hurricane, just a thunderstorm – and in 2011 due to a configuration problem. There’s something to be said for the fact that it happens so seldom that people completely lose their minds when it does, but at the same time it forces us to realize how fragile this connected world really is. The whole point of using the cloud, after all, is so the information will continue to be there even if something goes down.

It’s not that Amazon’s cloud service is so much worse than that of its competitors. If anything, it’s that it’s so widely used that any sort of glitch tends to have really big ramifications. In this particular case, 148,213 sites  use the S3 system, according to Elizabeth Weise in USA Today.

“Sites like Imgur, Medium, Expedia, Mailchimp, Buffer and even the U.S. Securities and Exchange Commission were all impacted, as were communication services like Slack,” writes WCMH. “Also ironically impacted, DownDetector.com, which is a website that tracks when other websites are down.”

“Some of the services affected included Amazon Prime Video and Amazon Music,” writes Janko Roettgers in Variety. “But the outage also affected numerous third-party websites, apps and services.  A number of web publishers, including The Verge and Axios, were unable to load images for their articles. Other media sites, including Business Insider, weren’t able to publish any new stories at all for hours. The outage also affected phone support systems at a number of companies and public agencies, including Boston’s public transit agency and the app-based investment service Acorns. The secure messaging app Signal reported on Twitter that users weren’t able to attach images to their messages. And an outage of the cloud-based scripting and control service IFTTT even led to internet-connected light bulbs ceasing to function, according to user reports.

In fact, so dependent is the world on this particular piece of AWS, based on the East Coast, that Amazon itself  got burned by the outage. “Amazon wasn’t able to update its own service health dashboard for the first two hours of the outage because the dashboard itself was hosted on AWS,” Weise writes.

Maybe, for disaster recovery purposes, Amazon want to re-think that decision? Just a thought.

Why it was down, Amazon hasn’t yet said, other than blaming “high error rates,” though that is likely a symptom rather than a cause. Weise quoted Gartner analyst Lydia Leong  as saying that the most common causes of this type of outage are software-related, either a bug in the code or human error. And the primary way to protect against it is to have multiple backups of the data, she adds. “Only the most paranoid, and very large companies, distribute their files across not just AWS but also Microsoft and Google, and replicate them geographically across regions  —  but that’s very, very expensive,” she tells Weise.

Variety notes, for example, that Netflix stayed up during the outage, likely because it had redundant storage on other services.

As far as what to do in the future, there’s really not much to add to what I wrote in 2011 when this happened:

“Organizations that use the cloud — anybody’s cloud, not just Amazon’s — should take this as a wake-up call. Even if you weren’t affected by this outage, you could be on the next one. Don’t just have a backup. Have a backup for the backup. Yes, it costs money. How much money does it cost for your business to be out for a day?”

10  Comments on this Post

 
There was an error processing your information. Please try again later.
Thanks. We'll let you know when a new response is added.
Send me notifications when other members comment.
  • cjwarren54
    "Variety notes, for example, that Netflix stayed up during the outage, likely because it had redundant storage on other services"

    Netflix was affected by the AWS outage in Oct 2016 which caused them to configure the redundancy they have
    10 pointsBadges:
    report
  • dring1
    Interesting post, Sharon. Have Azure and Google have had similar problems? I wonder if AMZN will come clean on the cause other than to cite increased error rates.
    2,995 pointsBadges:
    report
  • billpalmer
    I agree we need an explanation from Amazon, asap.  We are preparing a marketing blitz soon offering a business software suite hosted on AWS and one of our pitches is high availability due to the use of multiple "Availability Zones" within a "Region".  How did multiple availability zones go down?  Do we need to have replicas in other regions?  Does or will AWS offer this as a low cost feature?  How nearly real time can the replication take place?  Inquiring minds need to know....pronto.
    10 pointsBadges:
    report
  • Sharon Fisher
    Good questions, Bill.
    8,920 pointsBadges:
    report
  • Sharon Fisher
    Dring, I'm not sure. I suspect so. The thing is, Amazon is so much more heavily used that people notice their outages a lot more.
    8,920 pointsBadges:
    report
  • Sharon Fisher
    CJWarren, exactly. Netflix wised up.
    8,920 pointsBadges:
    report
  • dring1
    If the AMZN outage was caused by a software bug, will consumers shrug their shoulders and move on? Software is often treated differently than other types of products. Users accept bugs as coming with the territory.
    2,995 pointsBadges:
    report
  • dring1
    AMZN is biggest vendor in cloud infrastructure. Oracle bought Dyn after Dyn had an outage that caused more than 1,000 sites to fall off line.
    2,995 pointsBadges:
    report
  • Kevin Beaver
    Good read, Sharon. I think we're going to continue down this path of trusting in the cloud, the cloud fails, people get mad...time passes and we forget about the impact and then it happens again, over and over and over.

    One other thing to think about in terms of cloud security is that these companies are in the business of uptime. They clearly can't even get that nailed yet so many people assume that their stuff is also going to be secure simply because Amazon - or whomever - is going to take care of things. I've even seen some IT and security managers claim that they're no longer responsible for any incidents or breaches when someone else is hosting...especially if it's "just" a marketing website.

    LOTS of heads in sand over this...
    23,095 pointsBadges:
    report
  • dring1
    I don't know how the Washington Post got the scoop on this but the newspaper reported the AMZN problem was caused by an employee mistake.

    The Post reported that a team member was doing a bit of maintenance on Amazon Web Services, trying to speed up the billing system, when he or she tapped in the wrong codes — and inadvertently took a few more servers offline than the procedure was supposed to, Amazon said Thursday. With a few mistaken keystrokes, the employee wound up knocking out systems that supported other systems that help AWS work properly.

    Woodward and Bernstein must have been on this one.

    2,995 pointsBadges:
    report

Forgot Password

No problem! Submit your e-mail address below. We'll send you an e-mail containing your password.

Your password has been sent to:

Share this item with your network: