Data Center Apparatus

A SearchDataCenter.com blog covering the latest data center news and trends.

» VIEW ALL POSTS Jul 3 2012   10:39AM GMT

Amazon outage due to data center generator problems



Posted by: Beth Pariseau
Tags:
data center
data center design
data center infrastructure management

An outage at Amazon’s Virginia data center last Friday which affected Web services including Pinterest, Netflix and Instagram was due to a multi-generator failure, the company reported Monday.

It was the second failure involving generators to hit the same region in the month of June.

While related to generators generally, the problems stem from different issues in different data centers, according to Julius Neudorfer, CTO of North American Access Technologies, Inc. But the compound failures in each case could mean that the backup systems weren’t tested in failure mode, he said.

“Clearly they’re trying to learn from every mistake,” he said of Amazon. “The common element here seems like they only tested when everything was operating rather than inducing a failure during the test.”

Amazon’s Summary of the AWS Service Event in the US East Regionreport states that during an electrical storm in the northern Virginia area June 29, two of ten data centers in Amazon’s East Region availability zone were forced by a large electrical spike to fail over to generator power.

One of these data centers did not successfully fail over to the generators because “each generator independently failed to provide stable voltage as they were brought into service. As a result, the generators did not pick up the load,” according to Amazon’s summary of the incident. Thus, servers began to run on Uninterruptible Power Supply (UPS) power instead.

As Amazon worked to stabilize the primary and backup power generators, the UPS systems were depleting and servers began losing power at 8:04pm PDT. Ten minutes later, the backup generator power was stabilized, the UPSs were restarted, and power started to be restored.The full facility had power to all racks by 8:24pm PDT, according to the Amazon statement.

The outage didn’t end there, though. A bottleneck in the EC2 recovery process and a bug in the Elastic Load Balancer control plane meant that some of the affected customers didn’t come back online until between 11:15 and 12 a.m. PDT, according to the report.

An earlier failure, on June 14, was initiated by a cable fault inside one of the East Region data centers, but then a fan inside a backup generator failed to kick on; in this instance, secondary backup power also failed, according to widespread reports.

1  Comment on this Post

 
There was an error processing your information. Please try again later.
Thanks. We'll let you know when a new response is added.
Send me notifications when other members comment.

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy
  • Beth Pariseau
    [...] Web company, WhatsYourPrice.com, left AWS following two outages in June, and CEO Brandon Wade said the Christmas Eve failure was [...]
    0 pointsBadges:
    report

Forgot Password

No problem! Submit your e-mail address below. We'll send you an e-mail containing your password.

Your password has been sent to: