It was the second failure involving generators to hit the same region in the month of June.
While related to generators generally, the problems stem from different issues in different data centers, according to Julius Neudorfer, CTO of North American Access Technologies, Inc. But the compound failures in each case could mean that the backup systems weren’t tested in failure mode, he said.
“Clearly they’re trying to learn from every mistake,” he said of Amazon. “The common element here seems like they only tested when everything was operating rather than inducing a failure during the test.”
Amazon’s Summary of the AWS Service Event in the US East Regionreport states that during an electrical storm in the northern Virginia area June 29, two of ten data centers in Amazon’s East Region availability zone were forced by a large electrical spike to fail over to generator power.
One of these data centers did not successfully fail over to the generators because “each generator independently failed to provide stable voltage as they were brought into service. As a result, the generators did not pick up the load,” according to Amazon’s summary of the incident. Thus, servers began to run on Uninterruptible Power Supply (UPS) power instead.
As Amazon worked to stabilize the primary and backup power generators, the UPS systems were depleting and servers began losing power at 8:04pm PDT. Ten minutes later, the backup generator power was stabilized, the UPSs were restarted, and power started to be restored.The full facility had power to all racks by 8:24pm PDT, according to the Amazon statement.
The outage didn’t end there, though. A bottleneck in the EC2 recovery process and a bug in the Elastic Load Balancer control plane meant that some of the affected customers didn’t come back online until between 11:15 and 12 a.m. PDT, according to the report.
An earlier failure, on June 14, was initiated by a cable fault inside one of the East Region data centers, but then a fan inside a backup generator failed to kick on; in this instance, secondary backup power also failed, according to widespread reports.]]>
IO offers “data center OS” as stand-alone software
IO has released DCIM software it uses in its proprietary modular data centers as stand-alone software. The IO OS “data center operating system” gathers mechanical, power, cooling and electrical usage data in real time, maintaining that data and integrating it with ticketing systems and audit trail processes. IO OS can display data center assets according to a number of perspectives – physical, logical and infrastructure – and includes views of supporting systems such as generators, switchgear, paralleling systems and chillers. With this information, IO OS provides a single pane of glass from which data center operators can establish and maintain quality of service, while optimizing data center utilization and operating costs, the company said.
Sentilla adds business planning and analytics functions
At Sentilla Corp., the holy grail of DCIM is not so much to collect information about the data center, but to do something with it. As such, the latest Sentilla 4.0 includes financial and infrastructure planning modules, plus new asset analysis capabilities. The new version also supports the ability to support multiple data centers from one interface, and its asset database features improved importing and discovery capabilities. Sentilla continues to add modeling information for systems from Dell, HP, IBM, NetApp and Sun/Oracle to its database, and offers improved support for facility infrastructure from Eaton, Emerson, APC and Schneider Electric, plus management software from BMC and HP.
iTRACS ties in with Intel Data Center Manager
iTRACS, another DCIM player, is working with Intel to integrate Intel Data Center Manager software with its Converged Physical Infrastructure Management (CPIM) suite, improving its collection, management, and analysis of CPU power, temperature, and environmental information. As a result, iTRACS will be better able to perform capacity planning, improve rack densities, identify inefficient IT assets, pinpoint cooling issues, optimize IT equipment lifecycle and prevent outages.
Let us know what you think about the story; email Alex Barrett, Executive Editor at email@example.com, or follow @aebarrett on twitter.