An outage at Amazon’s Virginia data center last Friday which affected Web services including Pinterest, Netflix and Instagram was due to a multi-generator failure, the company reported Monday.
It was the second failure involving generators to hit the same region in the month of June.
While related to generators generally, the problems stem from different issues in different data centers, according to Julius Neudorfer, CTO of North American Access Technologies, Inc. But the compound failures in each case could mean that the backup systems weren’t tested in failure mode, he said.
“Clearly they’re trying to learn from every mistake,” he said of Amazon. “The common element here seems like they only tested when everything was operating rather than inducing a failure during the test.”
Amazon’s Summary of the AWS Service Event in the US East Regionreport states that during an electrical storm in the northern Virginia area June 29, two of ten data centers in Amazon’s East Region availability zone were forced by a large electrical spike to fail over to generator power.
One of these data centers did not successfully fail over to the generators because “each generator independently failed to provide stable voltage as they were brought into service. As a result, the generators did not pick up the load,” according to Amazon’s summary of the incident. Thus, servers began to run on Uninterruptible Power Supply (UPS) power instead.
As Amazon worked to stabilize the primary and backup power generators, the UPS systems were depleting and servers began losing power at 8:04pm PDT. Ten minutes later, the backup generator power was stabilized, the UPSs were restarted, and power started to be restored.The full facility had power to all racks by 8:24pm PDT, according to the Amazon statement.
The outage didn’t end there, though. A bottleneck in the EC2 recovery process and a bug in the Elastic Load Balancer control plane meant that some of the affected customers didn’t come back online until between 11:15 and 12 a.m. PDT, according to the report.
An earlier failure, on June 14, was initiated by a cable fault inside one of the East Region data centers, but then a fan inside a backup generator failed to kick on; in this instance, secondary backup power also failed, according to widespread reports.
Data center managers interested in highly dense, low-power system configurations will have another option to choose from later this year, according to an announcement made today by HP and Intel.
HP’s Project Moonshot shifted focus from the Redstone Server Development Platform based on Calxeda’s ARM processors to a new generation of the Intel Atom System-on-a-Chip (SoC) platform dubbed Centerton.
“It’s the best Atom infrastructure so far, but more significant is the server architecture, with an internal fabric for management of server nodes,” said Forrester Research analyst Richard Fichera. “These very dense x86-based servers put pressure on proposed ARM designs.”
HP emphasized that the new product, called Project Gemini, is not intended to replace any other product in its line. Where RedStone hardware was based on HP’s ProLiant Scalable System SL chassis, Project Gemini introduces a new chassis that connects individual server cartridges to an internal fabric, and those cartridges are to be “processor-neutral,” according to HP.
But in its first iteration, Gemini’s Atom-based processor cartridges will boast several features which appeal to enterprise data centers that its RedStone ARM counterpart doesn’t have, including 64-bit support, error correction code (ECC), enterprise software compatibility, and Intel’s Virtualization Technology (VT) – all in a six-watt power envelope.
Redstone was also referred to by HP in a press conference announcing Gemini on Tuesday as a “market development vehicle,” where Gemini is projected to be a generally available product later this year.
The confluence and competition between ARM and Atom is also being explored by HP rival Dell, which recently floated an ARM-based trial balloon with its Copperhead servers, available only to a select audience.
Meanwhile, microservers, the general category to which all of these products belong, remain suited to a niche market. Microservers pack large numbers of low-power chips into dense chassis and are suited to highly parallelized but lightweight workloads like Hadoop, Web hosting, content delivery, or distributed memory cacheing.
Intel estimates microservers could capture 10% of the overall server market by 2015 and estimates their current penetration at 1-2%. HP predicts 10 to 15% market share for the extreme low-energy servers by 2015.
Dell has shipped its “Copper” ARM-based server to a limited list of customers and partners, with the goal of sussing out uses for the low-power chip in enterprise environments.
The 32-bit Advanced RISC Machine (ARM) chips are used widely in cell phones and tablets. They have also begun to appear in microservers such as HP’s Moonshot, based on a partnership with Calxeda, Inc. since November.
At the server level, low-power chips are suited to environments where many relatively lightweight operations such as Web serving must be performed in parallel at massive scale.
Meanwhile, other low-power chips such as Intel Corp.’s Atom have been sold for similar purposes, including microserver startup SeaMicro Inc.’s products prior to the company’s acquisition by AMD. SeaMicro’s products were also resold by Dell.
“That’s a great question,” Dell executive director of marketing Steve Cumings said when asked why an IT pro seeking low-power scale-out hardware would use ARM over Atom or vice versa.
The answer is what Dell is after with the limited shipments of Copper, as well as two test clusters being set up in Dell’s Texas headquarters and at the Texas Advanced Computing Center for remote access by interested parties.
Currently, there’s a lot of code written for consumer devices on ARM, but very little in the way of enterprise applications, which is one thing holding ARM back. Dell also announced it will offer a version of its Crowbar automated server provisioning software on ARM by the end of the year.
The fact that 64-bit ARM designs have yet to hit production is a limiting factor to server-level adoption of the chip. Dell expects ARM servers to be used in production over the next 18 months to two years, when 64-bit chips become commonplace.
The Dell Copper server offers 48 ARM microservers based on the Marvell Armada CPU in a 3U shared environment. Each server node consumes 15 watts and includes Serial Advanced Technology Attachment or Flash storage; up to 8 GB RAM; and a 1 GbE input. Four server nodes are packed into a sled, each of which contains a non-blocking Layer 2 switch, and each chassis contains 12 sleds. The entire chassis draws 750 watts of power.
UPDATE: June 12th’s server maintenance brought in patch 1.02c with a “fix” for everybody’s favorite Error 37.
How did they fix it? That depends on your definition of “Fix.” Quoth the patch notes,”If the authentication service is busy, the login checkbox will now wait at ‘Authenticating Credentials’ until a player’s login attempt can be processed. As a result, players should no longer encounter Error 37 when logging in.”
Gee thanks, Blizzard. Take away the fun part of server problems…
If you haven’t heard yet, Blizzard released its much-anticipated Diablo III mere hours ago. And it didn’t take long for a host of complaints about online gaming issues to come shortly thereafter.
Throughout its history, World of Warcraft had login and game server issues, as must be expected for a game supporting so many people. For the most part the problems were small and short. I sometimes wondered if Blizzard wasn’t building in server downtime to force its millions to go look at nature for a few minutes. There is an incredible amount of computer power needed to run a massively popular multiplayer online game. Blizzard eventually opened 10 data centers around the world to support its runaway hit.
Fast forward to 2012, Blizzard has three World of Warcraft expansions, a fourth on the way and tons of data to farm for server load information. You would think they’d have learned something from all their stress tests that launch day will always, always, always result in more server stress. That’s why, for so many Blizzard fans, Diablo III’s launch day server problems are simply unacceptable. Twitter has been abuzz with the error37 hashtag, which is one of the login errors players have encountered. The comments have ranged from snarky – “The world’s first coordinated DDOS attack on Blizzard not organized by Anonymous.” – to humorous – “I was going to play #Diablo3 tonight, but then I took an #error37 to the knee.”
Joking aside, some Diablo fans, like Zynga’s Alex Levinson, wonder why this launch went off so poorly. In a blog post, Levinson talks about the sources of revenue for a game like Diablo III and how making customers happy at launch will generate interest later on and keep the game going. “This is why capacity planning for a launch like Diablo 3 should have had much more importance put on it,” he said. Levinson’s three tips for Blizzard: automatic scaling when demand is high, scaling written into application code and leveraging the cloud, like Zynga does, to help meet demand. Not bad advice there, Blizzard. Now you’ll have to excuse me while I power-level my Demon Hunter.
We’ve got May Day, Cinco de Mayo, Mother’s Day, Memorial Day and more this month, but CA Technologies thinks a day is not enough to celebrate the mainframe.
All month long, mainframe software vendor CA Technologies is hosting what it calls May Mainframe Madness 2012 (MMM2012). From its own description, the online event offers “more than 100+ valuable sessions, demos, papers and other valuable tools available over every business day in May.” Most of those tools are CA-specific but the idea of a month-long series of hour-long keynote addresses is highly appealing.
Registration lasts all month as well. You’ve already missed out on several days of mini-lectures, but coming up are talks on security management, CA’s DB2 database, storage management, Linux on System Z, MICS resource management and more. Though most of the speakers are CA employees, the varied topics will surely overlap somebody’s interests. The CA Mainframe Twitter account has been listing the sessions, but you can find a full list on their website as well.
And in case you’re in the market for a trip across the states, here’s a fun little list of other trade shows coming up in the near future. I’m not sure how the SpaceCraft Technology Expo made it into the IT list, but hey, you know I’m game for space tech.
Behind every U.S. soldier in modern warfare, there is an arsenal of technological support. Does that extend to not-so-real-but-wish-they-were superheroes? My suspicion is that Superman’s Fortress of Solitude is powered by a super-secret data center. I mean, it was originally built in the Arctic and later in space for ultimate free cooling! And Batman? We know alter ego Bruce Wayne is the billionaire head of a huge defense corporation, so it’s not a stretch to assume at least some of the Batcave’s processing power comes from Wayne Industries. The X-Men’s danger room is probably the most powerful hologram in existence, so you know there’s a data center behind that.
Following this train of comic book logic, Oracle’s S.H.I.E.L.D. data center is not so farfetched. Though clearly a marketing gesture, the Avengers’ data center, packed with fancy hardware from Oracle, is impressive. S.H.I.E.L.D., the intelligence agency behind the Avengers, needed a new data center after its old one was conveniently destroyed, paving the way for a way to publicize Oracle’s gear and the new Avengers movie.
But it does make my nerd-brain wonder: what would really be required for superhero-scale data centers? I don’t think Superman’s current incarnation requires much computing power, though in days past he apparently did lots of scientific research, but the large-scale Justice League and Avengers organizations— we won’t even mention the intergalactic Green Lantern Corps — probably would. Would they build green and get LEED certified? Who’ll foot the bill? More importantly, if these superheroes are still trying to remain secret, where would they find IT staff to support those data centers? Where’s our Super IT Admin comic book to answer these questions, huh?
…you’ll have to break a few eggs. The road to an environmentally friendly data center can be a bumpy one, which is not surprising when the path less traveled has innovative technologies and practices that haven’t been, ahem, road-tested.
Some, like the Emerson data center, have had trouble with data center management and asset availability in the face of strict, energy efficient power utilization guidelines. On the free cooling side, an incident at Facebook’s Prineville, Ore., shut down the data center when a control system program error brought condensation onto the servers and shorted out power supplies. Then there are the problems you might not see coming, like IBM’s Poughkeepsie Green Data Center, which discovered some capacity issues well into its existence. For every failure there is success, and both provide lessons for a more energy efficient data center.
Events like the Green Technologies Conference are tackling nagging problems — like overheating — for green data centers. There are other tips and tricks compiled by certification groups like the U.S. Green Building Council (USGBC) and the Green Grid to help reduce carbon footprints without sacrificing cooling and energy efficiency. Heck, even some of the big boys like Google and Facebook have put their designs and best practices out there for everyone to see. Though there can be a price to pay for innovation, it’s good to know you don’t have to go it alone when going green.
Remember CERN’s data conundrum? In a nutshell, they’ve got 15 petabytes to deal with and have run out of storage. As much of a problem as this is, apparently a new worldwide scientific enterprise called the Square Kilometer Array (SKA) is trying to one-up them in terms of data output. Well, 100-up them, to be accurate.
The new astronomical project is a radio telescope designed to see into the universe’s distant past to help answer questions about the Big Bang. To do this, it will pull in an exabyte – that’s 1 billion gigabytes – of data every day. As you can imagine, this presents a bit of a computing challenge.
Exploring the mysteries of the universe is no doubt going to take an epic data center project, and, as it turns out, a potentially new way of stacking chips in a server. According to its website, the SKA will require “100 petaflops per second processing power,” which is beyond current computing technology. Luckily, IBM and ASTRON, a Netherlands-based astronomy organization, have created the DOME project to bring about this new world order of high performance research computing. In other words, they’re in the future!
SKA is still in its infancy and won’t be fully operational until 2024, but it sure will be interesting to watch as it grows.
The Uptime Institute has started a fun way to encourage companies to cut down on wasted server resources. Known as the Server Roundup, the contest features lots of cowboy imagery, a silly video and a swanky belt buckle for the winners. This year’s winner, AOL, dumped 9,484 servers for a savings of more than $5 million in upkeep costs.
The Uptime blog states, “Decommissioning a single 1U rack server can result in $500 per year in energy savings, an additional $500 in operating system licenses, and $1,500 in hardware maintenance costs.” When you remove several thousand servers, you can see how it all adds up.
Coming in second was NBC with 284 decommissioned servers.
The question of when to replace or add more servers is an important one every enterprise needs to address.
New technology is making it possible to do more with less, so maybe we’ll see a closer competition in next year’s roundup.
Whenever I look up a see a beautiful sky with puffy clouds I think, “That’s where my iTunes are.”
I read this tweet from filmmaker Albert Brooks the other day, and it made me laugh, because, really, what could be further from the truth? Anyone that’s been where the iTunes really are — the data center — knows that there’s nothing puffy or ethereal about it. Data centers are grounded here on Earth, made of earth and the grimy, heavy, dirty stuff inside it.
In reality, that iTunes song is sitting in some data center in the Pacific Northwest, parsed into ones and zeroes, sitting on a spinning platter made of aluminum alloy and coated with ferro-magnetic material. Actually, make that a bunch of spinning disks. And several solid-state memory caches comprised of capacitors and transistors made from dust and sand, i.e., silicon.
The road between that data center and the iPod is long and paved with copper and glass (network cables), and it passes through countless relays (servers) made of steel, aluminum, silicon and petroleum, fueled by electricity created by burning coal buried deep under the Earth’s crust.
Eventually, that “cloud-based” iTunes song will make it to the iPod, but there’s nothing lofty or airy about its voyage (well, unless you count traveling over radio waves during a Wi-Fi sync). But like a dancer that floats through the air like a feather, data center managers are tasked with making everything look easy and light (i.e., “cloud-like”). The dancer’s graceful arabesque betrays no sign of the countless hours of practice, sore muscles and bruised and bandaged feet.
Likewise, that cloud-based iTunes song plays at the click of a button or tap of a screen but betrays nothing of the countless hours data center professionals spent stringing cable, racking servers, monitoring packets and optimizing HVAC systems.
They talk about the cloud, but we know better.