The IT world has had a decades-long love triangle with air- and water-cooling. Air-cooling takes IT to the prom, but now water-cooling is holding up a boom box outside IT’s window to win it back.
IBM has made so many headlines with the “world’s fastest” supercomputer, Sequoia. But it also made waves by introducing a new commercial supercomputer, the SuperMUC. It boasts direct hot-water cooling and superb energy efficiency – using 40% less energy than air-cooling, says IBM.
The PR video from the Leibniz Supercomuting Centre says the SuperMUC’s cooling system is based on the human circulatory system – a fun medicine/technology crossover. Cold water goes in directly to the processors and carries hot water out to a heat exchanger, which then heats the facility.
Apparently, the facility housing SuperMUC has successfully eliminated CRACs from the equation and is saving Leibniz a million euros a year. IBM used to cool mainframes with water, but increased processor density and cheaper air conditioning drove data centers to adopt air-cooling. Now that energy costs are on the rise and there’s an emphasis on going green, companies are once again looking to liquids to cool their machines. Plus, according to Robert McFarlane, Principal at Shen Milsom and Wilke, it’s hard to argue with the fact that “water is approximately 3,500 times more efficient than air.”
The hurdle for many facilities is infrastructure. Liquids require pipes. Even SuperMUC wouldn’t be able to use that capillary-inspired cooling system without the supporting infrastructure.
Internap, a data center hosting facility in various U.S. cities, has built the newest expansions of their facility with underfloor piping infrastructure to get glycol directly to servers. Older parts of the facility use hot/cold aisle air-cooling with the underfloor space used only for air.
Then there’s Google, which built a waste water processing facility to provide water for cooling, thus eliminating some of the strain on the community.
But both of those examples are new builds. It will be interesting to see how invasive and disruptive adding water-cooling infrastructure would be to an existing data center.
Do you think more facilities going to pony up the infrastructure cost and switch (back?) to water-cooling, or is the relative comfort of air-cooling enough to keep data centers happy?
Because speculation is fun, let’s talk a little about artificial intelligence and its potential in data centers. Automation and DCIM tools have come a long way, but as those technologies evolve, they might benefit greatly from an infusion of cutting-edge AI engineering.
The general public has seen artificial intelligence (AI) in movies and videogames where the typical scenario involves crazed robots or homicidal computers running amok. More tech-savvy consumers have Apple’s Siri in their iPhone to help fulfill a request or an adorable Roomba to vacuum a room. Once we push past the fears of a robot uprising, we realize AI can be an incredible tool to ease our workload.
The idea of AI as a functional part of a technologically advanced society isn’t remotely new. Alan Turing, the groundbreaking mathematician considered the father of AI, wrote about it back in the 1950s.
Modern uses for AI have been pioneered in many circles such as social media, board and video games, healthcare, Internet research and cat videos. And this recent story details the use of image recognition software to learn board game moves and defeat humans.
Why would any of this be useful for a data center environment? Well, let’s take automation and data center infrastructure management (DCIM) as our starting point. How great would it be if we could replace some of IT’s on-call overtime hours with AI hours? Instead of simply setting temperature or power parameters for automation software, we could have a DCIM program mimic the behavior of the human technicians to make decisions when an issue arises during the wee hours of the morning.
Automation without learning suffers from an inability to react to things outside its programming. This is one of the arguments for sending humans as well as robots to Mars. It’s not much of a leap to understand why human hours are still incredibly important for data center facilities management.
Let’s have an AI with cameras for eyes watch us work, then set it to work as a kind of “second shift” for monitoring and managing our facilities during off hours. If we want to expand into the realm of science fiction, then we can also develop a human chassis for the AI – think Stepford Wives with more coffee stained shirts – but we might be getting ahead of ourselves.
Building a new data center is a costly endeavor, as evidenced by Apple’s data center bid in Reno, Nev. But here’s a novel idea: If you don’t have the cash, build your data center out of Lego® bricks and Raspberry Pi clusters. Then install Minecraft and start computing.
That’s right, the increasingly-inventive Minecraft community has come up with several working computers built with blocks in the game. Stick enough of these puppies in your plastic data center and you might just have enough computing power to run Minecraft within Minecraft.
Silliness aside, inventing new, cheap and interesting ways to build data centers is important, especially with companies ever more budget conscious. If experimenting with games and toys is how to spur innovation, then bring on afternoon playtime.
An outage at Amazon’s Virginia data center last Friday which affected Web services including Pinterest, Netflix and Instagram was due to a multi-generator failure, the company reported Monday.
It was the second failure involving generators to hit the same region in the month of June.
While related to generators generally, the problems stem from different issues in different data centers, according to Julius Neudorfer, CTO of North American Access Technologies, Inc. But the compound failures in each case could mean that the backup systems weren’t tested in failure mode, he said.
“Clearly they’re trying to learn from every mistake,” he said of Amazon. “The common element here seems like they only tested when everything was operating rather than inducing a failure during the test.”
Amazon’s Summary of the AWS Service Event in the US East Regionreport states that during an electrical storm in the northern Virginia area June 29, two of ten data centers in Amazon’s East Region availability zone were forced by a large electrical spike to fail over to generator power.
One of these data centers did not successfully fail over to the generators because “each generator independently failed to provide stable voltage as they were brought into service. As a result, the generators did not pick up the load,” according to Amazon’s summary of the incident. Thus, servers began to run on Uninterruptible Power Supply (UPS) power instead.
As Amazon worked to stabilize the primary and backup power generators, the UPS systems were depleting and servers began losing power at 8:04pm PDT. Ten minutes later, the backup generator power was stabilized, the UPSs were restarted, and power started to be restored.The full facility had power to all racks by 8:24pm PDT, according to the Amazon statement.
The outage didn’t end there, though. A bottleneck in the EC2 recovery process and a bug in the Elastic Load Balancer control plane meant that some of the affected customers didn’t come back online until between 11:15 and 12 a.m. PDT, according to the report.
An earlier failure, on June 14, was initiated by a cable fault inside one of the East Region data centers, but then a fan inside a backup generator failed to kick on; in this instance, secondary backup power also failed, according to widespread reports.
Data center managers interested in highly dense, low-power system configurations will have another option to choose from later this year, according to an announcement made today by HP and Intel.
HP’s Project Moonshot shifted focus from the Redstone Server Development Platform based on Calxeda’s ARM processors to a new generation of the Intel Atom System-on-a-Chip (SoC) platform dubbed Centerton.
“It’s the best Atom infrastructure so far, but more significant is the server architecture, with an internal fabric for management of server nodes,” said Forrester Research analyst Richard Fichera. “These very dense x86-based servers put pressure on proposed ARM designs.”
HP emphasized that the new product, called Project Gemini, is not intended to replace any other product in its line. Where RedStone hardware was based on HP’s ProLiant Scalable System SL chassis, Project Gemini introduces a new chassis that connects individual server cartridges to an internal fabric, and those cartridges are to be “processor-neutral,” according to HP.
But in its first iteration, Gemini’s Atom-based processor cartridges will boast several features which appeal to enterprise data centers that its RedStone ARM counterpart doesn’t have, including 64-bit support, error correction code (ECC), enterprise software compatibility, and Intel’s Virtualization Technology (VT) – all in a six-watt power envelope.
Redstone was also referred to by HP in a press conference announcing Gemini on Tuesday as a “market development vehicle,” where Gemini is projected to be a generally available product later this year.
The confluence and competition between ARM and Atom is also being explored by HP rival Dell, which recently floated an ARM-based trial balloon with its Copperhead servers, available only to a select audience.
Meanwhile, microservers, the general category to which all of these products belong, remain suited to a niche market. Microservers pack large numbers of low-power chips into dense chassis and are suited to highly parallelized but lightweight workloads like Hadoop, Web hosting, content delivery, or distributed memory cacheing.
Intel estimates microservers could capture 10% of the overall server market by 2015 and estimates their current penetration at 1-2%. HP predicts 10 to 15% market share for the extreme low-energy servers by 2015.
Dell has shipped its “Copper” ARM-based server to a limited list of customers and partners, with the goal of sussing out uses for the low-power chip in enterprise environments.
The 32-bit Advanced RISC Machine (ARM) chips are used widely in cell phones and tablets. They have also begun to appear in microservers such as HP’s Moonshot, based on a partnership with Calxeda, Inc. since November.
At the server level, low-power chips are suited to environments where many relatively lightweight operations such as Web serving must be performed in parallel at massive scale.
Meanwhile, other low-power chips such as Intel Corp.’s Atom have been sold for similar purposes, including microserver startup SeaMicro Inc.’s products prior to the company’s acquisition by AMD. SeaMicro’s products were also resold by Dell.
“That’s a great question,” Dell executive director of marketing Steve Cumings said when asked why an IT pro seeking low-power scale-out hardware would use ARM over Atom or vice versa.
The answer is what Dell is after with the limited shipments of Copper, as well as two test clusters being set up in Dell’s Texas headquarters and at the Texas Advanced Computing Center for remote access by interested parties.
Currently, there’s a lot of code written for consumer devices on ARM, but very little in the way of enterprise applications, which is one thing holding ARM back. Dell also announced it will offer a version of its Crowbar automated server provisioning software on ARM by the end of the year.
The fact that 64-bit ARM designs have yet to hit production is a limiting factor to server-level adoption of the chip. Dell expects ARM servers to be used in production over the next 18 months to two years, when 64-bit chips become commonplace.
The Dell Copper server offers 48 ARM microservers based on the Marvell Armada CPU in a 3U shared environment. Each server node consumes 15 watts and includes Serial Advanced Technology Attachment or Flash storage; up to 8 GB RAM; and a 1 GbE input. Four server nodes are packed into a sled, each of which contains a non-blocking Layer 2 switch, and each chassis contains 12 sleds. The entire chassis draws 750 watts of power.
UPDATE: June 12th’s server maintenance brought in patch 1.02c with a “fix” for everybody’s favorite Error 37.
How did they fix it? That depends on your definition of “Fix.” Quoth the patch notes,”If the authentication service is busy, the login checkbox will now wait at ‘Authenticating Credentials’ until a player’s login attempt can be processed. As a result, players should no longer encounter Error 37 when logging in.”
Gee thanks, Blizzard. Take away the fun part of server problems…
If you haven’t heard yet, Blizzard released its much-anticipated Diablo III mere hours ago. And it didn’t take long for a host of complaints about online gaming issues to come shortly thereafter.
Throughout its history, World of Warcraft had login and game server issues, as must be expected for a game supporting so many people. For the most part the problems were small and short. I sometimes wondered if Blizzard wasn’t building in server downtime to force its millions to go look at nature for a few minutes. There is an incredible amount of computer power needed to run a massively popular multiplayer online game. Blizzard eventually opened 10 data centers around the world to support its runaway hit.
Fast forward to 2012, Blizzard has three World of Warcraft expansions, a fourth on the way and tons of data to farm for server load information. You would think they’d have learned something from all their stress tests that launch day will always, always, always result in more server stress. That’s why, for so many Blizzard fans, Diablo III’s launch day server problems are simply unacceptable. Twitter has been abuzz with the error37 hashtag, which is one of the login errors players have encountered. The comments have ranged from snarky – “The world’s first coordinated DDOS attack on Blizzard not organized by Anonymous.” – to humorous – “I was going to play #Diablo3 tonight, but then I took an #error37 to the knee.”
Joking aside, some Diablo fans, like Zynga’s Alex Levinson, wonder why this launch went off so poorly. In a blog post, Levinson talks about the sources of revenue for a game like Diablo III and how making customers happy at launch will generate interest later on and keep the game going. “This is why capacity planning for a launch like Diablo 3 should have had much more importance put on it,” he said. Levinson’s three tips for Blizzard: automatic scaling when demand is high, scaling written into application code and leveraging the cloud, like Zynga does, to help meet demand. Not bad advice there, Blizzard. Now you’ll have to excuse me while I power-level my Demon Hunter.
We’ve got May Day, Cinco de Mayo, Mother’s Day, Memorial Day and more this month, but CA Technologies thinks a day is not enough to celebrate the mainframe.
All month long, mainframe software vendor CA Technologies is hosting what it calls May Mainframe Madness 2012 (MMM2012). From its own description, the online event offers “more than 100+ valuable sessions, demos, papers and other valuable tools available over every business day in May.” Most of those tools are CA-specific but the idea of a month-long series of hour-long keynote addresses is highly appealing.
Registration lasts all month as well. You’ve already missed out on several days of mini-lectures, but coming up are talks on security management, CA’s DB2 database, storage management, Linux on System Z, MICS resource management and more. Though most of the speakers are CA employees, the varied topics will surely overlap somebody’s interests. The CA Mainframe Twitter account has been listing the sessions, but you can find a full list on their website as well.
And in case you’re in the market for a trip across the states, here’s a fun little list of other trade shows coming up in the near future. I’m not sure how the SpaceCraft Technology Expo made it into the IT list, but hey, you know I’m game for space tech.
Behind every U.S. soldier in modern warfare, there is an arsenal of technological support. Does that extend to not-so-real-but-wish-they-were superheroes? My suspicion is that Superman’s Fortress of Solitude is powered by a super-secret data center. I mean, it was originally built in the Arctic and later in space for ultimate free cooling! And Batman? We know alter ego Bruce Wayne is the billionaire head of a huge defense corporation, so it’s not a stretch to assume at least some of the Batcave’s processing power comes from Wayne Industries. The X-Men’s danger room is probably the most powerful hologram in existence, so you know there’s a data center behind that.
Following this train of comic book logic, Oracle’s S.H.I.E.L.D. data center is not so farfetched. Though clearly a marketing gesture, the Avengers’ data center, packed with fancy hardware from Oracle, is impressive. S.H.I.E.L.D., the intelligence agency behind the Avengers, needed a new data center after its old one was conveniently destroyed, paving the way for a way to publicize Oracle’s gear and the new Avengers movie.
But it does make my nerd-brain wonder: what would really be required for superhero-scale data centers? I don’t think Superman’s current incarnation requires much computing power, though in days past he apparently did lots of scientific research, but the large-scale Justice League and Avengers organizations— we won’t even mention the intergalactic Green Lantern Corps — probably would. Would they build green and get LEED certified? Who’ll foot the bill? More importantly, if these superheroes are still trying to remain secret, where would they find IT staff to support those data centers? Where’s our Super IT Admin comic book to answer these questions, huh?
…you’ll have to break a few eggs. The road to an environmentally friendly data center can be a bumpy one, which is not surprising when the path less traveled has innovative technologies and practices that haven’t been, ahem, road-tested.
Some, like the Emerson data center, have had trouble with data center management and asset availability in the face of strict, energy efficient power utilization guidelines. On the free cooling side, an incident at Facebook’s Prineville, Ore., shut down the data center when a control system program error brought condensation onto the servers and shorted out power supplies. Then there are the problems you might not see coming, like IBM’s Poughkeepsie Green Data Center, which discovered some capacity issues well into its existence. For every failure there is success, and both provide lessons for a more energy efficient data center.
Events like the Green Technologies Conference are tackling nagging problems — like overheating — for green data centers. There are other tips and tricks compiled by certification groups like the U.S. Green Building Council (USGBC) and the Green Grid to help reduce carbon footprints without sacrificing cooling and energy efficiency. Heck, even some of the big boys like Google and Facebook have put their designs and best practices out there for everyone to see. Though there can be a price to pay for innovation, it’s good to know you don’t have to go it alone when going green.