Despite blowback from refusing to supply services to whistleblower website WikiLeaks, controversy over congestion, uptime and customer service (or lack thereof), it seems that cloud computing giant Amazon Web Services (AWS) has never seen better days. The cloud provider is making new services and news announcements at a record clip. Here’s a roundup of the most significant recent announcements.
Cluster Compute GPU instances: Amazon made high-performance computing (HPC) headlines when it launched a special type of high-powered compute instance based on hardware normally found only in supercomputers. It’s impossible to fake or emulate the kinds of uses scientific and functional computing demands, so Amazon built a mini-supercomputer in its own Virginia data center and opened it up to the world. Now they’ve done it again, but this times it’s graphical processing unit (GPU) instances that are available.
Originally driven by the video game market, the chips that run video adapters have gotten so powerful that they’ve paved the way to new areas of HPC. Again, impossible to emulate, so Amazon has clearly laid out some serious capital to build a GPU-powered supercomputer to go along with the one that fires Cluster Compute instances.
This may speak to AWS’ operational maturity; their systems have evolved to the point where they can accommodate a variety of hardware in their billing, provisioning and management systems.
DNS in the cloud: Domain name servers, the street maps of the Internet, have long been the province of Web hosters and Internet service providers. They’re vital to delivering what customers want — Internet traffic — to the right place, and providers must be able to keep control of how traffic is distributed and account for that. Without access to DNS servers, you can’t properly control your email server, for instance.
Amazon, despite hosting one of the world’s signature collections of Internet traffic, hasn’t made DNS available to its customers; now it is. This might be due to the theory that as a pure infrastructure host, users should run their own. That’s led to angst when Amazon, the provider of record, gets blocked or banned by the Internet community for users’ misbehavior. It also effectively crippled the Elastic Load Balancing (ELB) service for many users, since they were unable to point their root domain at the ELB service. “Too complicated,” they were told.
The Route 53 DNS service lets users create zone files for their own domains, a little bit like handing the steering wheel back to the driver for a network admin. It’s based on popular free software (of course) djbdns and goes for $1 per month. Most website hosters provide DNS service for free. Regardless, AWS users are largely ecstatic over this, although it does not fix the zoning problems with Elastic Load Balancers quite yet.
UPDATE: Not all are ecstatic; Cantabrigian Unix developer Tony Finch writes a worthy critique of Route 53.
PCI DSS compliance in the cloud (sort of): This is a big deal to many. That’s because Visa and other credit card companies won’t let you take credit card payments, online or anywhere else, unless you can pass a PCI DSS audit. To date, Amazon hasn’t been able to do that, shutting it out of the market for e-commerce applications (in part). PCI DSS 2.0 was announced, which added new and vague but apparently satisfactory guidelines for virtualization; weeks later, Amazon is now PCI DSS Level 1 compliant.
Does this mean your new online business is automatically PCI compliant if you use Amazon to host it? Absolutely not. All it means is that it is now possible for merchants to consider using EC2 and S3 to process card payments. The responsibility of passing one’s own PCI audit hasn’t gone away. In the event of a data breach, you’re still on the hook if you store customer’s credit card info on AWS.
New SDKs for mobile developers: Amazon is committing to the support of mobile development, releasing new software development kits (SDKs) for the iPhone and Android devices. The SDKs facilitate writing apps that can connect directly to AWS’ infrastructure and do fun stuff. No one in their right mind who is developing for mobile isn’t already using AWS, parts of AWS or some other cloud service, so this isn’t exactly a brainteaser. It’s a significant show of support, however, and again demonstrates the increasing maturity of AWS as an environment.
CloudWatch updates: A raft of updates to AWS’ CloudWatch service, a rudimentary notification system which is less rudimentary now. A shining example of starting off with something that is basically kind of broken (the original CloudWatch was considerably more limited than the monitoring features built into your average microwave oven) and gradually becoming a truly useful part of the tool kit.
That includes features like threshold notifications, health checks and policy-based actions (like not auto-scaling up your application in response to unexpected traffic and leaving you with an eye-watering bill).
5 TB files in S3: Users now have the ability to upload files up to 5 TB in size, several orders of magnitude greater than the previous 5 GB limit. One has to assume they’ve made a major stride in their storage architecture, Dynamo. Jeff Barr, AWS evangelist, posits direct connections between a genome sequencer, S3 and the HPC cluster on EC2 for near-real time data processing.
Of course, keeping it there will cost you a solid $125 per month per TB, and getting it back out of S3 for archiving or other purposes will cost you as well. But if you don’t have a supercomputer handy to go with your genomics research institute, this may look pretty handy. Make sure none of your giant files are classified, too, or you might get “WikiLeaked”…
Whew. Exciting six months, right, kids? Nope, that was just the last three weeks. Crazy.
Web company Mixpanel delivered an informative tirade on why they are leaving Rackspace Cloud for Amazon Web Services (AWS) today. The story basically boils down to “AWS is better potting soil for Web apps,” although there are choice words for Rackspace support and operations failures as well.
Mixpanel makes an app that tracks your website’s use in some detail; it’s a tool for site operators and e-commerce types. It left Rackspace for a few significant reasons, one of which was the Elastic Block Store (EBS) feature of AWS, the ephemeral storage system linked to your virtual machines; another was the lack of a fully developed API for Rackspace. Big deal, Rackspace makes hay over customer wins, too.
What this highlights is the difference in the two offerings — Rackspace Cloud is much closer to traditional hosting, both in concept and design, than AWS. Go to the site, click on a button, get a server/website/whatever. You also have to deal with humans after a certain size, submitting a request to increase resources here and there.
AWS is a completely hands-off, completely blinded set of resources and rules that have much less to do with the way standard hosting operates; it’s fundamentally different even if the end result (you get a server) is the same.
Mixpanel wants (apparently) a generally new but now well-established concept; they want Web stuff and they want it all the time and everywhere. They mention Amazon’s superlative CDN, the range of instance sizes and so on, but it’s really the fact that you’re not actually dealing with infrastructure, except in the loosest concept, that’s pulling them over.
Storage and CPU and bandwidth are logically connected, but so loosely that you can’t really say it’s mimicking the operation of a physical facility. It’s just buckets of ability you buy, like power-ups in a video game or something. This is ideal for a Web application, since that’s how users are looking at the application, too. Maybe not so much for someone running a different kind of application. Encoding.com, for instance, chose Rackspace because their video encoding service needed Rackspace’s superior internal connectivity and CPU, not application flexibility.
Anyway, the fun part starts in the comment section of the blog, where users come on to gripe about AWS in almost the same way Mixpanel is griping about Rackspace; one developer said he was mysteriously slapped with charges over bandwidth that could possibly have occurred and is not unwilling to turn his test instance back on, since AWS simply refuses to address the issue. Sounds like some place where they put a premium on customer support might be a better fit — you know, where they have “fanatical support”…
On route to the Cloud Computing Expo this week, I ducked into Abiquo’s offices in Redwood City to catch up with CEO, Pete Malcolm.
By the end of the year he said the cloud management startup will pull in a second round of venture capital funding to add to the $5.1 million raised in March, 2010. His lips were sealed on the amount, but it will be enough to see the company through 2011/12.
Abiquo has 35 employees and somewhere between 10 and 50 customers using its cloud provisioning and automation software. Most of these are hosting companies, like BlueFire in Australia, which use the software as an enabling technology to sell more advanced cloud infrastructure services to its customers.
Enterprises have tested the software and Malcolm expects real deployments next year, once the budget for it kicks in. He said most companies did not have cloud in their budget in 2010 but will in 2011.
Abiquo just released the fourth version of its cloud management software, Abiquo 1.7, available in 45 days. The biggest new feature is a policy engine that allows organizations to allocate virtual resources based on different business and IT considerations including governance, security, compliance and cost — as well as a variety of utilization models. The business rules can be applied at multiple levels, and customized for individual physical data centers, racks, servers and storage, as well as virtual enterprises and virtual data centers.
CA, VMware, Cloud.com and Eucalyptus among many others are all vying for the same market as Abiquo, and it looks like 2011 is shaping up to be a crucial year for gaining market share.
Mark Russinovich — Microsoft technical fellow, a lead on the Azure platform and a renowned Windows expert — took pains at PDC ’10 (Watch the “Inside Windows Azure” session here) to lay out a detailed, high-level overview of the Azure platform and what actually happens when users interact with it.
The Azure cloud(s) is (are) built on Microsoft’s definition of commodity infrastructure. It’s “Microsoft Blades,” that is, bespoke OEM blade servers from several manufacturers. It’s probably Dell or HP, just saying, in dense racks. Microsoft containerizes its data centers now and pictures abound; this is only interesting to data center nerds anyway.
For systems managements nerds, here’s a 2006 presentation from Microsoft on the rudiments of shared I/O and blade design.
Azure considers each rack a ‘node’ of compute power and puts a switch on top of it. Each node — servers+top rack switch — is considered a ‘fault domain’ (see glossary, below), i.e., a possible point of failure. An aggregator and load balancers manage groups of nodes, and all feed back to the Fabric Controller (FC), the operational heart of Azure.
The FC gets it’s marching orders from the “Red Dog Front End” (RDFE). RDFE takes its name from nomenclature left over from Dave Cutler’s original Red Dog project that became Azure. The RDFE acts as kind of router for request and traffic to and from the load balancers and Fabric Controller.
Russinovich said that the development team passed an establishment called the “Pink Poodle” while driving one day. Red Dog was deemed more suitable, and Russinovich claims not to know what sort of establishment the Pink Poodle is.
How Azure works
Azure works like this:
- |___Aggregators and Load Balancers
- |___Fabric Controller
The Fabric Controller
The Fabric Controller does all the heavy lifting for Azure. It provisions, stores, delivers, monitors and commands the virtual machines (VMs) that make up Azure. It is a “distributed stateful application distributed across data center nodes and fault domains.”
In English, this means there are a number of Fabric Controller instances running in various racks. One is elected to act as the primary controller. If it fails, another picks up the slack. If the entire FC fails, all of the operations it started, including the nodes, keep running, albeit without much governance until it comes back online. If you start a service on Azure, the FC can fall over entirely and your service is not shut down.
The Fabric Controller automates pretty much everything, including new hardware installs. New blades are configured for PXE and the FC has a PXE boot server in it. It boots a ‘maintenance image,’ which downloads a host operating system (OS) that includes all the parts necessary to make it an Azure host machine. Sysprep is run, the system is rebooted as a unique machine and the FC sucks it into the fold.
The Fabric Controller is a modified Windows Server 2008 OS, as are the host OS and the standard pre-configured Web and Worker Role instances.
What happens when you ask for a Role
The FC has two primary objectives: to satisfy user requests and policies and to optimize and simplify deployment. It does all of this automatically, “learning as it goes” about the state of the data center, Russinovich said.
Log into Azure and ask for a new “Web Role” instance and what happens? The portal takes your request to the RDFE. The RDFE asks the Fabric Controller for the same, based on the parameters you set and your location, proximity, etc. The Fabric Controller scans the available nodes and looks for (in the standard case) two nodes that do not share a Fault Domain, and are thus fault-tolerant.
This could be two racks right next to each other. Russinovich said that FC considers network proximity and available connectivity as factors in optimizing performance. Azure is unlikely to pick nodes in two different facilities unless necessary or specified.
Fabric Controller, having found its juicy young nodes bursting with unused capacity, then puts the role-defining files at the host. The host OS creates the requested virtual machines and three Virtual Hard Drives (VHDs) (count ’em, three!): a stock ‘differencing’ VHD (D:\) for the OS image, a ‘resource’ VHD (C:\) for user temporary files and a Role VHD (next available drive letter), for role specific files. The host agent starts the VM and away we go.
The load balancers, interestingly, do nothing until the instance receives its first external HTTP communication (GET); only then is the instance routed to an external endpoint and live to the network.
The Platform as a Service part
Why so complicated? Well, it’s a) Windows and b) the point is to automate maintenance and stuff. The regular updates that Windows Azure systems undergoes — same as (within the specifications of what is running) the rest of the Windows world — happen typically about once a month and require restarting the VMs.
Now for the fun part: Azure requires two instances running to enjoy its 99.9% uptime service-level agreement (SLA), and that’s one reason why. Microsoft essentially enforces a high-availability, uninterrupted fault tolerance fire drill every time the instances are updated. Minor updates and changes to configuration do not require restarts, but what Russinovich called ‘VIP swaps’ do.
Obviously, this needs to be done in such a way that the user doesn’t skip a beat. A complicated hopscotch takes place as updates are installed to the resource VHD. One instance is shut down and the resource VHD updated, then the other one. The differencing VHDa makes sure new data that comes into the Azure service is retained and synced as each VM reboots.
Virtualization and security
What is it running on, we asked? Head scratching ensued for many moons as Microsoft pushed Hyper-V to customers but claimed Azure was not compatible or interoperable with Hyper-V.
It is, in fact, a fork of Hyper-V. Russinovich said it was basically tailored from the ground up for the hardware layout that Microsoft uses, same as the Azure OSes.
Russinovich said that the virtual machine is the security boundary for Azure. At the hypervisor level, the host agents on each physical machine are trusted. The Fabric Controller OSes are trusted. The guest agent- the part the user controls—is not trusted. The VMs communicate only through the load balancers and the public (user’s endpoint) IP and back down again.
Some clever security person may now appear and make fun of this scheme, but that’s not my job.
The Fabric Controller handles network security and Hyper-V uses machine state registries (MSRs) to verify basic machine integrity. That’s not incredibly rich detail, but its more than you knew five minutes ago and I guarantee its more than you know about how Amazon secures Xen. Here’s a little more on Hyper-V security.
New additions to Azure, like full admin rights on VMs (aka elevated privileges) justify this approach, Russinovich said. “We know for a fact we have to rely on this [model] for security,” he said.
Everyone feel safe and cozy? New user-built VM Roles are implemented a little differently
Azure now offers users the ability to craft their own Windows images and run them on Microsoft’s cloud. These VM Roles are built by you (sysprep recommended) and uploaded to your blob storage. When you create a service around your custom VMs and start the instances, Fabric Controller takes pains to redundantly ensure redundancy. It makes a shadow copy of your file, caches that shadow copy (in the VHD cacher, of course) and then creates the three VHDs seen above for each VM needed. From there, you’re on your own; Microsoft does not consider having to perform your own patches an asset in Azure.
A healthy host is a happy host
Azure uses heartbeats to measure instance health: It simply pings the Fabric Controller every few seconds and that’s that. Here again, fault tolerance is in play. You have two instances running (if you’re doing it right. Azure will let you run one, but then you don’t get the SLA). If one fails, the heartbeat times out, the differencing VHD on the other VM starts ticking over and Azure restarts the faulty VM, or recreates the configuration somewhere else. Then changes are synced and you’re back in business.
Do not end these processes
Now that we have the ability to RDP into our Azure Roles and monkey around, Russinovich helpfully explains that the processes Azure runs within the VM are WaAppHost.exe (Worker Role), WaWebHost.exe (Web Role), clouddrivesvc.exe (All Roles) and a handful of others, a special w3wp.exe for IIS configuration and so forth. All of these were previously restricted from user access but can be accessed via the new admin privileges.
Many of the features set out here are in development and beta but are promised to the end user soon. Russinovich noted that the operations outlined here still could change significantly. At any rate, his PDC session provided a fascinating look into how a cloud can operate, and it’s approximately eleventy bajillion percent more than I (or anyone else, for that matter) know about how Amazon Web Services or Google App Engine works.
Azure : Microsoft’s cloud infrastructure platform
Fabric Controller: A set of modified virtual Windows Server 2008 images running across Azure that control provisioning and management
Fault Domain: A set of resources within an Azure data center that are considered non-fault tolerant and a discrete unit, like a single rack of servers. A Service by default splits virtual instances across at least two Fault Domains.
Role: Microsoft’s name for a specific configuration of Azure virtual machine. The terminology is from Hyper-V.
Service: Azure lets users run Services, which then run virtual machine instances in a few pre-configured types, like Web or Worker Roles. A Service is a batch of instances that are all governed by the Service parameters and policy.
Web Role: An instance pre-configured to run Microsoft’s Web server technology Internet Information Services (IIS)
Worker Role: An instance configured not to run IIS but instead to run applications developed and/or uploaded to the VM by the end user
VM Role: User-created, unsupported Windows Server 2008 virtual machine images that are uploaded by the user and controlled through the user portal. Unlike Web and Worker Roles, these are not updated and maintained automatically by Azure.
Amazon CTO Werner Vogels said on Twitter that AWS does not oversubscribe its services.
“If you launch an instance type you get the performance you ask (and pay) for, period. No oversubscription,” he wrote. An earlier message said that CPU performance is fixed for each instance, and customers were granted access to the full amount of virtual CPU, an Amazon designated Elastic Compute Unit (ECU).
Why is this important? For one thing, it’s another data point about AWS operations dribbled out: the company is famously tight-lipped on even completely innocuous matters, let alone operational details. This allows some more inferences to be made about what AWS actually is.
Second, EVERYBODY oversubscribes, unless they explicitly say they don’t.
Oversubscription, in the IT world, originates with having a fixed amount of bandwidth and a user base that is greater than one. It stems from the idea that you have a total capacity of resources that a single user will rarely, if ever, approach. You tell the pool of users they all have a theoretical maximum amount of bandwidth, 1GBps on the office LAN, for instance.
Your average user consumes much less than that (<10 MBbs, say), so you are pretty safe if you say that 50 users can all use 10 MBps at the same time. This is oversubscription. Clearly, not everybody can have 1 GBps at once, but some can have it sometimes. Mostly, network management takes care of making sure nobody hogs all the bandwidth, or when congestion becomes an issue, more resources are ready. Why do this?
Oversubscription is a lot easier than having conversations that go like this:
Admin to office manager: “Well, yes, these wires CAN carry 1 GBps of data. But you only need about 10MBPs, so what we do is set up rules, so that…
What? No, you DO have the capacity. Listen, we can either hard limit EVERYONE to 10 MBps like it was, or we can let usage be elastic…What?
OK, fine. Everyone has a 1 GBps connection. Goodbye now.”
Problems with this model only arise when the provider does not have enough overhead on hand to comfortably manage surges in demand, i.e., they are lying about their capacity. Comcast and AT&T do this and get rightfully pilloried for fraud from time to time, airlines as well, and so on. that’s wrong.
Fundamentally, though, this is a business practice based on statistically sound math. It makes zero sense to give everyone 1000 feet of rope when 98% only ever need 36 inches.
And everybody does it
It’s also par for the course in the world of hosting. Bear in mind that a service provider is not lying if it promises you a single CPU core and 1 GB RAM, and then puts 100 customers on a box with 16 cores and 36 GB RAM. It is counting on the fact that most people’s servers and applications can comfortably run on a pocket calculator these days. When demand spikes, the service provider turns on another server and adds more power.
“Problem” customers, who use the advertised resources, go to dedicated boxes if needed, and everyone is happy. The provider thus realizes the vaunted economy of scale, and the customer is content. Service providers often don’t oversubscribe more expensive offerings as a marketing bullet point or to meet customer wishes and provide high touch customer service. It’s a premium to get your own box.
The fact that Amazon does not oversubscribe is indicative of a few things: first, it hasn’t altered its core Xen hypervisor that much, nor are users that far from the base infrastructure. Xen does not allow oversubscription per se, but of course Amazon could show customers whatever it wanted. (This is also largely true of VPS hosters, whose ‘slice’ offerings are often comparable to Amazon’s in price: ~$70/mo for a low end virtual server instance).
This allows us to make a much better guess about the size of Amazon’s Elastic Compute Cloud (EC2) infrastructure. Every EC2 instance gets a ‘virtual core,’ posited to be about the equivalent of a 1.2 GHz Intel or AMD CPU. Virtual cores are, by convention, no more than half a real CPU core. A dual core CPU equals four virtual cores, or four server instances. AWS servers are quad CPU, quad core, for the most part(this nugget is courtesy of Morphlabs’ Winston Damarillo, who built an AWS clone and studied their environment in detail). So, 16 cores and 32 virtual cores per server.
Guy Rosen, who runs the Jack of all Clouds blog, estimates the use of AWS regularly. In September 2010, AWS was home to 3,259 websites. In September-October 2009, Rosen came up with a novel way to count how many servers (each of which had at minimum one virtual core, or half a real CPU) Amazon provisions each day.
He said that AWS’s US-EAST region (one data center with 4 Availability Zones in it) launched 50,212 servers a day. At that time, AWS overall served 1,763 websites. Assume this growth is consistent, and Amazon is now serving 184% more instances. let’s say 93,000 server requests a day at US-EAST.
Physical infrastructure thus has to consist of at least 50,000 CPU cores at this point, although this is an inductive figure, not a true calculation. It is also quite conservative. Growth at AWS might have been better than double.That’s 3,125 actual servers to run those 50,000 nodes and 93,000 virtual machine instances.
Amazon’s cloud in Virginia runs on 3125 servers?
What? No Way.
Let’s be generous, and take into account the new HPC instances, all the overhead they must keep around, and factor in the use of large and extra large EC2 instances. We’ll give them 4,000 servers, 128,000 virtual CPUs.
US-EAST runs on 4,000 servers, or 100 racks. That could fit in 10,000 sq ft of data center, if someone really knew what they were doing. Equinix’s (just picking that name out of thin air) flagship DC/Virginia facilities operate 155,000 sq ft of Tier 3 space — if i’m even remotely in the ballpark, US-EAST, including cages and crash karts, could fit on one wall.
What was that about economies of scale again?
…and it was dense. I toured an AT&T Internet Data Center (IDC) facility this morning, in dear old Watertown, MA.
It’s impressive; lots of new, gleaming white PVC and galvanized steel tubing. The facility has the ability to shutdown its chillers in the winter and get free cooling when it’s below 45 degrees outside, thanks to modern plate and frame heat exchangers (they are bright blue, probably to set off the red water pumps that connect to the PVC chiller pipes. Red, white and blue!). Solar panels can provide about 75Kw, they have a handful of megawatts worth of generators, etc, and everything is shipshape and shiny.
On the floor (36″ raised) we saw customer cages and racks for AT&T’s co-location business, the massive switching and internet hub (four 10 GB pipes to the local backbone) and we saw the AT&T managed services infrastructure, a good dozen aisles of rack space, mostly full, with alternating hot/cold aisles with about a 20 degree difference between them.
This is where AT&T puts its virtualization, hosting, application services, managed services and yes, its Synaptic cloud services for the local area. Currently, AT&T offers Synaptic Storage as an on-demand, pay-as-you-go cloud service, and Synaptic Compute is live as a “controlled launch” with a limited number of users. Currently, AT&T Business Services says that cloud is its second heaviest investment area after mobile services.
AT&T currently runs all of its production services that are virtualized on VMware, but it plans to roll out additional hypervisor support very quickly. It is also pitching its “virtual private cloud” as an area of first interest to customers. AT&T has the ability, like Verizon, to offer truly dedicated cloud environments that do not have to be exposed to standard Internet traffic at all (it’s a telecom, it can do any dedicated circuit you want from copper POTS to fiber optic, if you pony up).
A couple of quick thoughts about what I saw: Density in the AT&T infrastructure was easily 10x that of customer colocated racks, both in use of space and the floorplan. It is clearly a beast of a different order than the individual server environments.
There was a LOT of empty space. AT&T has walled off about 70,000 square feet of raised floor for later; it has the space to double its capacity for cooling, power and switching, and it isn’t using a lot of the capacity its already got online. AT&T could treble its managed services infrastructure footprint in the current ready space and still take on co-lo and hosting customers.
That says to me that the market is ready to supply IT infrastructure, and cloud infrastructure, when the demand is there. Much ado is made over Google’s uncountable, bleeding-edge servers, and Amazon’s enormous cloud environment, or Microsoft’s data centers, but here in Watertown is the plain fact that the people who really do infrastructure are ready, and waiting, for the market. The long-term future of cloud infrastructures is probably in these facilities, and not with Google or Microsoft or Amazon.
Oh, and, Verizon has a building right next door. It’s bigger than AT&T’s.
The flexibility of cloud computing services appears to be extending to the physical infrastructure itself.
Researchers point to new examples of the rapidly maturing “shipping container data center” as proof. After all, if you can sign up and get a server at Amazon Web Services anytime you like, why shouldn’t you be able to order up a physical data center almost the same way?
Rob Gillen, a researcher at the Oak Ridge National laboratories in Tennessee, is part of a team researching and building a private cloud out of commodity x86 servers to support the operations of Jaguar, ORNL’s supercomputer. It is the fastest and most powerful system in the world right now, but it’s not exactly a good fit for day-to-day computing needs.
“If you look at our Jaguar server here with 225,000 cores, frankly there’s only a few people in the world smart enough to write code that will work really well on that,” said Rob Gillen, researcher for cloud computing technologies at ORNL. Gillen is working on both the overall goal of private cloud and is heavily involved in exploring the use of Microsoft Azure, the Platform as a Service.
He said ORNL is working to develop a self-service, fully virtualized environment to handle less important or less intensive tasks, like post-processing of results from workloads run on Jaguar, and long term storage and delivery of that data.
Gillen said the advantages of using standard, widely available hardware and virtualization technologies to make a pool of resources available, a la Amazon Web Services, was very simple. There was a clear divide in raw computing power, but the pool of available programmers, not to mention existing software tools, was much wider using commodity-type services.
“If you have the opportunity to use fixed, Infiniband gear, generally your scientific problems are going to express themselves better over that,” he said. “The commodity nature[of private clouds] is tough for scientists to grapple with, but the range of solutions gets better.”
Hadoop, the massive multi-parallel next generation database, might do a much better job of processing data from Jaguar, for example, and a researcher wouldn’t need to tie up critically valuable supercomputer time noodling around with different ways to explore all that data.
“The raw generation is done on the supercomputers but much of the post processing is really done on commodity, cloud environment,” said Gillen. But, he’s chronically short of space and wants more servers for the ORNL cloud.
That’s where cloud on wheels comes in; Gillen has been looking at a demo container data center from SGI, called the ICE Cube, which is a standard shipping container with a lot of servers in it.
Gillen’s photos and a video of the interior of the unit are a treat for the gearheads:
It gets put down anywhere there’s space and half a megawatt or so of power. Just add water and presto, instant data center. It might not be pretty, but it’s a less expensive way to get data center space.
“We’re space constrained and that’s one possibility,” said Gillen.
Gillen said that the containerized data center market was pretty well established by now, but offerings from HP and IBM were usually designed to adapt to a traditional data center management process. They had standard power hookup, standard rack equipment, and put a high degree of emphasis on customer access. “Some vendors like HP or IBM really want it to fit into the traditional data center so they optimize them for that.”
SGI’s demo box is a little different. It’s built to do nothing but pack as many commodity x86 servers inside as possible, with unique cooling and rack designs that include DC bus bars connecting directly to server boards (no individual power supplies) and refrigeration ducts that run the length of each rack (no CPU coolers).
Gillen said that means it’s ideally suited for getting a medium-sized private cloud (anywhere from 15,000 to 45,000 cores) in a hurry. He also noted that containerized data centers are available in a wide variety of specialized configurations already.
“We are looking at it specifically in the context of our cloud computing projects but over the last two days a lot of people from other areas have been walking through it,” he said.
Is anyone else amused by Verizon’s puffed up claims to dominance in the cloud computing market? In the wake of the vCloud Director unveiling at VMworld 2010, industry analysts made a huge fuss of VMware’s announcement that Verizon has joined its vCloud service provider program. I, on the other hand, am not impressed.
No doubt landing one of the top telecom providers in the world is a coup from a PR perspective, but so far the partnership is a big paper tiger if you’re an IT shop looking to do anything real with this news.
The press release claims that, with “the click of a mouse,” customers can expand their internal VMware environments to Verizon’s Compute as a Service (CaaS) offering built on VMware vCloud Data Center, for instant, additional capacity. The overall effect is referred to as a hybrid cloud.
The immediacy and ease touted here is far from true; ironically, I learned this during a session at VMworld entitled “Cloud 101: What’s real, what’s relevant for enterprise IT and what role does VMware play.”
The speaker said that to move a workload from internal VMware resources to a vCloud service provider such as Verizon is currently a manual process. It require users to shut down the to-be-migrated workload, select the cloud you’ll deploy it to, then switch to the Web interface of that service provider and import the workload. I am leaving out a bunch of other steps too tedious to mention, but it’s hardly the click of a mouse!
In a follow-up conversation after the session, VMware said the missing feature that will allow automated workload migration, called the vCloud client plug-in, was still to come. No timeframe was given.
And this isn’t all the smoke and mirrors from Verizon; the telco claims its CaaS is the first cloud service to offer PCI compliance. This statement isn’t quite either because the current PCI standard, v1.2, does not support virtual infrastructures. So a real cloud infrastructure (a multi-tenant, virtualized resource) cannot be PCI compliant. The PCI Council is expected to announce v2.0 of the standard at the end of October, which will explain how to obtain PCI compliance in a virtual environment.
A word of advice to IT shops investigating hybrid cloud options: Be sure to play around with the service before you buy. In many cases, these offerings are still only half-baked.
A story we wrote last week about Amazon’s newest disclosures on its security procedures was sparked in part by a earful from one of the sources in it. Seeking reactions to the newly updated “Overview of Security Processes,” I expected a guarded statement that the paper was a good general overview of how Amazon Web Services approached security, but pertinent technical details would probably only be shared with customers who requested them, and Amazon didn’t want to give too much away.
Instead, what I heard was that Amazon not only does not disclose relevant technical information but it apparently also does not understand what customers are asking for. Potential clients were both refused operational security details and also told wildly different answers on whether or not AWS staff could access data stored in users’ S3 accounts: “No, never,” and “Yes, under some circumstances.” That’s, um, kind of a big deal. They also refuse to indemnify themselves against potential failures and data loss as a matter of course.
Typically, a big enterprise IT organization has a set of procedures and policies it has to follow when provisioning infrastructure; charts are made, checkboxes checked, and someone, somewhere, will eventually claim that information and park it somewhere. This includes minor details like “who can access our data and how,” and “how does a service provider protect our assets and will they compensate us if they fail.” A big customer and a provider will sit down, discuss how the hoster can meet the needs of the organization, assign a value to the business revenue being generated for the enterprise, and agree to pay that amount for any outages.
Everybody is aware of this
Even their biggest fans are somewhat down on AWS for this. Cloud consultant Shlomo Swidler said in an email that Amazon’s efforts to brush up their security picture, like the launch of the AWS Vulnerability Reporting and Penetration Testing program, was the right idea, but Amazon had neutered it by not letting customers use it in a meaningful way. “Without a way to test how things will really behave under simulated attack conditions — including the AWS defensive responses — I don’t understand what will happen under real attack conditions,” he said. The Vulnerability Reporting and Penetration Testing program can reportedly only be used with pre-approval from AWS staff, meaning it can never simulate an in-the-wild attack.
Others are more charitable, and point to Amazon’s track record. IT security auditor Andrew Plato was asked about the new white paper and responded via email.
“From what’s in there, they seem to be doing the right things. They’ve got a good risk management framework, good firewalls, monitoring, they’re following ISO and COBIT , They’ve got change management; they seem to be doing all the good practices that we advise clients to do,” said Plato, president of Anitian Enterprise Security. But he noted that all we had to go on was Amazon’s good word. ”The long and short of it is the content says they’re doing the right things — now, they could be lying,” he said, tongue only partly in cheek.
Plato isn’t worried about Amazon’s security. I’m positive they aren’t lying about anything in their white paper. Nobody should be worried; they have an amazing track record, but we’ll never know, at this rate, exactly what they’re so proud of.
The problem is enterprises are picky
Here’s the problem: IT does not work like baby shoes and garden rakes. It’s not enough to just deliver the goods. You have to show your work, or the IT practitioner cannot trust what you are giving him, at a certain level. All hosting providers know this, and they are proud to show off what they’ve done. After all, they’ve spent a lot of money to get best-in-class gear so they can make money off it.
Hell, Rackspace will drag a hobo off the street to show them around the data center, they’ll talk your ear off; you’ll know what color socks the hard drive guy is wearing on Tuesdays if that’s important to you.
Now, it’s OK that Amazon doesn’t work quite that way. We all understand that the amazing feat they have managed to pull off is to offer real-time self-service IT and charge for it by the hour, and that users are responsible for their own foolishness, and Amazon backs only access and uptime. Most of Amazon’s customers are more than happy with that; they can’t afford to care about what kind of firewall and load balancers run the AWS cloud.
But if Amazon is going to compete for the enterprise customer, and they are explicit that they are trying for those customers, they are going to have to get over it and spill the beans. Not to me, although that would be nice, and not to their competition (though that’s hardly relevant now since their nearest cloud competitor, Rackspace, is apparently $400 million dollars shy of eating their lunch) but definitely to enterprise customers. It’s a fact of life. Enterprises won’t come unless you play their ball game.
There are all sorts of ways AWS can address this without giving away the goose. CloudAudit is one idea; that’s self-service security audits on an API; it fits right in to the AWS worldview. Talking to analysts and professionals under NDA is another. AWS must at the very least match what other service providers offer if it is sincere in competing for enterprise users.
Googler Vijay Gill posted a quick and dirty cloud calculator a few weeks ago that has caused some head scratching. The calculator seems to show an eye popping 168% premium on using AWS versus co-locating your own servers–$118,248/year for AWS XL instances and $70,079.88 for operating a co-lo with equivalent horsepower.
Can that really be the case? AWS isn’t cheap web hosting, it’s mid-tier VPS hosting, price-wise, if you’re talking about using it consistently year over year, and those are definitely cheaper than co-lo. Gill says $743,000 to buy and install your servers, so he’s got the investment figures in there.
Editor Matt Stansberry asked an expert on data center practices and markets that questions and was told:
“There is a point at where this is a very good exercise, but the way it was undertaken was grossly inaccurate,”
That’s Tier1 analyst Antonio Piraino, who points out that not only did Gill not spell out neccessary assumptions, he took Amazon’s retail price as the base cost, and Amazon will cut that in half if a user makes a year or multi-year commitment.
But is it fair to make the comparison in the first place?
Some people will choose Amazon for large-scale, long term commitments, but they will be in the diminous minority. There are far better options for almost anyone in hosting right now. The hosting market has been mature for the better part of a decade, and cloud has many years to go on that front.
AWS isn’t hosting or co-lo, obviously; it’s cloud. First, lots of people pick off the bits they want, like using S3 for storage. That is surely less expensive that co-locating your own personal SAN for data archive or second tier storage (first tier if you’re a web app). That’s the absolutely astounding innovation that AWS has shown the world; they sell any part of the compute environment by the hour, independent of all the other parts.
Second the whole point of AWS is that you can get the entire equivalent of that $743,000 co-lo hardware, running full bore, no cable crimpers or screwdrivers needed, in a few hours (if you’re tardy) without having to buy a thing. Building out a co-lo takes months and months.
So it’s a little off base and what’s the point? To prove that Amazon can be expensive? Not a shock. Renting an apartment can seem like a waste of money if you own a home, not so much if you need a place to live.