SimpleCDN had a simple premise: people will buy into a cheap, reliable content distribution network (CDN). Turns out it was too cheap, and maybe too cavalier with its choice of customers. As a result, the service was booted off its hosting provider, leaving thousands of users without access to massive amounts of digital content. It’s a microcosm of all the things that can go wrong in the cloud model.
SimpleCDN went dark for the majority of its customers on Saturday, Dec. 11, followed by an angry, terse explanation from Frank Wilson, senior engineer for SimpleCDN. He said that his company had been summarily booted from its hosting infrastructure at Texas-based SoftLayer, which does dedicated and cloud hosting in three locations in the U.S. SimpleCDN had bought SoftLayer from a reseller called 100TB.com, a subsidiary of the UK2 group. Customers are being pushed to a competitor.
100TB.com offered unlimited, unmetered, network access and did not charge extra for it, as do most hosting providers, claiming you could use up to 100 TB of transit every month and still only pay roughly average VPS hosting costs, ($600 per month of a decent quad-core server).
“I think they were doing about 30 GBps sustained at the end,” mused Jason Read, professional cloud watcher at CloudHarmony. Read recapped that SimpleCDN was doing business with a second-tier hoster at a level most people would consider a full-on DDOS, all day, every day. Read also added that SimpleCDN was able to offer their bargain prices based on 100TB.com’s marketing. “[SimpleCDN] was kind of milking that 100 TB unlimited bandwidth offer and I think SoftLayer told UK2 they had to amend their terms,” said Read.
That may have happened, of course. SimpleCDN’s Wilson states that he thinks that SoftLayer was getting massively undersold on its own CDN business, which is much more expensive than SimpleCDN, or that they were getting slammed by his booming business.
“…our best guess currently is that these organizations could not provide the services that we contracted and paid for, so instead they decided that terminating services would be the best solution for them,” he said.
Of course, CDN services are an incidental part of SoftLayer’s hosting business, probably a tiny percentage of its revenue; something they offer because they can or because customers are asking for it. Many hosters do the same and consider it a value add rather than a critical part of the business. Same with UK2 — they were reselling Akamai as a CDN offering. In the CDN market, there’s Limelight and Akamai, and then there’s everyone else. SoftLayer also lives in Dallas, one of the world’s hubs for Internet connectivity. If they were running out of bandwidth, they could simply buy more and sell it to the UK2 Group. If anyone was getting killed in this deal it was UK2, the middleman.
What led to UK2 terminating SimpleCDN was the nature of the traffic it served. SimpleCDN hosted a lot of live video streams, and anyone who’s taken even a cursory look into it knows that there is a booming business in streaming pirated content to U.S. audiences; some Web communities have users that will post entire seasons of a TV show or movies to an online service for anyone to watch for free, unauthorized marathons that entertainment companies take a very dim view of.
CDNs know this, of course, and “monitor” their networks by pulling streams when they get a DMCA takedown notice, which doesn’t mean a thing to the 99 other pirate streams going at the same time. SimpleCDN even had an automated DMCA action form. Somebody out there got sick of playing whack-a-mole with SimpleCDN and went straight to the provider of record, SoftLayer.
SoftLayer would not comment officially for this story except to say that SimpleCDN was not their customer but rather UK2’s and they had no commercial relationship with SimpleCDN. However, they are still the ones hosting all this content and it can be assumed they got a DMCA notice, which they are going to take VERY seriously, because unlike UK2 or SimpleCDN, they have actual physical infrastructure and assets. They would have gone to UK2 and said, “We are holding you responsible for this.” Wilson’s letter says that UK2 accused him of content violations and changed their Terms of Service (ToS) on the fly to put him in violation.
Why did 100TB/UK2 change the rules of the game, instead of duly passing along SoftLayer’s DMCA, as they should have done? SimpleCDN was murder on their bottom line. Their “free bandwidth” offer was a fiction when put to the test. SimpleCDN took them at face value, ran hundreds of servers and hosted thousands of terabytes of data with them, but it was gone in a flash, because it was based on a false economic premise and shady marketing.
So the lesson for cloud is two-fold: one is “too good to be true” usually is. If SimpleCDN had started directly with SoftLayer, it would have had to pay those bandwidth costs and its prices wouldn’t have been so attractive. Likewise, the DMCA issues would have had one less hop.
Second, for the business user, it’s a new wrinkle in vetting a service. Popular online services haven’t been known to pop like a soap bubble and vanish overnight, taking massive amounts of data with them; that’s the province of shady warehouse distribution operations and basement stock brokerages. Now they do, fueled by the explosion of middlemen and easy access that drives cloud computing. It’s not enough to examine whether a provider is sound; you have to make sure you understand who they rely on too.
UPDATE: Both UK2 Group and SimpleCDN were contacted by phone and email for this article but neither responded by press time.
Startup, CloudSwitch has released version 2.0 of its software that lets enterprise users connect their private data centers with cloud computing services and crucially, extends their internal security policies into the cloud.
With CloudSwitch, applications remain integrated with enterprise data center tools and policies, and are managed as if they were running locally, the company claims.
The new features in Enterprise 2.0 include:
Provisioning of new virtual machines in the cloud (in addition to migration of existing ones), through:
- Network boot support
- ISO support (CD-ROM/DVD)
Web services and command-line interfaces for programmatic scaling to meet peak demands. Broader networking options to extend enterprise network topologies into the cloud:
- Layer-2 connectivity with option for Layer-3 support through software-based firewall/load balancing in the cloud
- Public IP access to cloud resources with full enterprise control
- Multi-subnet support
Enhanced user interface support for better scalability, control and ease of use
Broader geographic coverage (Terremark vCloud Express & eCloud, Amazon EC2 East, West, EU and Asia Pacific regions) .
CloudSwitch officials said the company has landed pharmaceutical giant Novartis and Orange San Francisco, the subsidiary of telecommunications operator Orange. It has about 10 customers in total from pharma, retail and financial services. These customers are using CloudSwitch for a range of use cases including cluster scale-outs, web application hosting, application development and testing and labs on demand.
CloudSwitch Enterprise 2.0 is available now, with a free 15-day trial. Pricing begins at $25,000 for an annual license including basic support and up to 20 concurrent virtual machines under management in the cloud. Additional server packs are available for scaling. Cloud usage fees are paid separately to the cloud provider.
Despite blowback from refusing to supply services to whistleblower website WikiLeaks, controversy over congestion, uptime and customer service (or lack thereof), it seems that cloud computing giant Amazon Web Services (AWS) has never seen better days. The cloud provider is making new services and news announcements at a record clip. Here’s a roundup of the most significant recent announcements.
Cluster Compute GPU instances: Amazon made high-performance computing (HPC) headlines when it launched a special type of high-powered compute instance based on hardware normally found only in supercomputers. It’s impossible to fake or emulate the kinds of uses scientific and functional computing demands, so Amazon built a mini-supercomputer in its own Virginia data center and opened it up to the world. Now they’ve done it again, but this times it’s graphical processing unit (GPU) instances that are available.
Originally driven by the video game market, the chips that run video adapters have gotten so powerful that they’ve paved the way to new areas of HPC. Again, impossible to emulate, so Amazon has clearly laid out some serious capital to build a GPU-powered supercomputer to go along with the one that fires Cluster Compute instances.
This may speak to AWS’ operational maturity; their systems have evolved to the point where they can accommodate a variety of hardware in their billing, provisioning and management systems.
DNS in the cloud: Domain name servers, the street maps of the Internet, have long been the province of Web hosters and Internet service providers. They’re vital to delivering what customers want — Internet traffic — to the right place, and providers must be able to keep control of how traffic is distributed and account for that. Without access to DNS servers, you can’t properly control your email server, for instance.
Amazon, despite hosting one of the world’s signature collections of Internet traffic, hasn’t made DNS available to its customers; now it is. This might be due to the theory that as a pure infrastructure host, users should run their own. That’s led to angst when Amazon, the provider of record, gets blocked or banned by the Internet community for users’ misbehavior. It also effectively crippled the Elastic Load Balancing (ELB) service for many users, since they were unable to point their root domain at the ELB service. “Too complicated,” they were told.
The Route 53 DNS service lets users create zone files for their own domains, a little bit like handing the steering wheel back to the driver for a network admin. It’s based on popular free software (of course) djbdns and goes for $1 per month. Most website hosters provide DNS service for free. Regardless, AWS users are largely ecstatic over this, although it does not fix the zoning problems with Elastic Load Balancers quite yet.
UPDATE: Not all are ecstatic; Cantabrigian Unix developer Tony Finch writes a worthy critique of Route 53.
PCI DSS compliance in the cloud (sort of): This is a big deal to many. That’s because Visa and other credit card companies won’t let you take credit card payments, online or anywhere else, unless you can pass a PCI DSS audit. To date, Amazon hasn’t been able to do that, shutting it out of the market for e-commerce applications (in part). PCI DSS 2.0 was announced, which added new and vague but apparently satisfactory guidelines for virtualization; weeks later, Amazon is now PCI DSS Level 1 compliant.
Does this mean your new online business is automatically PCI compliant if you use Amazon to host it? Absolutely not. All it means is that it is now possible for merchants to consider using EC2 and S3 to process card payments. The responsibility of passing one’s own PCI audit hasn’t gone away. In the event of a data breach, you’re still on the hook if you store customer’s credit card info on AWS.
New SDKs for mobile developers: Amazon is committing to the support of mobile development, releasing new software development kits (SDKs) for the iPhone and Android devices. The SDKs facilitate writing apps that can connect directly to AWS’ infrastructure and do fun stuff. No one in their right mind who is developing for mobile isn’t already using AWS, parts of AWS or some other cloud service, so this isn’t exactly a brainteaser. It’s a significant show of support, however, and again demonstrates the increasing maturity of AWS as an environment.
CloudWatch updates: A raft of updates to AWS’ CloudWatch service, a rudimentary notification system which is less rudimentary now. A shining example of starting off with something that is basically kind of broken (the original CloudWatch was considerably more limited than the monitoring features built into your average microwave oven) and gradually becoming a truly useful part of the tool kit.
That includes features like threshold notifications, health checks and policy-based actions (like not auto-scaling up your application in response to unexpected traffic and leaving you with an eye-watering bill).
5 TB files in S3: Users now have the ability to upload files up to 5 TB in size, several orders of magnitude greater than the previous 5 GB limit. One has to assume they’ve made a major stride in their storage architecture, Dynamo. Jeff Barr, AWS evangelist, posits direct connections between a genome sequencer, S3 and the HPC cluster on EC2 for near-real time data processing.
Of course, keeping it there will cost you a solid $125 per month per TB, and getting it back out of S3 for archiving or other purposes will cost you as well. But if you don’t have a supercomputer handy to go with your genomics research institute, this may look pretty handy. Make sure none of your giant files are classified, too, or you might get “WikiLeaked”…
Whew. Exciting six months, right, kids? Nope, that was just the last three weeks. Crazy.
Web company Mixpanel delivered an informative tirade on why they are leaving Rackspace Cloud for Amazon Web Services (AWS) today. The story basically boils down to “AWS is better potting soil for Web apps,” although there are choice words for Rackspace support and operations failures as well.
Mixpanel makes an app that tracks your website’s use in some detail; it’s a tool for site operators and e-commerce types. It left Rackspace for a few significant reasons, one of which was the Elastic Block Store (EBS) feature of AWS, the ephemeral storage system linked to your virtual machines; another was the lack of a fully developed API for Rackspace. Big deal, Rackspace makes hay over customer wins, too.
What this highlights is the difference in the two offerings — Rackspace Cloud is much closer to traditional hosting, both in concept and design, than AWS. Go to the site, click on a button, get a server/website/whatever. You also have to deal with humans after a certain size, submitting a request to increase resources here and there.
AWS is a completely hands-off, completely blinded set of resources and rules that have much less to do with the way standard hosting operates; it’s fundamentally different even if the end result (you get a server) is the same.
Mixpanel wants (apparently) a generally new but now well-established concept; they want Web stuff and they want it all the time and everywhere. They mention Amazon’s superlative CDN, the range of instance sizes and so on, but it’s really the fact that you’re not actually dealing with infrastructure, except in the loosest concept, that’s pulling them over.
Storage and CPU and bandwidth are logically connected, but so loosely that you can’t really say it’s mimicking the operation of a physical facility. It’s just buckets of ability you buy, like power-ups in a video game or something. This is ideal for a Web application, since that’s how users are looking at the application, too. Maybe not so much for someone running a different kind of application. Encoding.com, for instance, chose Rackspace because their video encoding service needed Rackspace’s superior internal connectivity and CPU, not application flexibility.
Anyway, the fun part starts in the comment section of the blog, where users come on to gripe about AWS in almost the same way Mixpanel is griping about Rackspace; one developer said he was mysteriously slapped with charges over bandwidth that could possibly have occurred and is not unwilling to turn his test instance back on, since AWS simply refuses to address the issue. Sounds like some place where they put a premium on customer support might be a better fit — you know, where they have “fanatical support”…
On route to the Cloud Computing Expo this week, I ducked into Abiquo’s offices in Redwood City to catch up with CEO, Pete Malcolm.
By the end of the year he said the cloud management startup will pull in a second round of venture capital funding to add to the $5.1 million raised in March, 2010. His lips were sealed on the amount, but it will be enough to see the company through 2011/12.
Abiquo has 35 employees and somewhere between 10 and 50 customers using its cloud provisioning and automation software. Most of these are hosting companies, like BlueFire in Australia, which use the software as an enabling technology to sell more advanced cloud infrastructure services to its customers.
Enterprises have tested the software and Malcolm expects real deployments next year, once the budget for it kicks in. He said most companies did not have cloud in their budget in 2010 but will in 2011.
Abiquo just released the fourth version of its cloud management software, Abiquo 1.7, available in 45 days. The biggest new feature is a policy engine that allows organizations to allocate virtual resources based on different business and IT considerations including governance, security, compliance and cost — as well as a variety of utilization models. The business rules can be applied at multiple levels, and customized for individual physical data centers, racks, servers and storage, as well as virtual enterprises and virtual data centers.
CA, VMware, Cloud.com and Eucalyptus among many others are all vying for the same market as Abiquo, and it looks like 2011 is shaping up to be a crucial year for gaining market share.
Mark Russinovich — Microsoft technical fellow, a lead on the Azure platform and a renowned Windows expert — took pains at PDC ’10 (Watch the “Inside Windows Azure” session here) to lay out a detailed, high-level overview of the Azure platform and what actually happens when users interact with it.
The Azure cloud(s) is (are) built on Microsoft’s definition of commodity infrastructure. It’s “Microsoft Blades,” that is, bespoke OEM blade servers from several manufacturers. It’s probably Dell or HP, just saying, in dense racks. Microsoft containerizes its data centers now and pictures abound; this is only interesting to data center nerds anyway.
For systems managements nerds, here’s a 2006 presentation from Microsoft on the rudiments of shared I/O and blade design.
Azure considers each rack a ‘node’ of compute power and puts a switch on top of it. Each node — servers+top rack switch — is considered a ‘fault domain’ (see glossary, below), i.e., a possible point of failure. An aggregator and load balancers manage groups of nodes, and all feed back to the Fabric Controller (FC), the operational heart of Azure.
The FC gets it’s marching orders from the “Red Dog Front End” (RDFE). RDFE takes its name from nomenclature left over from Dave Cutler’s original Red Dog project that became Azure. The RDFE acts as kind of router for request and traffic to and from the load balancers and Fabric Controller.
Russinovich said that the development team passed an establishment called the “Pink Poodle” while driving one day. Red Dog was deemed more suitable, and Russinovich claims not to know what sort of establishment the Pink Poodle is.
How Azure works
Azure works like this:
- |___Aggregators and Load Balancers
- |___Fabric Controller
The Fabric Controller
The Fabric Controller does all the heavy lifting for Azure. It provisions, stores, delivers, monitors and commands the virtual machines (VMs) that make up Azure. It is a “distributed stateful application distributed across data center nodes and fault domains.”
In English, this means there are a number of Fabric Controller instances running in various racks. One is elected to act as the primary controller. If it fails, another picks up the slack. If the entire FC fails, all of the operations it started, including the nodes, keep running, albeit without much governance until it comes back online. If you start a service on Azure, the FC can fall over entirely and your service is not shut down.
The Fabric Controller automates pretty much everything, including new hardware installs. New blades are configured for PXE and the FC has a PXE boot server in it. It boots a ‘maintenance image,’ which downloads a host operating system (OS) that includes all the parts necessary to make it an Azure host machine. Sysprep is run, the system is rebooted as a unique machine and the FC sucks it into the fold.
The Fabric Controller is a modified Windows Server 2008 OS, as are the host OS and the standard pre-configured Web and Worker Role instances.
What happens when you ask for a Role
The FC has two primary objectives: to satisfy user requests and policies and to optimize and simplify deployment. It does all of this automatically, “learning as it goes” about the state of the data center, Russinovich said.
Log into Azure and ask for a new “Web Role” instance and what happens? The portal takes your request to the RDFE. The RDFE asks the Fabric Controller for the same, based on the parameters you set and your location, proximity, etc. The Fabric Controller scans the available nodes and looks for (in the standard case) two nodes that do not share a Fault Domain, and are thus fault-tolerant.
This could be two racks right next to each other. Russinovich said that FC considers network proximity and available connectivity as factors in optimizing performance. Azure is unlikely to pick nodes in two different facilities unless necessary or specified.
Fabric Controller, having found its juicy young nodes bursting with unused capacity, then puts the role-defining files at the host. The host OS creates the requested virtual machines and three Virtual Hard Drives (VHDs) (count ’em, three!): a stock ‘differencing’ VHD (D:\) for the OS image, a ‘resource’ VHD (C:\) for user temporary files and a Role VHD (next available drive letter), for role specific files. The host agent starts the VM and away we go.
The load balancers, interestingly, do nothing until the instance receives its first external HTTP communication (GET); only then is the instance routed to an external endpoint and live to the network.
The Platform as a Service part
Why so complicated? Well, it’s a) Windows and b) the point is to automate maintenance and stuff. The regular updates that Windows Azure systems undergoes — same as (within the specifications of what is running) the rest of the Windows world — happen typically about once a month and require restarting the VMs.
Now for the fun part: Azure requires two instances running to enjoy its 99.9% uptime service-level agreement (SLA), and that’s one reason why. Microsoft essentially enforces a high-availability, uninterrupted fault tolerance fire drill every time the instances are updated. Minor updates and changes to configuration do not require restarts, but what Russinovich called ‘VIP swaps’ do.
Obviously, this needs to be done in such a way that the user doesn’t skip a beat. A complicated hopscotch takes place as updates are installed to the resource VHD. One instance is shut down and the resource VHD updated, then the other one. The differencing VHDa makes sure new data that comes into the Azure service is retained and synced as each VM reboots.
Virtualization and security
What is it running on, we asked? Head scratching ensued for many moons as Microsoft pushed Hyper-V to customers but claimed Azure was not compatible or interoperable with Hyper-V.
It is, in fact, a fork of Hyper-V. Russinovich said it was basically tailored from the ground up for the hardware layout that Microsoft uses, same as the Azure OSes.
Russinovich said that the virtual machine is the security boundary for Azure. At the hypervisor level, the host agents on each physical machine are trusted. The Fabric Controller OSes are trusted. The guest agent- the part the user controls—is not trusted. The VMs communicate only through the load balancers and the public (user’s endpoint) IP and back down again.
Some clever security person may now appear and make fun of this scheme, but that’s not my job.
The Fabric Controller handles network security and Hyper-V uses machine state registries (MSRs) to verify basic machine integrity. That’s not incredibly rich detail, but its more than you knew five minutes ago and I guarantee its more than you know about how Amazon secures Xen. Here’s a little more on Hyper-V security.
New additions to Azure, like full admin rights on VMs (aka elevated privileges) justify this approach, Russinovich said. “We know for a fact we have to rely on this [model] for security,” he said.
Everyone feel safe and cozy? New user-built VM Roles are implemented a little differently
Azure now offers users the ability to craft their own Windows images and run them on Microsoft’s cloud. These VM Roles are built by you (sysprep recommended) and uploaded to your blob storage. When you create a service around your custom VMs and start the instances, Fabric Controller takes pains to redundantly ensure redundancy. It makes a shadow copy of your file, caches that shadow copy (in the VHD cacher, of course) and then creates the three VHDs seen above for each VM needed. From there, you’re on your own; Microsoft does not consider having to perform your own patches an asset in Azure.
A healthy host is a happy host
Azure uses heartbeats to measure instance health: It simply pings the Fabric Controller every few seconds and that’s that. Here again, fault tolerance is in play. You have two instances running (if you’re doing it right. Azure will let you run one, but then you don’t get the SLA). If one fails, the heartbeat times out, the differencing VHD on the other VM starts ticking over and Azure restarts the faulty VM, or recreates the configuration somewhere else. Then changes are synced and you’re back in business.
Do not end these processes
Now that we have the ability to RDP into our Azure Roles and monkey around, Russinovich helpfully explains that the processes Azure runs within the VM are WaAppHost.exe (Worker Role), WaWebHost.exe (Web Role), clouddrivesvc.exe (All Roles) and a handful of others, a special w3wp.exe for IIS configuration and so forth. All of these were previously restricted from user access but can be accessed via the new admin privileges.
Many of the features set out here are in development and beta but are promised to the end user soon. Russinovich noted that the operations outlined here still could change significantly. At any rate, his PDC session provided a fascinating look into how a cloud can operate, and it’s approximately eleventy bajillion percent more than I (or anyone else, for that matter) know about how Amazon Web Services or Google App Engine works.
Azure : Microsoft’s cloud infrastructure platform
Fabric Controller: A set of modified virtual Windows Server 2008 images running across Azure that control provisioning and management
Fault Domain: A set of resources within an Azure data center that are considered non-fault tolerant and a discrete unit, like a single rack of servers. A Service by default splits virtual instances across at least two Fault Domains.
Role: Microsoft’s name for a specific configuration of Azure virtual machine. The terminology is from Hyper-V.
Service: Azure lets users run Services, which then run virtual machine instances in a few pre-configured types, like Web or Worker Roles. A Service is a batch of instances that are all governed by the Service parameters and policy.
Web Role: An instance pre-configured to run Microsoft’s Web server technology Internet Information Services (IIS)
Worker Role: An instance configured not to run IIS but instead to run applications developed and/or uploaded to the VM by the end user
VM Role: User-created, unsupported Windows Server 2008 virtual machine images that are uploaded by the user and controlled through the user portal. Unlike Web and Worker Roles, these are not updated and maintained automatically by Azure.
Amazon CTO Werner Vogels said on Twitter that AWS does not oversubscribe its services.
“If you launch an instance type you get the performance you ask (and pay) for, period. No oversubscription,” he wrote. An earlier message said that CPU performance is fixed for each instance, and customers were granted access to the full amount of virtual CPU, an Amazon designated Elastic Compute Unit (ECU).
Why is this important? For one thing, it’s another data point about AWS operations dribbled out: the company is famously tight-lipped on even completely innocuous matters, let alone operational details. This allows some more inferences to be made about what AWS actually is.
Second, EVERYBODY oversubscribes, unless they explicitly say they don’t.
Oversubscription, in the IT world, originates with having a fixed amount of bandwidth and a user base that is greater than one. It stems from the idea that you have a total capacity of resources that a single user will rarely, if ever, approach. You tell the pool of users they all have a theoretical maximum amount of bandwidth, 1GBps on the office LAN, for instance.
Your average user consumes much less than that (<10 MBbs, say), so you are pretty safe if you say that 50 users can all use 10 MBps at the same time. This is oversubscription. Clearly, not everybody can have 1 GBps at once, but some can have it sometimes. Mostly, network management takes care of making sure nobody hogs all the bandwidth, or when congestion becomes an issue, more resources are ready. Why do this?
Oversubscription is a lot easier than having conversations that go like this:
Admin to office manager: “Well, yes, these wires CAN carry 1 GBps of data. But you only need about 10MBPs, so what we do is set up rules, so that…
What? No, you DO have the capacity. Listen, we can either hard limit EVERYONE to 10 MBps like it was, or we can let usage be elastic…What?
OK, fine. Everyone has a 1 GBps connection. Goodbye now.”
Problems with this model only arise when the provider does not have enough overhead on hand to comfortably manage surges in demand, i.e., they are lying about their capacity. Comcast and AT&T do this and get rightfully pilloried for fraud from time to time, airlines as well, and so on. that’s wrong.
Fundamentally, though, this is a business practice based on statistically sound math. It makes zero sense to give everyone 1000 feet of rope when 98% only ever need 36 inches.
And everybody does it
It’s also par for the course in the world of hosting. Bear in mind that a service provider is not lying if it promises you a single CPU core and 1 GB RAM, and then puts 100 customers on a box with 16 cores and 36 GB RAM. It is counting on the fact that most people’s servers and applications can comfortably run on a pocket calculator these days. When demand spikes, the service provider turns on another server and adds more power.
“Problem” customers, who use the advertised resources, go to dedicated boxes if needed, and everyone is happy. The provider thus realizes the vaunted economy of scale, and the customer is content. Service providers often don’t oversubscribe more expensive offerings as a marketing bullet point or to meet customer wishes and provide high touch customer service. It’s a premium to get your own box.
The fact that Amazon does not oversubscribe is indicative of a few things: first, it hasn’t altered its core Xen hypervisor that much, nor are users that far from the base infrastructure. Xen does not allow oversubscription per se, but of course Amazon could show customers whatever it wanted. (This is also largely true of VPS hosters, whose ‘slice’ offerings are often comparable to Amazon’s in price: ~$70/mo for a low end virtual server instance).
This allows us to make a much better guess about the size of Amazon’s Elastic Compute Cloud (EC2) infrastructure. Every EC2 instance gets a ‘virtual core,’ posited to be about the equivalent of a 1.2 GHz Intel or AMD CPU. Virtual cores are, by convention, no more than half a real CPU core. A dual core CPU equals four virtual cores, or four server instances. AWS servers are quad CPU, quad core, for the most part(this nugget is courtesy of Morphlabs’ Winston Damarillo, who built an AWS clone and studied their environment in detail). So, 16 cores and 32 virtual cores per server.
Guy Rosen, who runs the Jack of all Clouds blog, estimates the use of AWS regularly. In September 2010, AWS was home to 3,259 websites. In September-October 2009, Rosen came up with a novel way to count how many servers (each of which had at minimum one virtual core, or half a real CPU) Amazon provisions each day.
He said that AWS’s US-EAST region (one data center with 4 Availability Zones in it) launched 50,212 servers a day. At that time, AWS overall served 1,763 websites. Assume this growth is consistent, and Amazon is now serving 184% more instances. let’s say 93,000 server requests a day at US-EAST.
Physical infrastructure thus has to consist of at least 50,000 CPU cores at this point, although this is an inductive figure, not a true calculation. It is also quite conservative. Growth at AWS might have been better than double.That’s 3,125 actual servers to run those 50,000 nodes and 93,000 virtual machine instances.
Amazon’s cloud in Virginia runs on 3125 servers?
What? No Way.
Let’s be generous, and take into account the new HPC instances, all the overhead they must keep around, and factor in the use of large and extra large EC2 instances. We’ll give them 4,000 servers, 128,000 virtual CPUs.
US-EAST runs on 4,000 servers, or 100 racks. That could fit in 10,000 sq ft of data center, if someone really knew what they were doing. Equinix’s (just picking that name out of thin air) flagship DC/Virginia facilities operate 155,000 sq ft of Tier 3 space — if i’m even remotely in the ballpark, US-EAST, including cages and crash karts, could fit on one wall.
What was that about economies of scale again?
…and it was dense. I toured an AT&T Internet Data Center (IDC) facility this morning, in dear old Watertown, MA.
It’s impressive; lots of new, gleaming white PVC and galvanized steel tubing. The facility has the ability to shutdown its chillers in the winter and get free cooling when it’s below 45 degrees outside, thanks to modern plate and frame heat exchangers (they are bright blue, probably to set off the red water pumps that connect to the PVC chiller pipes. Red, white and blue!). Solar panels can provide about 75Kw, they have a handful of megawatts worth of generators, etc, and everything is shipshape and shiny.
On the floor (36″ raised) we saw customer cages and racks for AT&T’s co-location business, the massive switching and internet hub (four 10 GB pipes to the local backbone) and we saw the AT&T managed services infrastructure, a good dozen aisles of rack space, mostly full, with alternating hot/cold aisles with about a 20 degree difference between them.
This is where AT&T puts its virtualization, hosting, application services, managed services and yes, its Synaptic cloud services for the local area. Currently, AT&T offers Synaptic Storage as an on-demand, pay-as-you-go cloud service, and Synaptic Compute is live as a “controlled launch” with a limited number of users. Currently, AT&T Business Services says that cloud is its second heaviest investment area after mobile services.
AT&T currently runs all of its production services that are virtualized on VMware, but it plans to roll out additional hypervisor support very quickly. It is also pitching its “virtual private cloud” as an area of first interest to customers. AT&T has the ability, like Verizon, to offer truly dedicated cloud environments that do not have to be exposed to standard Internet traffic at all (it’s a telecom, it can do any dedicated circuit you want from copper POTS to fiber optic, if you pony up).
A couple of quick thoughts about what I saw: Density in the AT&T infrastructure was easily 10x that of customer colocated racks, both in use of space and the floorplan. It is clearly a beast of a different order than the individual server environments.
There was a LOT of empty space. AT&T has walled off about 70,000 square feet of raised floor for later; it has the space to double its capacity for cooling, power and switching, and it isn’t using a lot of the capacity its already got online. AT&T could treble its managed services infrastructure footprint in the current ready space and still take on co-lo and hosting customers.
That says to me that the market is ready to supply IT infrastructure, and cloud infrastructure, when the demand is there. Much ado is made over Google’s uncountable, bleeding-edge servers, and Amazon’s enormous cloud environment, or Microsoft’s data centers, but here in Watertown is the plain fact that the people who really do infrastructure are ready, and waiting, for the market. The long-term future of cloud infrastructures is probably in these facilities, and not with Google or Microsoft or Amazon.
Oh, and, Verizon has a building right next door. It’s bigger than AT&T’s.
The flexibility of cloud computing services appears to be extending to the physical infrastructure itself.
Researchers point to new examples of the rapidly maturing “shipping container data center” as proof. After all, if you can sign up and get a server at Amazon Web Services anytime you like, why shouldn’t you be able to order up a physical data center almost the same way?
Rob Gillen, a researcher at the Oak Ridge National laboratories in Tennessee, is part of a team researching and building a private cloud out of commodity x86 servers to support the operations of Jaguar, ORNL’s supercomputer. It is the fastest and most powerful system in the world right now, but it’s not exactly a good fit for day-to-day computing needs.
“If you look at our Jaguar server here with 225,000 cores, frankly there’s only a few people in the world smart enough to write code that will work really well on that,” said Rob Gillen, researcher for cloud computing technologies at ORNL. Gillen is working on both the overall goal of private cloud and is heavily involved in exploring the use of Microsoft Azure, the Platform as a Service.
He said ORNL is working to develop a self-service, fully virtualized environment to handle less important or less intensive tasks, like post-processing of results from workloads run on Jaguar, and long term storage and delivery of that data.
Gillen said the advantages of using standard, widely available hardware and virtualization technologies to make a pool of resources available, a la Amazon Web Services, was very simple. There was a clear divide in raw computing power, but the pool of available programmers, not to mention existing software tools, was much wider using commodity-type services.
“If you have the opportunity to use fixed, Infiniband gear, generally your scientific problems are going to express themselves better over that,” he said. “The commodity nature[of private clouds] is tough for scientists to grapple with, but the range of solutions gets better.”
Hadoop, the massive multi-parallel next generation database, might do a much better job of processing data from Jaguar, for example, and a researcher wouldn’t need to tie up critically valuable supercomputer time noodling around with different ways to explore all that data.
“The raw generation is done on the supercomputers but much of the post processing is really done on commodity, cloud environment,” said Gillen. But, he’s chronically short of space and wants more servers for the ORNL cloud.
That’s where cloud on wheels comes in; Gillen has been looking at a demo container data center from SGI, called the ICE Cube, which is a standard shipping container with a lot of servers in it.
Gillen’s photos and a video of the interior of the unit are a treat for the gearheads:
It gets put down anywhere there’s space and half a megawatt or so of power. Just add water and presto, instant data center. It might not be pretty, but it’s a less expensive way to get data center space.
“We’re space constrained and that’s one possibility,” said Gillen.
Gillen said that the containerized data center market was pretty well established by now, but offerings from HP and IBM were usually designed to adapt to a traditional data center management process. They had standard power hookup, standard rack equipment, and put a high degree of emphasis on customer access. “Some vendors like HP or IBM really want it to fit into the traditional data center so they optimize them for that.”
SGI’s demo box is a little different. It’s built to do nothing but pack as many commodity x86 servers inside as possible, with unique cooling and rack designs that include DC bus bars connecting directly to server boards (no individual power supplies) and refrigeration ducts that run the length of each rack (no CPU coolers).
Gillen said that means it’s ideally suited for getting a medium-sized private cloud (anywhere from 15,000 to 45,000 cores) in a hurry. He also noted that containerized data centers are available in a wide variety of specialized configurations already.
“We are looking at it specifically in the context of our cloud computing projects but over the last two days a lot of people from other areas have been walking through it,” he said.
Is anyone else amused by Verizon’s puffed up claims to dominance in the cloud computing market? In the wake of the vCloud Director unveiling at VMworld 2010, industry analysts made a huge fuss of VMware’s announcement that Verizon has joined its vCloud service provider program. I, on the other hand, am not impressed.
No doubt landing one of the top telecom providers in the world is a coup from a PR perspective, but so far the partnership is a big paper tiger if you’re an IT shop looking to do anything real with this news.
The press release claims that, with “the click of a mouse,” customers can expand their internal VMware environments to Verizon’s Compute as a Service (CaaS) offering built on VMware vCloud Data Center, for instant, additional capacity. The overall effect is referred to as a hybrid cloud.
The immediacy and ease touted here is far from true; ironically, I learned this during a session at VMworld entitled “Cloud 101: What’s real, what’s relevant for enterprise IT and what role does VMware play.”
The speaker said that to move a workload from internal VMware resources to a vCloud service provider such as Verizon is currently a manual process. It require users to shut down the to-be-migrated workload, select the cloud you’ll deploy it to, then switch to the Web interface of that service provider and import the workload. I am leaving out a bunch of other steps too tedious to mention, but it’s hardly the click of a mouse!
In a follow-up conversation after the session, VMware said the missing feature that will allow automated workload migration, called the vCloud client plug-in, was still to come. No timeframe was given.
And this isn’t all the smoke and mirrors from Verizon; the telco claims its CaaS is the first cloud service to offer PCI compliance. This statement isn’t quite either because the current PCI standard, v1.2, does not support virtual infrastructures. So a real cloud infrastructure (a multi-tenant, virtualized resource) cannot be PCI compliant. The PCI Council is expected to announce v2.0 of the standard at the end of October, which will explain how to obtain PCI compliance in a virtual environment.
A word of advice to IT shops investigating hybrid cloud options: Be sure to play around with the service before you buy. In many cases, these offerings are still only half-baked.