Mark Russinovich — Microsoft technical fellow, a lead on the Azure platform and a renowned Windows expert — took pains at PDC ’10 (Watch the “Inside Windows Azure” session here) to lay out a detailed, high-level overview of the Azure platform and what actually happens when users interact with it.
The Azure cloud(s) is (are) built on Microsoft’s definition of commodity infrastructure. It’s “Microsoft Blades,” that is, bespoke OEM blade servers from several manufacturers. It’s probably Dell or HP, just saying, in dense racks. Microsoft containerizes its data centers now and pictures abound; this is only interesting to data center nerds anyway.
For systems managements nerds, here’s a 2006 presentation from Microsoft on the rudiments of shared I/O and blade design.
Azure considers each rack a ‘node’ of compute power and puts a switch on top of it. Each node — servers+top rack switch — is considered a ‘fault domain’ (see glossary, below), i.e., a possible point of failure. An aggregator and load balancers manage groups of nodes, and all feed back to the Fabric Controller (FC), the operational heart of Azure.
The FC gets it’s marching orders from the “Red Dog Front End” (RDFE). RDFE takes its name from nomenclature left over from Dave Cutler’s original Red Dog project that became Azure. The RDFE acts as kind of router for request and traffic to and from the load balancers and Fabric Controller.
Russinovich said that the development team passed an establishment called the “Pink Poodle” while driving one day. Red Dog was deemed more suitable, and Russinovich claims not to know what sort of establishment the Pink Poodle is.
How Azure works
Azure works like this:
- |___Aggregators and Load Balancers
- |___Fabric Controller
The Fabric Controller
The Fabric Controller does all the heavy lifting for Azure. It provisions, stores, delivers, monitors and commands the virtual machines (VMs) that make up Azure. It is a “distributed stateful application distributed across data center nodes and fault domains.”
In English, this means there are a number of Fabric Controller instances running in various racks. One is elected to act as the primary controller. If it fails, another picks up the slack. If the entire FC fails, all of the operations it started, including the nodes, keep running, albeit without much governance until it comes back online. If you start a service on Azure, the FC can fall over entirely and your service is not shut down.
The Fabric Controller automates pretty much everything, including new hardware installs. New blades are configured for PXE and the FC has a PXE boot server in it. It boots a ‘maintenance image,’ which downloads a host operating system (OS) that includes all the parts necessary to make it an Azure host machine. Sysprep is run, the system is rebooted as a unique machine and the FC sucks it into the fold.
The Fabric Controller is a modified Windows Server 2008 OS, as are the host OS and the standard pre-configured Web and Worker Role instances.
What happens when you ask for a Role
The FC has two primary objectives: to satisfy user requests and policies and to optimize and simplify deployment. It does all of this automatically, “learning as it goes” about the state of the data center, Russinovich said.
Log into Azure and ask for a new “Web Role” instance and what happens? The portal takes your request to the RDFE. The RDFE asks the Fabric Controller for the same, based on the parameters you set and your location, proximity, etc. The Fabric Controller scans the available nodes and looks for (in the standard case) two nodes that do not share a Fault Domain, and are thus fault-tolerant.
This could be two racks right next to each other. Russinovich said that FC considers network proximity and available connectivity as factors in optimizing performance. Azure is unlikely to pick nodes in two different facilities unless necessary or specified.
Fabric Controller, having found its juicy young nodes bursting with unused capacity, then puts the role-defining files at the host. The host OS creates the requested virtual machines and three Virtual Hard Drives (VHDs) (count ’em, three!): a stock ‘differencing’ VHD (D:\) for the OS image, a ‘resource’ VHD (C:\) for user temporary files and a Role VHD (next available drive letter), for role specific files. The host agent starts the VM and away we go.
The load balancers, interestingly, do nothing until the instance receives its first external HTTP communication (GET); only then is the instance routed to an external endpoint and live to the network.
The Platform as a Service part
Why so complicated? Well, it’s a) Windows and b) the point is to automate maintenance and stuff. The regular updates that Windows Azure systems undergoes — same as (within the specifications of what is running) the rest of the Windows world — happen typically about once a month and require restarting the VMs.
Now for the fun part: Azure requires two instances running to enjoy its 99.9% uptime service-level agreement (SLA), and that’s one reason why. Microsoft essentially enforces a high-availability, uninterrupted fault tolerance fire drill every time the instances are updated. Minor updates and changes to configuration do not require restarts, but what Russinovich called ‘VIP swaps’ do.
Obviously, this needs to be done in such a way that the user doesn’t skip a beat. A complicated hopscotch takes place as updates are installed to the resource VHD. One instance is shut down and the resource VHD updated, then the other one. The differencing VHDa makes sure new data that comes into the Azure service is retained and synced as each VM reboots.
Virtualization and security
What is it running on, we asked? Head scratching ensued for many moons as Microsoft pushed Hyper-V to customers but claimed Azure was not compatible or interoperable with Hyper-V.
It is, in fact, a fork of Hyper-V. Russinovich said it was basically tailored from the ground up for the hardware layout that Microsoft uses, same as the Azure OSes.
Russinovich said that the virtual machine is the security boundary for Azure. At the hypervisor level, the host agents on each physical machine are trusted. The Fabric Controller OSes are trusted. The guest agent- the part the user controls—is not trusted. The VMs communicate only through the load balancers and the public (user’s endpoint) IP and back down again.
Some clever security person may now appear and make fun of this scheme, but that’s not my job.
The Fabric Controller handles network security and Hyper-V uses machine state registries (MSRs) to verify basic machine integrity. That’s not incredibly rich detail, but its more than you knew five minutes ago and I guarantee its more than you know about how Amazon secures Xen. Here’s a little more on Hyper-V security.
New additions to Azure, like full admin rights on VMs (aka elevated privileges) justify this approach, Russinovich said. “We know for a fact we have to rely on this [model] for security,” he said.
Everyone feel safe and cozy? New user-built VM Roles are implemented a little differently
Azure now offers users the ability to craft their own Windows images and run them on Microsoft’s cloud. These VM Roles are built by you (sysprep recommended) and uploaded to your blob storage. When you create a service around your custom VMs and start the instances, Fabric Controller takes pains to redundantly ensure redundancy. It makes a shadow copy of your file, caches that shadow copy (in the VHD cacher, of course) and then creates the three VHDs seen above for each VM needed. From there, you’re on your own; Microsoft does not consider having to perform your own patches an asset in Azure.
A healthy host is a happy host
Azure uses heartbeats to measure instance health: It simply pings the Fabric Controller every few seconds and that’s that. Here again, fault tolerance is in play. You have two instances running (if you’re doing it right. Azure will let you run one, but then you don’t get the SLA). If one fails, the heartbeat times out, the differencing VHD on the other VM starts ticking over and Azure restarts the faulty VM, or recreates the configuration somewhere else. Then changes are synced and you’re back in business.
Do not end these processes
Now that we have the ability to RDP into our Azure Roles and monkey around, Russinovich helpfully explains that the processes Azure runs within the VM are WaAppHost.exe (Worker Role), WaWebHost.exe (Web Role), clouddrivesvc.exe (All Roles) and a handful of others, a special w3wp.exe for IIS configuration and so forth. All of these were previously restricted from user access but can be accessed via the new admin privileges.
Many of the features set out here are in development and beta but are promised to the end user soon. Russinovich noted that the operations outlined here still could change significantly. At any rate, his PDC session provided a fascinating look into how a cloud can operate, and it’s approximately eleventy bajillion percent more than I (or anyone else, for that matter) know about how Amazon Web Services or Google App Engine works.
Azure : Microsoft’s cloud infrastructure platform
Fabric Controller: A set of modified virtual Windows Server 2008 images running across Azure that control provisioning and management
Fault Domain: A set of resources within an Azure data center that are considered non-fault tolerant and a discrete unit, like a single rack of servers. A Service by default splits virtual instances across at least two Fault Domains.
Role: Microsoft’s name for a specific configuration of Azure virtual machine. The terminology is from Hyper-V.
Service: Azure lets users run Services, which then run virtual machine instances in a few pre-configured types, like Web or Worker Roles. A Service is a batch of instances that are all governed by the Service parameters and policy.
Web Role: An instance pre-configured to run Microsoft’s Web server technology Internet Information Services (IIS)
Worker Role: An instance configured not to run IIS but instead to run applications developed and/or uploaded to the VM by the end user
VM Role: User-created, unsupported Windows Server 2008 virtual machine images that are uploaded by the user and controlled through the user portal. Unlike Web and Worker Roles, these are not updated and maintained automatically by Azure.
Amazon CTO Werner Vogels said on Twitter that AWS does not oversubscribe its services.
“If you launch an instance type you get the performance you ask (and pay) for, period. No oversubscription,” he wrote. An earlier message said that CPU performance is fixed for each instance, and customers were granted access to the full amount of virtual CPU, an Amazon designated Elastic Compute Unit (ECU).
Why is this important? For one thing, it’s another data point about AWS operations dribbled out: the company is famously tight-lipped on even completely innocuous matters, let alone operational details. This allows some more inferences to be made about what AWS actually is.
Second, EVERYBODY oversubscribes, unless they explicitly say they don’t.
Oversubscription, in the IT world, originates with having a fixed amount of bandwidth and a user base that is greater than one. It stems from the idea that you have a total capacity of resources that a single user will rarely, if ever, approach. You tell the pool of users they all have a theoretical maximum amount of bandwidth, 1GBps on the office LAN, for instance.
Your average user consumes much less than that (<10 MBbs, say), so you are pretty safe if you say that 50 users can all use 10 MBps at the same time. This is oversubscription. Clearly, not everybody can have 1 GBps at once, but some can have it sometimes. Mostly, network management takes care of making sure nobody hogs all the bandwidth, or when congestion becomes an issue, more resources are ready. Why do this?
Oversubscription is a lot easier than having conversations that go like this:
Admin to office manager: “Well, yes, these wires CAN carry 1 GBps of data. But you only need about 10MBPs, so what we do is set up rules, so that…
What? No, you DO have the capacity. Listen, we can either hard limit EVERYONE to 10 MBps like it was, or we can let usage be elastic…What?
OK, fine. Everyone has a 1 GBps connection. Goodbye now.”
Problems with this model only arise when the provider does not have enough overhead on hand to comfortably manage surges in demand, i.e., they are lying about their capacity. Comcast and AT&T do this and get rightfully pilloried for fraud from time to time, airlines as well, and so on. that’s wrong.
Fundamentally, though, this is a business practice based on statistically sound math. It makes zero sense to give everyone 1000 feet of rope when 98% only ever need 36 inches.
And everybody does it
It’s also par for the course in the world of hosting. Bear in mind that a service provider is not lying if it promises you a single CPU core and 1 GB RAM, and then puts 100 customers on a box with 16 cores and 36 GB RAM. It is counting on the fact that most people’s servers and applications can comfortably run on a pocket calculator these days. When demand spikes, the service provider turns on another server and adds more power.
“Problem” customers, who use the advertised resources, go to dedicated boxes if needed, and everyone is happy. The provider thus realizes the vaunted economy of scale, and the customer is content. Service providers often don’t oversubscribe more expensive offerings as a marketing bullet point or to meet customer wishes and provide high touch customer service. It’s a premium to get your own box.
The fact that Amazon does not oversubscribe is indicative of a few things: first, it hasn’t altered its core Xen hypervisor that much, nor are users that far from the base infrastructure. Xen does not allow oversubscription per se, but of course Amazon could show customers whatever it wanted. (This is also largely true of VPS hosters, whose ‘slice’ offerings are often comparable to Amazon’s in price: ~$70/mo for a low end virtual server instance).
This allows us to make a much better guess about the size of Amazon’s Elastic Compute Cloud (EC2) infrastructure. Every EC2 instance gets a ‘virtual core,’ posited to be about the equivalent of a 1.2 GHz Intel or AMD CPU. Virtual cores are, by convention, no more than half a real CPU core. A dual core CPU equals four virtual cores, or four server instances. AWS servers are quad CPU, quad core, for the most part(this nugget is courtesy of Morphlabs’ Winston Damarillo, who built an AWS clone and studied their environment in detail). So, 16 cores and 32 virtual cores per server.
Guy Rosen, who runs the Jack of all Clouds blog, estimates the use of AWS regularly. In September 2010, AWS was home to 3,259 websites. In September-October 2009, Rosen came up with a novel way to count how many servers (each of which had at minimum one virtual core, or half a real CPU) Amazon provisions each day.
He said that AWS’s US-EAST region (one data center with 4 Availability Zones in it) launched 50,212 servers a day. At that time, AWS overall served 1,763 websites. Assume this growth is consistent, and Amazon is now serving 184% more instances. let’s say 93,000 server requests a day at US-EAST.
Physical infrastructure thus has to consist of at least 50,000 CPU cores at this point, although this is an inductive figure, not a true calculation. It is also quite conservative. Growth at AWS might have been better than double.That’s 3,125 actual servers to run those 50,000 nodes and 93,000 virtual machine instances.
Amazon’s cloud in Virginia runs on 3125 servers?
What? No Way.
Let’s be generous, and take into account the new HPC instances, all the overhead they must keep around, and factor in the use of large and extra large EC2 instances. We’ll give them 4,000 servers, 128,000 virtual CPUs.
US-EAST runs on 4,000 servers, or 100 racks. That could fit in 10,000 sq ft of data center, if someone really knew what they were doing. Equinix’s (just picking that name out of thin air) flagship DC/Virginia facilities operate 155,000 sq ft of Tier 3 space — if i’m even remotely in the ballpark, US-EAST, including cages and crash karts, could fit on one wall.
What was that about economies of scale again?
…and it was dense. I toured an AT&T Internet Data Center (IDC) facility this morning, in dear old Watertown, MA.
It’s impressive; lots of new, gleaming white PVC and galvanized steel tubing. The facility has the ability to shutdown its chillers in the winter and get free cooling when it’s below 45 degrees outside, thanks to modern plate and frame heat exchangers (they are bright blue, probably to set off the red water pumps that connect to the PVC chiller pipes. Red, white and blue!). Solar panels can provide about 75Kw, they have a handful of megawatts worth of generators, etc, and everything is shipshape and shiny.
On the floor (36″ raised) we saw customer cages and racks for AT&T’s co-location business, the massive switching and internet hub (four 10 GB pipes to the local backbone) and we saw the AT&T managed services infrastructure, a good dozen aisles of rack space, mostly full, with alternating hot/cold aisles with about a 20 degree difference between them.
This is where AT&T puts its virtualization, hosting, application services, managed services and yes, its Synaptic cloud services for the local area. Currently, AT&T offers Synaptic Storage as an on-demand, pay-as-you-go cloud service, and Synaptic Compute is live as a “controlled launch” with a limited number of users. Currently, AT&T Business Services says that cloud is its second heaviest investment area after mobile services.
AT&T currently runs all of its production services that are virtualized on VMware, but it plans to roll out additional hypervisor support very quickly. It is also pitching its “virtual private cloud” as an area of first interest to customers. AT&T has the ability, like Verizon, to offer truly dedicated cloud environments that do not have to be exposed to standard Internet traffic at all (it’s a telecom, it can do any dedicated circuit you want from copper POTS to fiber optic, if you pony up).
A couple of quick thoughts about what I saw: Density in the AT&T infrastructure was easily 10x that of customer colocated racks, both in use of space and the floorplan. It is clearly a beast of a different order than the individual server environments.
There was a LOT of empty space. AT&T has walled off about 70,000 square feet of raised floor for later; it has the space to double its capacity for cooling, power and switching, and it isn’t using a lot of the capacity its already got online. AT&T could treble its managed services infrastructure footprint in the current ready space and still take on co-lo and hosting customers.
That says to me that the market is ready to supply IT infrastructure, and cloud infrastructure, when the demand is there. Much ado is made over Google’s uncountable, bleeding-edge servers, and Amazon’s enormous cloud environment, or Microsoft’s data centers, but here in Watertown is the plain fact that the people who really do infrastructure are ready, and waiting, for the market. The long-term future of cloud infrastructures is probably in these facilities, and not with Google or Microsoft or Amazon.
Oh, and, Verizon has a building right next door. It’s bigger than AT&T’s.
The flexibility of cloud computing services appears to be extending to the physical infrastructure itself.
Researchers point to new examples of the rapidly maturing “shipping container data center” as proof. After all, if you can sign up and get a server at Amazon Web Services anytime you like, why shouldn’t you be able to order up a physical data center almost the same way?
Rob Gillen, a researcher at the Oak Ridge National laboratories in Tennessee, is part of a team researching and building a private cloud out of commodity x86 servers to support the operations of Jaguar, ORNL’s supercomputer. It is the fastest and most powerful system in the world right now, but it’s not exactly a good fit for day-to-day computing needs.
“If you look at our Jaguar server here with 225,000 cores, frankly there’s only a few people in the world smart enough to write code that will work really well on that,” said Rob Gillen, researcher for cloud computing technologies at ORNL. Gillen is working on both the overall goal of private cloud and is heavily involved in exploring the use of Microsoft Azure, the Platform as a Service.
He said ORNL is working to develop a self-service, fully virtualized environment to handle less important or less intensive tasks, like post-processing of results from workloads run on Jaguar, and long term storage and delivery of that data.
Gillen said the advantages of using standard, widely available hardware and virtualization technologies to make a pool of resources available, a la Amazon Web Services, was very simple. There was a clear divide in raw computing power, but the pool of available programmers, not to mention existing software tools, was much wider using commodity-type services.
“If you have the opportunity to use fixed, Infiniband gear, generally your scientific problems are going to express themselves better over that,” he said. “The commodity nature[of private clouds] is tough for scientists to grapple with, but the range of solutions gets better.”
Hadoop, the massive multi-parallel next generation database, might do a much better job of processing data from Jaguar, for example, and a researcher wouldn’t need to tie up critically valuable supercomputer time noodling around with different ways to explore all that data.
“The raw generation is done on the supercomputers but much of the post processing is really done on commodity, cloud environment,” said Gillen. But, he’s chronically short of space and wants more servers for the ORNL cloud.
That’s where cloud on wheels comes in; Gillen has been looking at a demo container data center from SGI, called the ICE Cube, which is a standard shipping container with a lot of servers in it.
Gillen’s photos and a video of the interior of the unit are a treat for the gearheads:
It gets put down anywhere there’s space and half a megawatt or so of power. Just add water and presto, instant data center. It might not be pretty, but it’s a less expensive way to get data center space.
“We’re space constrained and that’s one possibility,” said Gillen.
Gillen said that the containerized data center market was pretty well established by now, but offerings from HP and IBM were usually designed to adapt to a traditional data center management process. They had standard power hookup, standard rack equipment, and put a high degree of emphasis on customer access. “Some vendors like HP or IBM really want it to fit into the traditional data center so they optimize them for that.”
SGI’s demo box is a little different. It’s built to do nothing but pack as many commodity x86 servers inside as possible, with unique cooling and rack designs that include DC bus bars connecting directly to server boards (no individual power supplies) and refrigeration ducts that run the length of each rack (no CPU coolers).
Gillen said that means it’s ideally suited for getting a medium-sized private cloud (anywhere from 15,000 to 45,000 cores) in a hurry. He also noted that containerized data centers are available in a wide variety of specialized configurations already.
“We are looking at it specifically in the context of our cloud computing projects but over the last two days a lot of people from other areas have been walking through it,” he said.
Is anyone else amused by Verizon’s puffed up claims to dominance in the cloud computing market? In the wake of the vCloud Director unveiling at VMworld 2010, industry analysts made a huge fuss of VMware’s announcement that Verizon has joined its vCloud service provider program. I, on the other hand, am not impressed.
No doubt landing one of the top telecom providers in the world is a coup from a PR perspective, but so far the partnership is a big paper tiger if you’re an IT shop looking to do anything real with this news.
The press release claims that, with “the click of a mouse,” customers can expand their internal VMware environments to Verizon’s Compute as a Service (CaaS) offering built on VMware vCloud Data Center, for instant, additional capacity. The overall effect is referred to as a hybrid cloud.
The immediacy and ease touted here is far from true; ironically, I learned this during a session at VMworld entitled “Cloud 101: What’s real, what’s relevant for enterprise IT and what role does VMware play.”
The speaker said that to move a workload from internal VMware resources to a vCloud service provider such as Verizon is currently a manual process. It require users to shut down the to-be-migrated workload, select the cloud you’ll deploy it to, then switch to the Web interface of that service provider and import the workload. I am leaving out a bunch of other steps too tedious to mention, but it’s hardly the click of a mouse!
In a follow-up conversation after the session, VMware said the missing feature that will allow automated workload migration, called the vCloud client plug-in, was still to come. No timeframe was given.
And this isn’t all the smoke and mirrors from Verizon; the telco claims its CaaS is the first cloud service to offer PCI compliance. This statement isn’t quite either because the current PCI standard, v1.2, does not support virtual infrastructures. So a real cloud infrastructure (a multi-tenant, virtualized resource) cannot be PCI compliant. The PCI Council is expected to announce v2.0 of the standard at the end of October, which will explain how to obtain PCI compliance in a virtual environment.
A word of advice to IT shops investigating hybrid cloud options: Be sure to play around with the service before you buy. In many cases, these offerings are still only half-baked.
A story we wrote last week about Amazon’s newest disclosures on its security procedures was sparked in part by a earful from one of the sources in it. Seeking reactions to the newly updated “Overview of Security Processes,” I expected a guarded statement that the paper was a good general overview of how Amazon Web Services approached security, but pertinent technical details would probably only be shared with customers who requested them, and Amazon didn’t want to give too much away.
Instead, what I heard was that Amazon not only does not disclose relevant technical information but it apparently also does not understand what customers are asking for. Potential clients were both refused operational security details and also told wildly different answers on whether or not AWS staff could access data stored in users’ S3 accounts: “No, never,” and “Yes, under some circumstances.” That’s, um, kind of a big deal. They also refuse to indemnify themselves against potential failures and data loss as a matter of course.
Typically, a big enterprise IT organization has a set of procedures and policies it has to follow when provisioning infrastructure; charts are made, checkboxes checked, and someone, somewhere, will eventually claim that information and park it somewhere. This includes minor details like “who can access our data and how,” and “how does a service provider protect our assets and will they compensate us if they fail.” A big customer and a provider will sit down, discuss how the hoster can meet the needs of the organization, assign a value to the business revenue being generated for the enterprise, and agree to pay that amount for any outages.
Everybody is aware of this
Even their biggest fans are somewhat down on AWS for this. Cloud consultant Shlomo Swidler said in an email that Amazon’s efforts to brush up their security picture, like the launch of the AWS Vulnerability Reporting and Penetration Testing program, was the right idea, but Amazon had neutered it by not letting customers use it in a meaningful way. “Without a way to test how things will really behave under simulated attack conditions — including the AWS defensive responses — I don’t understand what will happen under real attack conditions,” he said. The Vulnerability Reporting and Penetration Testing program can reportedly only be used with pre-approval from AWS staff, meaning it can never simulate an in-the-wild attack.
Others are more charitable, and point to Amazon’s track record. IT security auditor Andrew Plato was asked about the new white paper and responded via email.
“From what’s in there, they seem to be doing the right things. They’ve got a good risk management framework, good firewalls, monitoring, they’re following ISO and COBIT , They’ve got change management; they seem to be doing all the good practices that we advise clients to do,” said Plato, president of Anitian Enterprise Security. But he noted that all we had to go on was Amazon’s good word. ”The long and short of it is the content says they’re doing the right things — now, they could be lying,” he said, tongue only partly in cheek.
Plato isn’t worried about Amazon’s security. I’m positive they aren’t lying about anything in their white paper. Nobody should be worried; they have an amazing track record, but we’ll never know, at this rate, exactly what they’re so proud of.
The problem is enterprises are picky
Here’s the problem: IT does not work like baby shoes and garden rakes. It’s not enough to just deliver the goods. You have to show your work, or the IT practitioner cannot trust what you are giving him, at a certain level. All hosting providers know this, and they are proud to show off what they’ve done. After all, they’ve spent a lot of money to get best-in-class gear so they can make money off it.
Hell, Rackspace will drag a hobo off the street to show them around the data center, they’ll talk your ear off; you’ll know what color socks the hard drive guy is wearing on Tuesdays if that’s important to you.
Now, it’s OK that Amazon doesn’t work quite that way. We all understand that the amazing feat they have managed to pull off is to offer real-time self-service IT and charge for it by the hour, and that users are responsible for their own foolishness, and Amazon backs only access and uptime. Most of Amazon’s customers are more than happy with that; they can’t afford to care about what kind of firewall and load balancers run the AWS cloud.
But if Amazon is going to compete for the enterprise customer, and they are explicit that they are trying for those customers, they are going to have to get over it and spill the beans. Not to me, although that would be nice, and not to their competition (though that’s hardly relevant now since their nearest cloud competitor, Rackspace, is apparently $400 million dollars shy of eating their lunch) but definitely to enterprise customers. It’s a fact of life. Enterprises won’t come unless you play their ball game.
There are all sorts of ways AWS can address this without giving away the goose. CloudAudit is one idea; that’s self-service security audits on an API; it fits right in to the AWS worldview. Talking to analysts and professionals under NDA is another. AWS must at the very least match what other service providers offer if it is sincere in competing for enterprise users.
Googler Vijay Gill posted a quick and dirty cloud calculator a few weeks ago that has caused some head scratching. The calculator seems to show an eye popping 168% premium on using AWS versus co-locating your own servers–$118,248/year for AWS XL instances and $70,079.88 for operating a co-lo with equivalent horsepower.
Can that really be the case? AWS isn’t cheap web hosting, it’s mid-tier VPS hosting, price-wise, if you’re talking about using it consistently year over year, and those are definitely cheaper than co-lo. Gill says $743,000 to buy and install your servers, so he’s got the investment figures in there.
Editor Matt Stansberry asked an expert on data center practices and markets that questions and was told:
“There is a point at where this is a very good exercise, but the way it was undertaken was grossly inaccurate,”
That’s Tier1 analyst Antonio Piraino, who points out that not only did Gill not spell out neccessary assumptions, he took Amazon’s retail price as the base cost, and Amazon will cut that in half if a user makes a year or multi-year commitment.
But is it fair to make the comparison in the first place?
Some people will choose Amazon for large-scale, long term commitments, but they will be in the diminous minority. There are far better options for almost anyone in hosting right now. The hosting market has been mature for the better part of a decade, and cloud has many years to go on that front.
AWS isn’t hosting or co-lo, obviously; it’s cloud. First, lots of people pick off the bits they want, like using S3 for storage. That is surely less expensive that co-locating your own personal SAN for data archive or second tier storage (first tier if you’re a web app). That’s the absolutely astounding innovation that AWS has shown the world; they sell any part of the compute environment by the hour, independent of all the other parts.
Second the whole point of AWS is that you can get the entire equivalent of that $743,000 co-lo hardware, running full bore, no cable crimpers or screwdrivers needed, in a few hours (if you’re tardy) without having to buy a thing. Building out a co-lo takes months and months.
So it’s a little off base and what’s the point? To prove that Amazon can be expensive? Not a shock. Renting an apartment can seem like a waste of money if you own a home, not so much if you need a place to live.
CA’s spending spree in the cloud market is far from over according to Adam Elster, SVP and general manager of CA’s services business.
The software giant has gobbled up five companies in the last 12 months including Cassatt (resource optimization), Oblicore (IT service catalog), 3Tera (application deployment in the cloud), Nimsoft (monitoring and reporting of Google Apps, Rackspace, AWS and Salesforce.com) and most recently, 4Base Techbologies (a cloud consulting and integration firm). Some back of the envelope math says that’s close to a billion dollars worth of acquistions so far.
Elster says the company is looking to make an acquistion every 60 to 90 days to build out its portfolio of cloud offerings. It’s not done with services either. “We’re looking at a couple of others from a services perspective,” Elster said. CA’s focus, as always, is on management. It’s also looking at security in the cloud.
For now, the 4Base deal is keeping CA busy. A Sunnyvale, CA-based virtualization consulting firm, 4Base has about 300 projects on the go with companies including Visa, ebay and T-mobile. It charges around $250,000 per phase of a project and most projects are at least two phases. CA found itself in many of the same deals with 4Base, but 4Base was winning the IT strategy and consulting part of the deal, hence the acquisition.
It’s seems like an expensive proposition to hire the 4Base guys, but Elster says for many large companies it’s a time to market issue versus retraining “senior” inhouse IT staff. “Your challenge is those people do not have the large virtualization and cloud project experience … for $250,000 4Base does the assessment and builds the roadmap, it’s a hot space as it gets the organization to market quicker and reduces risk,” he said.
Most of the projects 4Base is working on involve helping companies build out their virtualization environments beyond a single application or test and dev environment. Rolling out virtualization to a larger scale means getting an ITIL framework in place, updating incident and capacity management reporting tools and creating more standardized IT processes, according to 4Base.
If you’re looking for other boutique companies in the virtualization and cloud consulting market there are a lot out there. Service Mesh, CloudManage.com, New Age Technologies, AllVirtualGroup.net, VirtualServerConsulting, Green Pages Technology Solutions and IT@Once spring to mind.
This week I wrote a story about Eli Lilly’s struggle with Amazon Web Services over legal indemnification issues.
Sources told us that Eli Lilly was walking away from contract negotiations with AWS for expanding its use of AWS beyond its current footprint. AWS has chosen to hide this fact by claiming the story says Eli Lilly is leaving Amazon completely, which was not reported.
Since publishing the story Amazon’s CTO Dr. Werner Vogels has called me a liar, attempted to discredit SearchCloudComputing.com and claimed my sources are wrong, all via Twitter. I am curious if he thinks any enterprise IT professionals are following his tweets? My hunch is not many, but that’s another story.
Information Week followed up with Eli Lilly to check out the story and was given this statement:
“Lilly is currently a client of Amazon Web Services. We employ a wide variety of Amazon Web Services solutions, including the utilization of their cloud environment for hosting and analytics of information important to Lilly.”
This statement does not refute the issue at the center of my story which is that Eli Lilly has been struggling to agree to terms with AWS over legal liability issues which has prevented it from deploying more important workloads on AWS.
Yes, AWS still gets some business from Eli Lilly, but larger HPC workloads and other corporate data are off the table, right now.
The story raises lots of questions about the murky area of how much liability cloud computing service providers should assume when things go wrong with their service. So far, AWS seems unwilling to negotiate with its customers, and it’s certainly unwilling to discuss this topic in a public way.
That’s AWS’s prerogative, but the issue will not subside, especially as more big companies debate the wisdom of trusting their business information to cloud providers like AWS, Rackspace, etal.
Has the endless optimism and sunny disposition of the Google crew finally led them to bite off more than they could chew?
Reported trouble meeting security standards has stalled a high profile deal between Google and the City of LA to implement email and office software in the cloud, replacing on premise Novell GroupWise software. While 10,000 users have moved onto Gmail already, according to city CTO Randi Levin, and 6,000 more will move by mid-August, 13,000 police personnel will not be ready to switch from in-house to out in the cloud until fall.
Google and CSC have reimbursed the city a reported $145,000 dollars to help cover the costs of the delay. There was already a sense that Google was giving Los Angeles a sweetheart deal to prove that Google Apps was ready for big deployments; when we first reported this last year, it was noted that Google could give the city more than a million dollars in kickbacks if other public California agencies joined the deal, and that Google was flying-in teams of specialists to pitch and plan the move, something most customers don’t get.
Also in our original coverage, critics raised precisely these concerns; that the technology was an unknown, that there would be unexpected headaches, and that overall, choosing a technology system because Google wanted to prove something might not be the smartest way to set policy.
“Google justified its pitch by saying that the use of Google Apps will save a ton of money based on productivity gains, when everyone knows that when you put in something new, you never know if it will integrate [well] or not with existing technology,” said Kevin McDonald, who runs an outsourced IT systems management firm. That’s not prescient; that’s common sense. MarketWatch also reports that users are dissatisfied with speed and delivery of email and that’s a primary concern for the LAPD.
There was no word today on the fate of the “Government Cloud” that Google said it was building to support public sector users who had a regulatory need to have their data segregated and accounted for. Google originally said that the Government Cloud would be able to meet any and all concerns over privacy and security by the City of LA. Why that hasn’t happened ten months after the promises were made remains to seen.
Google was happy to gloss over potential roadblocks when the deal was announced, like the fact that the LAPD relies on its messaging system; email, mobile devices etc for police duties and maybe it’s right in claiming, as it often has, that Google can do security better, but I’m going to go out on a limb and guess that when the LAPD’s email goes out, the Chief of Police probably does not want to call Google Support and get placed on hold. He probably wants to be able to literally stand next to the server and scream at someone in IT until it’s back.
Maybe that’s an out of date attitude, but it’s one that is hard to shake, especially in the public sector. These people have been doing their jobs (well, showing up at the office, at least) for a very long time without Google; they are not prone to enjoy experimentation or innovation, and Google needs to recognize that and get its ducks in a row if it wants to become a serious contender for the public sector. The “perpetual beta” attitude that the company seems to revel in simply isn’t going to fly.