Ask the IT Consultant

July 14, 2012  2:30 PM

Do We Really Need Cloud Standards?

Beth Cohen Beth Cohen Profile: Beth Cohen

Question:  I am working on building a cloud strategy for my company.  How can we avoid vendor lock-in?

You would think that cloud technology would have standardized long time ago.  While network standards that shape the Internet have been widely accepted throughout the industry, cloud standards have had a much slower adoption path.  Sadly the current state of cloud standards is, after 15 years, still far less mature than it ought to be.  As cloud infrastructure technology matures, increased interest by the enterprise and emerging vendors is driving a renewed effort to create viable standards that will benefit everyone.

For companies looking for integration and the capability to build hybrid clouds, various standards and proprietary and open APIs have been proposed to provide interoperability up and down the three layers of the cloud stack.  The first and so far only, cloud-oriented standard that has been ratified is the Open Virtualization Format (OVF), which was approved in September 2010 after three years of processing by the DMTF.  OVF’s open packaging and distribution format offers some platform independence by allowing migration between some platforms, but it does not provide all the tools needed for full cloud interoperability.

I think we can all agree that everyone benefits from technology standards in the long-term – the operative word is long-term.  You would think that everyone would agree that cloud infrastructure technology standards should be given high priority, but creating compelling proprietary systems and discouraging standards gives early adopter companies competitive advantage in the short-term.  Think about Amazon and VMware’s vast technology and market leads in the cloud services and enterprise infrastructure systems respectively.  They have little motivation to support any standards that have the potential to undercut their monopoly market positions.

Users of cloud are asking when will cloud computing standards mature enough so that more companies will feel comfortable implementing cloud architectures and using cloud services without feeling locked in.  Ironically, while the commercial cloud offerings have been growing, built on the very standards that created the Internet itself, Amazon and others have been reluctant to publish their architectures.  Application Programming Interfaces (API), which hide the underlying architecture, are all well and good, but they do not guaranty true interoperability.  Downstream vendors quickly find that they need to build API interfaces for all the different services they need to support; adding significantly to the development and maintenance costs.  To address this and transparently transfer workloads among the different vendor based on predefined business rules, there needs to be much more comprehensive standards.

One obvious question to ask is if there is an opportunity for the commercial cloud systems to become standards.  After all, there have been precedents where formerly proprietary formats, such as VMDK (Virtual Machine Disk) which was developed by VMware, have become de facto standards simply by being widely adopted by the industry.  One could even argue that platforms such as Amazon’s AWS are already standards.  However, as many companies have found to their chagrin, while it has a vast variety of services, easy to use tools, and a significant technological head start on the competition, it is more of a cloud world roach motel.  Lots of companies have found it easy to get applications running quickly, but changing providers or taking the applications back in-house as requirements change is fraught with unexpected perils.  Amazon’s backing of Eucalyptus does not address that problem directly, but it does offer a viable option for companies that want to build what Amazon euphemistically calls, on-premise services.

In conclusion, the good news is that the cloud industry is finally reaching consensus that the time to build cloud interoperability standards is long overdue.   The biggest need remains for interoperability standards to allow virtual machines to be migrated between clouds transparently and for more robust hybrid cloud solutions.  For the moment companies that want to use multiple platforms or a mix of public and private options are stuck with complex architectures and emerging orchestration tools such as enStratus and Rightscale to bridge the gap.

About the Author

Beth Cohen, Cloud Technology Partners, Inc.  Transforming Businesses with Cloud Solutions

June 17, 2012  10:00 PM

Really Big Data – Cloud object storage

Beth Cohen Beth Cohen Profile: Beth Cohen

Question:  What is the best way to store really large data sets?

We are commonly used to terabytes of data; 3TB hard drives are now usually available for under $200 and 4TB drives are starting to ship at premium prices.  It is not unusual for a company to have at least half a petabyte of data floating around on their storage systems these days and a petabyte of total data if you count all those forgotten data bases and buried servers.   Working in the storage industry I cannot tell you how many times clients would underestimate their actually storage data set by 50% or more.  SAN/NAS solutions, well understood technology that have been around for a while, are robust systems that reliably support storage pools of a petabyte or more.  However, as enterprises’ appetite for ever increasing amounts of data – so called big data – grows there is a need for new architectures that take a different approach to managing massive amounts of data (20 petabytes or more) at lower cost.  That is where object stores have the advantage over traditional storage approaches because they have the capability to store data very efficiently on commodity hardware, scale horizontally to essentially infinite size and seamlessly handle any type of data.

As enterprise data sets grow to tens of petabytes – i.e. beyond the scale of even the largest SAN/NAS solutions available today, there are some very attractive cloud systems that address the need for those ever expanding pools of storage.  It might be worthwhile to take a minute to understand how cloud storage works for very large amounts of data.  First introduced in 1993, object stores, unlike traditional file systems that maintain some type of hierarchical organization using the file and folder analogy, take a different approach.  Each file is treated as an object – hence the term object store – and the objects are placed in the store using a distributed data base model.  Having no central “brain” or master point of control provides greater scalability, redundancy and permanence.  It is not a file system or real-time data storage system, but rather a long-term storage system for a more permanent type of static data that can be retrieved, leveraged, and then updated if necessary. The details vary of course, but the ability to find objects from anywhere in the store using a distributed retrieval mechanism is what allows the stores to handle multiple petabytes of data.  It is ideal for write once, read many types of data pools.  Primary examples of data that best fit this type of storage model are virtual machine images, photo storage, email storage and backup archiving.

The advantage of moving from a SAN storage solution to a cloud solution for very large amounts of data makes sense for many use cases.  Some of the advantages include:

  • Widely deployed proven technology with hundreds of petabyte data storage in production today
  • Most cost efficient solution for the scale – Substantially lower per gigabyte per month storage costs
  • Reduced  data center floor space utilization
  • Enhanced flexibility to meet fluctuating storage demands
  • Potential for delivering faster throughput to applications and a better end-user experience
  • Highly scalable object storage
  • Capable of creating seamless storage pools across multiple back-end systems
  • Ability to scale horizontally instead of vertically
  • The horizontal architecture scales well beyond the 20 Petabytes maximum that traditional storage architectures allow
  • Uses interchangeable commodity hardware
  • Simplified operations

The average cost of commercial fully managed cloud storage is running $.11-.15/GB/month.  That might be a bit high for companies that have massive data storage needs, but an organization that has the wherewithal to build it in-house can bring the costs down substantially, easily to under $.05/GB/month.  Remember, for every 10 petabytes of data, every additional $0.01/GB/month of savings represents $1.2M/year.  For one such model, check out Amar Kapadia’s blog on cost projections for building an Openstack Swift store, Can OpenStack Swift Hit Amazon S3 like Cost Points?

In the end, if you have more than 10 petabytes of data, it might be worth checking out cloud object storage to take advantage of its ability to cost effectively and transparently scale to hundreds of petabytes.   With the right data set, a company can achieve significant savings and support planned growth.  In addition, object storage offers a more flexible architecture for future growth, and improved control over operational and capital costs.

About the Author

Beth Cohen, Cloud Technology Partners, Inc.  Transforming Businesses with Cloud Solutions

May 28, 2012  12:30 PM

2012 MIT CIO Symposium – The surprisingly tethered untethered enterprise

Beth Cohen Beth Cohen Profile: Beth Cohen

Question:  What is the leading edge thinking about the emerging role of mobile in the modern enterprise?

The 2012 MIT CIO symposium purported theme was the untethered enterprise.  Ironically for all the recent media buzz, there was little discussion of untethered enterprises, or even much mention of mobile at all.  Only one panel was slated to cover the topic, but that discussion quickly devolved into a tactical discussion of IT/vendor relationships and SLA’s.  That doesn’t mean that there was no discussion of mobile devices, just few of the in-depth conversations about how the enterprise can best incorporate emerging technologies that I expect from this usually forward thinking and informative conference. Reading between the lines, the enterprise is not quite ready to tackle these difficult considerations.

Very few companies are creating the holistic enterprise mobility strategy that is needed to drive real business advantage.  To put this all into perspective, an informal conference survey revealed that less than a third of the attendees had an enterprise BYOD policy in place today.  Some simply were letting their employees set the standards.  For one company this has resulted in far too many employees with two devices on their desks (one company issued and the other a BYOD of their own choice), and 11K unsupported iPads on their network.  That CIO wryly admitted that maybe it was time for their IT department to address the issue, if for no other reason than they were wasting millions on underused computer resources.

The push to add mobility is coming from both the top and the bottom of the organizations.  For every executive with their hot smartphone, there are 10 employees with three tablets.  However, IT is responding by pushing mobile apps out the door without a good understanding that mobile is a game changing strategy that takes the organization from the traditional top-down IT approach that has been fashionable in recent years back to a bottom up consumer driven initiative. Of course, this is completely runs counter to the traditional IT mindset.  As long as there is a disconnect between the demands of the workers and the services provided by IT, shadow IT will continue to remain a strong force in the enterprise.

From the technology perspective, while enterprise mobility has been around for 20 years in some form or another, the underlying technology to support the mobile apps is still quite brittle.  We are relying on a telecom infrastructure that isn’t fully capable of supporting millions of mobile endpoints.  Because mobility is primarily a device driven technology, it is completely dependent on the infrastructure.  The IT organization is better off getting into the way-back machine and treating them as dumb terminals.  The good news is that mobility security is finally being taken seriously.  Clearly there are some continuing issues that need to be addressed, but the technology and standards are there to make smartphones and tablet secure enough for even government standards.

Several times during the conference, the downsides of hyper-connectivity came up.  One panelist noted an interesting recent trend where formerly plugged-in 20-somethings were choosing to disconnect as they ramped up their careers and realized that separating their private and work lives was a sensible idea.  Several others commented on the need to provide a work environment that is attractive to the tech savvy worker, but old expectations of 100% worker availability is wearing thin.  Many American companies are realizing what the rest of the world has known forever, just because you can touch your workers 24 hours a day, doesn’t mean that you should.  There is an increasing awareness that for sanity if nothing else, you need to apply reasonable business etiquette for worker communications.  My only comment is that after living the 7/24 IT worker life for 20 years this revelation couldn’t come any sooner for me.

What I realized at the end of the day was that mobile and untethered is just a red herring.  The new generation of users sees it for what it really is, shared ubiquitous access to data in the cloud.   For that to succeed It has to be dirt simple and it has to be a thin client service that delivers the functionality that the users need.   That is not so hard, is it?

About the Author

Beth Cohen, Cloud Technology Partners, Inc.  Transforming Businesses with Cloud Solutions

May 5, 2012  11:00 PM

Openstack – Will the Real Customer Please Come Forward?

Beth Cohen Beth Cohen Profile: Beth Cohen

Question:  There has been much buzz in the IT community about Openstack as the emerging cloud standard.   Just who is the Openstack project designed for?

Not long ago I spent a week in mid-April 2012 at the Folsom Openstack Summit/Conference in San Francisco.  There was much enthusiasm for the project among the 450 developers who attended the development summit portion of the program and even more among the 900 or so developers, product and business development folks who stayed for the two day general conference that followed.  Both from the business and technology perspectives, Openstack has come a long way in just the past six months.  Where it continues to lag was the noticeable lack of real users attending the conference to contribute their voices and valuable guidance.

On the vendor front, Rackspace, of course, continues to be the biggest major supporter overall; but more marquee names have announced their intention to join the fledgling Openstack Foundation including AT&T, HP, Redhat and IBM.  HP’s motivations are clear; it is in the process of standing up one of the largest Openstack public IaaS offerings with the public beta scheduled to go live in the middle of May 2012.  It certainly has the deep pockets and technical wherewithal to successfully go up against Amazon and Rackspace.  IBM’s incentives for supporting the project are less clear, but the company does have a long history of backing open source projects and using their support for their own ends – Linux is a prime example that immediately comes to mind.  The jury seems to be still out for early supporters, Citrix and Microsoft, both of which have not officially committed to the Foundation to date.

Looking at Openstack from the technical perspective, in the 18 months since the project’s inception, it has come a long way towards becoming a viable system that could be used in a production environment.   I would still argue as I did six months ago, that you still need a cadre of systems engineers and dev/ops people to build and support it, but the attendees at the technical sessions I joined recognized the need for better user documentation in general and more ways to engage the operations people who are stuck supporting something that is still pretty rough around the edges.  The frustration from the few end users who were in attendance was abundantly clear.  These people are not developers and don’t have time to figure out the missing pieces.  As one person put it so succinctly, at 2AM trolling through the code is not an acceptable method of troubleshooting a down system.

At the end of the day, Openstack will only successfully be adopted by the enterprise if the real end users, the operations people from service providers such as HP and Rackspace or enterprise customers, step forward and join the technical conversation. Really, who better to set the direction of such an ambitious open source project than the users?  To rectify this situation, I encourage enterprises that are considering an Openstack deployment project to start contributing to the community today and plan on sending their systems operations managers and architects to the next Openstack Design Summit in the fall.  The future is literally in your hands.

About the Author

Beth Cohen, Cloud Technology Partners, Inc.  Transforming Businesses with Cloud Solutions

April 19, 2012  3:30 PM

Megaupload – A Cloud Security Parable: Part 2

Beth Cohen Beth Cohen Profile: Beth Cohen

Question:  Just when we were getting comfortable with storing our confidential data in the cloud, now I hear about the FBI shutting this company down.  Do I need to reexamine my cloud security policies?

A few years ago, I wrote a scary story about Widgitco a factitious company that found out the hard way that reading the fine print on cloud service contracts is important.  Widgetco’s problems stemmed from a lack of clarity about who actually owned the data in the cloud.  When Widgitco discovered their customer list had fallen into the hands of their main competitor they had little or no legal recourse because this issue had never been properly addressed by any of the parties in the fiasco, including Widgitco, its SaaS providers or the actual owners of the data center hardware.

Now the courts are addressing a dramatic and all too real example of the question of who owns data in the cloud and even more importantly, who is responsible for the files.  The recent legal problems of Megaupload, a filesharing service, not untypical of many others in the cloud storage and file sharing market, highlight these issues.   In a nutshell it boils down to the differing perspectives on the legal nature and purpose of file sharing.  Clearly there is lots of legitimate file sharing going on.  Dropbox, MediaFire, SugerSync, and even Megaupload, despite the government legal actions, all have a huge number of users who are using it to share personal files for both business and personal reasons.   According to MediaFire, employees at 86 percent of the Fortune 500 use its services.  They are not providing information on the nature of those files and what they are being used for, but I think we can safely assume at least some of the files are being legitimately used by collaborative teams to produce real work.  One can argue that using Dropbox and its ilk in the corporate setting does pose a serious risk for exposing sensitive corporate data.  I do agree with this sentiment, but I am also realistic about the reasons company employees are turning to these services in the first place.  Like the adoption of other consumer driven innovations like mobile devices and IM, it is often simply because the available internal corporate file sharing tools leave something to be desired.  How many of you have used a file sharing service as a team collaborative tool simply because it was easy to use and met the objective for expediency?

So what is the real issue?  The problem stems from a clash between the interests of the media content delivery companies such as Sony, Warner Brothers and others who are worried that these sites are primarily being used to share pirated copies of movies, e-books or music, and the far more common and benign private file sharing.  While one could argue that they should not be worried about the pirating in the first place, (Read Charlie Stross’s interesting take on that issue :  What Amazon’s ebook strategy means) that is a discussion that will be left to another time.  For the moment it appears that the government is siding with the large media conglomerates at the expense of everyone else.

Part of the reason that there is renewed government interest in prosecuting these services is a change in the technology and service delivery model.  In many ways these cloud file sharing systems are using a combination of the unenforceable peer to peer sharing strategy of BitTorrent, where the files are literally scattered all over the globe on millions of private computers, and the centralized server model of the late lamented Napster.  The government case boils down to the fact that the files are located in an identifiable set of machines that constitute a cloud file repository, therefore they can be subject to seizure.   I suspect the government case rests more on the relative ease of access to the servers than anything more valid.  A similar but more legally defensible case against the poor Boston University graduate student, Joel Tenebaum who got caught with pirated music garnered far more sympathy for the student than its intended anti-piracy warning message.

Where does that leave the legitimate business user?  The bottom line is that the on-going battle among pirates, business and the legal system will continue to work slowly through the courts, while the enabling technology to beat the system will remain far ahead of the law.  In the meantime, I think you can safely continue to use cloud file sharing services; just make sure they are business friendly and meet proper regulatory requirements for data security.

About the Author

Beth Cohen, Cloud Technology Partners, Inc.  Transforming Businesses with Cloud Solutions

March 31, 2012  4:01 PM

Bringing Consumer Technology into the Enterprise

Beth Cohen Beth Cohen Profile: Beth Cohen

Question: How does the widespread adoption of consumer technology affect the enterprise? What does an enterprise need to do to be prepared to benefit from implementing it?

The widespread adoption of technology at the consumer level is having profound and unexpected effects on enterprise IT. After years of IT stagnation caused by a combination of economic pressures and outsourcing, the adoption of consumer technology has been a breath of fresh air for some companies and a shock to others. For example, seemingly every corporate executive is demanding the latest tablet computer, but the casual use of mobile devices to view and transmit corporate IP is worrying to the business risk and governance folks.

At the very least, bringing consumer technology into the enterprise can ironically discourage internal IT innovation; because innovation generally means some risk and most enterprises are generally risk-adverse. Consumer innovations like mobile devices can be disruptive technologies, but they also represent a significant risk to the corporate view of itself as a self-contained entity. This factor has both been for the positive, when hard pressed IT departments embrace the outside help these battle tested technologies represent, and for the negative, when IT has dug in its heels with a “not invented here” attitude. This conflict is clear as the pendulum swing back to decentralized or bottom up IT has corporate IT struggling to keep up with the rapid proliferation of cloud technologies.

100% of all enterprises are using SaaS applications whether the IT function knows it or not, as business managers with credit cards take back their applications by leveraging easy access to SaaS and cloud development environments. This can be seen as a positive trend as the business units take on the responsibility for supporting their own IT requirements using (public or private) cloud technologies as the underlying infrastructure. One can argue that is where it has always belonged because they are able to response to the needs of the business far faster. On the other hand, the loss of centralized control and governance represents a certain amount of inefficiency and introduces significant risk at the enterprise level. As business units take over the applications, do you really know where your corporate data is?

Another worrying trend that will hinder the ability for the enterprise to incorporate promising consumer technologies is the erosion of skills and that innovative spark that is needed to drive their adoption. Years of outsourcing and off-shoring have battered the core technical skills of corporate IT as the role has moved increasing back towards being viewed as a costly utility by enterprise business managers who have little patience or interest in IT as a strategic asset. As corporate IT has evolved into vendor managers rather than drivers of innovation, essential skills such as systems architects and senior network engineers have disappeared. I recently worked with a major corporate IT organization that had been relying on their hardware vendors to manage their networks so long that their internal staff networking skills were so atrophied that they did not have a single person on staff who knew how to design an IP addressing scheme for their new cloud implementation. The long term effects of corporate IT downsizing and outsourcing of core functions has meant that IT departments have often been ill-prepared for the challenges of the introduction of consumer technologies that requires different approaches to the organization and support processes.

I have painted a pessimistic picture for corporate IT, but there is hope. The real innovation is happening at the edges of the enterprise as smart, creative business managers take up the challenge to modernize and drive real business value from IT by using the tools they have become comfortable with. Smart IT departments can recapture the technology leadership role by seeking out promising new consumer technologies and integrating them into the enterprise IT portfolio before they get that dreaded surprise support call from the business unit!

About the Author

Beth Cohen, Cloud Technology Partners, Inc. Transforming Businesses with Cloud Solutions

March 4, 2012  2:00 AM

Cloud High Availability Take Two – Supporting Rack Level Failure

Beth Cohen Beth Cohen Profile: Beth Cohen

Question:  I am concerned that the network is the weakest link in my private cloud. What will happen if any of my network hardware components fail?

In a previous discussion of cloud high availability, I covered in general terms what are some of the principals and approaches that make sense in a cloud environment. This time we will dive into some details of how this can be achieved in an Openstack environment.

The average published MTBF on switches seems to be between 100,000 and 200,000 hours. This number is dependent on the ambient temperature of the switch in the data center. I am assuming that most modern data centers are properly cooled for maximum switch life. This translates to between 11 and 22 years. Even in the worst case of poor ventilation and high ambient temperatures in the data center, the MTBF is still 2-3 years based on research found at

The mean time to replacement (MTTR) for a switch is going to be dependent on how exactly how the data center is staffed and what processes are used for replacing switches. Assuming that you would keep a few spares in the data center and that it is fully staffed 24 hours/day, the average time to replace a switch including configuration is going to be under 2 hours. Most modern switches are auto-configured so the actually provisioning time after the switch is powered up in the rack is under 5 minutes.

Let me walk through what will happen in the case of a top of rack (ToR) switch failure in the Swift cluster. Swift by its nature is fault tolerant at the rack level. That means that the system will continue to operate without data loss if an entire rack goes off-line. The cluster would detect the rack being off-line and send out a notification that the NOC staff would see within 5 minutes. In the case of rack going off-line Swift does not automatically move any data. The reason for this is that in fact, the NOC staff needs to make a decision about the cause of the rack going off-line and how long it will take for it to come back on line. In the case of a switch failure, the data in the rack is still intact, so it is far more efficient to just replace the switch then bring the rack back on-line without having to move the data. Even if the NOC staff decides to move data around, which they would only do if the fault is in the servers not the switch, the network overhead that it adds to the cluster is in the range of 3-5% for a large cluster with properly tuned ring rebuild cycle. Clearly taking a rack off-line is not considered a problem. I would argue that you should expect to be able to take racks off-line with no impact to the system as a whole as a matter of course for maintenance, upgrades and other reasons.

Nova behaves is slightly differently in the case of a rack failure. Unlike Swift the architecture does not have an assumed base unit of failure at the rack level. It does have the concept of a availability zone, which is quite different from a Swift zone just to confuse things. That doesn’t mean that you cannot create an equally fault tolerant Nova architecture, it just requires more development of high availability at the application level of the system combined with the use of the availability zone as a mechanism for balancing the applications in different locations. The assumption is that it is the responsibility of the application to build in fault tolerance, not the underlying infrastructure to keep track of the individual VM instances. Nova zones can be used to achieve this level of fault tolerance in combination. Combining this with the a live migration functionality and HA application design will allow you to build support for rack level failure. Again, the metrics for determining the next steps (replacement of switch only or rebuilding of entire rack) will be based on the specific component failure. See the recent discussion of this at for more ideas on how to architect such a system.

Another approach would be to create high availability through redundant hardware. In this case you could provision the racks with two switches. However this is an expensive option in a large data center with hundreds of racks. It is clearly orders of magnitude more expensive to take this approach. From a risk perspective, you have substantially increased your per rack costs with little or no reduction in risks since the rate of failure is so low to begin with and the architected unit of failure for a cloud infrastructure should be at the rack level to begin with.

About the Author

Beth Cohen, Cloud Technology Partners, Inc. Transforming Businesses with Cloud Solutions

February 25, 2012  2:00 AM

The Illusion of Cloud High Availability – Hardcore risk management

Beth Cohen Beth Cohen Profile: Beth Cohen

Question:  I am building a private cloud and am concerned about how to meet the SLA high availability requirements.  What are my risks and how can I best manage them?

To understand how to manage your high availability options, it is important to have a discussion about how high availability, risk and component failure work in a cloud environment.  High availability is very important for cloud environments; often the cloud provider is required to meet strict service level agreements for 99.99% or 99.999% (the so called 5-nines) availability.  In theory, that is the very reason that customers are interested in using cloud services.  It should be noted that most of the public cloud service providers have lots of methods to measuring availability that is in their favor, so even in the case of catastrophic systems failure, they are rarely accountable for the downtime.  This has been one of many sources of caution for enterprises that have wanted to leverage public cloud services.

Before going into the details of how to quantify the cost of risk mitigation for a cloud, a short discussion of the science of risk management will help with understanding how it all works.  The goal of business risk management is to detail what kinds of risks exist in your specific business and determine how to prevent them entirely or minimize their impact on the business as a whole. Business risk management is essentially quantifying the risk that a given system will fail multiplied by the cost. Cost is further broken into two more categories.  Out of pocket costs, also referred to as sunk costs, and lost opportunity costs.  Sunk costs are costs that you will need to pay out to fix the problem, while lost opportunity costs are revenue lost due to the system unavailability.  For example, the risk that there will be regular earthquakes in Japan is high.  The Japanese have responded to this threat by having some of the strongest earthquake resistant building codes in the world.  However, as last year’s 9.0 tremor and following tsunami so dramatically demonstrated, it is impossible to prepare for such extreme and rare events.

High availability is best addressed by redundancy. However, redundancy can be achieved at several levels of the IT infrastructure: hardware, software, network, or a combination.  Traditional IT organizations have reduced the risk of downtime by concentrating almost exclusively on hardware redundancy.  The scale of the cloud, where there are already thousands or hundreds of thousands of systems, hardware redundancy at the component level quickly becomes unsustainably expensive.   A telling scholarly article that looked at the reported hardware failures from several large data centers shows that by far the most likely failure at data center scales is as would be expected the components that have moving parts, such as hard drives and power supplies.

In practice, the enterprise needs to balance the cost of duplicating hardware throughout the cloud ecosystem, which is the traditional approach to solving the risk management problem, against the need for keeping operating expenses low. Another consideration is that duplicating everything at the hardware level does not automatically guaranty that you do not still have a single point of failure in the environment.  For example, you might have remembered to contract with multiple carriers to spread the risk of a network outage, but if they all come into the data center at a single location, the data center is still prone to a catastrophic “backhoe failure”, which is what happens when a backhoe has severed all the up-link cables in one fell swoop.  It is an expensive and time consuming repair that leaves many unhappy customers in its wake.  Yes, there are ways to mitigate this risk, but they are expensive and need to be balanced against the relative probability of such an event.

The best approach is to look at the probability of failure of each component in context of the entire ecosystem.  Since hard drives fail at such a high rate, the hardware approach is to mirror or RAID the drives across thousands of systems.  This translates to data redundancy and added costs that far exceed the optimum for availability and cost reduction.  At scale, building a storage system that handles the data redundancy at the software level is far more efficient.  Another examples, is planning for power supply (or more precisely fan) failures.  Again, since they are generally the next most common component to fail, instead of filling the cloud data center with thousands of extra power supplies and fans, it is better to build the cloud to be resistant to downtime if a server node fails.  In the end, addressing cloud high availability is not only about determining the MTTF of hard drives, cables or switch ports, it is also balancing it against the likelihood of a given failure at the data center macro level.

About the Author

Beth Cohen, Cloud Technology Partners, Inc.  Transforming Businesses with Cloud Solutions

February 18, 2012  1:15 PM

Reducing Cloud Operations Costs with Automation

Beth Cohen Beth Cohen Profile: Beth Cohen

Question:  How can I best realize the promised operational cost savings of my private cloud?

If you are running a typical enterprise IT organization, you are probably struggling with an organization and processes that are not optimized for delivering cloud services.  Traditional IT operations are best designed to handle customized applications and a heterogeneous IT infrastructure, just the opposite of the skills and processes that are needed to support cloud services.  As an illustration of this, I recently had a conversation with a data center engineer about deployment automation.  He noted that his group was able to build a new server in four hours so he didn’t really see the point in further automation.  Hand building systems works fine when you are building 10 servers a week.  It does not scale when you are building 10 or 100 servers a day.  Deployment automation is designed to solve the problem of how to set up hardware and systems quickly when managing hundreds of racks and thousands of servers.  To achieve this level of automation requires the acquisition of new staff skills, building a factory approach to operations, and developing different types of processes.  What is often overlooked is that it will in turn drive significant changes to the enterprise IT organization to meet the new demands for supporting the cloud infrastructure.

Public cloud services are offered so cheaply is because they have both the economies of scale and more importantly, the operations expertise to support highly automated IT infrastructure. Amazon is estimated to have over 300,000 servers.  They do not provision them by hand; it would be an impossible task.   Any company that is managing cloud services, public or private, has faced this problem and has needed to build processes to allow data center administrators to quickly stand up new or replacement racks and servers.

With automation, racks and servers can be provisioned with a minimum of error-prone human labor in a few minutes or hours.  In the case of hardware failure, the administrators simply install new hardware, power it up and allow the auto-provisioning systems to complete the loading of the operating systems and applications.  The hardware is pre-wired into the rack, so that it can be easily plugged in and then automatically configured using the deployment automation.  Ideally, systems are configured and monitored to send out alarms or even automated orders directly to the vendor for new hardware when a certain usage level is reached or there is a hardware failure in the system.

If you are serious about reducing your operational costs for your cloud investment, the smartest thing you can do is invest in some serious automation for all your operations.  This includes not only building staff skills and developing the capacity for automated virtual machine deployment, but also automating deployment of server nodes, network gear and even entire racks.  I would recommend creating automated deployment processes to allow daily or even hourly system builds for faster systems development, test cycles and production deployments.  Leverage automation processes and framework to automate deployment of all modules across the entire cloud architecture.

Deployment and operations automation not only allows for appropriate expansion, but it also cuts the costs of delivering high availability by reducing the need for expensive hardware redundancy.  Service level agreements, growth and scaling can all be addressed by deployment and operations automation.

About the Author

Beth Cohen, Cloud Technology Partners, Inc.  Transforming Businesses with Cloud Solutions

February 12, 2012  11:00 AM

Robust Cloud Network Architectures, or why the Internet runs on Layer 3

Beth Cohen Beth Cohen Profile: Beth Cohen

Question:  I am designing a network for my private cloud.  Should I use Layer 2 switches or Layer 3 routers for my cloud network architecture?

Since the dawn of the Internet there has been an on-going debate over whether to use Layer 2 (Ethernet) or Layer 3 (IP) networking inside the data center.  In the beginning, yes, we are talking about the 1990’s, for the most part networks were built on Layer 3 protocols.  L2 switches were only used for internal LAN’s or very small installations.  Does anyone really fondly remember NetBIOS or SPX/IPX?  While Layer 2 switches were easy to deploy – one brand was appropriately named Black Box — they were impossibly slow and unreliable once you scaled past more than a couple hundred machines.  Before the development of public IP address sparing protocols such as CIDR, DCHP and NAT, if you wanted to have an internet connection you had to assign each system a public IP address anyway.

Fast forward 20 years and many network protocols later, data centers are now typically architected to use L2 switches rather than L3 routers.  The reasoning seems to be that Ethernet is faster because you don’t have the overhead of the IP hierarchy, and you don’t have to worry about reconfiguring IP addresses as systems get moved around.  For me the second argument doesn’t hold up, since that is exactly what DHCP and DNS are designed for!  The simplicity of Layer 2 protocols might work well in a data center with hundreds of physical machines, but cloud data centers have the additional burden of needing to keep track of all the virtual machine addresses and networks as well.  It is not uncommon for one physical node to support 30-40 VM’s.  Layer 2 switching protocols have improved mostly by adding “bolt-ons” such as VLAN’s, RBridges, or Cisco’s L2MP.  I would argue that these are all proprietary patches to the fundamental scale and complexity problem. They still don’t have the built-in hierarchy and resiliency of a fully routed IP network.

A better paradigm is to think of cloud data centers as miniature (or in the case of Amazon, not so miniature) versions of the Internet.  Thus applying the inherent scalability and flexibility of the IP address based Internet to a cloud data center network architecture makes perfect sense.

Cloud Network Architecture Basic Principles

  • The cloud means, “Any server, any service, any time”
  • Scalability through hierarchy
  • Simplified network management
  • Maximum network traffic flexibility
  • Flattened traffic flow over the entire network mesh
  • Minimize amount of state information maintained in network by keeping VM state (VM MACs and IPs) out of core network
  • Reduce number of protocols to manage

Layer 2 Architecture Limitations

  • Number of VLANs is limited to 4096
  • Number of MACs stored in switch tables is limited
  • Need to maintain a set of Layer 4 devices to handle traffic control
  • MLAG (which is used for switch redundancy) is a proprietary solution that doesn’t scale beyond two devices and forces vendor lock-in
  • Difficult to troubleshoot network without IP addresses and ICMP
  • Configuring ARP is tricky on large L2 networks
  • All network devices need to be aware of all MACs, even VM MAC’s, so there is constant churn in MAC tables and network state changes as VM’s are started or stopped
  • Migrating MACs (VM migration) to different physical locations could be a problem if ARP table timeouts aren’t set properly

Layer 3 Architecture Advantages

  • Provides the same level of resiliency and scalability as the Internet
  • Easy to control traffic with routing metrics
  • Can use BGP confederation for scalability, so core routers have state proportional to number of racks, not to number of servers or VMs
  • It keeps per VM state (VM MACs and IPs) out of the network core, to reduce state churn. The only routing state changes are in case of a ToR failure or backbone link failure
  • Uses ICMP to monitor and manage traffic

In future articles, there will be further dives into some of the ways that Layer 3 networks, virtual networks, and the most exciting new networking development, software only networks can be used to successfully address the needs of a cloud data center network.

About the Author

Beth Cohen, Cloud Technology Partners, Inc.  Transforming Businesses with Cloud Solutions

Forgot Password

No problem! Submit your e-mail address below. We'll send you an e-mail containing your password.

Your password has been sent to: