Ask the IT Consultant

Boston SIM Consultants' Roundtable Blog

PREV 1234567 NEXT

May 5 2012   11:00PM GMT

Openstack - Will the Real Customer Please Come Forward?



Posted by: Beth Cohen
OpenStack, open source cloud, Openstack foundation, IaaS cloud, IaaS, public cloud services, enterprise cloud services

Question:  There has been much buzz in the IT community about Openstack as the emerging cloud standard.   Just who is the Openstack project designed for?

Not long ago I spent a week in mid-April 2012 at the Folsom Openstack Summit/Conference in San Francisco.  There was much enthusiasm for the project among the 450 developers who attended the development summit portion of the program and even more among the 900 or so developers, product and business development folks who stayed for the two day general conference that followed.  Both from the business and technology perspectives, Openstack has come a long way in just the past six months.  Where it continues to lag was the noticeable lack of real users attending the conference to contribute their voices and valuable guidance.

On the vendor front, Rackspace, of course, continues to be the biggest major supporter overall; but more marquee names have announced their intention to join the fledgling Openstack Foundation including AT&T, HP, Redhat and IBM.  HP’s motivations are clear; it is in the process of standing up one of the largest Openstack public IaaS offerings with the public beta scheduled to go live in the middle of May 2012.  It certainly has the deep pockets and technical wherewithal to successfully go up against Amazon and Rackspace.  IBM’s incentives for supporting the project are less clear, but the company does have a long history of backing open source projects and using their support for their own ends - Linux is a prime example that immediately comes to mind.  The jury seems to be still out for early supporters, Citrix and Microsoft, both of which have not officially committed to the Foundation to date.

Looking at Openstack from the technical perspective, in the 18 months since the project’s inception, it has come a long way towards becoming a viable system that could be used in a production environment.   I would still argue as I did six months ago, that you still need a cadre of systems engineers and dev/ops people to build and support it, but the attendees at the technical sessions I joined recognized the need for better user documentation in general and more ways to engage the operations people who are stuck supporting something that is still pretty rough around the edges.  The frustration from the few end users who were in attendance was abundantly clear.  These people are not developers and don’t have time to figure out the missing pieces.  As one person put it so succinctly, at 2AM trolling through the code is not an acceptable method of troubleshooting a down system.

At the end of the day, Openstack will only successfully be adopted by the enterprise if the real end users, the operations people from service providers such as HP and Rackspace or enterprise customers, step forward and join the technical conversation. Really, who better to set the direction of such an ambitious open source project than the users?  To rectify this situation, I encourage enterprises that are considering an Openstack deployment project to start contributing to the community today and plan on sending their systems operations managers and architects to the next Openstack Design Summit in the fall.  The future is literally in your hands.

About the Author

Beth Cohen, Cloud Technology Partners, Inc.  Transforming Businesses with Cloud Solutions

Apr 19 2012   3:30PM GMT

Megaupload - A Cloud Security Parable: Part 2



Posted by: Beth Cohen
cloud storage, cloud compliance, consumer cloud, Consumer IT technology, consumer IT innovation, Cloud backup

Question:  Just when we were getting comfortable with storing our confidential data in the cloud, now I hear about the FBI shutting this company down.  Do I need to reexamine my cloud security policies?

A few years ago, I wrote a scary story about Widgitco a factitious company that found out the hard way that reading the fine print on cloud service contracts is important.  Widgetco’s problems stemmed from a lack of clarity about who actually owned the data in the cloud.  When Widgitco discovered their customer list had fallen into the hands of their main competitor they had little or no legal recourse because this issue had never been properly addressed by any of the parties in the fiasco, including Widgitco, its SaaS providers or the actual owners of the data center hardware.

Now the courts are addressing a dramatic and all too real example of the question of who owns data in the cloud and even more importantly, who is responsible for the files.  The recent legal problems of Megaupload, a filesharing service, not untypical of many others in the cloud storage and file sharing market, highlight these issues.   In a nutshell it boils down to the differing perspectives on the legal nature and purpose of file sharing.  Clearly there is lots of legitimate file sharing going on.  Dropbox, MediaFire, SugerSync, and even Megaupload, despite the government legal actions, all have a huge number of users who are using it to share personal files for both business and personal reasons.   According to MediaFire, employees at 86 percent of the Fortune 500 use its services.  They are not providing information on the nature of those files and what they are being used for, but I think we can safely assume at least some of the files are being legitimately used by collaborative teams to produce real work.  One can argue that using Dropbox and its ilk in the corporate setting does pose a serious risk for exposing sensitive corporate data.  I do agree with this sentiment, but I am also realistic about the reasons company employees are turning to these services in the first place.  Like the adoption of other consumer driven innovations like mobile devices and IM, it is often simply because the available internal corporate file sharing tools leave something to be desired.  How many of you have used a file sharing service as a team collaborative tool simply because it was easy to use and met the objective for expediency?

So what is the real issue?  The problem stems from a clash between the interests of the media content delivery companies such as Sony, Warner Brothers and others who are worried that these sites are primarily being used to share pirated copies of movies, e-books or music, and the far more common and benign private file sharing.  While one could argue that they should not be worried about the pirating in the first place, (Read Charlie Stross’s interesting take on that issue :  What Amazon’s ebook strategy means) that is a discussion that will be left to another time.  For the moment it appears that the government is siding with the large media conglomerates at the expense of everyone else.

Part of the reason that there is renewed government interest in prosecuting these services is a change in the technology and service delivery model.  In many ways these cloud file sharing systems are using a combination of the unenforceable peer to peer sharing strategy of BitTorrent, where the files are literally scattered all over the globe on millions of private computers, and the centralized server model of the late lamented Napster.  The government case boils down to the fact that the files are located in an identifiable set of machines that constitute a cloud file repository, therefore they can be subject to seizure.   I suspect the government case rests more on the relative ease of access to the servers than anything more valid.  A similar but more legally defensible case against the poor Boston University graduate student, Joel Tenebaum who got caught with pirated music garnered far more sympathy for the student than its intended anti-piracy warning message.

Where does that leave the legitimate business user?  The bottom line is that the on-going battle among pirates, business and the legal system will continue to work slowly through the courts, while the enabling technology to beat the system will remain far ahead of the law.  In the meantime, I think you can safely continue to use cloud file sharing services; just make sure they are business friendly and meet proper regulatory requirements for data security.

About the Author

Beth Cohen, Cloud Technology Partners, Inc.  Transforming Businesses with Cloud Solutions


Mar 31 2012   4:01PM GMT

Bringing Consumer Technology into the Enterprise



Posted by: Beth Cohen
IT Innovation, Consumer IT technology, business innovation, enterprise architectures, IT operations, enterprise IT

Question: How does the widespread adoption of consumer technology affect the enterprise? What does an enterprise need to do to be prepared to benefit from implementing it?

The widespread adoption of technology at the consumer level is having profound and unexpected effects on enterprise IT. After years of IT stagnation caused by a combination of economic pressures and outsourcing, the adoption of consumer technology has been a breath of fresh air for some companies and a shock to others. For example, seemingly every corporate executive is demanding the latest tablet computer, but the casual use of mobile devices to view and transmit corporate IP is worrying to the business risk and governance folks.

At the very least, bringing consumer technology into the enterprise can ironically discourage internal IT innovation; because innovation generally means some risk and most enterprises are generally risk-adverse. Consumer innovations like mobile devices can be disruptive technologies, but they also represent a significant risk to the corporate view of itself as a self-contained entity. This factor has both been for the positive, when hard pressed IT departments embrace the outside help these battle tested technologies represent, and for the negative, when IT has dug in its heels with a “not invented here” attitude. This conflict is clear as the pendulum swing back to decentralized or bottom up IT has corporate IT struggling to keep up with the rapid proliferation of cloud technologies.

100% of all enterprises are using SaaS applications whether the IT function knows it or not, as business managers with credit cards take back their applications by leveraging easy access to SaaS and cloud development environments. This can be seen as a positive trend as the business units take on the responsibility for supporting their own IT requirements using (public or private) cloud technologies as the underlying infrastructure. One can argue that is where it has always belonged because they are able to response to the needs of the business far faster. On the other hand, the loss of centralized control and governance represents a certain amount of inefficiency and introduces significant risk at the enterprise level. As business units take over the applications, do you really know where your corporate data is?

Another worrying trend that will hinder the ability for the enterprise to incorporate promising consumer technologies is the erosion of skills and that innovative spark that is needed to drive their adoption. Years of outsourcing and off-shoring have battered the core technical skills of corporate IT as the role has moved increasing back towards being viewed as a costly utility by enterprise business managers who have little patience or interest in IT as a strategic asset. As corporate IT has evolved into vendor managers rather than drivers of innovation, essential skills such as systems architects and senior network engineers have disappeared. I recently worked with a major corporate IT organization that had been relying on their hardware vendors to manage their networks so long that their internal staff networking skills were so atrophied that they did not have a single person on staff who knew how to design an IP addressing scheme for their new cloud implementation. The long term effects of corporate IT downsizing and outsourcing of core functions has meant that IT departments have often been ill-prepared for the challenges of the introduction of consumer technologies that requires different approaches to the organization and support processes.

I have painted a pessimistic picture for corporate IT, but there is hope. The real innovation is happening at the edges of the enterprise as smart, creative business managers take up the challenge to modernize and drive real business value from IT by using the tools they have become comfortable with. Smart IT departments can recapture the technology leadership role by seeking out promising new consumer technologies and integrating them into the enterprise IT portfolio before they get that dreaded surprise support call from the business unit!

About the Author

Beth Cohen, Cloud Technology Partners, Inc. Transforming Businesses with Cloud Solutions


Mar 4 2012   2:00AM GMT

Cloud High Availability Take Two – Supporting Rack Level Failure



Posted by: Beth Cohen
Cloud architectures, BD/DR, business continuity, Disaster Recovery, Cloud computing standards, cloud data center, OpenStack, cloud computing models, cloud infrastructure, cloud hardware, Cloud innovation, Cloud IT

Question:  I am concerned that the network is the weakest link in my private cloud. What will happen if any of my network hardware components fail?

In a previous discussion of cloud high availability, I covered in general terms what are some of the principals and approaches that make sense in a cloud environment. This time we will dive into some details of how this can be achieved in an Openstack environment.

The average published MTBF on switches seems to be between 100,000 and 200,000 hours. This number is dependent on the ambient temperature of the switch in the data center. I am assuming that most modern data centers are properly cooled for maximum switch life. This translates to between 11 and 22 years. Even in the worst case of poor ventilation and high ambient temperatures in the data center, the MTBF is still 2-3 years based on research found at http://www.garrettcom.com/techsupport/papers/ethernet_switch_reliability.pdf.

The mean time to replacement (MTTR) for a switch is going to be dependent on how exactly how the data center is staffed and what processes are used for replacing switches. Assuming that you would keep a few spares in the data center and that it is fully staffed 24 hours/day, the average time to replace a switch including configuration is going to be under 2 hours. Most modern switches are auto-configured so the actually provisioning time after the switch is powered up in the rack is under 5 minutes.

Let me walk through what will happen in the case of a top of rack (ToR) switch failure in the Swift cluster. Swift by its nature is fault tolerant at the rack level. That means that the system will continue to operate without data loss if an entire rack goes off-line. The cluster would detect the rack being off-line and send out a notification that the NOC staff would see within 5 minutes. In the case of rack going off-line Swift does not automatically move any data. The reason for this is that in fact, the NOC staff needs to make a decision about the cause of the rack going off-line and how long it will take for it to come back on line. In the case of a switch failure, the data in the rack is still intact, so it is far more efficient to just replace the switch then bring the rack back on-line without having to move the data. Even if the NOC staff decides to move data around, which they would only do if the fault is in the servers not the switch, the network overhead that it adds to the cluster is in the range of 3-5% for a large cluster with properly tuned ring rebuild cycle. Clearly taking a rack off-line is not considered a problem. I would argue that you should expect to be able to take racks off-line with no impact to the system as a whole as a matter of course for maintenance, upgrades and other reasons.

Nova behaves is slightly differently in the case of a rack failure. Unlike Swift the architecture does not have an assumed base unit of failure at the rack level. It does have the concept of a availability zone, which is quite different from a Swift zone just to confuse things. That doesn’t mean that you cannot create an equally fault tolerant Nova architecture, it just requires more development of high availability at the application level of the system combined with the use of the availability zone as a mechanism for balancing the applications in different locations. The assumption is that it is the responsibility of the application to build in fault tolerance, not the underlying infrastructure to keep track of the individual VM instances. Nova zones can be used to achieve this level of fault tolerance in combination. Combining this with the a live migration functionality and HA application design will allow you to build support for rack level failure. Again, the metrics for determining the next steps (replacement of switch only or rebuilding of entire rack) will be based on the specific component failure. See the recent discussion of this at http://lists.us.dell.com/pipermail/crowbar/2012-January/000643.html for more ideas on how to architect such a system.

Another approach would be to create high availability through redundant hardware. In this case you could provision the racks with two switches. However this is an expensive option in a large data center with hundreds of racks. It is clearly orders of magnitude more expensive to take this approach. From a risk perspective, you have substantially increased your per rack costs with little or no reduction in risks since the rate of failure is so low to begin with and the architected unit of failure for a cloud infrastructure should be at the rack level to begin with.

About the Author

Beth Cohen, Cloud Technology Partners, Inc. Transforming Businesses with Cloud Solutions


Feb 25 2012   2:00AM GMT

The Illusion of Cloud High Availability – Hardcore risk management



Posted by: Beth Cohen
cloud infrastructure, cloud data center, IT operations, SLA, Cloud operations, Cloud architectures, cloud computing, Data center operations, risk management, IT risk management, Dev/Ops

Question:  I am building a private cloud and am concerned about how to meet the SLA high availability requirements.  What are my risks and how can I best manage them?

To understand how to manage your high availability options, it is important to have a discussion about how high availability, risk and component failure work in a cloud environment.  High availability is very important for cloud environments; often the cloud provider is required to meet strict service level agreements for 99.99% or 99.999% (the so called 5-nines) availability.  In theory, that is the very reason that customers are interested in using cloud services.  It should be noted that most of the public cloud service providers have lots of methods to measuring availability that is in their favor, so even in the case of catastrophic systems failure, they are rarely accountable for the downtime.  This has been one of many sources of caution for enterprises that have wanted to leverage public cloud services.

Before going into the details of how to quantify the cost of risk mitigation for a cloud, a short discussion of the science of risk management will help with understanding how it all works.  The goal of business risk management is to detail what kinds of risks exist in your specific business and determine how to prevent them entirely or minimize their impact on the business as a whole. Business risk management is essentially quantifying the risk that a given system will fail multiplied by the cost.  Cost is further broken into two more categories.  Out of pocket costs, also referred to as sunk costs, and lost opportunity costs.  Sunk costs are costs that you will need to pay out to fix the problem, while lost opportunity costs are revenue lost due to the system unavailability.  For example, the risk that there will be regular earthquakes in Japan is high.  The Japanese have responded to this threat by having some of the strongest earthquake resistant building codes in the world.  However, as last year’s 9.0 tremor and following tsunami so dramatically demonstrated, it is impossible to prepare for such extreme and rare events.

High availability is best addressed by redundancy.  However, redundancy can be achieved at several levels of the IT infrastructure: hardware, software, network, or a combination.  Traditional IT organizations have reduced the risk of downtime by concentrating almost exclusively on hardware redundancy.  The scale of the cloud, where there are already thousands or hundreds of thousands of systems, hardware redundancy at the component level quickly becomes unsustainably expensive.   A telling scholarly article that looked at the reported hardware failures from several large data centers shows that by far the most likely failure at data center scales is as would be expected the components that have moving parts, such as hard drives and power supplies.

In practice, the enterprise needs to balance the cost of duplicating hardware throughout the cloud ecosystem, which is the traditional approach to solving the risk management problem, against the need for keeping operating expenses low. Another consideration is that duplicating everything at the hardware level does not automatically guaranty that you do not still have a single point of failure in the environment.  For example, you might have remembered to contract with multiple carriers to spread the risk of a network outage, but if they all come into the data center at a single location, the data center is still prone to a catastrophic “backhoe failure”, which is what happens when a backhoe has severed all the up-link cables in one fell swoop.  It is an expensive and time consuming repair that leaves many unhappy customers in its wake.  Yes, there are ways to mitigate this risk, but they are expensive and need to be balanced against the relative probability of such an event.

The best approach is to look at the probability of failure of each component in context of the entire ecosystem.  Since hard drives fail at such a high rate, the hardware approach is to mirror or RAID the drives across thousands of systems.  This translates to data redundancy and added costs that far exceed the optimum for availability and cost reduction.  At scale, building a storage system that handles the data redundancy at the software level is far more efficient.  Another examples, is planning for power supply (or more precisely fan) failures.  Again, since they are generally the next most common component to fail, instead of filling the cloud data center with thousands of extra power supplies and fans, it is better to build the cloud to be resistant to downtime if a server node fails.  In the end, addressing cloud high availability is not only about determining the MTTF of hard drives, cables or switch ports, it is also balancing it against the likelihood of a given failure at the data center macro level.

About the Author

Beth Cohen, Cloud Technology Partners, Inc.  Transforming Businesses with Cloud Solutions


Feb 18 2012   1:15PM GMT

Reducing Cloud Operations Costs with Automation



Posted by: Beth Cohen
Cloud architectures, enterprise cloud services, Cloud Services, operations, Data center operations, Deployment automation, Enterprise datacenter, Dev/Ops

Question:  How can I best realize the promised operational cost savings of my private cloud?

If you are running a typical enterprise IT organization, you are probably struggling with an organization and processes that are not optimized for delivering cloud services.  Traditional IT operations are best designed to handle customized applications and a heterogeneous IT infrastructure, just the opposite of the skills and processes that are needed to support cloud services.  As an illustration of this, I recently had a conversation with a data center engineer about deployment automation.  He noted that his group was able to build a new server in four hours so he didn’t really see the point in further automation.  Hand building systems works fine when you are building 10 servers a week.  It does not scale when you are building 10 or 100 servers a day.  Deployment automation is designed to solve the problem of how to set up hardware and systems quickly when managing hundreds of racks and thousands of servers.  To achieve this level of automation requires the acquisition of new staff skills, building a factory approach to operations, and developing different types of processes.  What is often overlooked is that it will in turn drive significant changes to the enterprise IT organization to meet the new demands for supporting the cloud infrastructure.

Public cloud services are offered so cheaply is because they have both the economies of scale and more importantly, the operations expertise to support highly automated IT infrastructure. Amazon is estimated to have over 300,000 servers.  They do not provision them by hand; it would be an impossible task.   Any company that is managing cloud services, public or private, has faced this problem and has needed to build processes to allow data center administrators to quickly stand up new or replacement racks and servers.

With automation, racks and servers can be provisioned with a minimum of error-prone human labor in a few minutes or hours.  In the case of hardware failure, the administrators simply install new hardware, power it up and allow the auto-provisioning systems to complete the loading of the operating systems and applications.  The hardware is pre-wired into the rack, so that it can be easily plugged in and then automatically configured using the deployment automation.  Ideally, systems are configured and monitored to send out alarms or even automated orders directly to the vendor for new hardware when a certain usage level is reached or there is a hardware failure in the system.

If you are serious about reducing your operational costs for your cloud investment, the smartest thing you can do is invest in some serious automation for all your operations.  This includes not only building staff skills and developing the capacity for automated virtual machine deployment, but also automating deployment of server nodes, network gear and even entire racks.  I would recommend creating automated deployment processes to allow daily or even hourly system builds for faster systems development, test cycles and production deployments.  Leverage automation processes and framework to automate deployment of all modules across the entire cloud architecture.

Deployment and operations automation not only allows for appropriate expansion, but it also cuts the costs of delivering high availability by reducing the need for expensive hardware redundancy.  Service level agreements, growth and scaling can all be addressed by deployment and operations automation.

About the Author

Beth Cohen, Cloud Technology Partners, Inc.  Transforming Businesses with Cloud Solutions


Feb 12 2012   11:00AM GMT

Robust Cloud Network Architectures, or why the Internet runs on Layer 3



Posted by: Beth Cohen
Cloud Network Architectures, Cloud architectures, Cloud Networks, Data center operations, Data Center Networks, Virtual Networks, IaaS, Layer 3 networks, Layer 2 networks

Question:  I am designing a network for my private cloud.  Should I use Layer 2 switches or Layer 3 routers for my cloud network architecture?

Since the dawn of the Internet there has been an on-going debate over whether to use Layer 2 (Ethernet) or Layer 3 (IP) networking inside the data center.  In the beginning, yes, we are talking about the 1990’s, for the most part networks were built on Layer 3 protocols.  L2 switches were only used for internal LAN’s or very small installations.  Does anyone really fondly remember NetBIOS or SPX/IPX?  While Layer 2 switches were easy to deploy - one brand was appropriately named Black Box — they were impossibly slow and unreliable once you scaled past more than a couple hundred machines.  Before the development of public IP address sparing protocols such as CIDR, DCHP and NAT, if you wanted to have an internet connection you had to assign each system a public IP address anyway.

Fast forward 20 years and many network protocols later, data centers are now typically architected to use L2 switches rather than L3 routers.  The reasoning seems to be that Ethernet is faster because you don’t have the overhead of the IP hierarchy, and you don’t have to worry about reconfiguring IP addresses as systems get moved around.  For me the second argument doesn’t hold up, since that is exactly what DHCP and DNS are designed for!  The simplicity of Layer 2 protocols might work well in a data center with hundreds of physical machines, but cloud data centers have the additional burden of needing to keep track of all the virtual machine addresses and networks as well.  It is not uncommon for one physical node to support 30-40 VM’s.  Layer 2 switching protocols have improved mostly by adding “bolt-ons” such as VLAN’s, RBridges, or Cisco’s L2MP.  I would argue that these are all proprietary patches to the fundamental scale and complexity problem. They still don’t have the built-in hierarchy and resiliency of a fully routed IP network.

A better paradigm is to think of cloud data centers as miniature (or in the case of Amazon, not so miniature) versions of the Internet.  Thus applying the inherent scalability and flexibility of the IP address based Internet to a cloud data center network architecture makes perfect sense.

Cloud Network Architecture Basic Principles

  • The cloud means, “Any server, any service, any time”
  • Scalability through hierarchy
  • Simplified network management
  • Maximum network traffic flexibility
  • Flattened traffic flow over the entire network mesh
  • Minimize amount of state information maintained in network by keeping VM state (VM MACs and IPs) out of core network
  • Reduce number of protocols to manage

Layer 2 Architecture Limitations

  • Number of VLANs is limited to 4096
  • Number of MACs stored in switch tables is limited
  • Need to maintain a set of Layer 4 devices to handle traffic control
  • MLAG (which is used for switch redundancy) is a proprietary solution that doesn’t scale beyond two devices and forces vendor lock-in
  • Difficult to troubleshoot network without IP addresses and ICMP
  • Configuring ARP is tricky on large L2 networks
  • All network devices need to be aware of all MACs, even VM MAC’s, so there is constant churn in MAC tables and network state changes as VM’s are started or stopped
  • Migrating MACs (VM migration) to different physical locations could be a problem if ARP table timeouts aren’t set properly

Layer 3 Architecture Advantages

  • Provides the same level of resiliency and scalability as the Internet
  • Easy to control traffic with routing metrics
  • Can use BGP confederation for scalability, so core routers have state proportional to number of racks, not to number of servers or VMs
  • It keeps per VM state (VM MACs and IPs) out of the network core, to reduce state churn. The only routing state changes are in case of a ToR failure or backbone link failure
  • Uses ICMP to monitor and manage traffic

In future articles, there will be further dives into some of the ways that Layer 3 networks, virtual networks, and the most exciting new networking development, software only networks can be used to successfully address the needs of a cloud data center network.

About the Author

Beth Cohen, Cloud Technology Partners, Inc.  Transforming Businesses with Cloud Solutions


Feb 4 2012   10:30AM GMT

Cloud Use Cases – Making the cloud work for you!



Posted by: Beth Cohen
cloud infrastructure, Cloud architectures, enterprise cloud services, enterprise, cloud computing models, PaaS, IaaS, Cloud Business strategy, Cloud business models, application development, IT portfolio management

Question:  My company is exploring building a private cloud.  What uses that will best leverage my cloud infrastructure investment?

The magic of the cloud is that it can do anything.  It is both robust and flexible, the best of both worlds.  Ok, I admit that I have been spending far too much time reading cloud marketing materials lately.  Now back to reality.  Yes, the cloud is highly flexible and it can do almost anything, but if you want to get the most out of your private cloud investment, you need to pay attention to the underlying hardware as I discussed previously, and you need to define what you are planning on using it for by creating and testing use cases.

Use case planning seems counter-intuitive.  After all you can sign up for a web server with Amazon in about 5 minutes.  Amazon does not know what you are planning on doing with it.  Wrong.  Amazon’s product management department spends plenty of time figuring out exactly what would be attractive to their typical customer and honing the service to deliver it.  For the enterprise, the planning process is no different, but instead of planning for an external paying customer, for example, the use could be for internal application developers or a web portal.

To give you an idea of how this works, let us say, you are planning on using the cloud for the company’s E-commerce website.  This means that you will need to plan for applications that will support thousands of sessions per second, variable workloads and lots of complex and changing data.  By identifying the key metrics such as number of concurrent transactions per second, size of database, etc. you can then build a method for testing your assumptions.

To get the conversation started here is a short list of possible use cases for a private cloud.  Over the next few weeks I will be digging deeper into how to leverage the cloud model in the enterprise.

Archive storage — Many companies have moved to keeping their archives on line instead of on backup tape for many excellent reasons.  Using SAN or near-line storage is still expensive.  Cloud object or block storage is an attractive alternative because of its optimized approach to high availability.  It also scales nicely as archives grow over time.

Federated hypervisor/VM management - This is one of the main reasons that the enterprise is interested in the cloud in the first place - any server, any service, any time.  Adding self-service, charge back and transparent delivery of the right resources from a federated pool can be very cost effective.  Look for a cloud that provides cross platform hypervisor support and robust VM management tools.

Development and test - One of the best use cases for an enterprise cloud is a shared development and test environment.  Self-service is essential, but the private version allows much more control on resource use by using a rules based delivery model to optimize IT investments.  Creating an enterprise PaaS environment is also desirable because it allows better integration across applications and more standardized application development.

Application spaghetti rationalization - An enterprise cloud delivers better application portfolio management and more efficient deployment by leveraging self-service features, rules for deployments based on types of use.

Web services, portals and e-commerce - Web services of all sorts are a natural for the enterprise cloud.  They are well suited to being able to take advantage of the inherent elasticity and automated workload based provisioning and deployment capabilities.

VDI Support - VDI is another natural for an enterprise cloud.  VDI is often used to better maintain control over workers’ compute environments, but the workloads are inherently highly variable, which is an excellent reason for implementing such systems on the cloud.  An obvious extension is mobile application support which is a growing part of the enterprise service portfolio.

Disaster Recovery/Business Continuity — Again the cheap storage and VM management makes a good case for using the cloud as a secondary site.  The public cloud is already heavily used for these purposes, but moving the function in-house could be cost effective for a very large enterprise.

About the Author

Beth Cohen, Cloud Technology Partners, Inc.  Transforming Businesses with Cloud Solutions


Jan 15 2012   3:00PM GMT

Cloud Redundancy – A different approach to component failure



Posted by: Beth Cohen
cloud computing, cloud computing models, hardware failure, business continuity, Disaster Recovery, OpenStack, Cloud business models, enterprise cloud, Cloud Services, enterprise cloud services

Question:  What is the best way to manage the thousands of components in a typical cloud?  How does managing “at scale” change my systems administration practices?

People have been managing data centers for 30-40 years now, so that should mean that there are a good set of standard best practices for building highly available resilient components.   That is true for the old style data center, but the old best practices are expensive and do not scale well for cloud architectures.  Duplicating hardware to protect against failure works well when you have hundreds of components but the costs are linear so it does not scale.  Unlike traditional IT operations, over-design to protect against obsolescence is not desirable when scaling to thousands of nodes.  For example, spending an extra $6000/rack for 10GB switches might seem to be a sensible way to protect against hardware obsolescence if you have 10 racks, but that extra cost is much harder to justify when you are provisioning a 100 racks and it has turned into an extra $6 million!

The principal of ‘replacement management’ takes on great importance when managing the thousands of physical devices required for a cloud deployment.  The advantage of the cloud is that you do not need to build expensive high availability redundant systems because an assumption that components will fail is built into the architecture.  By leveraging the huge pools of cloud resources, the level of redundancy can be considerably reduced.  If a component fails, the system will continue to work until someone replaces it.  Since commodity low price devices typically have a high rate of failure, the whole architecture needs to be based on “availability” and “partial failure”.

In a cloud environment, it makes much more sense to just replace a component than worry about what caused the failure and trying to troubleshoot it.  The most common components to fail are disks, since they are mechanical moving parts.  A typical disk failure rate in a cloud data center is about 10-15%.  However, fans, power supplies and memory will also fail less frequently.  For example, the OpenStack Swift architecture assumes that disks, systems and entire zones can and will disappear (fail) at any time.  Yet, there are only three copies of every file, and no additional redundancy in the hardware.

This approach to failure at scale can be very cost effective, but it takes different mindset from traditional operations.  Every cloud operations engineer for cloud should learn what is in the service, where the critical parts are located, and how to replace a failed component, then incorporate the knowledge into standard operations processes.  Automated tools need to be written to help identify the location of failed disks and other components so they can quickly be isolated from the environment and replaced.  To maintain a high level of robustness without sacrificing cost efficiency, the system needs to be designed to replicate data on the application/software level, not disk or network level.

In conclusion, the biggest paradigm shift is that development and operations groups need to work together to optimize the systems and drive down costs.  Tests and metrics need to be created to determine the optimum systems configurations.  By understanding how changes in the components affect the systems as a whole, it will allow you to flexibly configure the systems to meet the application requirements as they change.

About the Author

Beth Cohen, Cloud Technology Partners, Inc.  Transforming Businesses with Cloud Solutions


Jan 3 2012   1:30PM GMT

Cloud Hardware – Sacrificing system efficiency for low cost



Posted by: Beth Cohen
Cloud architectures, cloud hardware, Cloud Data Storage, Data center operations, IT hardware, disk failure, cloud computing models, Cloud Reference architecture

Question:  Is the cloud really hardware agnostic?

The wonderful thing about cloud architectures is that they are designed to be cost effective at massive scales.  The major cloud providers are profitable not only because they can aggregate customers and use the available equipment more efficiently, but they can leverage their considerable market muscle to purchase truckloads of components at steep discounts.  As Google discovered and published in Failure Trends in a Large Disk Drive Population , the brand and cost of a hard drive had little to do with its reliability.  Another paper delivered at the same 2007 Usenix Conference, Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You?, came to similar conclusions.  The key to building reliability in the cloud is not quality components; it is building a hardware architecture that assumes that the components will fail and plan for that failure.  Since the individual components are essentially interchangeable, it stands to reason that a good cloud architecture should be completely hardware agnostic.

The fallacy of that kind of thinking is that failure rates are the only criteria for choosing a given component.  As you know hardware is a moving target, new and better hardware is always coming around the next corner.  Any good storage engineer knows that enterprise customers do not pay the EMC or NetApp premium just because they feel more comfortable buying from a known brand.  They are typically paying for the better tools, faster performance or bigger capacity that they need for their high performance applications.

It turns out that this applies to cloud hardware architectures as well.  Hardware does in fact matter if a cloud is going to run at peak efficiency.  Which hardware components are chosen can make a significant difference under stress conditions.  When the objective is to optimize the environment, the ideal cloud environment should be running at close to peak capacity - essentially under some stress — most to the time.  For example, in a storage array, the two constraints are always going to be system network bandwidth and disk I/O, i.e. how fast the disks can push the data around.  By specifying a faster disk controller and tweaking the configuration to boost the throughput by eliminating disk write caching for example, the entire system will run that much more efficiently.  Yes, in this case you will be reducing disk reliability, but since you already have a mechanism that provides disk failure resiliency in other ways, that risk can be tolerated in exchange for the faster throughput.

In conclusion, at the proof of concept and small system level, cloud hardware agnosticism works just fine, but for massive cloud installations that want to run at peak efficiency, paying attention to specifying the right hardware components to eliminate the throughput bottlenecks, has the potential to boost overall performance significantly.  The trick is determining if the hardware cost differential is worth the increased performance.  Of course at truly Amazonian scales, that cost differential essentially disappears.  However at more modest enterprise scales, in my opinion, in most cases the TCO business case for the better hardware will prevail.

About the Author

Beth Cohen, Cloud Technology Partners, Inc.  Moving companies’ IT services into the cloud the right way, the first time!


PREV 1234567 NEXT