Posted by: Beth Cohen
Cloud architectures, cloud computing, cloud data center, cloud infrastructure, Cloud operations, Data center operations, Dev/Ops, IT operations, IT risk management, risk management, SLA
Question: I am building a private cloud and am concerned about how to meet the SLA high availability requirements. What are my risks and how can I best manage them?
To understand how to manage your high availability options, it is important to have a discussion about how high availability, risk and component failure work in a cloud environment. High availability is very important for cloud environments; often the cloud provider is required to meet strict service level agreements for 99.99% or 99.999% (the so called 5-nines) availability. In theory, that is the very reason that customers are interested in using cloud services. It should be noted that most of the public cloud service providers have lots of methods to measuring availability that is in their favor, so even in the case of catastrophic systems failure, they are rarely accountable for the downtime. This has been one of many sources of caution for enterprises that have wanted to leverage public cloud services.
Before going into the details of how to quantify the cost of risk mitigation for a cloud, a short discussion of the science of risk management will help with understanding how it all works. The goal of business risk management is to detail what kinds of risks exist in your specific business and determine how to prevent them entirely or minimize their impact on the business as a whole. Business risk management is essentially quantifying the risk that a given system will fail multiplied by the cost. Cost is further broken into two more categories. Out of pocket costs, also referred to as sunk costs, and lost opportunity costs. Sunk costs are costs that you will need to pay out to fix the problem, while lost opportunity costs are revenue lost due to the system unavailability. For example, the risk that there will be regular earthquakes in Japan is high. The Japanese have responded to this threat by having some of the strongest earthquake resistant building codes in the world. However, as last year’s 9.0 tremor and following tsunami so dramatically demonstrated, it is impossible to prepare for such extreme and rare events.
High availability is best addressed by redundancy. However, redundancy can be achieved at several levels of the IT infrastructure: hardware, software, network, or a combination. Traditional IT organizations have reduced the risk of downtime by concentrating almost exclusively on hardware redundancy. The scale of the cloud, where there are already thousands or hundreds of thousands of systems, hardware redundancy at the component level quickly becomes unsustainably expensive. A telling scholarly article that looked at the reported hardware failures from several large data centers shows that by far the most likely failure at data center scales is as would be expected the components that have moving parts, such as hard drives and power supplies.
In practice, the enterprise needs to balance the cost of duplicating hardware throughout the cloud ecosystem, which is the traditional approach to solving the risk management problem, against the need for keeping operating expenses low. Another consideration is that duplicating everything at the hardware level does not automatically guaranty that you do not still have a single point of failure in the environment. For example, you might have remembered to contract with multiple carriers to spread the risk of a network outage, but if they all come into the data center at a single location, the data center is still prone to a catastrophic “backhoe failure”, which is what happens when a backhoe has severed all the up-link cables in one fell swoop. It is an expensive and time consuming repair that leaves many unhappy customers in its wake. Yes, there are ways to mitigate this risk, but they are expensive and need to be balanced against the relative probability of such an event.
The best approach is to look at the probability of failure of each component in context of the entire ecosystem. Since hard drives fail at such a high rate, the hardware approach is to mirror or RAID the drives across thousands of systems. This translates to data redundancy and added costs that far exceed the optimum for availability and cost reduction. At scale, building a storage system that handles the data redundancy at the software level is far more efficient. Another examples, is planning for power supply (or more precisely fan) failures. Again, since they are generally the next most common component to fail, instead of filling the cloud data center with thousands of extra power supplies and fans, it is better to build the cloud to be resistant to downtime if a server node fails. In the end, addressing cloud high availability is not only about determining the MTTF of hard drives, cables or switch ports, it is also balancing it against the likelihood of a given failure at the data center macro level.
About the Author
Beth Cohen, Cloud Technology Partners, Inc. Transforming Businesses with Cloud Solutions