Question: I am concerned that the network is the weakest link in my private cloud. What will happen if any of my network hardware components fail?
In a previous discussion of cloud high availability, I covered in general terms what are some of the principals and approaches that make sense in a cloud environment. This time we will dive into some details of how this can be achieved in an Openstack environment.
The average published MTBF on switches seems to be between 100,000 and 200,000 hours. This number is dependent on the ambient temperature of the switch in the data center. I am assuming that most modern data centers are properly cooled for maximum switch life. This translates to between 11 and 22 years. Even in the worst case of poor ventilation and high ambient temperatures in the data center, the MTBF is still 2-3 years based on research found at http://www.garrettcom.com/techsupport/papers/ethernet_switch_reliability.pdf.
The mean time to replacement (MTTR) for a switch is going to be dependent on how exactly how the data center is staffed and what processes are used for replacing switches. Assuming that you would keep a few spares in the data center and that it is fully staffed 24 hours/day, the average time to replace a switch including configuration is going to be under 2 hours. Most modern switches are auto-configured so the actually provisioning time after the switch is powered up in the rack is under 5 minutes.
Let me walk through what will happen in the case of a top of rack (ToR) switch failure in the Swift cluster. Swift by its nature is fault tolerant at the rack level. That means that the system will continue to operate without data loss if an entire rack goes off-line. The cluster would detect the rack being off-line and send out a notification that the NOC staff would see within 5 minutes. In the case of rack going off-line Swift does not automatically move any data. The reason for this is that in fact, the NOC staff needs to make a decision about the cause of the rack going off-line and how long it will take for it to come back on line. In the case of a switch failure, the data in the rack is still intact, so it is far more efficient to just replace the switch then bring the rack back on-line without having to move the data. Even if the NOC staff decides to move data around, which they would only do if the fault is in the servers not the switch, the network overhead that it adds to the cluster is in the range of 3-5% for a large cluster with properly tuned ring rebuild cycle. Clearly taking a rack off-line is not considered a problem. I would argue that you should expect to be able to take racks off-line with no impact to the system as a whole as a matter of course for maintenance, upgrades and other reasons.
Nova behaves is slightly differently in the case of a rack failure. Unlike Swift the architecture does not have an assumed base unit of failure at the rack level. It does have the concept of a availability zone, which is quite different from a Swift zone just to confuse things. That doesn’t mean that you cannot create an equally fault tolerant Nova architecture, it just requires more development of high availability at the application level of the system combined with the use of the availability zone as a mechanism for balancing the applications in different locations. The assumption is that it is the responsibility of the application to build in fault tolerance, not the underlying infrastructure to keep track of the individual VM instances. Nova zones can be used to achieve this level of fault tolerance in combination. Combining this with the a live migration functionality and HA application design will allow you to build support for rack level failure. Again, the metrics for determining the next steps (replacement of switch only or rebuilding of entire rack) will be based on the specific component failure. See the recent discussion of this at http://lists.us.dell.com/pipermail/crowbar/2012-January/000643.html for more ideas on how to architect such a system.
Another approach would be to create high availability through redundant hardware. In this case you could provision the racks with two switches. However this is an expensive option in a large data center with hundreds of racks. It is clearly orders of magnitude more expensive to take this approach. From a risk perspective, you have substantially increased your per rack costs with little or no reduction in risks since the rate of failure is so low to begin with and the architected unit of failure for a cloud infrastructure should be at the rack level to begin with.
About the Author
Beth Cohen, Cloud Technology Partners, Inc. Transforming Businesses with Cloud Solutions