Question: What is the best way to manage the thousands of components in a typical cloud? How does managing “at scale” change my systems administration practices?
People have been managing data centers for 30-40 years now, so that should mean that there are a good set of standard best practices for building highly available resilient components. That is true for the old style data center, but the old best practices are expensive and do not scale well for cloud architectures. Duplicating hardware to protect against failure works well when you have hundreds of components but the costs are linear so it does not scale. Unlike traditional IT operations, over-design to protect against obsolescence is not desirable when scaling to thousands of nodes. For example, spending an extra $6000/rack for 10GB switches might seem to be a sensible way to protect against hardware obsolescence if you have 10 racks, but that extra cost is much harder to justify when you are provisioning a 100 racks and it has turned into an extra $6 million!
The principal of ‘replacement management’ takes on great importance when managing the thousands of physical devices required for a cloud deployment. The advantage of the cloud is that you do not need to build expensive high availability redundant systems because an assumption that components will fail is built into the architecture. By leveraging the huge pools of cloud resources, the level of redundancy can be considerably reduced. If a component fails, the system will continue to work until someone replaces it. Since commodity low price devices typically have a high rate of failure, the whole architecture needs to be based on “availability” and “partial failure”.
In a cloud environment, it makes much more sense to just replace a component than worry about what caused the failure and trying to troubleshoot it. The most common components to fail are disks, since they are mechanical moving parts. A typical disk failure rate in a cloud data center is about 10-15%. However, fans, power supplies and memory will also fail less frequently. For example, the OpenStack Swift architecture assumes that disks, systems and entire zones can and will disappear (fail) at any time. Yet, there are only three copies of every file, and no additional redundancy in the hardware.
This approach to failure at scale can be very cost effective, but it takes different mindset from traditional operations. Every cloud operations engineer for cloud should learn what is in the service, where the critical parts are located, and how to replace a failed component, then incorporate the knowledge into standard operations processes. Automated tools need to be written to help identify the location of failed disks and other components so they can quickly be isolated from the environment and replaced. To maintain a high level of robustness without sacrificing cost efficiency, the system needs to be designed to replicate data on the application/software level, not disk or network level.
In conclusion, the biggest paradigm shift is that development and operations groups need to work together to optimize the systems and drive down costs. Tests and metrics need to be created to determine the optimum systems configurations. By understanding how changes in the components affect the systems as a whole, it will allow you to flexibly configure the systems to meet the application requirements as they change.
About the Author
Beth Cohen, Cloud Technology Partners, Inc. Transforming Businesses with Cloud Solutions