Every few months (or weeks), the Cloud Computing industry seems to pick a topic and beat it to death from a technology or religious point of view. The concept of “Cloud SLAs” has been doing the rounds lately. Conveniently, these particular discussions came up after a few well publicized Public Cloud outages.
Lydia Leung (Gartner, @cloudpundit) recently got the pot stirring with her piece about HP and Amazon AWS SLAs. Lydia is very well respected in the industry and she does a nice job of digging into the details of various vendors SLAs. She obviously has a deep understanding of this space, especially as it relates to Enterprise customers, as she leads the Gartner IaaS Magic Quadrant program.
There is some interesting back and forth in the comments about what is a proper definition of an SLA. That would be all well and good if Cloud Computing used lawyers or auditors to solve business problems. But it doesn’t. It uses technology. And quite honestly, the business leaders that are paying for various Cloud Computing services don’t care about the legalese or the underlying technology. They care about the business. They care about moving the business forward and managing business risks. Cloud SLAs, in their current form (in most cases), don’t align the business risk and the technology risk very well.
Let’s step back a second and look at this in a slightly different context…
Now, let’s compare that to the SLAs discussed in some of Cloud Computing scenarios from Gartner. They focus on the redemption value of the SLAs, measured in things like “refunded computing hours”. So after the Cloud provider has a 48hr outage, your business gets back the equivalent of 48hrs * $0.12/hr, or ~$6.00. I’m sure that more than covers any lost revenues your business might have had during the outage, right?
Hello, still with me?
Or are you back to looking at the fine print of the SLA?
Regardless of the parameters of the SLA being honored or not, Availability Zone or Region or Act of God, can you now see how the risk potentially far outweighs the cost-savings of certain types of Cloud Computing services? And even with the best engineering teams in the world, how does an IT organization communicate this to the business?
With more companies looking to move applications to Cloud Computing services, it feels like it’s time to look at SLAs through a different prism. SLAs should be about more than just uptime or downtime, they should also have an element of performance and responsiveness for the day-to-day operations. This would get us somewhat closer to a model that both IT and the business can understand.
- How will the application perform during low-activity times?
- How will the application perform during high-activity times?
- What failure scenarios are we currently protected against?
- How will we recovery from those failures, and approximately how fast?
- What can we expect to lose during those failures? (expressed in data and lost-revenues)
- What additional steps can we take to further minimize those losses?
With more and more Cloud services offering new ways to manage Enterprise IT needs, it’ll be important to create discussions that can be understood by BOTH the IT organization and the business. We all know that failures happen – in hardware, in software, in operations, in facilities – but understanding the results and corrective actions is the true value of an SLA.
$6.00 doesn’t cover the cost of the digital ink the current SLAs are written on.