Cloud outages are always big news – and for good reason, because they usually affect many people. Last month’s Microsoft Azure outage was no exception. But at least Microsoft appears to be trying to learn from its mistakes.
The software giant released detailed findings of its root cause analysis of the Azure outage earlier this month, and said it would to use lessons learned from the incident to improve its cloud service. The analysis, posted by Azure engineering team leader Bill Laing, provides a detailed description of the Leap Day bug that triggered the Feb. 28 outage. The analysis was prefaced by an apology and an offer of service credits to customers, and included a description of the steps Microsoft is taking to improve its engineering, operations and communication in the wake of the outage.
“Rest assured that we are already hard at work using our learnings to improve Windows Azure,” Laing said.
Microsoft’s plans include improved testing to detect time-related bugs, strengthening its Azure dashboard, and improved customer communication during an incident.
Kyle Hilgendorf, principal research analyst at Gartner, said he was impressed with the level of detail in Microsoft’s analysis.
“I encourage all current and prospective Azure customers to read and digest the Azure RCA [root cause analysis],” he wrote in a blog post. “There is significant insight and knowledge around how Azure is architected, much more so than customers have received in the past.”
The 33% service credit offered by Microsoft, he added, is becoming a de facto standard for cloud outages. “Customers appreciate this offer as it benefits both customers and providers alike from having to deal with SLA claims and the administrative overhead involved,” he said.
In a previous blog post, Hilgendorf summarized Azure customers concerns after the outage. Customers told him Microsoft’s communication during the outage was lacking; the company needed to be more transparent, and they were looking into options for protecting themselves against future outages.
So while Microsoft is applying lessons learned from the Azure outage, it appears Azure customers got a harsh reminder of the need to plan for service disruption. At last year’s Gartner Catalyst Conference, Richard Jones, managing vice president for cloud and data center strategies at Gartner, advised attendees to prepare for cloud failure by planning for resilience into their cloud infrastructure and services. Experts have also said organizations need to plan for outages in their cloud contracts.
“Cloud outages are a sad and unfortunate event,” Hilgendorf wrote. “However, if we learn from them, build better services, increase transparency, and guide towards better application design, then we can make something great out of something bad.”