Posted by: Mark Fontecchio
five nines of availability, uptime
Google’s Gmail service was down for “about 100 minutes” yesterday, according to Ben Treynor, its VP of engineering and site reliability czar. The outage caused a flurry of conversation online, particularly from Gmail users on Twitter, who flooded the site with messages using the “#googlefail” hashtag.
According to Treynor, the problem came about due to not enough capacity and insufficient routing of network traffic during routine maintenance.
“(W)e took a small fraction of Gmail’s servers offline to perform routine upgrades,” Treynor wrote. “However, as we now know, we had slightly underestimated the load which some recent changes (ironically, some designed to improve service availability) placed on the request routers — servers which direct web queries to the appropriate Gmail server for response.”
Some request routers became overloaded and transferred the load to other request routers, which in turn became overloaded themselves, and the situation snowballed.
Treynor concluded that the Gmail team will make improvements over the next few weeks, but tempered that by writing that “Gmail remains more than 99.9% available to all users…”
Wow, that sounds like a lot, doesn’t it? 99.9% available? When you break it down, 99.9% availability is equal to almost nine hours of downtime over the course of a year. That might be OK for a normal person using Gmail for personal use, but businesses that rely on it might not be happy. For comparison’s sake, if Gmail had four nines of availability, it would have 53 minutes of downtime every year; the holy five nines of availability (99.999%) would equal five minutes of downtime.