Posted by: Troy Tate
antispam, antivirus, Cloud Services, corrective actions, Google, incident report, root cause analysis, saas, service level, service outage
I recently posted about Google’s Postini – cloud email security service – delivery issues. This is a follow-on post about the incident root cause analysis and corrective actions. Maybe there’s some lessons learned here that you can use in your organization’s service delivery.
The impact on customer email services lasted more than 24 hours while Postini engineers worked to resolve the issues. So, this was not an insignificant event. During this period, messages were delayed and users were not able to get to their quarantines to release messages trapped by filters. Administrators were also unable to access the administration console. The Postini support portal was unreachable at times due to the high volume of users trying to get updates on the event. The support phone line queues were very long and it took a long time to reach a support agent. Nothing like this has happened before in all of the years we have been a Postini customer.
I just received the incident report about the service disruption and wanted to share some of the information with IT Trenches readers.
The event started at about 6:25 PM GMT Tuesday, October 13. At this time customers began experiencing severe mail delays and disruption. Some senders were receiving delivery failure notifications after multiple resend attempts failed. About an hour later, automated monitoring systems detected the mail flow issues and traffic was automatically failed over to a secondary data center. Message flow was also poor through the secondary data center.
Trying to improve message flow, message traffic was directed across both primary and secondary data centers. Also, in an attempt to reduce impact on data center resources, access to the administration and other web consoles was disabled. The engineers were able to eventually discover the causes of the message flow issues.
- A message filter update inadvertantly caused performance issues.
- Unusual malformed messages caused increased scan processing and in tandem with the bad message filter update caused issues with mail delivery.
- Processing capacity was reduced due to a power supply failure on a database storage system. This increased latency in message processing.
As you can see, this event was caused by a series of issues. The hardware was repaired and the filter update was revoked, but not before a lot of messages were either deferred or not delivered.
The corrective actions to prevent future outages due to similar conditions include:
- Create a standard procedure for reverting message filter updates. Isn’t it always a good idea to be able to back out updates?
- Improve monitoring of database server power failures. Apparently this power failure was not detected by their current monitoring process.
- Improve communication with customers during service outage events. This would help relieve some frustration and help customers understand the severity and scope of the outage.
Fortunately email services have now returned to normal. However, during this 24 hour period, there was a high level of frustration and concern about how to work around the impact on email delivery. I think most IT organizations can learn some lessons and improve service delivery when reading of incidents like this and seeing the lessons learned. I know my team will have some discussions around this event and work to improve the resiliency of service delivery.
Thanks for reading & let’s continue to be good network citizens!