Infrastructure 2.0 Blog

Sep 24 2010   1:21PM GMT

Facebook is the Latest Organization to Get Tripped Up By Change



Posted by: Guest Author
Networking

Today’s guest post brought to us by Matt Gowarty.

Facebook going down for a few hours may have caused a spike in employee productivity.

But the culprit in this event shines a bright light on the biggest problem for IT teams worldwide—change. A Facebook blog post references “Today we made a change to the persistent copy of a configuration value that was interpreted as invalid” which caused the performance impact for over 2.5 hours.

When dealing with network change and configurations, organizations must be more proactive in the testing, validating and monitoring the ongoing status of both the critical network infrastructure (routers, switches, firewalls, etc.) as well as the Web services (applications, databases and servers). In fact, analyst reports show approximately 2/3rds of network performance issues are tied to change.

Too often, organizations spend the majority of time and resources focusing on the “outside attack” or employees with malicious intent to harm the organization. The Facebook example highlights by far the biggest issue for enterprises worldwide—a change made with the best intentions can often yield unintended and damaging consequences. For most IT organizations, simple changes are often the hardest to find and troubleshoot. Even with some of the most advanced network and application development capabilities and IT staff, the entire Facebook community was brought to its knees for more than two hours.

Tony Bradley from PCWorld highlighted the impact. “The Facebook outage was caused by implementing a configuration value on the live Web site without proper testing and validation. Had Facebook tested the new configuration value in a lab environment designed to mirror the real-world database cluster, it should have identified the problem with the new configuration value, and the error loop that caused this problem before allowing it to take the entire Facebook site offline.”

While this problem was extremely painful for Facebook and its users, it appears to have been an easy problem to detect because the change caused an immediate problem. The IT staff probably identified the culprit relatively quickly. The lengthy time and effort troubleshooting the issue was due to the extent of the issue. For many organizations, the potential change and configuration errors that lurk in the network for days, weeks or months are the issues that typically cause the biggest headaches for most IT shops.

These issues are the most complex because it takes another event, change or configuration issue to cause enough of a reaction for end-users to experience pain or monitoring tools to trip a threshold. The time and effort to find and solve these problems are exponentially longer because the initial culprit was hiding for days or weeks and it took another event to cause the pain to show up. It’s times like these that force IT and networking teams to spend days or weeks trying the siphon though all of the changes that may have occurred over the past days or weeks to trigger the failing event.

Using Facebook’s issue as the example, how long do you think it would have taken to find the problem if the change occurred last month and was a suboptimal setting that didn’t cause the problem right away? The odds are it would have been much longer than a few hours. I’d estimate it would have taken days or weeks to find the change related issue within the network infrastructure of a company the size of Facebook.

The silver lining—if Facebook should go down again in the next few months and you can’t update your status every few hours, you’ll have more time to fine tune your fantasy football team—oops, I mean work on that critical project for your boss.


Comment on this Post

Leave a comment: