By now most here have heard about Research In Motion’s (RIM) outages, which affected approximately 30 to 40 million Blackberry users.
That’s about half of all Blackberry subscribers worldwide. Affected areas included the U.S., Canada, Europe, the Middle East, India, Africa, and Latin America. Not a pretty picture for a company that, according to the Financial Times of London, advertises a 99.999% network reliability rating (no mention of who the rating entity is, however) – and it’s particularly poor timing being that competition just increased by virtue of the debut of the new iPhone model.
Of particular concern is that RIM was reporting the problem as fixed the first day’s night, after reports of initial outages in Europe, Africa and the Middle East. By next day, however, outages and service disruptions were spreading: RIM was forced to correct its position, and report that the disruption was the failure of a “core switch” – responsible for routing traffic across what I guess we must assume is the near-totality of RIM’s network. Hmmm… I’m wondering if this “core-switch” issue is an over-simplification of an infrastructure failure… or the alternative?
The alternative, and the face assumption, would be that this was a single-point-of-failure type of incident. In other words, there was a core-switch, with no attendant parallel piece of backup infrastructure, process, and data traffic. When that switch popped… data dropped. I am so sorry for that rhyme. No I’m not “:^ )
I find it difficult to believe that this was a single-point issue – but you never know. It well might have been: I’ve seen many surprising things in the businesses I survey and counsel. But the RIM/Blackberry incident, and its high-profile newsworthiness, makes for a great lesson. And – it came just in time for October’s National Cyber-security Awareness Month (here in the U.S.).
Cyber-security is not just about thwarting malware, hacks, breaches, thefts, viruses and other malfeasance that is initiated by nefarious human activity. Cyber-security includes basic best-practices regarding infrastructure wellness and backstopping. Survey your environment for single-points-of-failure areas: Servers, process, infrastructure, connectivities, data. Also, include the human element: If someone is sick or injured, and they’re removed from the environment for an extended period, do you have someone who can step in to their duties? If not someone internal, then an identified vendor. Are positions and procedures well-documented?
Think about it. And RIM – are you listening?
NP: Interplay, Bill Evans, jazz24.org]]>