Data Center Apparatus

Mar 27 2017   4:02PM GMT

You did what? Blunders, boo-boos and bloopers from data center outages

Robert Gates Robert Gates Profile: Robert Gates

Data Center

Data center outages at Delta Airlines and Amazon Web Services stole the headlines in recent months, but there’s plenty of other outages at everyday enterprises that fly under the radar.

IT pros dished the dirt last week on the show floor at IBM Interconnect, anonymously sharing tales about their data center outages at the hybrid cloud booth. They illustrated the various problems behind data center downtime, and a reality check about how that next outage could be caused by just about anything.

A CIO, two weeks into the new position, claimed he was hired to implement a “transformational agenda” – but first he endured a one week outage of a core, externally facing customer system. “I spent months delaying my agenda to focus on sustainability” wrote the unnamed CIO.

An insurance company in Connecticut performed a data migration from its original system to a new platform, then shut down the old system, claimed another contributor. But when they attempted to bring up the new system, the data was corrupt.

In a networking tale of woe a F5 refresh took out an entire website when a parameter set to direct traffic to the least loaded server instead sent the traffic to a test server. You can probably guess what happened next.

Another debacle cited failure of an unspecified storage component which degraded performanceand ultimately triggered the disaster recovery plan. But there was one problem: — “We had no way to failback – not good,” wrote the IT pro.

Nature was blamed for one data center takedown — a squirrel chewed into a main power feed during maintenance to the data center’s battery backup. That caused a blackout – albeit short – with the data center going down for about five seconds until the generators kicked in. No word on whether any data –or the squirrel — was lost.

One IT pro lamented how a load test was conducted on productive storage during working hours. It was a virtualized environment and nothing should have happened, but the ports became saturated and the network couldn’t handle the load so there was downtime.

Timing can be everything, and that was certainly the case when the hard drive died in a network staging server at one company — just before a new product was to be launched, according to the anonymous writer.

Backup for data center cooling and power systems are especially important, as shown by one story where an IT pro claimed that there was no UPS or generator backup for cooling towers on the roof of the data center. When power went out, the CPU overheated with no working cooling system.

Don’t blame me

Notice a common theme? None of the authors accept guilt in their stories of data center downtime. In fact, nobody is blamed in most cases. So much for the blameless post mortem, even when it is anonymous. A majority of data center outages are caused by human error, which leaves us wondering exactly what was the painful truth behind these outages.

Now that you’ve read some tales from the data center trenches, what’s your best story about an outage and downtime?

1  Comment on this Post

There was an error processing your information. Please try again later.
Thanks. We'll let you know when a new response is added.
Send me notifications when other members comment.
  • BigKat
    Back when RPGLE first came out, a consultant (different consulting firm than mine :P ) decided ON HIS OWN to convert everything over the weekend to RPGLE via the built-in conversion utility and changed all of the CLP source type to CLLE. (He didn't change any of the overrides' scoping though, so they defaulted to activation group level).  Recompiled everything, deleted the original RPG source and went home.  Monday morning when they went to start up nothing was working correctly. The entire division was down a week while they scrambled to essentially go through every CL and set the scoping correctly and verified the RPGLE where MOVEs were changed to EVALs but didn't work quite right and other subtle quirks that the utility didn't handle correctly. $10M/day in lost revenue.  Needless to say, he wasn't working there any longer
    9,460 pointsBadges:

Forgot Password

No problem! Submit your e-mail address below. We'll send you an e-mail containing your password.

Your password has been sent to:

Share this item with your network: