Yottabytes: Storage and Disaster Recovery

Aug 15 2016   5:43PM GMT

Airlines’ Disaster Recovery Criticized After Outages

Sharon Fisher Sharon Fisher Profile: Sharon Fisher

Tags:
Disaster Recovery

If you sell disaster recovery products and solutions, you have a big batch of potential new customers: Major U.S. airlines. Four airlines — Delta, Southwest, United, and American, which just happen to control 80 percent of all domestic travel in the U.S. — have suffered major outages in recent months that have been attributed to a lack of a proper disaster recovery plan.

In August, Delta customers suffered major delays because of…a power failure? “A power outage in Atlanta, which began at approximately 2:30 a.m. ET, has impacted Delta computer systems and operations worldwide, resulting in flight delays,” the company reportedly wrote on its blog at the time. “Following the power loss, some critical systems and network equipment didn’t switch over to Delta’s backup systems,” reported Kim Nash in the Wall Street Journal.

“How could a company as technologically savvy and mature with its business processes as Delta not have a working disaster recovery plan?” writes M.J. Shoer in Seacoast Online. “It’s a fair question and we are still waiting to learn the details. I can’t for a minute believe Delta does not have a disaster recovery plan to deal with an event like this, but it failed. That begs the question as to when it was last tested and how often this plan is reviewed, revised and retested.”

In July, “Southwest Airlines canceled more than 2,000 flights over several days after an outage that it blamed on a faulty network router,” write Alastair Jamieson, Shamar Walters, Kurt Chirbas and Gabe Gutierrez for NBC News. While the company had redundancies built into its equipment, they didn’t work, according to CEO Gary Kelly.  “A back-up system also failed, extending the outage. Ultimately the company had to replace the router and reboot 400 servers,” the company’s chief operating officer told Conor Shine of the Dallas Morning News.

Delta and Southwest aren’t alone. “Computer network outages have affected nearly all the major carriers in recent years,” writes David Koenig for the Associated Press. “After it combined IT systems with merger partner Continental, United suffered shutdowns on several days, most recently in 2015. American also experienced breakdowns in 2015, including technology problems that briefly stopped flights at its big hub airports in Dallas, Chicago and Miami.”

Repercussions

Altogether, thousands of flights were delayed, resulting in costs of millions of dollars. Southwest’s Kelly could even lose his job over the incident, after criticism from a number of employee unions.

“The meltdown highlights the vulnerability in Delta’s computer system, and raises questions about whether a recent wave of four U.S. airline mergers that created four large carriers controlling 85 percent of domestic capacity has built companies too large and too reliant on IT systems that date from the 1990s,” writes Susan Carey in the Wall Street Journal.

Moreover, it points to how vulnerable airlines are to terrorist attacks, writes Hugo Martin in the Los Angeles Times. In fact, at first, some even attributed Delta’s outage to a terrorist attack – maybe because it just seemed so unbelievable that an airline could be brought down by a power failure. But it’s happening.

So What’s Causing It?

“There have been several reservation system outages that have hit worldwide airline ops with distressing regularity over the past few years,” risk management specialist Robert Charette told Willie Jones of IEEE Spectrum. “Southwest Airlines had one just a few weeks ago. (It had another big one June of 2013 and another in October 2015.) What you’ll see in reviewing them is recurring problems with infrastructure (i.e., power, networks, routers, servers, etc.) that seem to keep surprising the airlines. In every case I can recall, there were backup systems in place, but they failed—another recurring theme. The Southwest CEO claimed that the last outage—caused by a router—was equivalent to a 1000-year flood. Not only was that a comical overstatement, but it also shows the thinking that is probably [leading to the airlines] skimping on contingency management preparations.”

Some have attributed the failures to too much consolidation in the airline industry and too much emphasis on efficiency. One study, Delivering the Benefits? Efficiencies and Airline Mergers, found that not only did mergers not save operational money, but often cost much more than expected – particularly in terms of integrating IT. Some have attributed Delta’s failure, in particular, to the retiring of its previous CIO, Theresa Wise – who merged the Delta and Northwest IT teams — in January.

“Is this a sign that airlines aren’t investing enough money in their IT infrastructure?” writes Adam Levine-Weinberg in The Motley Fool.

Airlines Aren’t Alone

On the other hand, airlines aren’t alone in having insufficient disaster recovery protection – just the most conspicuous. A survey – admittedly by a disaster recovery vendor – found that companies in general aren’t testing their disaster recovery systems enough. “When asked about how frequently they tested their DR environment, more than half of the respondents indicated that they tested less than once a year; even worse, a third said that they tested infrequently or never,” the survey found.

Airlines’ IT systems’ complexity makes it worse. “Since they are needed on a 24/7 basis, 365 days a year, it’s hard to fully test every potential scenario that could cause problems,” Levine-Weinberg  writes. “As a result, it may be impossible to fully eliminate large-scale IT outages across the airline industry.”

The biggest problem with this failure in disaster recovery? The computer networks and systems can be repaired. Disaster recovery plans can be created for the next time. But repairing customers’ trust may not be so easy.

5  Comments on this Post

 
There was an error processing your information. Please try again later.
Thanks. We'll let you know when a new response is added.
Send me notifications when other members comment.
  • GSTZGS

    Pretty often, old systems are blamed as a reason for outages, also in this article citing containing a remark that " companies (are) too large and too reliant on IT systems that date from the 1990s".

    However, regarding the Southwest Airline outage it was stated: "Ultimately the company had to replace the (faulty) router and reboot 400 servers” - which does sound much more like one of those contemporary and pretty complex server farms than like oldfashioned big iron.

    By the way, complexity tends to kill reliability ...

    120 pointsBadges:
    report
  • MarkUnderwoodisKnowlengr
    We try to keep up with these outages at GlitchReporter.com, and there are plenty of examples to choose from -- not just the airlines, but also the FAA itself and related support infrastructure.

    I share the outrage of this writer, but as with the supposed fiasco with Healthcare.gov a few years ago, the blame should stretch beyond one organization or one CEO or even one IT department.

    There is an acceptance in the IT community for both processes and architectures that lead to instability and poor risk management. Evidence: you have to look hard to find press reports from the industry (this one excepted!) to deconstruct the problem and address the issue.

    Everyone wants to write about Spark or Alluxio than DR or the underlying causes -- me included, most of the time.

    Complexity is part of the challenge, but there are known mitigating techniques, such as more explicit dependency-aware playbooks, rehearsal strategies. It can mean additional staffing, and resources.  Just a restart, coupled with an avalanche of queued requests (the airlines faced both their regular daily workload plus the exceptional load caused by the outages). Many DR plans would have inadequately exercised that scenario -- for obvious reasons.

    The desire to build resilient and reliable systems is there for AWS, Azure, GCE and military systems like SPRNeT. But outside of these deep-pocketed monoliths, resiliency is not baked-in, and only lightly tested.

    Where's the career glamour in that?

    Logic would suggest that increased deployment of complex systems, concentrated in fewer and fewer IT suppliers as well as airlines, will lead to repeats.

    Stand by to retweet this spot-on story.. Maybe Netflix next time. Oh wait, that already happened.

    10 pointsBadges:
    report
  • Sharon Fisher
    thanks for the well-thought-out replies!
    9,250 pointsBadges:
    report
  • thorzite
    I'm sorry, so yah, I call BS on all accounts, anyone, who is anyone in the IT Infrastructure field knows best practice is to run on UPS power full time, there is no - if the power goes out we flip to battery, NO WE ARE ALWAYS ON BATTERY - the only differentiating factor is weather or not those batteries are being charged by transformer power, or generator power - end of story, a power outage is not even a believable lie.   I have been a Network Engineer for over 20 years, and I am telling you point blank ... if ANY network I was in charge of that had high availability for mission critical app delivery 24 / 7 went down for three days - someone would be fired and I mean immediately! 
    10 pointsBadges:
    report
  • PeteDeLeon122
    I'm not technically inclined, to understand a lot of stuff. Software: has to be the most arduous and multitasked programs that are out there, even the oldest stuff out there can make one spin there heads out off the shoulders.
    The stuff coming out is going to be more of a Hologram coming out.And even more technical, than their counter parts. I hope I could be part of the solution.
    55 pointsBadges:
    report

Forgot Password

No problem! Submit your e-mail address below. We'll send you an e-mail containing your password.

Your password has been sent to:

Share this item with your network: