Yottabytes: Storage and Disaster Recovery

Jul 31 2013   10:47PM GMT

Storage Upgrade Takes State of Oregon Offline for a Day

Sharon Fisher Sharon Fisher Profile: Sharon Fisher

A large number of Oregonians looking for state services — including 63,000 unemployed people expecting checks for a total of $18 million in benefits — were left high and dry for a day recently due to problems with a Hitachi storage upgrade.

Hitachi contractors were doing what was supposed to be a routine upgrade to the State Data Center in Salem when a connectivity issue caused the system to go down, KGW News reported state spokesman Matt Shelby as saying. “Hitachi worked overnight to fix the problem. All state agency websites were affected, but no data was lost,” the station said. The outage started at 7 p.m. Monday and was repaired by Tuesday morning, while state services were restored by midday.

Up to 90 percent of the weekly unemployment benefits are normally processed on Monday nights, according to an AP story in The Columbian.

Other issues, according to Oregon Public Radio and The Oregonian, included:

  • Inability for the state’s more than 90 agencies to communicate directly with each other via email
  • Any jobs that needed to pull data from the data center couldn’t run
  • The Department of Transportation TripCheck was down
  • The Department of Forestry, which was fighting a fire in Prineville (ironically, where Facebook has one of its data centers) didn’t have access to email or database forms
  • 35 applications for food stamps scheduled for overnight processing were delayed

Ironically, to a certain extent Oregon brought this on itself by planning to consolidate its various state data centers into the single State Data Center in 2004. “The State Data Center was authorized in July 2004 to consolidate the computer operations of the 12 largest agencies,” notes the Statesman-Journal. “A $20 million building on Airport Road SE houses the center, which opened in fall 2005. Lawmakers in 2005 approved $43.6 million for the consolidation process.”  But in July, 2008 — almost exactly five years ago — the state’s plan for consolidating data centers was sharply criticized for not adequately consolidating the servers themselves.

The system has also been plagued by crashes. In October, 2009, a network failure on the State Data Center system caused an overload on the unemployment system, shutting it down for 12 hours. In October, 2011, unemployment payments were delayed a day because a computer upgrade had “unintended consequences.” Then in May, 2012, a number of state websites were down for most of a day due to problems in a Texas data center that stored their content.

That was just two months after the Secretary of State’s office performed an audit of the department, noting that it needed improvement in the area of disaster recovery. That letter referenced the Federal Information Systems Controls Audit Manual, which notes, among other things, that “Spare or backup hardware is used to provide a high level of system availability for critical and sensitive applications.”

And, a month ago, three senior officials in the Department of Employment lost their jobs due in part to problems with the department’s computer systems. “Audit after audit exposed leadership problems that festered as they agency wasted as much as $30 million on computer software programs that didn’t work,” reported The Oregonian. “IT employees ‘are appointed to positions that they may or may not be suitable for, they are not coached and then their job duties were significantly changed.’ It said that the IT division needed “leadership, governance, priority setting, methodology, contract administration and appropriate HR practices.”

State officials pointed out that no data was lost in the recent incident, and that it was simply a matter of access to the systems that was lost for a day.

This is not to pick on Oregon; as IEEE Spectrum pointed out, the state government computer systems of New Mexico, Kansas, North Carolina, New Jersey, and Iowa all ran into problems that same week. These incidents do demonstrate, though, the challenges for citizens needing services — who tend to be the less computer-savvy ones — when the increasingly computerized state computer systems run into problems.

“Just who in their right mind upgrades a live system?” noted one commenter.

Analyst Greg Schulz of Storage I/O agrees, calling it “CYA 101.” “Anytime there is a person involved — regardless of if it’s hardware, cables, software, firmware, configurations or physical environments –something can happen,” he writes. “If the vendor drops the ball or a cable or card or something else and causes an outage or downtime, it is their responsibility to discuss those issues. However, it is also the customer’s responsibility to discuss why they let the vendor do something during that time without taking adequate precautions. Likewise, if the storage system was a single point of failure for an important system, then there is the responsibility to discuss the cost cutting concerns of others and have them justify why a redundant solution is not needed.”

6  Comments on this Post

There was an error processing your information. Please try again later.
Thanks. We'll let you know when a new response is added.
Send me notifications when other members comment.
  • ToddN2000
    Did they really schedule this upgrade on a Monday night at 7:00 pm when 90 percent of the benefits are processed ?? If so it's bad planning. Did they have a back-up plan in case the upgrade failed ?
    136,410 pointsBadges:
  • Michael Tidmarsh
    I agree with Todd...wouldn't you think they would schedule an upgrade presumably on a Friday night?
    65,700 pointsBadges:
  • TomLiotta
    Without knowing operational schedules, there's no way to know what an appropriate scheduling time might be. We often imagine that "office hours" is the worst time slot, but large organizations can have regular schedules that have the truly critical work happening during "off hours". From a number of years personal experience, I'm aware that a State Government is definitely a large and complex organization. There often is no such thing as a good time for an upgrade. Whatever time it happens, there will be a significant disruption to some segment if a problem arises. In this case, we heard about a few agencies feeling some pain. Had it happened during some other time slot, it possibly wouldn't have created the same headlines, but I have no doubt that the same degree of disruption would have happened just to a different list of offices or citizens. -- Tom
    125,585 pointsBadges:
  • bhannah
    I have to agree with Michael and Todd.  It sounds to me like bad planning and bad disaster recovery.  It should have planned for during a low volume time and was not.  A lot of organizations have no options about their hardware upgrades and their systems are always live.  You always plan for a worst case scenario, and then be very happy when it does not happen.
    4,590 pointsBadges:
  • evertonlf
    If they needed this upgrade for monday the environment needs a DA solution (disaster avoidance) with two sites active-ative.
    10 pointsBadges:
  • alany
    “Just who in their right mind upgrades a live system?” noted one commenter.

    Funny.  I thought people in their right minds only bought systems that are DESIGNED AND ENGINEERED to be upgraded live. I was upgrading AIX systems live back in the 90's, for pity's sake, and Tandem was doing the same for storage in the same timeframe.

    10 pointsBadges:

Forgot Password

No problem! Submit your e-mail address below. We'll send you an e-mail containing your password.

Your password has been sent to:

Share this item with your network: