Posted by: Sharon Fisher
data center, digital government, disaster recovery, hitachi
A large number of Oregonians looking for state services — including 63,000 unemployed people expecting checks for a total of $18 million in benefits — were left high and dry for a day recently due to problems with a Hitachi storage upgrade.
Hitachi contractors were doing what was supposed to be a routine upgrade to the State Data Center in Salem when a connectivity issue caused the system to go down, KGW News reported state spokesman Matt Shelby as saying. “Hitachi worked overnight to fix the problem. All state agency websites were affected, but no data was lost,” the station said. The outage started at 7 p.m. Monday and was repaired by Tuesday morning, while state services were restored by midday.
Up to 90 percent of the weekly unemployment benefits are normally processed on Monday nights, according to an AP story in The Columbian.
Other issues, according to Oregon Public Radio and The Oregonian, included:
- Inability for the state’s more than 90 agencies to communicate directly with each other via email
- Any jobs that needed to pull data from the data center couldn’t run
- The Department of Transportation TripCheck was down
- The Department of Forestry, which was fighting a fire in Prineville (ironically, where Facebook has one of its data centers) didn’t have access to email or database forms
- 35 applications for food stamps scheduled for overnight processing were delayed
Ironically, to a certain extent Oregon brought this on itself by planning to consolidate its various state data centers into the single State Data Center in 2004. “The State Data Center was authorized in July 2004 to consolidate the computer operations of the 12 largest agencies,” notes the Statesman-Journal. “A $20 million building on Airport Road SE houses the center, which opened in fall 2005. Lawmakers in 2005 approved $43.6 million for the consolidation process.” But in July, 2008 — almost exactly five years ago — the state’s plan for consolidating data centers was sharply criticized for not adequately consolidating the servers themselves.
The system has also been plagued by crashes. In October, 2009, a network failure on the State Data Center system caused an overload on the unemployment system, shutting it down for 12 hours. In October, 2011, unemployment payments were delayed a day because a computer upgrade had “unintended consequences.” Then in May, 2012, a number of state websites were down for most of a day due to problems in a Texas data center that stored their content.
That was just two months after the Secretary of State’s office performed an audit of the department, noting that it needed improvement in the area of disaster recovery. That letter referenced the Federal Information Systems Controls Audit Manual, which notes, among other things, that “Spare or backup hardware is used to provide a high level of system availability for critical and sensitive applications.”
And, a month ago, three senior officials in the Department of Employment lost their jobs due in part to problems with the department’s computer systems. “Audit after audit exposed leadership problems that festered as they agency wasted as much as $30 million on computer software programs that didn’t work,” reported The Oregonian. “IT employees ‘are appointed to positions that they may or may not be suitable for, they are not coached and then their job duties were significantly changed.’ It said that the IT division needed “leadership, governance, priority setting, methodology, contract administration and appropriate HR practices.”
State officials pointed out that no data was lost in the recent incident, and that it was simply a matter of access to the systems that was lost for a day.
This is not to pick on Oregon; as IEEE Spectrum pointed out, the state government computer systems of New Mexico, Kansas, North Carolina, New Jersey, and Iowa all ran into problems that same week. These incidents do demonstrate, though, the challenges for citizens needing services — who tend to be the less computer-savvy ones — when the increasingly computerized state computer systems run into problems.
“Just who in their right mind upgrades a live system?” noted one commenter.
Analyst Greg Schulz of Storage I/O agrees, calling it “CYA 101.” “Anytime there is a person involved — regardless of if it’s hardware, cables, software, firmware, configurations or physical environments –something can happen,” he writes. “If the vendor drops the ball or a cable or card or something else and causes an outage or downtime, it is their responsibility to discuss those issues. However, it is also the customer’s responsibility to discuss why they let the vendor do something during that time without taking adequate precautions. Likewise, if the storage system was a single point of failure for an important system, then there is the responsibility to discuss the cost cutting concerns of others and have them justify why a redundant solution is not needed.”