Storage Soup:

Data center disaster recovery planning

Jun 18 2009   2:49PM GMT

HDS disk array failure suspected in Barclays outage; where’s the HAM?



Posted by: Beth Pariseau
Data center disaster recovery planning

According to reports out of the U.K. yesterday, Barclays ATM machines stopped working Tuesday because of a fault with one of its disk arrays.

The exact nature of the problem has not been specified, but the company is publicly known as a customer of Hitachi Data Systems’ (HDS) USP-V. HDS supplied a SAN subsystem based on its high-end USP-V hardware in February to bring capacity to 1 PB at a new 28,000 square foot Gloucester data center. That is the data center where the outage occurred.

Reached for comment, an HDS spokesperson wrote to Storage Soup in an email:

Not much to respond to as Barclays’ operations are now fully back online as of end of business day yesterday local time. Barclays and Hitachi Data Systems are investigating the cause of the problem. As a trusted storage partner to customers around the globe, it is our commitment to deliver on high standards of customer service and support excellence to Barclays and all of our customers worldwide.

U.K. storage consultant Chris M. Evans, who has worked with HDS products and customers, came to the vendor’s defense. He pointed the finger at the lack of redundancy of Barclays’ architecture.

What surprises me with this story is the time Barclays appeared to take to recover from the original incident.  If a storage array is supporting a number of critical applications including online banking and ATMs, then surely a high degree of resilience has been built in that caters for more than just simple hardware failures?  Surely the data and servers supporting ATMs and the web are replicated (in real time) with automated clustered failover or similar technology?

We shouldn’t be focusing here on the technology that failed.  We should be focusing on the process, design and support of the environment that wasn’t able to manage the hardware failure and “re-route” around the problem.  

One other thought.  I wonder if this problem would have been avoided with a bit of Hitachi HAM?

Feb 17 2009   7:26PM GMT

VMware reports glitch with SRM and NetApp systems



Posted by: Beth Pariseau
Data center disaster recovery planning

According to a note posted on VMware’s KnowledgeBase website, the server virtualization software maker is recommending that users of NetApp’s FAS arrays in a High Availability System Configuration not upgrade to VMware Site Recovery Manager (SRM) 1.0 Update 1.

The note says these users have a 50% chance of encountering a bug that means “replicated datastores are not detected correctly” within the application. It’s unclear whether the bug is on the NetApp or VMware side of the equation, but the companies are investigating and can assist customers with downgrading to previous versions if they experience problems.

This is not the first time integration between SRM, VMware’s disaster recovery/failover application for virtual servers, and array-based replication from storage vendors has proven tricky. An HP user also told SearchDisasterRecovery.com last week that he experienced some pain while trying to bring his EVA environment up to speed with SRM.


Dec 11 2008   2:48PM GMT

The vagaries of disaster recovery, cont’d



Posted by: Beth Pariseau
Data center disaster recovery planning

Last week, Tory Skyers wrote a post about the unforeseen complexities of disaster recovery after his PC’s motherboard fried. This week, I had an interesting discussion with somebody working with a much bigger enterprise infrastructure who also found that people, process, and sometimes luck–good and bad–can influence disaster recovery planning more than any technology.

Mark Zwartz, manager of information technologies for privately held real estate conglomerate JMB Companies, has been signed on with SunGard’s Availability Services for virtual server-based DR since August.  Last week Zwartz gave me a deeper look behind the scenes at his disaster recovery planning process, and the ways it wasn’t so simple.

For one thing, it might not have happened at all without a contract re-negotiation with SunGard. “Our original contract was a wacky deal with [one subsidiary] that expanded to the other entities, but the contracts were so goofy and webbed into each other that if we called with a problem or needing to fail over, the people at SunGard might not have any idea which company or machines we were talking about,” he said. “We saved money on lawyers and negotiations, but quite honestly, if we had to failover, nobody at SunGard might’ve known what to turn on.”

The renegotiation of that contract happened to coincide with the beta program for the SunGard service. “That was the foot in the door to change things,” Zwartz said. Because it was a beta program, the companies got three free months of testing, which caught the attention of Zwartz’s management.

Meanwhile, JMB was in the midst of a hardware refresh, as well as rolling out virtualization. Still, “one of the hardest parts was selling virtualization–these are highly intelligent people who are handling millions if not billions of dollars, and it’s hard to explain the concept that they don’t actually ‘own’ anything [with virtual servers],” he said.

A broker at a hedge fund firm in the conglomerate was highly reliant on Outlook contacts and notes in each contact file to do her daily business. Meanwhile, the company still worked with traditional tape backup, which wouldn’t offer the granular protection to recover all of those contacts in the event of an outage.

I’m sure most of you out there in blogland know what happened next. “She was synching her BlackBerry herself, and blew out her contacts,” according to Zwartz. The only option for restoring Exchange backups was to restore the entire Exchange database from tape to a separate server, which the company had declined to buy. “They lost a significant part of a trade before that,” he said. “Nobody realized such a small thing would make her unproductive.”

Of course, without a fully staffed and replicated secondary environment, Zwartz acknowledged, it’s impossible to be disaster-proof. But the incident also had a silver lining when it came to convincing management to participate in the new SunGard program. “It would’ve cost $1,500 to get a new backup device,” he said. “It wound up taking six weeks to get one contact back and cost between $15,000 and $20,000. It was a selling point when it came to virtual servers with SunGard.”


Dec 1 2008   4:33PM GMT

Lessons learned from personal disaster recovery



Posted by: Tory Skyers
Data center disaster recovery planning, small business storage

I’ve recently been at the fuzzy end of the data recovery/data availability lollipop. I lost a motherboard due to some crazy unknown issue/interaction with my front-mounted headphone jack, the motherboard and the sound card. During this nightmare I’ve come to appreciate even more the process of making sure that, in the event of a disaster, companies (even small ones like my home business) have access not only to their data but to their critical systems as well.

I’ve passed through all the phases of grief with this motherboard. At first, I was in denial for a good 24 hours, thinking ‘there’s no way this could be happening, something just tripped and all I have to do is reset a switch or jumper.” Well, I moved around the three jumpers on the board, and the myriad of switches at least 10 times each, and it was still dead.  I took out the CPU, the memory, all the cards, and tried a new power supply. No go.

By now I’d been down for about 48 hours and panic was setting in. So I set out to try and at least recover my data. I have most (funny thing, I thought I had all) of my important data on my file server in the basement, my email via IMAP (replicated from a protected server on the Internet to a server in my home virtual server farm) and the applications I’d need to carry out my work functions available via ISOs on another file server. I figured these steps would be good enough to get me up and running in case I lost my desktop. But I was wrong, ooooh so wrong. As it turned out, neither repairing the motherboard nor restoring data from other devices even came close to solving the whole problem.

The first tenet of data recovery planning is “Know the value of thy data” (Jon Toigo). The second tenet is “Know where it is, dummy” (Curtis Preston). I thought I knew the value of all my data, and I was absolutely certain I knew where it was. I had scripts built to move that data around from where I created it (my now-dead desktop) to a “safer” place (my super-redundant file server), while some of the smaller file size and text-based items were created directly on the file server.

I routinely categorize my documents, images, invoices and other data I create as well. As far as data classification is concerned, I really do eat my own dog food.

But apparently. this wasn’t enough (or I need something new) because I still wasn’t able to work after my desktop went down. I was literally dead in the water–production in my office came to a screeching halt with terabytes of storage, servers and such still happily whirring away.

Why? Here’s the kicker. I was so used to my dual monitor setup with that fast storage subsystem that most of the things I  was creating I couldn’t easily (or in some cases at all) shift to working on a laptop. Not only that, but I missed small things that I thought were unimportant, like Outlook email filters I created to organize my email (I get about 100 or so real messages out of the 500+ total messages on a weekday). I found it almost impossible to sift through all the email to get at the bits I needed. I kept running into situations where documents I was creating depended on some bit of data that was easily accessible when I was working on my desktop but took me close to two hours to find when I was on my laptop (I have a desktop search engine setup that indexes my document stores). 

I’d also gotten so used to the notepad gadget on Vista’s sidebar that I stored all kinds of little notes to myself,  URLs and such. All now inaccessible. While I could technically “work,” it was taking me eight hours to do what normally took 30 minutes.

Being caught completely offguard by this made all the steps I took to prepare for this situation seem all the more pointless. I had most of my data. I could access most of my data. But I was having serious problems with productivity because key pieces were missing.

This cost me. . .and not just in terms of productivity. I actually ended up paying $100 more for goods for my hobby e-shop because I couldn’t locate the original quote the company sent me and it had been a relatively long period of time between quoting and purchasing. Aargh!

Trying to find a motherboard (the same brand and model) locally was an exercise in futility. The board was out of production and stock had dried up everywhere but on the Internet, where the price was astronomical. I ended up having to RMA a second board and had to switch manufacturers and reinstall Vista three times.

What’s more, there are always complicating factors at work in any recovery situation. Right before my motherboard shorted, my wife and I — given the economy — had revisited our budget looking to cut costs, and, seeing how much we were paying for communications and television, decided to switch to Comcast VoIP from a Verizon land line. 

In doing so, we discovered that the cable line coming into our house had a crack in it, and when the wind blew or a bird sat on the cable the cable swayed, and the signal strength would fluctuate too much for the VoIP Terminal Adapter.  So the cable had to be replaced. This meant that when the motherboard died, not only was my main computer down, but I also had no reliable communications besides my cell phone. The only way for me to get on the Internet reliably was to tether with my cell phone–all this only a week after I got my computer back to a semi-productive state!

Comcast would replace our modem four times, and send five different technicians out to diagnose the issue. After two weeks of no (or nearly no) Internet, they replaced the cable all the way out to multiples poles along the street.

And those were just the infrastructure disasters. The work stoppages caused by them were disasters in and of themselves. I have a home office, and my wife works exclusively from her home office. Without the Internet she is, for all intents and purposes, out of business, and I’m not too far behind her. Over the five weeks it took for these events to unfurl we’ve calculated the lost man (and woman) hours at about 350. . .give or take a few working Saturdays.

Lessons learned for me:

  • Have a spare board. Sure, it’s costly, but after almost two weeks of lost productivity just waiting for a board, I realized it’s cheaper to have a board on a shelf.
  • As an infrastructure engineer I do my best to plan for disasters by building in replication facilities and sourcing storage subsystems that lend themselves to replication and can operate in hot/warm and hot/hot configurations. This, however, is not disaster recovery planning, as much as I’d like to pat myself on the back and say it is. That part of the process is simply being prudent about hardware choices.  While it helps with DR, it cannot be relied on as your main plan no matter what hardware vendors tell you.
  • Really planning for DR involves things that I’ve always felt should be left to folks with proven expertise. My recent experiences have firmly cemented that belief. A storage professional is not a DR professional by default, no matter how many storage professionals happen to be extremely proficient at DR. Having a great protection plan for data with SRDF, snapshots and gigawatts of backup power does not mean that you or your business will actually be able to function in the event of a disaster.
  • Make efforts to truly understand the value of metadata, indexes and other things required to conduct business in the event of a disaster, not just the Word file and a copy of Microsoft Office.
  • Internet access has become a requirement. It is no longer a luxury plan for a backup line (DSL, cellular etc).
  • If your computers not working means you will lose money at your business, pay someone to help you with a REAL DR plan. If you are a home-based business, do research on what you should be planning for and talk with a professional about DR.
  • Have spares. . .wait, did I say that already?

Hopefully all this will scare spare some folks this nightmare by pushing them to take a real look at how they work and how they can continue working in the event of a disaster. Whether it’s on a small or large scale.


Nov 17 2008   4:43AM GMT

CA goes SaaS route with DR



Posted by: Dave Raffo
Data center disaster recovery planning, Storage backup, Storage Software as a Service, data backup, small business storage

CA jumped into the software as a service (SaaS) game by launching three offerings at CA World. The SaaS offerings include a disaster recovery/business continuity service called CA Instant Recovery On Demand, which is built on technology acquired when CA bought XOsoft in 2006.

CA will sell the service through resellers and other channel partners. A participating reseller will establish a VPN connection between the customer and CA, and use that to automatically fail over a server that goes down. The service supports Microsoft Exchange, SQL Server and IIS, as well as Oracle applications.

Instant Recovery on Demand  costs around $900 per server for a one-year subscription.

Adam Famularo, CA’s general manager for recovery management and data modeling, expects the service to appeal mostly to SMBs because larger organizations are more likely to use the XOsoft packaged software for high availability and replication. “If an enterprise customer says ‘We love this model, too,’ they can buy it,” he says. “But most enterprises want to buy it as a product.”

Famularo says he sees the service more for common server problems than for large disasters. “It’s not just for hurricane season, but for everyday problems,” he says.


Oct 14 2008   4:29PM GMT

Symantec question marks: email archiving SaaS and VMware competition



Posted by: Beth Pariseau
Data center disaster recovery planning, VMware, Strategic storage vendors, data compliance and archiving

There are some potential inferences that can be made from two moves Symantec Corp. made last week: the acquisition of MessageLabs and the launch of Veritas Cluster Server One. However, for now clear answers as to whether or not those inferences are correct are not forthcoming.

MessageLabs is partnered with Fortiva to offer email archiving SaaS, so I wondered if the acquisition might mean that Symantec will get into that kind of offering as well.  SearchSecurity.com reported that MessageLabs CEO Adrian Chamberlain will be heading up a new SaaS group at Symantec, though Symantec officials also told SearchSecurity they won’t be SaaS-enabling all products. This remains an open question for now, as a Symantec spokesperson told me that product roadmaps will be decided after the acquisition closes, which might not happen until year end.

Symantec also launched VCS One, with the goal of allowing organizations to keep active farms of virtual servers running at a disaster recovery site, as well as recover tiered applications with dependencies intact.  ”Right now, this process is dependent on a lot of tribal knowledge in the heads of individuals who know the right order and design scripts to run this kind of recovery,”  said Mark Lohmeyer, vice president and general manager of the VCS product group at Symantec.

If any of that sounds familiar, it might be beacause this past May, VMware and its storage partners launched VMware Site Recovery Manager,  allowing VMware’s VirtualCenter to execute commands against storage arrays at primary and secondary sites during recoveries and enable VirtualCenter-generated metadata about virtual machines to be replicated, along with system and application data. In part, SRM is designed to help server virtualization customers automate their disaster recovery checklists, which many of them keep on paper and check off manually.

Meanwhile, Symantec has been among the most outspoken of storage vendors about friction with the server virtualization giant, and at Vision this year took VMware rival Citrix XenServer under its wing and into its product line, claiming the resultant Veritas Virtual Infrastructure product will be a better approach than VMware’s Virtual Machine File System (VMFS)  for server virtualization in large environments.

However, Symantec positions VCS One as complementary to SRM, rather than competitive with it. A Symantec spokesperson emailed me the following statement when I asked about it late last week:

VCS One is a complementary solution for VMware environments that can help improve overall availability of the environment in production, for mission-critical apps, by taking an application-centric approach to HA/DR.  And, we work closely with VMware to integrate with, and leverage VMware technologies such as Vmotion (for reducing planned downtime) and DRS today, and we’re looking at how we can also integrate with SRM in the future.  Finally, our solution is ideal for heterogeneous physical and virtual environments, that includes VMware as well as other platforms (which is the case in virtually every data center).

It’s important to note, though, that VCS One only supports VMware virtual machines at present, which might make the kind of competitive statements made earlier this year a bit awkward at this stage. Lohmeyer says VCS One was under development before Xen came on the scene. “[Support for Xen] will be in our very next release,” he said. Once that happens, I wonder if Symantec’s messaging might change somewhat.


Oct 1 2008   3:23PM GMT

Unitrends unites backup, DR management



Posted by: Beth Pariseau
Data center disaster recovery planning, data backup, small business storage

SMB backup and DR vendor Unitrends has released version 4.0 of its RapidRecovery management software for its Data Protection Unit disk-to-disk backup hardware. The new version completes a yearlong effort from Unitrends to bring together what were once separate GUIs for managing backup and offsite vaulting using the DPU devices.

A year ago, the company removed the command line interface, which CEO Duncan MacPherson described as “a late ’90s level GUI that looked old and slow.” At that time, Unitrends gave backup and configuration management interfaces a facelift. The current release pulls in offsite vaulting and data recovery. Other new features include the ability to create customized reports based on the GUI, test DR plans, recover single files from a secondary site, and support for new operating systems including Novell Netware. MacPherson said Windows 2008 will be supported by the end of the year.

Unitrends’ goal is to package all data protection processes and hardware into one product. Combining operational backup and disaster recovery practices also seems to be an emerging trend. This is also being done through backup service providers whose backups by definition are offsite, and who are beginning to offer more affordable system state recovery of hosts using virtual servers. Stay tuned to the SearchDataBackup.com and SearchDisasterRecovery.com sites for more on this.


Aug 25 2008   10:58AM GMT

Symantec says C-level execs not involved enough in DR



Posted by: Beth Pariseau
Data center disaster recovery planning

Symantec Corp. released the results of its survey of 1000 IT managers and decision makers about disaster recovery for 2008 today. Among its findings was a decrease in C-level executive involvement in DR planning compared to the results for the 2007 survey, which Symantec officials said they found alarming.  

In the 2007 DR survey, 55 percent of respondents said that their DR committees involved the CIO/CTO/IT director.  In 2008, that number dropped to 33 percent worldwide.  

“Executive complacency could be attributed to the improvement in DR testing successes,” according to the company’s survey report. Delegation of tasks to lower-level managers once the C-suite sets overall DR goals could also be at play, conceded Symantec director of product marketing for Data Protection Marty Ward.  However, the survey results remain a cause for concern at Symantec, Ward said. “It’s more likely that DR is still just not seen as a basic requirement for companies - there also haven’t been as many current events lately that spur people into thinking about disaster recovery.”

As for that last statement, let’s all just take a moment to knock on wood. Meanwhile, Symantec says other results of the survey, like the fact that only 14% of chief security officers are involved in DR, point to complacency rather than delegation.

Other key findings of the study:

  • Although one third of organizations have had to execute a disaster recovery plan, just under half say they can get fully operational in a week.
  • The amount of applications that IT Managers believe are business critical has increased 20 percentage points over data from the previous year, and only about half of these applications are covered in DR plans.
  • Virtualization is driving organizations to reevaluate their DR plans.
  • Organizations report that DR testing impacts customers, sales and revenue because of the lack of tools that can address both virtual and physical environments.

On that last one, a recent customer case study we ran on the site can attest to that issue. It’s tough enough for companies to classify all data and arrange for tiered recovery while maintaining accurate and realistic RTOs and RPOs. So tough, in fact, that very few companies I’ve come across have even reached the frontier Northeast Utilities came up against - keeping the DR plan current and in working order without the operational bandwidth to complete live tests.

The analogy I’d use for this situation is to another unpleasant task - dieting. If initial DR planning is like losing weight, continued monitoring and updating for the environment is like keeping it off–in other words, the really hard part. According to the 2008 Symantec survey results, only 30 percent of tests meet RTO objectives.   Only 31 percent of respondents reported that they could achieve baseline operations within one day if a significant disaster occurred that obliterated their main data center. Only 3 percent believed they have skeleton operations within 12 hours.

Not all is doom and gloom, though. “Don’t get me wrong, there has been a 10 fold increase in testing over the last decade, and one of the most encouraging things about the 2008 survey is that it showed that not only are people testing, but more people are testing successfully,” Ward said. Last year, 50% of DR tests failed. This year, that number was 30 percent. “But there are still ongoing issues.”


May 2 2008   8:12AM GMT

The Storage Admin, DR, and the Down Market



Posted by: Tory Skyers
Data center disaster recovery planning

The economy has been on the mind of just about everybody recently, and with good reason. Gas at near record highs, unemployment rising, housing values reportedly dropping, the credit crunch and foreclosures numbering in the bazillions it is easy to see why people are not exactly upbeat about the state of our economy.

In the storage market, however, it’s looking like a blockbuster year. EMC and others are reportedly on track to meet or beat financial analysts’ estimates, and that leads me to today’s blog.

As it turns out, the impetus for this blog post was my recent attendance at a DR seminar put on by Storage Decisions featuring Jon Toigo. Looking around the room, I couldn’t help but think of what it looked like in the early days of “network administrators” when people didn’t think of network pros as any different from the server guys. Today, the storage admin is being called on to be part lawyer, part business analyst, part networking guru and all-knowing about all things storage, but there are very few companies with a dedicated storage team (outside the Fortune 500’s that have Exabytes of storage to manage).

For the most part (and please chime in with your experience) storage folks are still viewed as “server guys”. This is, of course, changing, and I wouldn’t bring it up if there weren’t a bigger point to be made: if you do a quick scan of Monster, Dice or Jobcircle, there are more and more listings specifically calling for a “Storage Administrator”. Storage is fast becoming the segment to be in–the information infrasructure could not function without it, and it is increasingly becoming the focus of much planning and resource allocation, in terms of both time and money. Talk to most companies, and they have storage budgets that are going up even in a down market, and they are hiring people to dedicate to the task of storage. Storage pros are more highly valued, and their pay is going up.

So what does this have to do with DR? DR is, at its basic level, moving data from one place to another, on a regular basis, far enough away that if you had a disaster you could recover your data and continue operations in the face of a disaster. This, in almost every case that I can think of, requires storage, storage networking technologies and someone who knows enough about them to set it all up and keep it working in a changing environment. Hence all the storage pros in the room vs business types that normally involve themselves in DR.

Toigo put on a great presentation. It was filled with a ton of valuable information and even if you have nothing to do with the DR planning and implementation at your company, I would enthusiastically recommend attending one in your area. I walked in thinking I had a passable grasp of DR best practices and walked out realizing I had barely scratched the surface, and that as a storage professional I needed to understand more about business practices as they relate to DR.

For example, Toigo discussed what a data model was and not only how to build one but suggestions on explaining it to non-technical analysts so we could all use it together to ultimately build a workable DR plan around valuable data instead of putting together a set of technologies to make our systems highly available but unable to really recover from a disaster. And it’s the storage guy who should be taking the lead on that.

Think of the value you bring to the table when you can not only provide the information infrastructure, but also assist in developing a DR plan that will keep the business functioning, and generating revenue in a disaster. In the process, you can also create things that have intrinsic value to multiple business units–think of what information security can do if they know what a document or document type is worth as compared to other documents. My fellow storage pros, I’m seeing a bright future for us.


Mar 5 2008   3:47PM GMT

New data protection gadgetry hits the streets



Posted by: Beth Pariseau
Data center disaster recovery planning, data backup

Two storage-related announcements came out of CeBIT this week that have turned a few heads.

The first is the FlashBack Adapter from thumb-drive king SanDisk. The device fits into the ExpressCard slot of a user’s PC, and automatically and continuously backs up and encrypts data onto a flash memory card. This way, to quote SanDisk, when “you’re at a conference and someone spills coffee on your laptop PC, shorting out the system and cutting you off from your presentation and notes. Or your computer slips out of your hands and crashes to the floor,” you can extract the memory card from the smoking wreckage, find another PC and be on your way.

The second announcement comes from a UK company called Retrodata, which is reportedly getting ready to release a do-it-yourself drive recovery system. The beast, which has yet to be photographed, reportedly weighs 75 kg (165 lbs.) and will be priced at around $7000. But for all you Austin Powers fans out there, it does come equipped with…”lasers”.

Photobucket

According to techchee, a blog dedicated to high-tech products:

The device uses laser-guided positioning to help it accurately extract platters from any 3.5 inch hard drive with minimal user intervention. What’s unusual element is that such devices normally require highly skilled operators, whereas the System P. EX can be used by a relative novice at a data recovery company.

Maybe if Retrodata plays its cards right, it’ll get an order for…one million dollars.

Photobucket