Posted by: Tskyers
Data center disaster recovery planning, small business storage
I’ve recently been at the fuzzy end of the data recovery/data availability lollipop. I lost a motherboard due to some crazy unknown issue/interaction with my front-mounted headphone jack, the motherboard and the sound card. During this nightmare I’ve come to appreciate even more the process of making sure that, in the event of a disaster, companies (even small ones like my home business) have access not only to their data but to their critical systems as well.
I’ve passed through all the phases of grief with this motherboard. At first, I was in denial for a good 24 hours, thinking ‘there’s no way this could be happening, something just tripped and all I have to do is reset a switch or jumper.” Well, I moved around the three jumpers on the board, and the myriad of switches at least 10 times each, and it was still dead. I took out the CPU, the memory, all the cards, and tried a new power supply. No go.
By now I’d been down for about 48 hours and panic was setting in. So I set out to try and at least recover my data. I have most (funny thing, I thought I had all) of my important data on my file server in the basement, my email via IMAP (replicated from a protected server on the Internet to a server in my home virtual server farm) and the applications I’d need to carry out my work functions available via ISOs on another file server. I figured these steps would be good enough to get me up and running in case I lost my desktop. But I was wrong, ooooh so wrong. As it turned out, neither repairing the motherboard nor restoring data from other devices even came close to solving the whole problem.
The first tenet of data recovery planning is “Know the value of thy data” (Jon Toigo). The second tenet is “Know where it is, dummy” (Curtis Preston). I thought I knew the value of all my data, and I was absolutely certain I knew where it was. I had scripts built to move that data around from where I created it (my now-dead desktop) to a “safer” place (my super-redundant file server), while some of the smaller file size and text-based items were created directly on the file server.
I routinely categorize my documents, images, invoices and other data I create as well. As far as data classification is concerned, I really do eat my own dog food.
But apparently. this wasn’t enough (or I need something new) because I still wasn’t able to work after my desktop went down. I was literally dead in the water–production in my office came to a screeching halt with terabytes of storage, servers and such still happily whirring away.
Why? Here’s the kicker. I was so used to my dual monitor setup with that fast storage subsystem that most of the things I was creating I couldn’t easily (or in some cases at all) shift to working on a laptop. Not only that, but I missed small things that I thought were unimportant, like Outlook email filters I created to organize my email (I get about 100 or so real messages out of the 500+ total messages on a weekday). I found it almost impossible to sift through all the email to get at the bits I needed. I kept running into situations where documents I was creating depended on some bit of data that was easily accessible when I was working on my desktop but took me close to two hours to find when I was on my laptop (I have a desktop search engine setup that indexes my document stores).
I’d also gotten so used to the notepad gadget on Vista’s sidebar that I stored all kinds of little notes to myself, URLs and such. All now inaccessible. While I could technically “work,” it was taking me eight hours to do what normally took 30 minutes.
Being caught completely offguard by this made all the steps I took to prepare for this situation seem all the more pointless. I had most of my data. I could access most of my data. But I was having serious problems with productivity because key pieces were missing.
This cost me. . .and not just in terms of productivity. I actually ended up paying $100 more for goods for my hobby e-shop because I couldn’t locate the original quote the company sent me and it had been a relatively long period of time between quoting and purchasing. Aargh!
Trying to find a motherboard (the same brand and model) locally was an exercise in futility. The board was out of production and stock had dried up everywhere but on the Internet, where the price was astronomical. I ended up having to RMA a second board and had to switch manufacturers and reinstall Vista three times.
What’s more, there are always complicating factors at work in any recovery situation. Right before my motherboard shorted, my wife and I — given the economy — had revisited our budget looking to cut costs, and, seeing how much we were paying for communications and television, decided to switch to Comcast VoIP from a Verizon land line.
In doing so, we discovered that the cable line coming into our house had a crack in it, and when the wind blew or a bird sat on the cable the cable swayed, and the signal strength would fluctuate too much for the VoIP Terminal Adapter. So the cable had to be replaced. This meant that when the motherboard died, not only was my main computer down, but I also had no reliable communications besides my cell phone. The only way for me to get on the Internet reliably was to tether with my cell phone–all this only a week after I got my computer back to a semi-productive state!
Comcast would replace our modem four times, and send five different technicians out to diagnose the issue. After two weeks of no (or nearly no) Internet, they replaced the cable all the way out to multiples poles along the street.
And those were just the infrastructure disasters. The work stoppages caused by them were disasters in and of themselves. I have a home office, and my wife works exclusively from her home office. Without the Internet she is, for all intents and purposes, out of business, and I’m not too far behind her. Over the five weeks it took for these events to unfurl we’ve calculated the lost man (and woman) hours at about 350. . .give or take a few working Saturdays.
Lessons learned for me:
- Have a spare board. Sure, it’s costly, but after almost two weeks of lost productivity just waiting for a board, I realized it’s cheaper to have a board on a shelf.
- As an infrastructure engineer I do my best to plan for disasters by building in replication facilities and sourcing storage subsystems that lend themselves to replication and can operate in hot/warm and hot/hot configurations. This, however, is not disaster recovery planning, as much as I’d like to pat myself on the back and say it is. That part of the process is simply being prudent about hardware choices. While it helps with DR, it cannot be relied on as your main plan no matter what hardware vendors tell you.
- Really planning for DR involves things that I’ve always felt should be left to folks with proven expertise. My recent experiences have firmly cemented that belief. A storage professional is not a DR professional by default, no matter how many storage professionals happen to be extremely proficient at DR. Having a great protection plan for data with SRDF, snapshots and gigawatts of backup power does not mean that you or your business will actually be able to function in the event of a disaster.
- Make efforts to truly understand the value of metadata, indexes and other things required to conduct business in the event of a disaster, not just the Word file and a copy of Microsoft Office.
- Internet access has become a requirement. It is no longer a luxury plan for a backup line (DSL, cellular etc).
- If your computers not working means you will lose money at your business, pay someone to help you with a REAL DR plan. If you are a home-based business, do research on what you should be planning for and talk with a professional about DR.
- Have spares. . .wait, did I say that already?
Hopefully all this will scare spare some folks this nightmare by pushing them to take a real look at how they work and how they can continue working in the event of a disaster. Whether it’s on a small or large scale.