I am trying to develop the process which my company will follow in the event of a data center disaster.
Part of that process is to question whether or not a major incident has the characteristics of a true disaster and if a disaster should be declared (which would obviously kick off or recovery proceeses).
Can anyone advise on questions or things to consider when evaluating the impact, risk, and exposure of a major data center incident?
Thanks for any help!
Software/Hardware used:
ASKED:
April 7, 2010 8:12 PM
UPDATED:
April 8, 2010 5:31 PM
Thanks Bitraptor.
If the Incident/Problem management teams determine that production will be affected for greater than 4 hours, we must start our disaster recovery processes as we will take a good 24 hours to recover to our hotsite and begin to take on production traffic again.
So what I’m really after is some kind of criteria that an incident has reached, or is presumed to be on target to reach, which will trigger a disaster declaration.
My place uses an estimated 24 hours of outage as the determination of a disaster. We do not have a “hot site” for fail over; we have to go to a cold site and reconstruct our systems. During D/R testing it takes 24 hours to build our “tier one” (critical) systems.
While I understand trying to develop good procedures and plans, I personally do not get too bogged down in the semantics of incident vs disaster. It isn’t the “word”, it is the “outcome”. After all, geologists refer to a volcanic eruption as an “event.”
We use a multi-pronged approach to defining a disaster. It involves an estimate of down-time, application priority, impact size, hydro power and hardware availability.
If our online access is down for more that 30 minutes, or if a high priority application (inventory management) is experiencing problems, or our email system crashes (affects 100′s of users), or we loose electrical power for more than 15 minutes, or if we don’t have a duplicate replacement piece of equipment that can be instjalled within 4 hours, we quickly communicate with our disaster co-ordination team, discuss options and implement the appropriate recovery plan.
This communication process usually involves about 3 – 5 people and is finished within 10 – 20 minutes. We have pre-defined issues that will cause us pain, along with solultions. As we encounter issues not covered by our solutions, we update our processes accordingly. As you can see, this is a never ending process.
We also contact our senior and departmental managment to inform them of our decisions/progress.
This is really good discussion everyone! I really appreciate responses thus far…