By David Boyd, GlassHouse Technologies (UK), principal consultant
Disasters occur. Water pipes burst, roofs leak, electricity supplies fail, network paths get dug up, and much worse. Fortunately, most IT managers will never get to see their disaster recovery plans put into practice for real. But some will and all too often it is only at that point that their shortcomings are realised.
Once the dust has settled and normal service has resumed IT managers will be under pressure to fix what went wrong with the original plan. Knee-jerk reactions and over-engineered solutions are too often implemented when a measured approach may be more efficient and much less costly.
I have a customer who is in exactly this situation. A water leak resulted in the failure of the entire storage environment. Following the disaster our customer rapidly acquired a new storage array and set about configuring it and allocating storage back to the clients.
Fortunately, the backup environment was unaffected and all services were resumed within a couple of weeks. It could have been much worse. If the disaster had happened at month end or if their hardware suppliers had not been able to mobilise so quickly the impact of the disaster could have been catastrophic.
Valuable lessons can be learned from a disaster. As a result of this exercise my customer has a detailed knowledge of the relative importance of each application. They know exactly what components link together to provide service; in my experience something that a lot of organisations don’t have a handle on. They also know exactly how long it takes to recover a service – including build, restore and configuration times. Despite being armed with that information, senior management have decreed that going forward a recovery time objective (RTO) of 24 hours be implemented for all services, including test and development.
There is no doubt that the outage caused severe pain to the organisation but reacting by stipulating a single recovery tier across the organisation is going to cost a lot of money. Replacement hardware can rarely be sourced within such a timeframe and therefore an exact replica of the environment needs to be purchased and will only be used in the event of a disaster.
A better way would be to take the hard lessons learned and align applications to defined disaster recovery service offerings. If you have an understanding of your disaster recovery requirements then you can start building a service catalogue which reflects that. Create distinct service tiers, based upon the RTO and RPO (recovery point objective) and align a technology configuration to those tiers.
For example, the “platinum” tier might include application level clustering and synchronous data replication whereas your “bronze” tier would involve data recovery from tape after the procurement of replacement hardware. In this way, applications that have little impact on the organisation’s daily activities can be assigned to a lower tier which gives sufficient time to procure replacement hardware, negating the need to have everything duplicated.
Once the service catalogue is in place, solutions that meet those requirements can be investigated and purchased. Of course, a full schedule of DR testing is a vital part of any solution.
While taking longer, this approach will help to prevent the knee-jerk reactions that often follow disasters.