One of the more complex problems in network architecture design is redundancy. Sometimes the principles of redundancy and disaster recovery become confused.
So the first question, what is the difference between redundancy and disaster recovery? The simplest way to understand the difference is by asking this question. When does the system engage? A system designed for after a failure is a disaster recovery system. A system designed to prevent a failure is usually associated with redundant systems.
For example: Think about a server with two power supplies. If one power supply fails does the server fail? A second or redundant power supply is designed to keep the server running even if the first power supply should fail. Two power supplies is an example of redundancy.
Hard disk RAID arrays are examples of redundancy. If one drive in the array fails, the system continues to run. Often the disk can be replaced without shutting down the server. Most of the components in a modern server are redundant. In order for the server to completely fail every redundant hardware component must fail.
So how is this different from a disaster recovery system? A disaster recovery system kicks in after the system is down. The classic example is a hard drive failure. When a server fails, the number one worry by the management teams is the safety of the data. In this situation the disaster recovery team restores the data from a backup. A disaster recovery plan starts with the question what if we lose a file? Then asks what do we do if we lose a drive or drive array? Then what if the server fails? Finally what if the site fails? If there is no redundant system in place a disaster recovery plan is the last chance to recover from a data or system loss.
In a redundant system the architect asks the same questions. If the server is lost, there must be another server that has the same information. If an internet connection is lost another internet connection is available. If a site is lost, there must be another site with the same information available to switch over to. When considering the cost of the network design the rule is, “The more redundant the system, the more expensive the system.” With unlimited funds every network would have unlimited redundancy, but in a world where funds and resources are limited most networks are a mixture of redundancy and disaster recovery.
The architect meets with the business owner to decide the priorities associated with the system design. Business critical production systems tend to be highly redundant. This is because while a business critical system is down, the company is losing money. Profitability and Organizational productivity are directly related. Statistically when these types of systems are down it costs a small to medium size business $7,314.00, or more, each hour in productivity losses. Successful large businesses see this rise to $50,000 / hour. Some systems within a Fortune 500 company could easily cost $1,000,000 / minute in lost productivity. Consider that for a successful small business a 12 hour loss can potentially cost the business $87,000 in lost productivity. Redundancy is statistically 1/7th the cost of repairing the system after a failure.
Yet non-critical systems failures can sometimes go un-noticed for weeks. So the cost of making non-critical business systems redundant may not be worth the cost to the management teams. Deciding which systems are business critical and which aren’t is a managerial decision. The problem is that many often the management teams are left out of this decision process.
When making availability decisions in my architecture here a question questions I ask myself:
The logic tree behind this…
- All strategic business decisions should be made by management.
- All technical decisions are tactical and should be made by the technician.
Therefore: If a change made by a technical expert does not support the vision or goals of the company, a department or a team this is a strategically decision. Therefore this change should be run by management before the change is made.
Therefore: If the change is made to reduce time or costs for the company, this is also a strategic decision. Therefore this change should be run by management before the change is made.
Finally: If the change supports the present tactics of the organization, without affecting cost or time (positively or negatively) this is an appropriate decision for the technical team to make without running the problem by management.
In this article I’ve discussed the difference between redundancy and disaster recovery. Most network systems are a combination of these two types of systems. As a system architect changes that affect the business vision, goals and strategy need to be run by management before they are implemented. This is true not matter whether the change is made by a technician, system administrator or system architect.
I’m curious if others have seen the effect of confusing redundancy and disaster recovery? I also wonder if others have seen negative affects when technicians made decisions that affected the strategy of the organization without realizing the cost of what they had done.