820 pts.
 A random thought about failovers and reboots
I had the opportunity to exam tens of thousands of problems in multiple server farms and was struck by how many were resolved by rebooting. I then examined what happened to those servers later and found many repeatedly experienced the same failure until finally someone took the problem seriously and found a real solution.

Based on this experience I began to wonder if reboots were more likely to encourage managers to kick the problem down the road, rather than resolve it.

I now wonder if failovers are going to make what I observed even worse. At least with reboots it took time and therefore a user might be tempted to ensure that whatever the problem was it didn’t repeat itself. But to me a failover is a reboot without the time factor, so even more problems get kicked down the road.

I know that failover looks different because you are moving from one device to another, but since most reboots seemed to solve the problem on the same device why wouldn’t it also seem to solve the problem by being moved to a new device. My own feeling is that the inability to diagnose failures on servers has led to a solution of "when in doubt…reboot". And now the new idea is, "when in doubt…failover". Any thoughts or comments. Jim4522



Software/Hardware used:
ASKED: April 16, 2010  3:40 PM
UPDATED: April 20, 2010  9:25 PM

Answer Wiki:
<b><i>"...a failover is a reboot without the time factor, so even more problems get kicked down the road."</i></b> Since downtime usually means money, it makes a lot of sense to replace reboots with failover and reduce downtime to 0. The cost would be too high if every incident needed to be investigated in real time before bringing the system back to service, so when time is crucial, it makes sense to find the way to reduce downtime, and many times that way includes a reboot. However, the incident should be investigated further until the root cause is found and corrected, <b>but this depends on people</b>, and unfortunately, many times it is not done. And I agree, failover <b>could </b>make this worst if people don't do their jobs responsibly. -CarlosDL ------------
Last Wiki Answer Submitted:  April 16, 2010  5:24 pm  by  carlosdl   63,535 pts.
All Answer Wiki Contributors:  carlosdl   63,535 pts.
To see all answers submitted to the Answer Wiki: View Answer History.


Discuss This Question:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _


 

CarlosDL, it is interesting when you mention downtime. I traditionally think of downtime as applying to hardware, as in the time the hardware is down is the time from failure to the time the hardware is functioning again, But in your answer you are referring to the downtime of the application not the hardware. And that makes sense because when someone refers to the high cost of downtime they are not referring to the hardware being down which is only a problem for IT, but how long the application is down, since that is where the potential of real expense lies. Jim4522

 820 pts.