I had the opportunity to exam tens of thousands of problems in multiple server farms and was struck by how many were resolved by rebooting. I then examined what happened to those servers later and found many repeatedly experienced the same failure until finally someone took the problem seriously and found a real solution.
Based on this experience I began to wonder if reboots were more likely to encourage managers to kick the problem down the road, rather than resolve it.
I now wonder if failovers are going to make what I observed even worse. At least with reboots it took time and therefore a user might be tempted to ensure that whatever the problem was it didn’t repeat itself. But to me a failover is a reboot without the time factor, so even more problems get kicked down the road.
I know that failover looks different because you are moving from one device to another, but since most reboots seemed to solve the problem on the same device why wouldn’t it also seem to solve the problem by being moved to a new device. My own feeling is that the inability to diagnose failures on servers has led to a solution of "when in doubt…reboot". And now the new idea is, "when in doubt…failover". Any thoughts or comments. Jim4522
Software/Hardware used:
ASKED:
April 16, 2010 3:40 PM
UPDATED:
April 20, 2010 9:25 PM
CarlosDL, it is interesting when you mention downtime. I traditionally think of downtime as applying to hardware, as in the time the hardware is down is the time from failure to the time the hardware is functioning again, But in your answer you are referring to the downtime of the application not the hardware. And that makes sense because when someone refers to the high cost of downtime they are not referring to the hardware being down which is only a problem for IT, but how long the application is down, since that is where the potential of real expense lies. Jim4522