I had the opportunity to exam tens of thousands of problems in multiple server farms and was struck by how many were resolved by rebooting. I then examined what happened to those servers later and found many repeatedly experienced the same failure until finally someone took the problem seriously and found a real solution.
Based on this experience I began to wonder if reboots were more likely to encourage managers to kick the problem down the road, rather than resolve it.
I now wonder if failovers are going to make what I observed even worse. At least with reboots it took time and therefore a user might be tempted to ensure that whatever the problem was it didn’t repeat itself. But to me a failover is a reboot without the time factor, so even more problems get kicked down the road.
I know that failover looks different because you are moving from one device to another, but since most reboots seemed to solve the problem on the same device why wouldn’t it also seem to solve the problem by being moved to a new device. My own feeling is that the inability to diagnose failures on servers has led to a solution of "when in doubt…reboot". And now the new idea is, "when in doubt…failover". Any thoughts or comments. Jim4522
April 16, 2010 3:40 PM
April 20, 2010 9:25 PM