Technical failures: Are they really technical failures?
Posted by: James Murray
Walking to a new client site as a Seattle IT Consultant, it would be easy to blame the last guy for all the mistakes. I have a 6 month window to repair the failures. I also know that I can sell replace equipment that will last at least another 18 months before those systems fail. So it’s easy to take potshots for a year or so, rather than identify the real problem. I know I’ve said this before, but I believe all technical failures are actually business failures in disguise. The reason we as IT experts don’t realize this is because we are IT experts not business experts.
In my blog article “The biggest root cause for IT failure…” I gave an example of how putting a server on a 5 year depreciation cycle could have saved a business from a catastrophic server failure. Not just the failure, but also the lost productivity while the workers were fixing the problem. We, as technicians, are all consciously aware that down time, means lost worker productivity. The business owner realizes that lost productivity means lost profitability. Since we don’t think in terms of profits, we often don’t think about the consequences of our behavior on the bottom line. Since technology seems like magic to many our managers (and sometimes our technical peers) perhaps we should understand the consequences a little better.
Statistically 70% of all IT failures are due to human error. 30% are due to hardware failure. Usually we see the hardware failures before we put the systems into production. Later on we see hardware failures again after the server has passed its life design specifications. If this true, then for the first 5 years in production, the majority of failures are caused by human error. Should we be condemning the IT department for these human failures?
If, as I say, all technical failures are actually business failures in disguise, then I would have to say no. Human error is part of the risk that managers are trained to account for in all business systems. Ultimately and IT system, IT is actually a business system supporting a business process. If we look at other departments we see that there are systems for reducing or eliminating the effect of human error.
For example: In accounting the CPA for the firm is held accountable by the Controller for the organization. Management is held accountable by the board of directors and the stockholders. Each business system includes a check and balance to keep everyone honest. In most business departments if there is a failure, it’s usually because a business check was not being maintained. In the IT department though, these types of business checks are not always in place.
As an example: When I started RAID arrays were monitored by the technician assigned to the server. One of the technician’s job was to check the green lights on the servers. If there was a red light, the drive had failed and needed to be replaced. Yet year after year of seeing green lights, a technician might miss a red light. Not just miss the light, but miss the light month after month. Until finally the second drive in the array also failed. Then the technician was often fired. While this was big mistake, was it appropriate business process?
On a ship during a war, there was always more than one sailor searching the horizon for submarines. Why? Because one tired sailor could make a mistake. We also can’t depend on radar or Sonar to always be right. Sometimes bleeding edge technology can outfox cutting edge technology leaving the ship open to surprise attacks. Our fired technician may have functioned just fine up until that moment. In know I’ve missed a red light periodically. Thank goodness I had more than one set of eyes looking with me. As a result we always caught the failed drive before it became a problem. Perhaps the real problem wasn’t that the technician missed the red light, but that there weren’t many eyes reviewing the lights.
Today we have automated systems doing much of the review, but is it wise to depend exclusively on those systems? We know that systems report only what they are programed to report. So as part of our business process perhaps an additional audit of the systems might identify additional risk. It’s when we assume that weakest link will never fail, eventually something will fail. Whether that system is a technical or a manual process there will be failures. Ultimately it’s not the technical systems fault any more than it is the technician’s fault. There is a business process that should be catch failure before the system actually falls over. This is why I say… all technical failures are actually business failures in disguise.




