In previous postings I’ve discussed the Healthy conflict between the Incident Management role and the Problem management roles in IT. When I was an IT Cowboy, I used to play both roles. Over the years networks got bigger and bigger. It became impossible for one person to do everything. As the number of technicians grows on the network, so too does inefficiency. Doing things the same old way causes even more problems. As a Seattle IT Consultant, I’ve begun looking to ITIL for an understanding of how a network should be run. With a managed services model here are some of the basic roles I’ve found are necessary in larger environments.
All the operations management roles are designed to support both the Incident and problem management tactical roles.
Monitoring – Systems don’t just fail overnight. Statistically there are 6 minor problems on the system before a catastrophic failure. It’s the 7th minor failure that brings the system down. Monitoring systems are designed to identify errors on the system before anyone notices the 7th error. Monitoring systems are programed to alert system administrators of minor problems on the network. System administrators can then fix the minor problem before the 7th error occurs.
Before monitoring, all systems were managed in a break/fix model. Where systems were ignored until the system failed. Managed Services is a tactical strategy for addressing issues early. Break/fix strategies run with 85% to 95% availability. Conversely this means that the systems are down, 5 – 15% of the time. On the other hand a well-run “Managed Services” model will keep system up 99.9% or better. For the owner this can be the difference between a month of network downtime (Break/Fix) or just 60 minutes of downtime per year (Manage Services). Figuring that productivity losses average $7314 / hour this adds up quickly for the business.
Change Management – When a stable network fails, the first question asked is, “What changed?” Change management is the process for implementing and documenting changes to the network as they occur. As networks grow it’s hard to remember all the changes made on a network. The administrator can make a change that will counter another administers change. Suddenly the network fails.
Change management systems are designed to answer the question “What has changed on the network?” Incident management and Problem Management used the change logs to identify causes for failures on the network. These change logs reduce the troubleshooting time from hours or days to minutes.
Deployment –networks changes and upgrades, when executed are called deployments. A deployment can be a simple patch, a version upgrade and as complex as a new replacement system. Coordinating with the Change team, Incident and the business groups affected by the change; deployment technicians make known and documented changes to the network.
Infrastructure Management – Different from deployment and incident. Infrastructure management maintains the day to day integrity of the core systems. This includes routing, email, databases and other systems on the network. Change requests by deployment are passed through the infrastructure management team. This teams functions as the guardian of the core infrastructure.
Customer Management – From the standpoint of IT, all users of the system are customers. Customer management is the liaison between the business departments and the technology groups. They share change information that might affect the department. (Imagine payroll being schedule the same day as a major IT deployment.) The Customer management teams also work with the business departments to identify changes to the technology that will benefit the department. The customer manager will identify the business requirement and convert those requirements into a technical requirements document for the system architect.
Quality Assurance (QA) – Imagine an IT trouble ticket completed during an emergency. Now problem management is trying to identify the cause. The resolution description reads “System broken, System fixed.” In order to perform root cause analysis, the problem manager now has to interview everyone to figure what actually happened. QA defines the technical standards for systems including ticket documentation. Then QA will enforce the defined standards.
As an example QA is responsible for establishing, training and enforcing ticket documentation standards so that problem management can close tickets without a deep review of what has already been done.
Understanding these roles at a high level allows the owner and the IT management team to better communicate with their IT teams. As networks move into the cloud, it stops being about the technology and becomes more about the process of supporting the network. As IT Experts it’s easy to get lost in the 1’s and 0’s of the new technologies that keep sprouting up. Yet I find that “getting lost” in the technology can bring the business to a halt. We can’t forget that the reason for the technology is the business and building the value of the business.