I’ve noticed some questions about how to run a Help desk. So I wrote up a quick article about IT Operations roles.
So what does a standard IT department look like? We all are familiar with calling the help desk when there’s a failure. Working in an IT department and one realizes that there is much more to managing an IT department than just the help desk technician on the phone. By understanding the standard IT management and support practices the modern network architect can build more effective network infrastructure.
In general operations roles are made up of those who manage the daily systems of the network and those who fix problems found on the network. The management of the network infrastructure is the role of Operations and is headed up by the operations manager. The Facilities team falls under Operations and manages the day to day hardware and software of the company’s network infrastructure. Incident management, also under operations, manages the resolution of failures across the networks that are known issues. Problem Management resolves unknown issues across the infrastructure. Event Management is the early warning system for the network designed to catch problems before an actual failure on the system occur.
Over 18 years ago there was no automated event management system. Review entailed watching the autoexec.bat file for errors when the system booted. The manually checking disk space, memory allocation and log files on the system. The first two hours of a shift were spent reviewing log files and researching (without the internet) the possible meanings for the error. Networks systems caught on quickly though and soon networks became too large to manage manually. Event management systems automated this process. Managed Services Systems have a central focus on the event management system. Making the types of SLA’s we see in cloud system possible. Working with the facilities team the event management team allows small teams to manage 100′s even 1000′s of servers.
There are two teams that manage failures on the network.
- The Incident management is responsible for bringing the systems back on line as quickly as possible. The idea is that every minute a system is down, is lost profit for the business.
- In contrast Problem management is responsible for making sure the same incidents don’t keep reoccurring.
At first you might ask, how can an incident be resolved if it’s not fixed?
Well here’s a typical example:
A failing server is rebooted.
At first it’s running fine, but after a day or two begins to slow and fail.
Later it’s discovered that memory is not being deleted from RAM after an operation is completed. (Called a Memory leak) Rebooting clears the memory and the system runs fine. Yet inevitably the RAM again fills because the software is still not releasing memory. The system begins slows down until it once again begins failing.
In the case of this example, Incident’s job is to reboot the server to get the system up and running. Problem Management’s job is to review the memory dump to verify the memory leak. The conflict is that the memory dump requires time before rebooting the system. Incident presses for the reboot, while problem management presses for time to download the memory dump.
Booting a server may bring the system operational again, but does not address the root cause of the problem. Eventually the problem will happen again. Incident management is not concerned with the root cause of the problem only in getting the system back online. Problem Management is focused on stopping the incident from happening again. This conflict of interest puts the two teams in constant conflict with one another. This is an expected conflict in a healthy operations team and the operations manager is the ultimate referee.
The incident group is broken into support levels sometimes call tiers. The first tier level is a triage level. The tier 1 support technician will try to identify the problem in order to
- a) Fix a known error or
- b) Pass the incident to the appropriate tier 2 support team.
These teams will investigate known issues to find a solution to the problem. The tier 3 support technicians have the deepest level of training in the specific technology. The tier 3 team is responsible for final resolution of all Incidents
Tier 3 support may put together a major incident team. To resolve the incident tier 3 team’s responsibility includes contacting anyone and everyone to resolve the Incident. This includes contacting outside vendors and manufacturers of software and hardware technical support teams. Major incident teams are put together to coordinate, document and manage this final stage of the Incident process.
Resolving the incident means bringing the system back online and functional. Once the incident is resolved, the incident team’s job is complete. For Problem Management the job is just begun. Resolved major incidents are discussed by the operations management team to determine if the major incident is a known issue? If not the Incident becomes a problem and is taken over by the Problem management team.
Problem management’s job is to find the root cause for each problem ticket. Problem management teams spend time looking at the hardware, software, drivers and other possible causes. The teams will bring in other members from the manufacturers who developed the components that failed. Once a problem is determined the cause, symptoms, fix and/or work-a-round is documented. The solution to the problem is placed into the Incident team database. The incident team will now have access to solve the known issue without escalating the problem to the top tier levels.
In this way an IT department maintains the network. Day to day management is handled by the facilities team. Incidents are failures that are managed by the Incident team. Incidents are managed through three levels of support. Problems are failures without a known cause. Problem management determines the cause, the solution and records this in the incident support database. Finally the entire team is management by the Operations Manager.