Here are 10 best practices for system monitoring that Javier Soltero, CTO of Hyperic Inc., has seen succeed in his years in IT.
- Define what it means for a given resource — a server, an application or a service — to be labeled “production”.
- Figure out what monitoring you need to satisfy the production requirement.
- Implement the monitoring capability, either manually or with open source tools like Nagios or commercial tools.
- Define what it means for something to be “broken/unavailable/on fire” — also referred to as WARN/ERROR/CRITICAL.
- Implement alerts in your monitoring system to capture these thresholds.
- Define what process is to be followed for each alarm level.
- Make sure your alerting process follows that notification process.
- Create roles/responsibilities for groups of people to share alerts, control and detailed access to relevant their job function.Focusing individuals generally means better performance for their area.
- Designate a small number of super-users that architect your entire system of alerts, monitoring protocols, roles, etc., to ensure they follow a single blueprint.
- Lather, rinse, and repeat if necessary.
I pulled these tips from a LinuxWorld 2007 preview interview with Javier Soltero. In another excerpt from that interview — Virtualization boosts Linux adoption big-time — he talks about the synergy between Linux and virtualization and challenges posed in managing multiple-operating system environments and identifying and tracking virtual machines. Javier also offered some great comments on other subjects, which can be found in articles from our LinuxWorld and Next Generation Data Center Conference 2007 coverage here.