We've all been there. We look at the monitoring dashboard and see the collection of Christmas lights staring back at us. The hope is that they are all green and happy. But there is the occasional problem that generates a red light. We stop and look at it for a moment.
- How long has this been red?
- If someone already looked at it, why are there no notes?
- Did this get fixed? Is it even able to get fixed?
- Do we need to have a meeting to figure this out?
All these questions come up and make us try and figure out how to solve the problem. But what if the light itself is the problem?
Red Means Stop
How many monitoring systems get setup with the traditional red/yellow/green monitoring thresholds that never get modified? We spend a large amount of time looking at dashboards trying to glean information from the status indicators, but do we even know what those indicators represent? I would wager that a large portion of the average IT department doesn't know what it means to have the indicator light transition from green to yellow.
Threshold settings are very critical to the action plan. Is a CPU running at 90%, triggering a red warning? CPU spikes happen frequently. Does it need to run at that level for a specific period of time to create an alert? When does it fall back to yellow? And is that same 90% threshold the same for primary storage? What about tracking allocated storage? A thick provisioned LUN takes up the same amount of space on a SAN whether it has data stored there or not.
Defining the things that we want to monitor take up a huge amount of time in an organization. Adding too many pieces to the system causes overhead and light fatigue. Trying to keep an eye on the status of 6,000 switch ports is overwhelming. Selecting the smartest collection of data points to monitor is critical to making the best of your monitoring system.
But at the same time, it is even more critical to know the range for the data that you're monitoring. If your alerts are triggering on bad thresholds or are staying red or yellow for long periods of time then the problem isn't with the device. The problem is that the data you're monitoring is bad or incorrect. When lights stay in a trouble state for longer than necessary you create aversion to wanting to fix problems.
I tested this once by doing just that. I went in to a monitoring system and manually trigged a red light on a connectivity map to see how long it would take for that light to get changed back or for someone to notice that a fault had occurred. The results of my test were less than stellar. I finally had to point it out some weeks later when I was talking to the team about the importance of keeping a close eye on faults.
Green Means Go!
How can we fix light fatigue? The simple solution is to go through your monitoring system with a fine-toothed comb and make sure you know what thresholds are trigging status changes. Make notes about those levels as well. Don't just take the defaults for granted. Is this level a "best practice"? Who's practice? If it doesn't fit for your organization then change it.
Once you've figured out your thresholds, make sure you implement the changes and tell your teams that things have changed. Tell them that the alerts are real and meaningful now. Also make sure you keep tabs on things to ensure that triggered alerts are dealt with in a timely manner. Nothing should be left for long without a resolution. That prevents the kind of fatigue that makes problems slip under the radar.
Lastly, don't let your new system stay in place for long without reevaluation. Make sure to call regular meetings to reassess things when new hardware is installed or major changes happen. That way you can stay on top of the lights before the lights go out.