Red Light Fatigue

We've all been there. We look at the monitoring dashboard and see the collection of Christmas lights staring back at us. The hope is that they are all green and happy. But there is the occasional problem that generates a red light. We stop and look at it for a moment.

  • How long has this been red?
  • If someone already looked at it, why are there no notes?
  • Did this get fixed? Is it even able to get fixed?
  • Do we need to have a meeting to figure this out?

All these questions come up and make us try and figure out how to solve the problem. But what if the light itself is the problem?

Red Means Stop

How many monitoring systems get setup with the traditional red/yellow/green monitoring thresholds that never get modified? We spend a large amount of time looking at dashboards trying to glean information from the status indicators, but do we even know what those indicators represent? I would wager that a large portion of the average IT department doesn't know what it means to have the indicator light transition from green to yellow.

Threshold settings are very critical to the action plan. Is a CPU running at 90%, triggering a red warning? CPU spikes happen frequently. Does it need to run at that level for a specific period of time to create an alert? When does it fall back to yellow? And is that same 90% threshold the same for primary storage? What about tracking allocated storage? A thick provisioned LUN takes up the same amount of space on a SAN whether it has data stored there or not.

Defining the things that we want to monitor take up a huge amount of time in an organization. Adding too many pieces to the system causes overhead and light fatigue. Trying to keep an eye on the status of 6,000 switch ports is overwhelming. Selecting the smartest collection of data points to monitor is critical to making the best of your monitoring system.

But at the same time, it is even more critical to know the range for the data that you're monitoring. If your alerts are triggering on bad thresholds or are staying red or yellow for long periods of time then the problem isn't with the device. The problem is that the data you're monitoring is bad or incorrect. When lights stay in a trouble state for longer than necessary you create aversion to wanting to fix problems.

I tested this once by doing just that. I went in to a monitoring system and manually trigged a red light on a connectivity map to see how long it would take for that light to get changed back or for someone to notice that a fault had  occurred. The results of my test were less than stellar. I finally had to point it out some weeks later when I was talking to the team about the importance of keeping a close eye on faults.

Green Means Go!

How can we fix light fatigue? The simple solution is to go through your monitoring system with a fine-toothed comb and make sure you know what thresholds are trigging status changes. Make notes about those levels as well. Don't just take the defaults for granted. Is this level a "best practice"? Who's practice? If it doesn't fit for your organization then change it.

Once you've figured out your thresholds, make sure you implement the changes and tell your teams that things have changed. Tell them that the alerts are real and meaningful now. Also make sure you keep tabs on things to ensure that triggered alerts are dealt with in a timely manner. Nothing should be left for long without a resolution. That prevents the kind of fatigue that makes problems slip under the radar.

Lastly, don't let your new system stay in place for long without reevaluation. Make sure to call regular meetings to reassess things when new hardware is installed or major changes happen. That way you can stay on top of the lights before the lights go out.

  • Drives me nuts at this company.  They seem perfectly happy with having red lights on our dashboards, not only that but in some instances red-lights are prevented from being on the main panel due to being filtered out.

  • I was in a vendor presentation last week and the presenter pulled up a dashboard for the appliance. It was quite vivid with all the contrasting colors representing different services, performance, and alerts. And then a co-worker at the end of the table chimed in. It all looks gray to me. I'm color blind. That is one way with dealing all these alerts.  :-)

  • Establish a rule regarding alerts...if it is not actionable by a human (critical) then don't show it.

    It is hard to do in a shop that was a mom and pop and has had rapid growth.  Some people are info junkies and need to see all the data all the time.

    To them, go dig through splunk or what ever log aggregation tool you have.  Too many alerts that are not having the NOC call someone are a waste of everyone's time and will likely increase the opportunity to miss something because things fell through the cracks. In some places, people want the happy messages along with the bad ones....again..more traffic.  If it is pertinent happy message, then wrap some heartbeat logic around it...but don't show it until it isn't there when it should be.

  • Reducing some of the red lights is a big part of my plan right now... too much information is a problem.

  • Glad to see we share this technique, I only build NOC dashboards to show problems, if things are fine I simply do not need to know about it.

    I also move the node Notes resource to the top of the screen so it gets more attention, default has it at the bottom of the last column and half my customers don't know that solarwinds even has a notes feature.

Thwack - Symbolize TM, R, and C