Red Light Fatigue

networkingnerd over 8 years ago 3 minute read time

We've all been there. We look at the monitoring dashboard and see the collection of Christmas lights staring back at us. The hope is that they are all green and happy. But there is the occasional problem that generates a red light. We stop and look at it for a moment.

How long has this been red?
If someone already looked at it, why are there no notes?
Did this get fixed? Is it even able to get fixed?
Do we need to have a meeting to figure this out?

All these questions come up and make us try and figure out how to solve the problem. But what if the light itself is the problem?

Red Means Stop

How many monitoring systems get setup with the traditional red/yellow/green monitoring thresholds that never get modified? We spend a large amount of time looking at dashboards trying to glean information from the status indicators, but do we even know what those indicators represent? I would wager that a large portion of the average IT department doesn't know what it means to have the indicator light transition from green to yellow.

Threshold settings are very critical to the action plan. Is a CPU running at 90%, triggering a red warning? CPU spikes happen frequently. Does it need to run at that level for a specific period of time to create an alert? When does it fall back to yellow? And is that same 90% threshold the same for primary storage? What about tracking allocated storage? A thick provisioned LUN takes up the same amount of space on a SAN whether it has data stored there or not.

Defining the things that we want to monitor take up a huge amount of time in an organization. Adding too many pieces to the system causes overhead and light fatigue. Trying to keep an eye on the status of 6,000 switch ports is overwhelming. Selecting the smartest collection of data points to monitor is critical to making the best of your monitoring system.

But at the same time, it is even more critical to know the range for the data that you're monitoring. If your alerts are triggering on bad thresholds or are staying red or yellow for long periods of time then the problem isn't with the device. The problem is that the data you're monitoring is bad or incorrect. When lights stay in a trouble state for longer than necessary you create aversion to wanting to fix problems.

I tested this once by doing just that. I went in to a monitoring system and manually trigged a red light on a connectivity map to see how long it would take for that light to get changed back or for someone to notice that a fault had occurred. The results of my test were less than stellar. I finally had to point it out some weeks later when I was talking to the team about the importance of keeping a close eye on faults.

Green Means Go!

How can we fix light fatigue? The simple solution is to go through your monitoring system with a fine-toothed comb and make sure you know what thresholds are trigging status changes. Make notes about those levels as well. Don't just take the defaults for granted. Is this level a "best practice"? Who's practice? If it doesn't fit for your organization then change it.

Once you've figured out your thresholds, make sure you implement the changes and tell your teams that things have changed. Tell them that the alerts are real and meaningful now. Also make sure you keep tabs on things to ensure that triggered alerts are dealt with in a timely manner. Nothing should be left for long without a resolution. That prevents the kind of fatigue that makes problems slip under the radar.

Lastly, don't let your new system stay in place for long without reevaluation. Make sure to call regular meetings to reassess things when new hardware is installed or major changes happen. That way you can stay on top of the lights before the lights go out.

Top Comments

silverbacksays over 8 years ago +2

Another good thing to do is to ensure that your NOC has a simple view. In line with " Keeping IT Stupid Simple " I used to setup the view for my NOC where it ONLY showed values that were NOT green. Minimal…
tcbene over 8 years ago +1

I agree if you keep it simple and don't over monitor then you are more likely not to be fatigued by monitoring.
rschroeder over 8 years ago +1

Maintaining control, and addressing every red light as a to-do, keeps Orion useful. We had a person add 300 servers into NPM via WMI so we could show the SysAdmins the benefits of Orion, didn't tell my…

cnorborg over 7 years ago

Drives me nuts at this company. They seem perfectly happy with having red lights on our dashboards, not only that but in some instances red-lights are prevented from being on the main panel due to being filtered out.
- Cancel
- Vote Up +1 Vote Down
- More
- Cancel
tinmann0715 over 7 years ago

I was in a vendor presentation last week and the presenter pulled up a dashboard for the appliance. It was quite vivid with all the contrasting colors representing different services, performance, and alerts. And then a co-worker at the end of the table chimed in. It all looks gray to me. I'm color blind. That is one way with dealing all these alerts. :-)
- Cancel
- Vote Up +1 Vote Down
- More
- Cancel
Jfrazier over 7 years ago

Establish a rule regarding alerts...if it is not actionable by a human (critical) then don't show it.
It is hard to do in a shop that was a mom and pop and has had rapid growth. Some people are info junkies and need to see all the data all the time.
To them, go dig through splunk or what ever log aggregation tool you have. Too many alerts that are not having the NOC call someone are a waste of everyone's time and will likely increase the opportunity to miss something because things fell through the cracks. In some places, people want the happy messages along with the bad ones....again..more traffic. If it is pertinent happy message, then wrap some heartbeat logic around it...but don't show it until it isn't there when it should be.
- Cancel
- Vote Up +1 Vote Down
- More
- Cancel
ecklerwr1 over 7 years ago

Reducing some of the red lights is a big part of my plan right now... too much information is a problem.
- Cancel
- Vote Up 0 Vote Down
- More
- Cancel
mesverrum over 8 years ago

Glad to see we share this technique, I only build NOC dashboards to show problems, if things are fine I simply do not need to know about it.
I also move the node Notes resource to the top of the screen so it gets more attention, default has it at the bottom of the last column and half my customers don't know that solarwinds even has a notes feature.
- Cancel
- Vote Up 0 Vote Down
- More
- Cancel