Question 2: Fixing False Positives

“False positives. It would be good to train that concept into those who have access to our monitoring environment”

Within a monitoring context, a false positive is when you get an alert for something wrong, but which is actually OK. It is strongly related to one of the “Four Questions of Monitoring” Kevin Sparenberg (  ) and I spoke about during THWACKcamp 2019 - “Why did I get this alert?”

Before going any further, I want to qualify this answer by stating I’m presuming the alert is something you want overall, but you’re getting it at a point in time when a problem isn’t actually occurring. If you’re getting notifications for things you never wanted at all, that’s a different issue.

Identifying why a false positive was triggered comes down to a few common cases. Because this is a manager’s forum, I’m not going to dive deep into the technical, but rather offer an overview and some hints on how to direct the work of the technical staff.

The Hair Trigger

One of most common reasons for false positives is when duration of the issue failed to be taken into consideration. Do you want to know the instant a machine fails a single ping? Or (more likely) when a machine has been unavailable for a short-but-still-not-ridiculous period of time (say 1 minute). And for non-critical systems, this duration may be even longer (5 or even 10 minutes). Including this kind of lead time helps avoid all manner of things that go bump in the night, but don’t indicate a true outage.

As a manager, I encourage you to ask (yourself and your team) how long a system might be down before it has an actual impact to the business, and use this information as the basis for the delay you impose on alerts and notifications. The result is two-fold: first, you’ll get a lot less false alerts. Second, you and your team will respond faster to alerts, knowing it’s truly urgent and actionable.

Setting Your Sights Too High (or Too Low)

The second most common reason for false positives is when the threshold to trigger is simply too low. A common misconception held my managers I encounter is they think of monitoring and alerts like a gas tank instead of a speedometer. Every time my gas tank gets low, I want to know about it, so I can stop and fill up, but I do NOT need an alert whenever I go above 35mph. Most monitoring is more like the latter.

My advice is to have you and your team write out each monitor in natural language (“collect CPU statistics and notify the team when it’s over 80%”) and then add “and then the team will...” at the end. If you can’t finish the sentence, chances are good your threshold is wrong.

Asking the Wrong Questions

My last common reason for false alerts (for this post, at least) is when you’re considering the wrong thing in the first place. Let’s take the CPU alert from the previous example. Why would you alert on high CPU ever? There are reasons, to be sure, but if you don’t even know why (and more specifically, what you and your team will DO about it when it’s over threshold) it’s likely because you’re not collecting the right information. To continue with the example, high CPU is often a symptom of a problem, but not the root cause. Like a person with a fever, the symptom tells you *something* is happening, but you need to collect more data to understand what it is.

As a manager, the key lies, again, in asking the team what they’ll do when they receive an alert. If you get a bunch of shrugged shoulders or simplistic answers (“we’ll reboot the box and hope it clears up”) you know you need to keep digging.

Thank you for joining me on this second installment in the Monitoring for Managers forum. I’m looking forward to your questions or thoughts in the comments section below. Or you can reach out to me via private message here (  ).

Until next time,
  -  Leon

  • These are great steers on why we always discuss with our clients the need to:

    1. Baseline - set a period of data collection, so you can see what your norms are. For the CPU example above, is it standard behaviour for a server to run at 83%, which under normal alert settings (see next point) would generate an alert. If so learn and adapt your thresholds
    2. The out of the box alerts are GENERIC, they are not purpose designed for you as an individual customer. Review the alerts provided and then make them work for you and your needs
    3. Don't forget thresholds can work in both directions. I had a customer only last week that were complaining about alerts being generated on Linux Cache volumes. "I get an alert it is 100% as soon as I add a new Linux server". Well yes, as the default alert says it should do so. Do you need to monitor your cache volumes? Change the thresholds from Critical being '> 90' to '< 98'
    4. Identify what you are actually going to read and DO SOMETHING ABOUT. If you look at an alert and say 'that is good to know', it is a report and not an alert. Go configure reports to provide you this information and you will also find you get much more benefit with all the supporting data and information you can inject in to it

    Nice Post Leon


  • Yes to all of this.

    And to the managers reading, the point here is two-fold:

    1) "salt to taste" is for more than cooking. Adjustments and customizations are necessary to make monitoring fit your organization. That means everything from thresholds to scan schedules to alert flow. Everything needs to be considered and calibrated to the goals and needs of your business. Your team (the technical folks) can do the work, but they can't do it right if you, the business leader, don't provide a clear idea of what those business goals area.

    2) Ask "Then what" until you get to meteors. What I mean is to keep asking your team questions like: "after the system does that, what happens next?" Keep pushing out the boundaries of what monitoring is able to both accomodate and resolve. Eventually you'll get to a point where the team says "well, at that point we're covered against anything except a meteor hitting the facility" (or something like that type of response) and you can stop.