Question 2: Fixing False Positives

“False positives. It would be good to train that concept into those who have access to our monitoring environment”

Within a monitoring context, a false positive is when you get an alert for something wrong, but which is actually OK. It is strongly related to one of the “Four Questions of Monitoring” Kevin Sparenberg (  ) and I spoke about during THWACKcamp 2019 - “Why did I get this alert?”

Before going any further, I want to qualify this answer by stating I’m presuming the alert is something you want overall, but you’re getting it at a point in time when a problem isn’t actually occurring. If you’re getting notifications for things you never wanted at all, that’s a different issue.

Identifying why a false positive was triggered comes down to a few common cases. Because this is a manager’s forum, I’m not going to dive deep into the technical, but rather offer an overview and some hints on how to direct the work of the technical staff.

The Hair Trigger

One of most common reasons for false positives is when duration of the issue failed to be taken into consideration. Do you want to know the instant a machine fails a single ping? Or (more likely) when a machine has been unavailable for a short-but-still-not-ridiculous period of time (say 1 minute). And for non-critical systems, this duration may be even longer (5 or even 10 minutes). Including this kind of lead time helps avoid all manner of things that go bump in the night, but don’t indicate a true outage.

As a manager, I encourage you to ask (yourself and your team) how long a system might be down before it has an actual impact to the business, and use this information as the basis for the delay you impose on alerts and notifications. The result is two-fold: first, you’ll get a lot less false alerts. Second, you and your team will respond faster to alerts, knowing it’s truly urgent and actionable.

Setting Your Sights Too High (or Too Low)

The second most common reason for false positives is when the threshold to trigger is simply too low. A common misconception held my managers I encounter is they think of monitoring and alerts like a gas tank instead of a speedometer. Every time my gas tank gets low, I want to know about it, so I can stop and fill up, but I do NOT need an alert whenever I go above 35mph. Most monitoring is more like the latter.

My advice is to have you and your team write out each monitor in natural language (“collect CPU statistics and notify the team when it’s over 80%”) and then add “and then the team will...” at the end. If you can’t finish the sentence, chances are good your threshold is wrong.

Asking the Wrong Questions

My last common reason for false alerts (for this post, at least) is when you’re considering the wrong thing in the first place. Let’s take the CPU alert from the previous example. Why would you alert on high CPU ever? There are reasons, to be sure, but if you don’t even know why (and more specifically, what you and your team will DO about it when it’s over threshold) it’s likely because you’re not collecting the right information. To continue with the example, high CPU is often a symptom of a problem, but not the root cause. Like a person with a fever, the symptom tells you *something* is happening, but you need to collect more data to understand what it is.

As a manager, the key lies, again, in asking the team what they’ll do when they receive an alert. If you get a bunch of shrugged shoulders or simplistic answers (“we’ll reboot the box and hope it clears up”) you know you need to keep digging.

Thank you for joining me on this second installment in the Monitoring for Managers forum. I’m looking forward to your questions or thoughts in the comments section below. Or you can reach out to me via private message here (  ).

Until next time,
  -  Leon

  • These are great steers on why we always discuss with our clients the need to:

    1. Baseline - set a period of data collection, so you can see what your norms are. For the CPU example above, is it standard behaviour for a server to run at 83%, which under normal alert settings (see next point) would generate an alert. If so learn and adapt your thresholds
    2. The out of the box alerts are GENERIC, they are not purpose designed for you as an individual customer. Review the alerts provided and then make them work for you and your needs
    3. Don't forget thresholds can work in both directions. I had a customer only last week that were complaining about alerts being generated on Linux Cache volumes. "I get an alert it is 100% as soon as I add a new Linux server". Well yes, as the default alert says it should do so. Do you need to monitor your cache volumes? Change the thresholds from Critical being '> 90' to '< 98'
    4. Identify what you are actually going to read and DO SOMETHING ABOUT. If you look at an alert and say 'that is good to know', it is a report and not an alert. Go configure reports to provide you this information and you will also find you get much more benefit with all the supporting data and information you can inject in to it

    Nice Post Leon

    Mark

  • Yes to all of this.

    And to the managers reading, the point here is two-fold:

    1) "salt to taste" is for more than cooking. Adjustments and customizations are necessary to make monitoring fit your organization. That means everything from thresholds to scan schedules to alert flow. Everything needs to be considered and calibrated to the goals and needs of your business. Your team (the technical folks) can do the work, but they can't do it right if you, the business leader, don't provide a clear idea of what those business goals area.

    2) Ask "Then what" until you get to meteors. What I mean is to keep asking your team questions like: "after the system does that, what happens next?" Keep pushing out the boundaries of what monitoring is able to both accomodate and resolve. Eventually you'll get to a point where the team says "well, at that point we're covered against anything except a meteor hitting the facility" (or something like that type of response) and you can stop.

  • When I work with newer monitoring teams, I mention how monitoring can often fall into the trap of the story of 'The Boy who cried Wolf'. When you get false positives and these create emails in your platform, not only are they in themselves useless, but the bring down the psychological importance of the rest of the monitoring emails. After the 10th time you open an email and you find it not actionable, you will unconsciously assume that the email that just came in is just more of the same.

    The same can also be said about 'Reset Actions': If you are getting an email simply to tell you that something is working again, you actually assign a mindset of 'I don't need to do anything with that email'. After a while, this same mindset blurs into the other emails and before long you missed the email about an important service going down and everything becomes chaos.  

    Appreciate the posts so far !

  • My last common reason for false alerts (for this post, at least) is when you’re considering the wrong thing in the first place. Let’s take the CPU alert from the previous example. Why would you alert on high CPU ever? There are reasons, to be sure, but if you don’t even know why (and more specifically, what you and your team will DO about it when it’s over threshold) it’s likely because you’re not collecting the right information. To continue with the example, high CPU is often a symptom of a problem, but not the root cause. Like a person with a fever, the symptom tells you *something* is happening, but you need to collect more data to understand what it is.

    I think here you actually combine two different separately actionable insights together.

    First, you have symptom based alerting.  Symptom based alerting in itself isn't bad - we all have the experience of going to the doctor because "it hurts here" and making that the jumping-off point to which you actually get a diagnosis.  Patients can't always turn up and say "I have a T6 spinal cord injury, please fix me."  If your system can give better insight as to its own diagnosis, of course it should, but if all you know is "this machine is no longer responding and that's bad" then we can't let the perfect be the enemy of the good, we just have to pay the penalty by spending more operations time drilling into root-causing instead.  (That additional time may mean you have to fire alerts sooner as well, factoring in the length of time root-causing takes).  Different teams have different philosophies about whether they're OK with symptom-based alerting, especially as a catchall in case nobody has considered the specific failure mode of a component that would allow a more specific alert to exist.

    (In fact, as an aside, some businesses try to push for alerting to be tied to an SLO or SLA claiming it's the most aligned alerting towards business requirements.  In these cases, they are almost enforcing the use of a symptom based alert - "we are breaching SLA but we don't know exactly why")

    The second is "vanity metrics" which are almost always a bad smell.  Just because a platform allows you to monitor a given value, doesn't mean you should.  The fact that the system knows what the CPU is, knows what the network throughput is, knows all the running processes... it's extremely tempting to make all kinds of meaningless assertions because either (1) surely more alerts is better at catching problems, or (2) we don't currently have a metric for the real thing we want to monitor, but maybe if we fudge these four together (machine is up and process is running and CPU is greater than 0 and disk activity is greater than 0) then we can use that as a proxy for what we really care about (the system is processing data as intended).  Of course, the answer here is 'no, actually make the thing you want to monitor, don't rely on the fragility of your approximation remaining true'.