“False positives. It would be good to train that concept into those who have access to our monitoring environment”
Within a monitoring context, a false positive is when you get an alert for something wrong, but which is actually OK. It is strongly related to one of the “Four Questions of Monitoring” Kevin Sparenberg ( KMSigma.SWI ) and I spoke about during THWACKcamp 2019 - “Why did I get this alert?”
Before going any further, I want to qualify this answer by stating I’m presuming the alert is something you want overall, but you’re getting it at a point in time when a problem isn’t actually occurring. If you’re getting notifications for things you never wanted at all, that’s a different issue.
Identifying why a false positive was triggered comes down to a few common cases. Because this is a manager’s forum, I’m not going to dive deep into the technical, but rather offer an overview and some hints on how to direct the work of the technical staff.
The Hair Trigger
One of most common reasons for false positives is when duration of the issue failed to be taken into consideration. Do you want to know the instant a machine fails a single ping? Or (more likely) when a machine has been unavailable for a short-but-still-not-ridiculous period of time (say 1 minute). And for non-critical systems, this duration may be even longer (5 or even 10 minutes). Including this kind of lead time helps avoid all manner of things that go bump in the night, but don’t indicate a true outage.
As a manager, I encourage you to ask (yourself and your team) how long a system might be down before it has an actual impact to the business, and use this information as the basis for the delay you impose on alerts and notifications. The result is two-fold: first, you’ll get a lot less false alerts. Second, you and your team will respond faster to alerts, knowing it’s truly urgent and actionable.
Setting Your Sights Too High (or Too Low)
The second most common reason for false positives is when the threshold to trigger is simply too low. A common misconception held my managers I encounter is they think of monitoring and alerts like a gas tank instead of a speedometer. Every time my gas tank gets low, I want to know about it, so I can stop and fill up, but I do NOT need an alert whenever I go above 35mph. Most monitoring is more like the latter.
My advice is to have you and your team write out each monitor in natural language (“collect CPU statistics and notify the team when it’s over 80%”) and then add “and then the team will...” at the end. If you can’t finish the sentence, chances are good your threshold is wrong.
Asking the Wrong Questions
My last common reason for false alerts (for this post, at least) is when you’re considering the wrong thing in the first place. Let’s take the CPU alert from the previous example. Why would you alert on high CPU ever? There are reasons, to be sure, but if you don’t even know why (and more specifically, what you and your team will DO about it when it’s over threshold) it’s likely because you’re not collecting the right information. To continue with the example, high CPU is often a symptom of a problem, but not the root cause. Like a person with a fever, the symptom tells you *something* is happening, but you need to collect more data to understand what it is.
As a manager, the key lies, again, in asking the team what they’ll do when they receive an alert. If you get a bunch of shrugged shoulders or simplistic answers (“we’ll reboot the box and hope it clears up”) you know you need to keep digging.
Thank you for joining me on this second installment in the Monitoring for Managers forum. I’m looking forward to your questions or thoughts in the comments section below. Or you can reach out to me via private message here ( adatole ).
Until next time,
- Leon