Cutting Down On Alerting Noise: Guest Post From Support

I’ve been with SolarWinds going on 8 years now, working with the Support Team, so it came as no surprise to me when I saw how many of you answered our poll last month that Alerts, specifically, filtering out the noise from the real issues, was your top priority problem to solve in the next 30 days.

So, with that in mind, I’ve put together three powerful tips that could help you to achieve this goal, and I’ll discuss some of the common alerting questions that we tend to get in Support along the way.

NPM has awesome alerting capability, and when coupled with some of the other Orion features, it’ll let you get really granular – so yes, you really can cut down on a lot of that noise!

Tip #1 – Use custom properties with alerts for a powerful combo

Custom properties allow you to define any property you like that doesn’t already exist in Orion. The possibilities are endless here, and you can use multiple custom properties together to provide a high level of granularity. Not heard of custom properties before? Check out a video tutorial on them here.

I’ll take three case studies as examples that we’re regularly asked about, which could go a long way to reducing that alerting noise.

Let’s start really simple, and assume you’ve got a handful of devices that you need to monitor, but that you don’t need alerts for them. Create a Boolean (Yes/No) custom property called DoNotAlertOnThis. By default, custom property Boolean values default to false – so all you need to do is set this to True for these ‘noisy’ devices.

Set up your alert like this:

pic1.png

Next, lets take a look at creating a single alert that can email a different contact for each node. When you’ve got a lot of departments, you don’t want all of your admin teams to be receiving alerts for devices that others are responsible for, right? Well, what if I told you that you could create a custom property on your nodes, and store the email address or distribution list address for the Primary Contact(s) within it?

And finally, do you ever find yourself wishing that you could define your alert thresholds per device, per interface, or per volume? Not all volumes are created equal - have you ever wished you could alert on some volumes when they reach 50%, but then others aren’t really a panic until they hit 95%? It’s not only possible, it’s quite easy to set up with custom properties.

Tip #2 – Using Dependencies to add intelligence to the Alerts

Dependencies are another awesome way to build alerting intelligence, and they cut down drastically on noise.

Let’s assume for example, that you’ve got a remote site with a few hundred devices – well using just the default ‘Node Down’ alert, you’re probably getting an email for every single one of those remote devices, if the link to that remote site goes down. It’s not that those devices are actually down, but as Orion can’t contact them, it has no choice but to mark them that way. Right? Thankfully – Wrong!

Using Groups and Dependencies, you can tell Orion that those devices are Unreachable when that link goes down – that way, you’ll only receive your interface down alert. Sound good?

Here’s how you do it.

First create a Group for your remote site. Let’s call it FieldOffice. Set a dependency for your FieldOffice group, with the interface of the network device in your HQ as the parent.

That’s it. It’s really that simple – Orion will do all the rest for you automatically.

If that interface goes down, Orion will set all the nodes in your Field Office as Unreachable – a specific status used for Dependencies. As these nodes are ‘Unreachable’ status, and not ‘Down’, the logic for ‘alert me when a nodes goes down’ does not apply.

Speaking of groups – this is a little off topic, but we do get asked about it from time to time. How do you set up an alert to tell you when a group member goes down (or tell you which of your group members caused the group to go down)? You can actually use the out-of-the-box ‘Alert me when a group goes down’ alert. By default this tells you about the group going down (as the status bubbles up to group level – but it doesn’t include the detail about what group member keeled over to cause that to happen). Add in the following variable to your trigger action to get that information:

Root Cause: ${GroupStatusRootCause}

Tip #3 - Check out the complex condition options in 11.5

With the complex conditions options in 11.5 the possibilities are endless, so rather than talk about specific case studies, let’s talk about some of the awesome logic options you can now apply to your alerts.

First, enable the “Complex Conditions” on the “Trigger Conditions” tab of your new alert:

pic2.png

First, let’s take a look at a really cool noise reducing option – only get an alert when a certain number of objects meet the alert criteria.

Yes, you heard that right – instead of receiving a separate alert for each separate interface with high utilization, you can choose to receive an alert only when a certain threshold of interfaces that match your alert requirements go over their thresholds. With the Complex Conditions enabled, this option will appear at the end of the Primary Section:

pic3.png

And finally, as my last but definitely not least tip of the day – let’s take a look and stringing conditions together. There’ll be times when you might only want to get an alert when two (or more) very distinct conditions occur at once, and when you’re fighting fires, it can be hard to correlate a bunch of emails together to see if that scenario actually happened or not.

The classic example of this would be to check the status of two separate applications on two separate servers – maybe you can hobble along with just one, but you might need a critical alert sent out if you lose both.

pic4.png

Using the standard alert conditions, you can create an alert to alert on just one of those applications. If you were to add ‘Node name is equal to ServerB’ in here, it could never trigger – as no server can be named both ‘ServerA’ and ‘ServerB’ at the same time.

This is where the complex condition is king. You can now alert as:

pic5.png

So now an application would have to fail on both Server A and also fail on Server B in order to generate this alert.

Here’s where it gets super interesting – the conditions don’t have to match. In fact, they don’t have to have anything to do with each other at all, other than both resolving to ‘true’ conditions in order for the alert to fire. And you can add as many of these additional conditions as you like to get truly granular and complex alerts.

So – over to you! What tips and tricks have you picked up on Alerts over the years? What excited you about the changes to alerts in 11.5?

Want to learn more about Orion Alerts? Take a look at these resources:

Introduction to Alerts

Level 2 Training – Service Groups, Alerts and Dependencies (53 mins)

Building Complex Alert Conditions

Information about Advanced SQL Alerts

Alert Variables

Thwack - Symbolize TM, R, and C