cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post

Cutting Down On Alerting Noise: Guest Post From Support

Level 11

I’ve been with SolarWinds going on 8 years now, working with the Support Team, so it came as no surprise to me when I saw how many of you answered our poll last month that Alerts, specifically, filtering out the noise from the real issues, was your top priority problem to solve in the next 30 days.

So, with that in mind, I’ve put together three powerful tips that could help you to achieve this goal, and I’ll discuss some of the common alerting questions that we tend to get in Support along the way.

NPM has awesome alerting capability, and when coupled with some of the other Orion features, it’ll let you get really granular – so yes, you really can cut down on a lot of that noise!

Tip #1 – Use custom properties with alerts for a powerful combo

Custom properties allow you to define any property you like that doesn’t already exist in Orion. The possibilities are endless here, and you can use multiple custom properties together to provide a high level of granularity. Not heard of custom properties before? Check out a video tutorial on them here.

I’ll take three case studies as examples that we’re regularly asked about, which could go a long way to reducing that alerting noise.

Let’s start really simple, and assume you’ve got a handful of devices that you need to monitor, but that you don’t need alerts for them. Create a Boolean (Yes/No) custom property called DoNotAlertOnThis. By default, custom property Boolean values default to false – so all you need to do is set this to True for these ‘noisy’ devices.

Set up your alert like this:

pic1.png

Next, lets take a look at creating a single alert that can email a different contact for each node. When you’ve got a lot of departments, you don’t want all of your admin teams to be receiving alerts for devices that others are responsible for, right? Well, what if I told you that you could create a custom property on your nodes, and store the email addres...

And finally, do you ever find yourself wishing that you could define your alert thresholds per device, per interface, or per volume? Not all volumes are created equal - have you ever wished you could alert on some volumes when they reach 50%, but then others aren’t really a panic until they hit 95%? It’s not only possible, it’s quite easy to set up with custom properties.

Tip #2 – Using Dependencies to add intelligence to the Alerts

Dependencies are another awesome way to build alerting intelligence, and they cut down drastically on noise.

Let’s assume for example, that you’ve got a remote site with a few hundred devices – well using just the default ‘Node Down’ alert, you’re probably getting an email for every single one of those remote devices, if the link to that remote site goes down. It’s not that those devices are actually down, but as Orion can’t contact them, it has no choice but to mark them that way. Right? Thankfully – Wrong!

Using Groups and Dependencies, you can tell Orion that those devices are Unreachable when that link goes down – that way, you’ll only receive your interface down alert. Sound good?

Here’s how you do it.

First create a Group for your remote site. Let’s call it FieldOffice. Set a dependency for your FieldOffice group, with the interface of the network device in your HQ as the parent.

That’s it. It’s really that simple – Orion will do all the rest for you automatically.

If that interface goes down, Orion will set all the nodes in your Field Office as Unreachable – a specific status used for Dependencies. As these nodes are ‘Unreachable’ status, and not ‘Down’, the logic for ‘alert me when a nodes goes down’ does not apply.

Speaking of groups – this is a little off topic, but we do get asked about it from time to time. How do you set up an alert to tell you when a group member goes down (or tell you which of your group members caused the group to go down)? You can actually use the out-of-the-box ‘Alert me when a group goes down’ alert. By default this tells you about the group going down (as the status bubbles up to group level – but it doesn’t include the detail about what group member keeled over to cause that to happen). Add in the following variable to your trigger action to get that information:

Root Cause: ${GroupStatusRootCause}

Tip #3 - Check out the complex condition options in 11.5

With the complex conditions options in 11.5 the possibilities are endless, so rather than talk about specific case studies, let’s talk about some of the awesome logic options you can now apply to your alerts.

First, enable the “Complex Conditions” on the “Trigger Conditions” tab of your new alert:

pic2.png

First, let’s take a look at a really cool noise reducing option – only get an alert when a certain number of objects meet the alert criteria.

Yes, you heard that right – instead of receiving a separate alert for each separate interface with high utilization, you can choose to receive an alert only when a certain threshold of interfaces that match your alert requirements go over their thresholds. With the Complex Conditions enabled, this option will appear at the end of the Primary Section:

pic3.png

And finally, as my last but definitely not least tip of the day – let’s take a look and stringing conditions together. There’ll be times when you might only want to get an alert when two (or more) very distinct conditions occur at once, and when you’re fighting fires, it can be hard to correlate a bunch of emails together to see if that scenario actually happened or not.

The classic example of this would be to check the status of two separate applications on two separate servers – maybe you can hobble along with just one, but you might need a critical alert sent out if you lose both.

pic4.png

Using the standard alert conditions, you can create an alert to alert on just one of those applications. If you were to add ‘Node name is equal to ServerB’ in here, it could never trigger – as no server can be named both ‘ServerA’ and ‘ServerB’ at the same time.

This is where the complex condition is king. You can now alert as:

pic5.png

So now an application would have to fail on both Server A and also fail on Server B in order to generate this alert.

Here’s where it gets super interesting – the conditions don’t have to match. In fact, they don’t have to have anything to do with each other at all, other than both resolving to ‘true’ conditions in order for the alert to fire. And you can add as many of these additional conditions as you like to get truly granular and complex alerts.

So – over to you! What tips and tricks have you picked up on Alerts over the years? What excited you about the changes to alerts in 11.5?

Want to learn more about Orion Alerts? Take a look at these resources:

Introduction to Alerts

Level 2 Training – Service Groups, Alerts and Dependencies (53 mins)

Building Complex Alert Conditions

Information about Advanced SQL Alerts

Alert Variables

32 Comments
Level 15

‌all excellent tips.

I would say the number one thing we do to lessen the alert noise is to alert at the correct level of object. By that I mean rather then keep the out of the box alerts for application components we disable those and add the field {components with problems} to our alert action email. This way we get one alert per application which simply lists out the 1 or 20 components not up and their status. You can do the same for node/interfaces, node/volumes, nodes or apps/groups or just about anything parent-child.

Just watch out as it silences secondary alerts till the parent status is reset.

An entire post could be written about timing as well.

Level 7

But is it possible to alert on the list of components that go down in a SINGLE alert instead of one alert for each component as well as include the list of components that went up in that application monitor since the last poll, the components that are still down, as well as a final alert when all the components in the application monitor are back up? This is so crucial to reduce noise but I unfortunately haven't found a good way to handle this.

Level 11

I think bluefunelemental's method would actually work quite well for you - have you tried basing your alert on Application, rather than on Component? When a component goes down, it will set the parent Application status to down aswell. By using the {components with problems} field, you'll get the list of the components that are warning / down etc, and the alert will only reset once all components are up. What you won't see with using this alert though, is if a second component goes down during the first component's outage - you'll be alerted only when the Application status was changed by that first component going down. If you need it to alert for every component status change within this one alert, add a feature request / idea for it here

Level 12

Great post, appreciate this as we have the 'too many alerts to take notice' issue. Please check out my question on your second tip: Group Dependencies - Parent/Child Objects would love to find out more about this, if you could shed some light that would be great.

Level 11

Great questions - I hope you don't mind me answering them here, as some of the questions you're brought up are one's we regularly get about how dependencies and parent nodes nodes work, and it may help others.

Matt's question:

"So I recently found out the dependency feature (can't believe I never knew about this). So now I can create a dependency that sets up a parent object (in our case, a remote sites router) and the child object (in our case the sites group), so basically if the parent (router) goes down, the child (all the network nodes in the group) becomes 'unreachable' rather than down; suppressing alerts. Brilliant. This is exactly what we need. However, I had already set up groups for each remote site and the router (parent) sits inside of that group. The problem is, you can't do that. It states that the child can't be in the same group as the parent. Fair enough. Now I have found that if you remove the router from the group and set up the dependency, it works fine. Now... I am able to remove it from the group, set up the dependency and then add it back into the group, without it throwing up any errors that there is a dependency in place. My question is, does this nullify the dependency or will this still work?

I'm quite particular about my grouping and I've got it all set up neatly (finally) for each remote site (over 100 sites) and would like to leave the routers in the groups. If I can't, that's fine, i'll just create a group above the sites called 'routers' or something. But... if my above work around does work, i'd rather keep it like that.

Any thoughts would be appreciated. Thought I would throw this out there, before spending an age altering everything."

Glad you're liking the dependencies feature! So - you bring up two interesting questions here.

The first - whether its possible to have a parent node within a child group - that's a definite no. You're essentially making that router dependent on itself, leading to an interesting conundrum if that router goes down. It sounds like you found a way around our blocks to make sure that situation can't happen - so, we'll need to fix that

The second situation you discussed, was whether it would work to put all the routers in one group, and make the child sites dependent on that group. What that would actually mean for your alert is, every router in that parent group would need to be down for any alerts on the child site to be suppressed. If you place all routers within the one group, we assume that you're telling us you've got a few backup routes to that site, and that losing access to that site would require all nodes in the parent group to be down.

You'll need to set the parent for a site group to either the router on that site, or you could potentially use the WAN link out that remote site as the parent

Level 12

Right, great thanks for clearing this up, I assumed that first one was a bug, as that didn't seem right. Secondly, what I meant by putting all the routers in one group, wasn't to make the child dependent on the router group but make the child dependent on a router in the router group. It's basically a work around, so that I can take all the routers out of the current groups, keep them organised and still set up dependencies that work, i'm guessing this would work? As the routers in the router group are not associated with the child object in anyway, as it would be in it's own separate group.

Level 11

Exactly right - once the site is dependent only on its site router (but not on your router group), it will work just as you want it.

Level 12

Great, I will get on this. Thanks for your help, this is going to help a lot when it comes to managing our alerts!

Level 15

We have the same thing in a site code group and another sitecode firewall group.

We learned the hard way if you place a whole router in the group but just the wan links go down you will still have the original problem and a whole site full of devices go down vs unreachable. Add just the wan interfaces, set the group as status calculator best so it waits till both links are down in the pair to trigger then it works as expected.

We did something fun that you might consider- I set up a dynamic query in the firewall or site router group driving off the custom properties sitecode and wan_link. Now rather then having to use the groups tool our engineers simply mark links as wan_links and they get added to the firewall parent group. It's easy enough to do a traceroute and pretty much copy and paste.

Remember if you have redundant links and because of the best status you would need a new group for each hop so a single hop doesn't get masked by the other up pairs before or after.

Make sense?

Level 12

Ye makes sense, may give this a try thanks bluefunelemental!

Level 10

I configured a 'Node Down'  alert using the new 'alert when a certain number of objects meet the alert criteria' feature. (Yay!! I've been waiting for this one). When the alert is triggered and shows up in All Active Alerts it shows the number of objects that triggered the alert under the 'Object that triggered this alert' column. I'd like to include that number in alert message (trigger and reset actions), but can't seem to locate that variable. Is that variable available? Is there an example an alert configured with the 'multiple objects' criteria I could look at?

Level 17

${N=Alerting;M=AlertTriggerCount} should get you the desired count.

Level 10

Thanks Rob, but it didn't give me the count I was expecting. I had two nodes go down, and it returned '55'. If I goto the 'All Active Alerts' view in the GUI, it shows '2 Nodes' under 'Objects that triggered this alert'. 55 may correlate to other devices on that network that were in an unmanaged state.

Level 13

Great tips, I used Custom Properties to address issues to relevant recipients already for years. Other tips for 11.5 will check soon, thanks

I'd add one simple tip for operators who are flooded with alerts quite often. Use sender address as alert class - example A_alert for alert alert needed action, AA_alert for higher priority alert etc. (R_alert for alert reset). More - use most important variables in alert message subject so operator will not need to open message more often but address problem immediately from mailbox.

Level 9

We've found spike smoothing to be a good technique, especially for things like CPU monitoring - I'm not sure you can do this in Orion but setting a monitor to only fire when the error condition exists in say 2 out of 3 job iterations works well.

You still check a metric regularly but short spikes are ignored & alerts only fire for extended CPU flat-lining.

I'd also like to see advanced features where you can group alerts together in one ITIL type incident & manage the "incident" containing multiple alerts as one entity.

Level 7

Good Article.

I have a question around Custom Property Alerting.

Essentially I have a server which is suffers from continual high CPU / processor use.

I have created a Custom Property called "Do_Not_Monitor_CPU", deployed this to my infrastructure servers, set it to "yes" on the affected server.

Next it was my intention being to add it to the Alerting policy that processes Processor / CPU alerts.

The problem I have is that the Alert I need to add this to called "Alert me when a component goes into warning or critical state" doesn't only process CPU processor alerts, so I will not just be disabling CPU alerts from this server as was my intention.

FYI --  This is essentially "an out of the box" alert.

Any suggestions ??


Level 15

I would suggest your custom property to be "No" for normal and "Yes" for outlier as it will default to no.

Mine is called Mute_CPU.

You are correct about the alert as it triggers on status regardless of reason, cpu, memory, Virtual, metric, etc. Not in front of it but is there a type field that shows the reporting object?

Level 15

I've seen and liked the incident concept but in our case we do that in ServiceNow.

Today I report on parent objects and join child to minimize counts of alerts. For example reporting on volumes or apps on a node so we only get one alert no matter how many objects alert. Plenty of downsides though.

Later this year I hope look further into consolidating multiple alerts as events into single SN incident.

Level 7

Are you able to screenshot your trigger policy ??

Regards,

Gawain (Robbie) Parry

parro008@gmail.com

Tel: 01722 790659

Mob: 07570 957990

Level 13

To cut down on the alert noise, and the sheer number of alerts and alert condition we manage, we need:

  • conditional trigger actions with access to all variables in trigger conditions
  • conditional escalation logic - again, with access to all variables in trigger conditions
  • site-wide trigger actions conditions


E.g.

  • if "node tier" is equal to 1, keep sending alerts every hour. For all others, once is enough.
  • if the status is "unknown" for only one polling interval, do not send an alert; otherwise do
  • if the status is "unknown" for N consecutive polls (or N minutes), send a "critical" alert rather than just a "warning"; do not change anything for status other than "unknown"
  • site-wide (rather than in each and every alert) conditions such as "no alerts for objects that are 'in maintenance' or 'powered down'".

Until then, we are forced to maintain separate alerts for each set of custom properties or conditional trigger actions - and that's unmanageable - as in, not a working solution.

Level 7

Hey there!

Can you elaborate a little more on your "It's easy enough to do a traceroute and pretty much copy and paste." comment and also a little more on the new group for each hop for a redundant link?  We have redundant links to our remote locations and I am trying to figure out how to monitor them.

Thanks!

Jake

Level 12

I made a custom property to tier the alerts as needed. Alert levels high-low 1-3. It was hard placing all nodes from the start, but worth it now.

It's easy to demote a device to a lower level if you know it will not be reaching the performance metrics for level 1 regularly and it's a non-factor.

Level 8

Hello,

Thanks for the great article!  It helped me get a couple (directly connected) sites setup with the reduced noise.  We also have several sites connected using ASE (AT&T Switched Ethernet).  They are pretty much connected with a cloud for the lack of a better word.  How would you recommend us setting this up to get the same result? 

Thank you in advanced for any help.

Level 11

Hey mesteiger

It really depends on your environment, and what you need to achieve - and also on whats causing the noise. Submit a ticket with our support folks, who would be able to have a deeper conversation with you about what steps you could take to get the most out of your alerts.

MVP
MVP

Tip #0 - Get to know your noise first

Repetitive Email Alerts (Noise) - MUST HAVE REPORT!!!

Level 11

Nice one Alex!

Level 12

Good one Alex!

Level 7

Incase of Exchange Server is down in the organization no email can be sent. How can we setup the text alerts to cell phone so that it can send are paging sound.

Level 11

ssiddappa​ You can use third party software called Notepage, hooked up with a modem to achieve this. Orion integrates with this software allowing you to send an SMS, page or beep someone.

Edit your alert, and on the "Trigger Actions" page, click "Add Action". You should see an option here for "Dial Paging or SMS Service". Click "Configure Action". If Notepage hasn't yet been installed, you'll get a link for downloading it. Install this on your Orion server, and follow the instruction for configuring SMS / paging here:

Set up Orion SMS alerts - SolarWinds Worldwide, LLC. Help and Support

Level 11

Hi Alex,

Great share!

I created a custom property called email1 and I would like to see it on the report as we are using custom properties in EMAIL TO field to see which email address we put on that custom properties.

How can I make this possible as I am adding the ,n.email1 AS 'email1' script is not working when updating the script.

Thanks in advance!

MVP
MVP

Hi,

Instead of using simple text alerts (SMS) that will also require an SMS gateway, you should look at options like Slack

Alerting using #Slack

This step-by-step guide from Leon should help

The Incomplete Guide to Integrating SolarWinds Orion and Slack

A much cleaner option

Level 12

If Slack is too much of a cultyure shift, I think most people use email to send text alerts.

Key things for that are:

1. Each cell carrier has a different email address format. See: https://www.digitaltrends.com/mobile/how-to-send-a-text-from-your-email-account/

2. You get both the Subject and the Message body in the text when emailing to cell text. Often, that's duplicate info in the email. So I create a separate actions for email vs email to cell phones.

=Foon=

About the Author
Hello! I'm Caroline, and I've been working for SolarWinds in various roles on the Support Team since 2007, and absolutely love it here. I started with Solarwinds as a Technical Support Engineer, a bright-eyed SNMP geek with a love of problem solving and a World of Warcraft habit. Some of you may remember me as Caroline Doyle when I was working tickets on the queues - these days, I go by my married name of Caroline Toomey and spend my days analyzing data - as the Support Teams Analyst, the customer experience is always at the heart of everything I do. Still haven't nixed that Warcraft habit though - I've added to it instead with Dragon Age and Elder Scrolls :)