cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post
Level 18

TIPS & TRICKS: Stop The Madness: How to set alert thresholds per-device

This is the second in a series of posts where, in the name of giving back to the community, I'm going to share some of the customizations that make SolarWinds a little more robust for us and our customers.

First, a little background about my company and how we use SolarWinds. Sentinel is an IT solutions provider that focuses on communications technologies, Data Center, and Outsourced / Managed Solutions.

One of our key services (and the thing that lets me put food on the table) is a remote monitoring solution (based on SolarWinds, of course). All we have to do is drop a VPN router onto the customer's premises and set up NAT's for the devices they want (read "pay us") to monitor, and we're good to go. This is a perfect fit for our customer base, where they don't want to divert resources for the ongoing investment in staff, software, and skills to set up an enterprise-wide monitoring and management solution (not to mention figuring out who's going to handle all those pesky tickets).

So our model - where we have many independent customers with different sets of values, different monitoring requirements and so on has driven us to come up with some customizations that focus on:

  • How to stop alerting on various devices (because of pilot projects, new customer onboarding, or maintenance windows) while continuing to collect statistics
  • How to set thresholds for devices when that could be different on nearly a device-by-device basis
  • How to ignore alerts based on the built-in monitors for CPU/RAM, etc on older or closed-architecture devices where a custom OID gave better data

This post is going to look at our solution for the second bullet - how to set thresholds for devices on a device-by-device basis. You can find the discussion about the first item here.

If you've worked with SolarWinds alerts for more than 15 minutes, you probably already know the slippery slope it presents. You start by setting an alert for CPU with a pretty logical threshold of "> 90% for 10 minutes". Soon after that one of two events happen (or both. It depends on your environment)

  1. Device "owners" complain about all the events you are missing because the threshold is too high
  2. The people receiving alerts complain they are getting too many false alarms because the threshold is too low

About this time you realize that various devices - depending on their machine type, OS, role, or even the specifics of that particular system) require custom thresholds.

So you start copying alerts and modifying them. And when you turn around, you realize you've got 237 different "high CPU" alerts and the logic of each of them ("machine type = "Windows" and IP_Address contains 1.2.3 and (custom field) IS_IMPORTANT = 1 and....") is enough to constipate Einstein.

In a fit of pique during a monitoring review meeting, you throw your hands up in the air and say "why don't I set up a separate threshold for Every. Flipping. Device?!?!?!"

Assuming you retained employment at your company after that outburst, I want to let you in on a secret:
You can.

The key here, much like the one presented earlier for muting, is a couple of custom fields and a little bit of Alert logic.

The Custom Fields

You can call them anything you want, but they should be numeric. Here at Sentinel, we've got ALERT_CPU, ALERT_RAM and ALERT_VOL. The first two go in the nodes table, the last one (logically enough) goes in the volume table.

The Alert Logic

Now the we can alert on individualized thresholds for those elements on a node-by-node basis, leveraging the alert system's "complex conditions" option: "where (field or value) xxx is greater/less/equal to (field or value) yyy".

The alert logic for CPU would look something like this:

Where ANY of the following are true
  Where ALL of the following are true
     ALERT_CPU is empty
     CPU_LOAD is greater than 90
  Where ALL of the following are true
     ALERT_CPU is not empty
     the field CPU_LOAD is greater than the field  ALERT_CPU

This has the effect of setting a default threshold for any device that doesn't have a specific value in the custom alert field (that's the first "Where ALL" section; but if it DOES have a value then compare whatever number is there to the field ALERT_CPU.

For those who are following along from my previous article, here's the logic that includes the "mute" options:

Where ANY of the following are true
  Where ALL of the following are true
     N_MUTE is not equal to YES
     ALERT_CPU is empty
     CPU_LOAD is greater than 90
  Where ALL of the following are true
     N_MUTE is not equal to YES
     ALERT_CPU is not empty
     the field CPU_LOAD is greater than the field  ALERT_CPU

This is also useful if you want to MUTE just one element - say CPU. You have a device that simply "runs hot". You don't want CPU alerts, but you also don't want to mute the whole node, because you still want RAM alerts, interface alerts, etc. Set the ALERT_CPU to 105, and you will continue to collect CPU stats, but (since the CPU can never go above 100), you won't ever get a CPU alert.

IN THE NEXT (AND FINAL) POST: How to ignore built-in alerts for CPU, RAM, etc. in favor of custom OIDS.


Leon Adato is a monitoring engineer at Sentinel Technologies. Sentinel is an independent technology company providing integrated, customized IT solutions including remote systems Monitoring and Management. Find out more at http://www.sentinel.com/

Leon Adato | Head Geek
------
"Measure what is measurable,
and make measurable what is not so." - Gallileo

23 Replies
Level 10

This post really helped me out. However, here in the newer version, the Alert Manager is very different, so I though it might help to post the newer Alert Language:

(In my case I was creating Volume size exceptions)

A least one child condition must be satisfied (OR)

     All child conditions must be satisfied (AND)

          VOLUME Alert_VOL is empty

          VOLUME Volume Type is equal to Fixed Disk

          VOLUME Volume Percent Available is less than 10 %

   or

     All child conditions must be satisfied (AND)

          VOLUME Alert_VOL is not empty

          VOLUME Volume Type is equal to Fixed Disk

          VOLUME Volume Percent Available is less than Volume ALERT_VOL  <-This one requires a Double Value Comparison



Thanks for keeping this fresh

Level 13

good 2nd posting

0 Kudos

😉

After the move to the new thwack forum, I didn't want stuff to get lost in the mists of time.

Leon Adato | Head Geek
------
"Measure what is measurable,
and make measurable what is not so." - Gallileo

0 Kudos
Level 12

I did the same thing in my environment as well. We have over 900 monitored nodes and I setup the alert threshold based on 1 year average. I wanted to group them into pools of 20% range, so as to simplify the effective thresholds. I used the following formula in Excel:

=CEILING(CPU_AVG/20,1)*20


This of course was not without faults. I found two cases where manual interception was needed:

1. Some devices 1 year AVG CPU turned out to be higher than their normal operating conditions. This can be easily fixed by using Percentile value instead of AVG value.

2. Some devices kept on having spikes above their average value even with significant head room allowed by the 20% pools. They were exhibiting periodic fluctuations based on user activity. Finally we manually increased it to a higher value, because you shouldn't be alerted to something if there is no follow up action to it, in this case periodic fluctuations didn't seem to indicate any trouble.

0 Kudos

These are great Leon,

Cant wait to see the next one you have up your sleeve.

Will definately be sharing these with our other customers.

Thanks,

0 Kudos

Having trouble with the actual alert conditons for the Volume usage

Here is my alert

As soon as I fill out the highlighted field the field in the same condition reverts back to *...I then set that back to Volume Space available and then the highlighted field goes back to *...It only accepts one field statement in the condition.  Any thoughts?

0 Kudos

I have a similar problem, but mine pertains to CPU_Load and my custom property ALERT_CPU.

 

I created a custom property "Alert_CPU" (configured as Property Type "Integer Number")

 

Added "80" to the custom property ALERT_CPU on a test system

 

When I try to configure the advanced alert I am unable to create the comparison between "CPU_Load" and custom field "ALERT_CPU" where it the first or second field resets to *

 

I thought the logic was correct, "field CPU_Load is greater than field ALERT_CPU" and this appears to be a similar configuration as others have in place.

- Andy

0 Kudos

I'm not in front of my system right now, but I would create a new field (because you can't change them - 😞  grrr! ) that is the long number format rather than integer.

That's good for two reasons:

1) it's how I have mine set up
2) it lets you set a threshold when the level is at >95.5 instead of a straight 96 (or whatever). Maybe not a big deal for CPU, but certainly useful for disk when you are dealing with terabytes.

Let me know how that works.

Leon Adato | Head Geek
------
"Measure what is measurable,
and make measurable what is not so." - Gallileo

0 Kudos

It would be nice if you could change them, have a few that could be updated to make things cleaner. That said when I created the new field ALERT_CPU, the property types appear limited.

Maybe I have an issue to open a case for review?

0 Kudos

Pick "Floating Point" number and see how that works.

Leon Adato | Head Geek
------
"Measure what is measurable,
and make measurable what is not so." - Gallileo

0 Kudos

Tried this, the field values continue to reset to default.

0 Kudos

I'm clean out of ideas. Time to open a ticket! Please remember to come back to this thread and let us know how it turned out.

Leon Adato | Head Geek
------
"Measure what is measurable,
and make measurable what is not so." - Gallileo

0 Kudos

I have submitted a case. Will update the thread when the issue is determined. Thanks for the tips!

0 Kudos

did this ever get resolved?  I'm having the same problem...

Nevermind...creating a new custom property and selecting the floating point value seems to have done the trick.  I also exited out of advanced alert manager prior to trying to create the trigger conditions again...don't know if that had anything to do with it...

0 Kudos

JoelGarnick: Glad your issue was resolve-able. I would definitely close the alert window before expecting it to pick up new custom fields you created in that utility (ie: not do both activities at the same time.

YourVillageIdiot: Any update from that ticket? Just curious.

Leon Adato | Head Geek
------
"Measure what is measurable,
and make measurable what is not so." - Gallileo

0 Kudos

MDRISKELL: A couple of things I noticed:

First, I would change the field in the last line of your graphic to "Volume Percent Used", the same as the condition group above. Just for consistencies sake. You don't HAVE to, but it's the way I'd do it. That also may be why you can't compare the two fields (maybe a type mismatch)

Second, I want to double-check that you made the ALERT_VOL field numeric, not text. Again, that might be why you can't do a "greater than/less than" comparison

Finally, it looks like your first condition group (ALERT_VOL is empty, etc) is on the same level as your first "Trigger alert when ALL of the following apply". You need to click the elipses (... button) next to "Trigger alert when ALL of the following apply" and pick "simple condition" so that those filters are nested inside the ALL and not inside the ANY block.

Let me know if any of this isn't making sense.

Leon Adato | Head Geek
------
"Measure what is measurable,
and make measurable what is not so." - Gallileo



MDRISKELL: A couple of things I noticed:

First, I would change the field in the last line of your graphic to "Volume Percent Used", the same as the condition group above. Just for consistencies sake. You don't HAVE to, but it's the way I'd do it. That also may be why you can't compare the two fields (maybe a type mismatch)



I specificly want it at percentage if the field isn't filled out due to the large storage we have available on our systems.  I am using actual space free for the alerts where it is defined.

Second, I want to double-check that you made the ALERT_VOL field numeric, not text. Again, that might be why you can't do a "greater than/less than" comparison



This was the problem on the condition.  Now it is accepting it as valid.

Finally, it looks like your first condition group (ALERT_VOL is empty, etc) is on the same level as your first "Trigger alert when ALL of the following apply". You need to click the elipses (... button) next to "Trigger alert when ALL of the following apply" and pick "simple condition" so that those filters are nested inside the ALL and not inside the ANY block.



I recreated the alert just to capture the screen shot for the post...in doing so I had the formats wrong.  I have corrected.

 

Thanks for the help.

0 Kudos
Level 9

Thanks for the tips.  Now I need a system to track all the custom fields and ensure they are set at the correct values.  🙂

Adatole,

First of all I want to say great post.  I just found this thread yesterday.  Up until a month ago I changed employers where in my previous role I managed a NOC where we provided much the same type of managed services you described.  I had created several custom alerts based on different thresholds.

If only I had found this or thought of this idea myself over 5 years ago when i started using Solarwinds oh the hours I could have saved. 

I am in the begining stages of implementing Solarwinds at my new employer to monitor our internal network and you can bet money that I will be using this tip from the start to make my job much easier.

Fantastic post!!!!!

0 Kudos