This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

TIPS & TRICKS: Stop The Madness: How to set alert thresholds per-device

This is the second in a series of posts where, in the name of giving back to the community, I'm going to share some of the customizations that make SolarWinds a little more robust for us and our customers.

First, a little background about my company and how we use SolarWinds. Sentinel is an IT solutions provider that focuses on communications technologies, Data Center, and Outsourced / Managed Solutions.

One of our key services (and the thing that lets me put food on the table) is a remote monitoring solution (based on SolarWinds, of course). All we have to do is drop a VPN router onto the customer's premises and set up NAT's for the devices they want (read "pay us") to monitor, and we're good to go. This is a perfect fit for our customer base, where they don't want to divert resources for the ongoing investment in staff, software, and skills to set up an enterprise-wide monitoring and management solution (not to mention figuring out who's going to handle all those pesky tickets).

So our model - where we have many independent customers with different sets of values, different monitoring requirements and so on has driven us to come up with some customizations that focus on:

  • How to stop alerting on various devices (because of pilot projects, new customer onboarding, or maintenance windows) while continuing to collect statistics
  • How to set thresholds for devices when that could be different on nearly a device-by-device basis
  • How to ignore alerts based on the built-in monitors for CPU/RAM, etc on older or closed-architecture devices where a custom OID gave better data

This post is going to look at our solution for the second bullet - how to set thresholds for devices on a device-by-device basis. You can find the discussion about the first item here.

If you've worked with SolarWinds alerts for more than 15 minutes, you probably already know the slippery slope it presents. You start by setting an alert for CPU with a pretty logical threshold of "> 90% for 10 minutes". Soon after that one of two events happen (or both. It depends on your environment)

  1. Device "owners" complain about all the events you are missing because the threshold is too high
  2. The people receiving alerts complain they are getting too many false alarms because the threshold is too low

About this time you realize that various devices - depending on their machine type, OS, role, or even the specifics of that particular system) require custom thresholds.

So you start copying alerts and modifying them. And when you turn around, you realize you've got 237 different "high CPU" alerts and the logic of each of them ("machine type = "Windows" and IP_Address contains 1.2.3 and (custom field) IS_IMPORTANT = 1 and....") is enough to constipate Einstein.

In a fit of pique during a monitoring review meeting, you throw your hands up in the air and say "why don't I set up a separate threshold for Every. Flipping. Device?!?!?!"

Assuming you retained employment at your company after that outburst, I want to let you in on a secret:
You can.

The key here, much like the one presented earlier for muting, is a couple of custom fields and a little bit of Alert logic.

The Custom Fields

You can call them anything you want, but they should be numeric. Here at Sentinel, we've got ALERT_CPU, ALERT_RAM and ALERT_VOL. The first two go in the nodes table, the last one (logically enough) goes in the volume table.

The Alert Logic

Now the we can alert on individualized thresholds for those elements on a node-by-node basis, leveraging the alert system's "complex conditions" option: "where (field or value) xxx is greater/less/equal to (field or value) yyy".

The alert logic for CPU would look something like this:

Where ANY of the following are true
  Where ALL of the following are true
     ALERT_CPU is empty
     CPU_LOAD is greater than 90
  Where ALL of the following are true
     ALERT_CPU is not empty
     the field CPU_LOAD is greater than the field  ALERT_CPU

This has the effect of setting a default threshold for any device that doesn't have a specific value in the custom alert field (that's the first "Where ALL" section; but if it DOES have a value then compare whatever number is there to the field ALERT_CPU.

For those who are following along from my previous article, here's the logic that includes the "mute" options:

Where ANY of the following are true
  Where ALL of the following are true
     N_MUTE is not equal to YES
     ALERT_CPU is empty
     CPU_LOAD is greater than 90
  Where ALL of the following are true
     N_MUTE is not equal to YES
     ALERT_CPU is not empty
     the field CPU_LOAD is greater than the field  ALERT_CPU

This is also useful if you want to MUTE just one element - say CPU. You have a device that simply "runs hot". You don't want CPU alerts, but you also don't want to mute the whole node, because you still want RAM alerts, interface alerts, etc. Set the ALERT_CPU to 105, and you will continue to collect CPU stats, but (since the CPU can never go above 100), you won't ever get a CPU alert.

IN THE NEXT (AND FINAL) POST: How to ignore built-in alerts for CPU, RAM, etc. in favor of custom OIDS.


Leon Adato is a monitoring engineer at Sentinel Technologies. Sentinel is an independent technology company providing integrated, customized IT solutions including remote systems Monitoring and Management. Find out more at http://www.sentinel.com/

  • Thumbs all the way up...great post.

  • Thanks for the tips.  Now I need a system to track all the custom fields and ensure they are set at the correct values.  :-)

  • Adatole,

    First of all I want to say great post.  I just found this thread yesterday.  Up until a month ago I changed employers where in my previous role I managed a NOC where we provided much the same type of managed services you described.  I had created several custom alerts based on different thresholds.

    If only I had found this or thought of this idea myself over 5 years ago when i started using Solarwinds oh the hours I could have saved. 

    I am in the begining stages of implementing Solarwinds at my new employer to monitor our internal network and you can bet money that I will be using this tip from the start to make my job much easier.

    Fantastic post!!!!!

  • Thanks....I had seen that one this morning as well.

    These got me thinking about some tricks I have used throughout the years with the product.  May have to think on it a little and then post a couple.

  • These are great Leon,

    Cant wait to see the next one you have up your sleeve.

    Will definately be sharing these with our other customers.

    Thanks,

  • Having trouble with the actual alert conditons for the Volume usage

    Here is my alert

    As soon as I fill out the highlighted field the field in the same condition reverts back to *...I then set that back to Volume Space available and then the highlighted field goes back to *...It only accepts one field statement in the condition.  Any thoughts?

  • MDRISKELL: A couple of things I noticed:

    First, I would change the field in the last line of your graphic to "Volume Percent Used", the same as the condition group above. Just for consistencies sake. You don't HAVE to, but it's the way I'd do it. That also may be why you can't compare the two fields (maybe a type mismatch)

    Second, I want to double-check that you made the ALERT_VOL field numeric, not text. Again, that might be why you can't do a "greater than/less than" comparison

    Finally, it looks like your first condition group (ALERT_VOL is empty, etc) is on the same level as your first "Trigger alert when ALL of the following apply". You need to click the elipses (... button) next to "Trigger alert when ALL of the following apply" and pick "simple condition" so that those filters are nested inside the ALL and not inside the ANY block.

    Let me know if any of this isn't making sense.



  • MDRISKELL: A couple of things I noticed:

    First, I would change the field in the last line of your graphic to "Volume Percent Used", the same as the condition group above. Just for consistencies sake. You don't HAVE to, but it's the way I'd do it. That also may be why you can't compare the two fields (maybe a type mismatch)



    I specificly want it at percentage if the field isn't filled out due to the large storage we have available on our systems.  I am using actual space free for the alerts where it is defined.

    Second, I want to double-check that you made the ALERT_VOL field numeric, not text. Again, that might be why you can't do a "greater than/less than" comparison



    This was the problem on the condition.  Now it is accepting it as valid.

    Finally, it looks like your first condition group (ALERT_VOL is empty, etc) is on the same level as your first "Trigger alert when ALL of the following apply". You need to click the elipses (... button) next to "Trigger alert when ALL of the following apply" and pick "simple condition" so that those filters are nested inside the ALL and not inside the ANY block.



    I recreated the alert just to capture the screen shot for the post...in doing so I had the formats wrong.  I have corrected.

     

    Thanks for the help.

  • I have a similar problem, but mine pertains to CPU_Load and my custom property ALERT_CPU.

     

    I created a custom property "Alert_CPU" (configured as Property Type "Integer Number")

     

    Added "80" to the custom property ALERT_CPU on a test system

     

    When I try to configure the advanced alert I am unable to create the comparison between "CPU_Load" and custom field "ALERT_CPU" where it the first or second field resets to *

     

    I thought the logic was correct, "field CPU_Load is greater than field ALERT_CPU" and this appears to be a similar configuration as others have in place.

    - Andy