0 Replies Latest reply: Jun 1, 2011 7:58 AM by Leon Adato RSS

Stop the Madness, Part III: ignoring Orions built-in poller alerts in favor of custom OIDs

Leon Adato

This is the third in a series of posts where, in the name of giving back to the community, I'm going to share some of the customizations that make SolarWinds a little more robust for us and our customers.

First, a little background about my company and how we use SolarWinds. Sentinel is an IT solutions provider that focuses on communications technologies, Data Center, and Outsourced / Managed Solutions.

One of our key services (and the thing that lets me put food on the table) is a remote monitoring solution (based on SolarWinds, of course). All we have to do is drop a VPN router onto the customer's premises and set up NAT's for the devices they want (read "pay us") to monitor, and we're good to go. This is a perfect fit for our customer base, where they don't want to divert resources for the ongoing investment in staff, software, and skills to set up an enterprise-wide monitoring and management solution (not to mention figuring out who's going to handle all those pesky tickets).

So our model - where we have many independent customers with different sets of values, different monitoring requirements and so on has driven us to come up with some customizations that focus on:

  • How to stop alerting on various devices (because of pilot projects, new customer onboarding, or maintenance windows) while continuing to collect statistics
  • How to set thresholds for devices when that could be different on nearly a device-by-device basis
  • How to ignore alerts based on the built-in monitors for CPU/RAM, etc on older or closed-architecture devices where a custom OID gave better data

This post is going to look at our solution for the third bullet - how to ignore built-in SolarWinds values in favor of custom OIDs. You can find the discussion about the first item TIPS & TRICKS: Stop the madness! Avoiding alerts but continuing to pull statistics. and the second item's information TIPS & TRICKS: Stop The Madness: How to set alert thresholds per-device.

If you've been playing along at home, you now have custom fields and alert logic to mute nodes, interfaces volumes and maybe even specialized items like APM; you have fields (and associated alert logic) to allow custom alert thresholds for CPU, RAM, disk space, bandwidth, and whatever else makes your heart beat faster.

But then you run into a situation where the built-in SolarWinds pollers don't work correctly for a particular device. Usually, it's possible to set up a custom Universial Device Poller (UnDP), but that doesn't stop the default poller from spewing false alarms.

We have that situation with a series of old Cisco 6500's where the standard SW poller mis-reports CPU; and on some linux-based appliances where the vendor has hacked the kernel to the point that the standard linux OIDs show extremely.... let's just say "festive" metrics. But because Orion detects the machine type as "net-snmp" it attempts to pull CPU, RAM, etc using the standards.

The problem (with regard to the ALERT_CPU, ALERT_RAM, etc, custom fields described in part 2 of this series) is that they are all using the standard CPU_LOAD element as the point of comparison.

You COULD set the ALERT_CPU to some ridiculously high number, and then implement a custom alert. We tried that, but ran into two problems:

  1. It became difficult to figure out why an alert triggered. We'd see a CPU alert and then notice that the threshold was set to 105%, and things got really confusing until we realized the device in question used a custom CPU OID
  2. Remember those Linux-based appliances I mentioned earlier? For some of Orion reports 200% RAM usage or more using the standard OID. Which is always exciting news in the Operations center. Good times, good times.

So we've implemented fields like OVR_STD_CPU and OVR_STD_RAM (simple Yes/No custom properties) to get around this. Effectively, this tells SolarWinds that a non-standard OID is being used as the key element, and thus the standard OID should be skipped.

In an alert, the simplified logic for STANDARD devices looks like this:

Where ALL of the following are true
  OVR_STD_CPU is not equal to YES
  CPU_LOAD is greater than 90

What we are saying is that we don't want a standard CPU alert if the OVR_STD_CPU is set to "Yes". The assumption is that there is also a custom alert in place to grab the correct value. And of course, THAT one will only fire for nodes that have the custom UnDP applied. The complete alert logic for standard devices (including muting and standard ALERT_CPU) would now look like this:

Where ANY of the following are true
  Where ALL of the following are true
     N_MUTE is not equal to YES
     OVR_STD_CPU is not equal to YES
     ALERT_CPU is empty
     CPU_LOAD is greater than 90
  Where ALL of the following are true
     N_MUTE is not equal to YES
     ALERT_CPU is not empty
     OVR_STD_CPU is not equal to YES
     the field CPU_LOAD is greater than the field  ALERT_CPU

This would ensure that the standard CPU alert would NEVER trigger for the node in question. Then we can set up a different alert that uses the custom OID, which uses the existing MUTE and ALERT_xxx logic. Of course it will only trigger when the custom OID was applied to a node.

Where ANY of the following are true
  Where ALL of the following are true
     N_MUTE is not equal to YES
     OVR_STD_CPU is not equal to YES
     ALERT_CPU is empty
      is greater than 90
  Where ALL of the following are true
     N_MUTE is not equal to YES
     ALERT_CPU is not empty
     OVR_STD_CPU is not equal to YES
     the field  is greater than the field  ALERT_CPU

If you've read this far...

I'm going to indulge in just one paragraph of self-promotion. Sentinel Technologies provides a remote monitoring and management service which is based on nearly the full SolarWinds suite (NPM, NCM, APM, IPSLA and Netflow) but includes significant customizations such as those discussed in this series. In addition we've integrated extensive snmp trap and syslog filters, an event correlation engine, pop-up alerting and a knowledge base to enhance our service even more. All of this is backed by a 24-hour Network Operations Center as well as options for hardware support, sparing and pro-active maintenance. Contact Sentinel at 800.769.4343 or http://www.sentinel.com/ to find out more.


Leon Adato is a monitoring engineer at Sentinel Technologies. Sentinel is an independent technology company providing integrated, customized IT solutions including remote systems Monitoring and Management. Find out more at http://www.sentinel.com/