This is the third in a series of posts where, in the name of giving back to the community, I'm going to share some of the customizations that make SolarWinds a little more robust for us and our customers.
First, a little background about my company and how we use SolarWinds. Sentinel is an IT solutions provider that focuses on communications technologies, Data Center, and Outsourced / Managed Solutions.
One of our key services (and the thing that lets me put food on the table) is a remote monitoring solution (based on SolarWinds, of course). All we have to do is drop a VPN router onto the customer's premises and set up NAT's for the devices they want (read "pay us") to monitor, and we're good to go. This is a perfect fit for our customer base, where they don't want to divert resources for the ongoing investment in staff, software, and skills to set up an enterprise-wide monitoring and management solution (not to mention figuring out who's going to handle all those pesky tickets).
So our model - where we have many independent customers with different sets of values, different monitoring requirements and so on has driven us to come up with some customizations that focus on:
- How to stop alerting on various devices (because of pilot projects, new customer onboarding, or maintenance windows) while continuing to collect statistics
- How to set thresholds for devices when that could be different on nearly a device-by-device basis
- How to ignore alerts based on the built-in monitors for CPU/RAM, etc on older or closed-architecture devices where a custom OID gave better data
This post is going to look at our solution for the third bullet - how to ignore built-in SolarWinds values in favor of custom OIDs. You can find the discussion about the first item TIPS & TRICKS: Stop the madness! Avoiding alerts but continuing to pull statistics. and the second item's information TIPS & TRICKS: Stop The Madness: How to set alert thresholds per-device.
If you've been playing along at home, you now have custom fields and alert logic to mute nodes, interfaces volumes and maybe even specialized items like APM; you have fields (and associated alert logic) to allow custom alert thresholds for CPU, RAM, disk space, bandwidth, and whatever else makes your heart beat faster.
But then you run into a situation where the built-in SolarWinds pollers don't work correctly for a particular device. Of course you can set up a custom Universial Device Poller (UnDP), but that doesn't stop the default poller from spewing false alarms.
We have that situation with a series of old Cisco 6500's where the standard SW poller mis-reports CPU; and on some linux-based appliances where the vendor has locked out the standard linux OIDs in favor of their own - but because Orion detects the machine type as "net-snmp" it attempts to pull CPU, RAM, etc using the standards.
The problem (with regard to the ALERT_CPU, ALERT_RAM, etc, custom fields described in part 2 of this series) is that they are all using the standard CPU_LOAD element to compare against.
Of course, you COULD set the ALERT_CPU to some rediculously high number, and then implement a custom alert. We did, but ran into two problems:
- It became difficult to figure out why an alert triggered. We'd see a CPU alert and then notice that the threshold was set to 105%, and things got really confusing until we realized the device in question used a custom CPU OID
- Remember those Linux-based appliances I mentioned earlier? On some of them the standard CPU OID reports 200% or more. Which always makes for jolly good times in the Ops center when they see THAT guage on the screen.
So we've implemented OVR_STD_CPU and OVR_STD_RAM fields (both simple Yes/No custom properties) to get around this. Effectively, this tells SolarWinds that a non-standard OID is being used as the key element, and the standard OID should be skipped.
Where ALL of the following are true OVR_STD_CPU is not equal to YES CPU_LOAD is greater than 90
The complete alert logic (including muting and standard ALERT_CPU) would now look like this:
Where ANY of the following are true Where ALL of the following are true N_MUTE is not equal to YES OVR_STD_CPU is not equal to YES ALERT_CPU is empty CPU_LOAD is greater than 90 Where ALL of the following are true N_MUTE is not equal to YES ALERT_CPU is not empty OVR_STD_CPU is not equal to YES the field CPU_LOAD is greater than the field ALERT_CPU
This would ensure that the standard CPU alert would NEVER trigger for the node in question. Then we can set up a different alert that uses the custom OID, which uses the existing MUTE and ALERT_xxx logic. Of course it will only trigger when the custom OID was applied to a node.
Where ANY of the following are true Where ALL of the following are true N_MUTE is not equal to YES OVR_STD_CPU is not equal to YES ALERT_CPU is empty
is greater than 90 Where ALL of the following are true N_MUTE is not equal to YES ALERT_CPU is not empty OVR_STD_CPU is not equal to YES the field is greater than the field ALERT_CPU
If you've read this far...
I'm going to indulge in just one paragraph of self-promotion. Sentinel Technologies provides a remote monitoring and management service which (as it should be obvious by now) based on nearly the full SolarWinds suite (NPM, NCM, APM, IPSLA and Netflow) but includes significant customizations such as those discussed in this series. In addition we've integrated extensive snmp trap and syslog filters, an event correlation engine, pop-up alerting and a knowledge base to enhance our service even more. All of this is backed by a 24-hour Network Operations Center as well as options for hardware support, sparing and pro-active maintenance. Contact Sentinel at 800.769.4343 or http://www.sentinel.com/ to find out more.
Leon Adato is a monitoring engineer at Sentinel Technologies. Sentinel is an independent technology company providing integrated, customized IT solutions including remote systems Monitoring and Management. Find out more at http://www.sentinel.com/