This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

TIPS & TRICKS: Stop the madness! Avoiding alerts but continuing to pull statistics.

This is the first in a series of posts where, in the name of giving back to the community, I’m going to share some of the customizations that make SolarWinds a little more robust for us and our customers.

First, a little background about my company and how we use SolarWinds. Sentinel is an IT solutions provider that focuses on communications technologies, Data Center, and Outsourced / Managed Solutions.

One of our key services (and the thing that lets me put food on the table) is a remote monitoring solution (based on SolarWinds, of course). All we have to do is drop a VPN router onto the customer’s premises and set up NAT’s for the devices they want (read “pay us”) to monitor, and we’re good to go. This is a perfect fit for our customer base, where they don’t want to divert resources for the ongoing investment in staff, software, and skills to set up an enterprise-wide monitoring and management solution (not to mention figuring out who’s going to handle all those pesky tickets).

So our model – where we have many independent customers with different sets of values, different monitoring requirements and so on has driven us to come up with some customizations that focus on:
•    How to stop alerting on various devices (because of pilot projects, new customer onboarding, or maintenance windows) while continuing to collect statistics
•    How to set thresholds for devices when that could be different on nearly a device-by-device basis
•    How to ignore alerts based on the built-in monitors for CPU/RAM, etc on older or closed-architecture devices where a custom OID gave better data

This post is going to look at our solution for the first bullet – how to stop alerting but continue to collect statistics.

Of course, we all know that SolarWinds has the “unmanage” feature. This is a nifty little function that even has a scheduler associated with it, and can handle one-time or recurring events.

But our problem was that in some cases we needed to continue to collect the statistics even during the window where alerts would be a problem. For example, when a circuit goes down, our Network Operations Center (NOC) staff contact the customer’s carrier and act as the point of contact for testing and resolution. During that time, *we* want to know the status of the WAN circuit, but we don’t want additional alerts (read “tickets, where we have an SLA that carries $$$ penalties if we fail to acknowledge and close”). Unmanage would certainly turn off the alerts, but we’d have no way of knowing what SolarWinds thought about the circuit status until we managed the interface again and – you guessed it – potentially cut another ticket.

So we developed the “MUTE” field.  The logic is very simple:
1.    Set up a custom property (a yes/no field) labeled "mute"
2.    for specific nodes, set that property to "yes"
3.    Within your alerts, make sure one of your logic checks is something like "MUTE is not equal to YES"

That’s the basic idea. But here at Sentinel we’ve made it a bit more granular. The following mute fields are in place:

•    n_mute - node mute. This is an overall mute. All alerts should check for node-mute, and if it is set to "yes", the alert should be ignored.
•    i_mute - interface mute. This is, as the name implies, used in any interface-related alert
•    v_mute - volume mute. Again, the name should be a good clue to the usage. Very valuable when you have disks that are always at the edge of being full, but (for whatever reason) you don't care.
•    APM_mute - This mute option is very useful when you are bringing new applications online and want to pilot them, but you still need to get hardware alerts (CPU, RAM, etc).

The logic for any alert then looks like this:

Where ALL of the following are true
  N_MUTE is not equal to YES
  <the rest of your alert criteria>

For an interface alert, the logic would simply include two lines:

Where ALL of the following are true
  N_MUTE is not equal to YES
  I_MUTE is not equal to YES
  <the rest of your alert criteria>

Along with the MUTE fields, there are associated DESCRIPTION (n_mute_desc, i_mute_desc, etc) fields. That way we can add comments about when and why the element was muted.

As long as everything stays nice and standard, views and reports can be designed that let you know which elements are muted and why.

We’ve developed a standard set of terms for use in these description fields so that, for example, we can create a view that shows all the muted nodes – so that we can know when a device has been muted for too long - but ignores ones that are purposely muted forever based on customer requirements.

IN THE NEXT POST: How to easily set per-device thresholds.


Leon Adato is a monitoring engineer at Sentinel Technologies. Sentinel is an independent technology company providing integrated, customized IT solutions including remote systems Monitoring and Management. Find out more at http://www.sentinel.com/

Parents
  • adatole,

    Please forgive my denseness here, but I need a little help understanding how the MUTE custom property works.

    Will the value of the MUTE property always be Yes for those nodes, or is it manually set to Yes under certain circumstances?

Reply
  • adatole,

    Please forgive my denseness here, but I need a little help understanding how the MUTE custom property works.

    Will the value of the MUTE property always be Yes for those nodes, or is it manually set to Yes under certain circumstances?

Children
  • Freemen,

    It is up to the administrator.  If you have the device/vol/app muted it will not alert.  Once you set that property it will exclude it from any alerts if you have muting supressed from your alert conditions.

    Here is an example of where I would use muting.  We get a proactive maintenance notification from Verizon telling us that a circuit is going to be down somewhere between 12AM - 5AM tonight.  I would have the NOC mute that device at 12AM and unmute at 5AM (unless I can figure out how to script it).  Why do this instead of simply unmanage.  Well if I unmanage Solarwinds doesn't collect anything for that node during the unmanage period.  If I use the muting concept it will collect stats but simply not alert.  I use the Verizon proactive maintenance for an example because these maintenance windows are typically about 5 hours but the impact is usually 15 minutes or less.  I don't want to lose 5 hours of data for a 15 minute hit.

    If you left a node always muted Solarwinds would collect data but never alert.  The reason you would do this is so you don't have to create unique exclusions in your alerts based on node name or something else but rather simply always check for the node muted field.

    I hope that this helps explain the concept a little better.  I really love the idea of doing this but I need to figure out how to script this from a web front end for it to be really useful in our cases.

  • Yes, I understand the idea now, but this only matters on interfaces, volumes and applications correct? If the node is what will actually be down, there is no sense in polling it - you will get no data. Correct?

    If a child element or application is going down, then you can still get data on the node itself. Is that the idea?

  • for a scheduler, you could use windows task that runs a batch file:

    @echo off
    rem Set SQLCMD="Update Nodes Set N_Mute='Yes' where NodeID in (1,2,3,4)"
    Set SQLCMD="Update NetPerfMon.dbo.Nodes Set N_Mute='Yes' where Caption Like '<servername>.%'"
    sqlcmd -S <ORION_DB_SERVER> -E -Q %SQLCMD% -h -1 -W

     

    Another cool concept might be to use a date/time instead on "yes"/"no" that would auto expire a maintenance window (like solarwinds built in one).  I think you would have to do an advanced SQL alert then though.

  • Let me give you another example.

    Say you are dealing with a circuit issue.  You have a node going up and down constantly and the carrier can't fix the issue until Monday.  You would potentially still want solarwinds collecting data so that you have a history of outage information (to potentially recoup money from the carrier for an uptime agreement) but do you need to get alerted every time it happens if it's a known issue.

    The entire point of this method is to supress the alert but still poll...if you don't need to poll then yes unmanage works perfect for you and you don't need to utilize this method.

  • Freemen:

    Yes, BUT...

    You might conceivably mute a note during a maintenence window; when the node is just coming online but in pilot mode; when it's having an intermittent long-term problem that is already being looked into; etc.

    Otherwise, if the node was just down-down and it was going to stay down, no sense in muting.

  • PS to everyone:

    I'm glad this has sparked such a good discussion! Thanks for participating.

  • Netlogix:

    Absolutely right. What's slowing me down is the actual user interface and the back-end data to keep track of the on/off. It's not HARD, but I just don't feel like writing the web-based calendar applet, then the shim to write the nodeID, mute-on, mute-off data to the db, and the OTHER interface to list out the upcoming blackouts with a set of MAC (move add change) options. I know it's small potatoes for someone who codes all day. Just not something I do a lot of.

    When I get annoyed enough, it'll happen. And a week later SolarWinds will announce it's integrated into their system, and I will weep in self-pity.

    ;-)

     - Leon

  • Let me pass on my own thanks for the original posting and all the questions and comments. That's why this forum is so valuable. All the people are so knowledgeable and nice. We're probably very good looking too!

  • you don't need SQL query. See my post below...