This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

How are people managing traps?

I am curious how other people out there using SolarWinds Orion are managing their SNMP Traps?

I have found that I run into a few different problems:

  1. When I have devices such as Cisco devices where there are literally thousands of different traps that they could throw it doesn't seem practical to sort through and set alerts for each and every one as that would take forever.  On the flip side, the Cisco devices we have generate a lot of traps most of which are just noise so I can't alert on everything either.
  2. On devices such as HP/Compaq they have a lot of traps with an indeterminate severity because the details or var-binds in the trap could indicate something that is normal or could indicate something is critical all within the same trap definition.

These are two significant examples but in general getting the important trap data to my techs while filtering out the unwanted ones is a constant battle so I was curious how other people are accomplishing this?

P.S.  To give a bit of background info; we are a MSP and we manage both network devices as well as servers.  We have Orion NPM, APM and NCM.

  • I'm guessing most folks have moved away from traps in favor of syslog, any chance you can do that? It'll simplify your life so I'd seriously consider it if possible...

     

    -SK

  • Technically, switching to syslogs from traps would be possible; getting people to accept that change may be a bit more difficult.

    If I do switch to syslogs we still run into a lot of the same or similar problems though; alerting my techs of the important stuff while squelching out all of the noise.

    There are certainly logs that we know we want alerts for and we have set those up without a problem.  The problem is the ones you don't know about but are important and will want to have alerts for if they come through.

  • One approach I've found useful to filter out syslog noise is the "artificial ignorance" technique. I believe this technique was named by Marcus Ranum, probably close to 20 years ago by now. The idea is that you filter out stuff that you know isn't interesting, and refine that filter list over time. The simplest way is with little shell scripts, like:

    grep -vf exclusions.txt syslog.txt

    where exclusions.txt is a list of stuff you know you don't want to see, e.g. LINEPROTO-5-UPDOWN, etc. I've seen people write small VB apps to do the same sort of thing in a pure Windows environment.

    A more sophisticated way would be to create syslog summary scripts, that give you a quick list of how many instances of each message type occurred during a window. This thread:

    has an example of how to do this with Kiwi, and a link to a blog post by Terry Slattery on how to do it with Perl scripts. (I haven't tried the Kiwi method yet, but I will probably get to it next week.)

  • For the reasons you have outlined, we have over time moved away from both snmp traps and syslog.

    Instead, we tend to rely more on active probes (i.e. Orion provided for interfaces, up/down etc and custom pollers for array status, temp, power status, fan status etc). We do still have some alerts defined for both syslogs and snmp traps - just not a huge number (considering the size of our implementation and number of device types).

    This may delay the alert detection - but is outweighed in most cases by ease of setup/maintenance, no false positives and certainty over what we have monitored in our environment. Given how many times we have discussed this internally (with different viewpoints) - I'm sure that not everyone will agree with this approach. The alternative however is the need to spend a lot of time tailoring device configs, alert filters, testing etc (and re-doing these activities for each software upgrade or hardware refresh).

    Part of the biggest problem is vendors like Cisco that just don't implement standard message formats (or even have message id format). If you have ever looked at the number of different messages Cisco devices can generate to indicate that a power supply fan has failed - you will understand the problems of defining an alert (they even change within the same IOS train). Cisco could learn a lot from the IBM mainframe approach - once a message format is published (with a unique message ID) then it will never change.

    We do still however collect syslog and snmp traps. These are mainly used however for adhoc reporting. For example we look for most frequently occurring alerts, top talkers etc. They are of course also available to the support staff if we want to see what alerts were produced for a particular failure event that occurred.

    Dave.