This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

Alert on Gateway Down?

We tried using the "alert me when a neighbor goes down" but I'm getting  lots of noise and false positives on nodes that arent a direct neighbor. (one site goes down and 8 other sites all report their neighbor (via vpn tunnel) dropped resulting in 9 alerts)

Is there a way to alert on the next hop/interface gateway down so that only the next hop is monitored for status?

  • Is alerting on syslogs or traps an option for you?

  • Dont get me started on dependencies and groups. Mine only work half the time. I have dynamic dependencies set on all of my sites, yet more than half the time when a parent goes down, I still get a page for every device behind it.

    And all of our sites have redundant connections with auto failover. So I dont see a site go down when a connection drops that would generate an alert. Either my users complain that the network is slow (it failed from cable to T1) or the cable goes out and they go down hard (because the T1 failed a week prior and nobody noticed)

  • This issue you are seeing probably has to do with the polling intervals.  The most common situation i see with intermittent dependency issues happens like this:

    Your poller is rolling through the giant list of devices it needs to hit.  At some point it stops getting pings back from something that should be a dependency child device, but because of the polling cycle it hasn't polled the parent yet, so it doesn't know that the parent is also down.  Child device gets has to go 120 seconds without answering before it gets marked down (assuming you didnt change the default value).  During that time the parent goes into warning as well since we aren't getting pings there either.   Child device eventually gets status marked as down, triggers the email alert.  Somewhere in the next 120 seconds your parent gateway gets marked as down also.  Now that Solarwinds has figured out a parent is down it changes the status of every child object to "Unreachable" and begins suppressing the alerts that would fire from that point forward.

    So there are a couple things that can help with this situation.  First of all I try to make sure my parent devices are getting their up/down status polled more frequently than child objects, this way there is just a smaller window of time where solarwinds would see children going down before it knows the parent is down.  So if 120 seconds is the standard across your environment set your parents to 60 seconds.  You can over ride the global polling intervals on the edit node screen.  The second step to help is to slow down the speed at which the node down alert fires.  This one is a bit more controversial because it depends on how your organization weighs the benefits of faster notifications versus more spam messages.  Setting the Node down alert frequency to every 2-3 minutes instead of once every 60 seconds and combining that with a higher frequency of polling on the parent nodes will give enough time where  the parents would always be able to get their status updated and the suppression to kick in before the children trigger their node down alerts.

  • Tried that already. Polling LAN devices every 5 minutes. Routers every 60 seconds. Still manages to do it.

    I think it has something to do with dynamic groups vs manually assigning devices. I see more sites that have members added via a dynamic query "if node IP address begins with..." rules than when I just manually add devices in the site to the group by hand.