3 Replies Latest reply on Jun 30, 2009 12:12 PM by jcooler

    Alert rollups and Alert suppression



      I've read several posts on how alert suppression should be configured and how alert rollup should work.  My problem is finding the best way to make it all work together properly to get valuable triggered alerts, and properly displayed alerts on the maps.

      Here's what I'm seeing on my large "hub-n-spoke" network (yes, I know this is 2 different issues, but they are closely linked) [timeline in minutes]:

      1.  Time 0:00 - An alert comes in that a remote router goes down and the map turns red (proper behavior).

      2.  Time 2:00 - After being polled again, the router turns grey (unknown) on the map, but the alert still says the down.  I would think it should stay red.

      3.  Time 4:00 - Router is still grey on map, alert still says down.

      4.  Time 5:00 - Alerts come in that the remote switches are down (5min polling cycle).  I would like the switches to remain unchanged and no alert to come in.  They aren't down, just unavailable due to the router.

      I may have missed something in all my research (forum posts, admin guide, and online videos), but shouldn't the above be defualt behavior?  I really don't like the thought of configuring custom alerts for the entire network (1,000s of devices and interfaces).

      Does anyone have some ideas on how best to implement a solution described above?  Some best practice or how-to guides?  Maybe I missed something in my research and its easier than I think.  Please, and insight would be appreciated.


        • Re: Alert rollups and Alert suppression

          As of now, without some complicated SQL/alerting hackery, you're stuck with building two alerts per spoke.  One alert is for the router that provides access to the site and the second alert is for the switches.  You should configure the router alert to go off all the time, but configure the switch alert with a suppression that follows the logic of "If router X status is Down, don't alert".

          That said, I've never seen a router that's dropping pings go into "Unknown" status - it always goes into "Warning", then "Down".  About the only time I've seen things go into "Unknown" is when an interface is not responding to SNMP (or the node itself is no longer responding to SNMP).  There are other reasons for "Unknown", but that's the majority of what I come up against.

          1 of 1 people found this helpful
            • Re: Alert rollups and Alert suppression

              Thank you for the quick response!

              As Elizabeth is also monitoring this...is functionality being built to address this suppression need?  Seeing/using other similar tools (came from the OpenView world), I understand how difficult this is to develop; but I also understand the need for this.  If there is anything I can do to help develop such functionality (beta test, discussions on what I would like to see, etc.), I'm all in...  :)

              As for the "unknown," the router is not available at all (ICMP or SNMP) because of a data T1 failure.  So, the serial interface is down, and the router becomes unavailable, so the router shows down...and on the second poll after the outage, the router becomes unknown.  Then, subsequantly, the switches follow the same behavior.

              Am I not monitoring devices correctly?  As stated above, I'm not new to the monitoring game, but that doesn't mean I'm doing it correctly.  I just need NPM to provide value to our team.  It is doing so, but not per management expectations.