6 Replies Latest reply on Jun 12, 2014 7:27 AM by jim8424

    Problem with alerting on nodes down.


      I created a new alert to trigger an email anytime we have a cisco device down for more than one minute. The problem is, when I test the alert I get an email for ANY node that goes down even if I tell it to test on a monitored windows server instead a cisco device. I even tried setting suppression for windows nodes to prevent the trigger but that didn't work. Specific information on the alert configuration follows:


      1. I create the alert by checking "Alert me when a node goes down" in the Alert Manager and clicking edit, then giving it a new alert name.

      2. My Trigger conditions are set to All of the following, Node is equal to down, Vendor is equal to Cisco. At the bottom, I set Do not trigger this action until condition exists more than 1 minute.

      3. On the Suppression conditions tab I first tried Vendor is not equal to Cisco. When that didn't work I set suppression conditions of ANY of the following, Vendor is equal to unknown, Vendor is equal VMWare Inc, and Vendor is equal to Windows. That covers all the Vendor types I have monitored.

      4. I configured the trigger action to send an email to me and set the message to contain the variable of the node name.


      So.. If use the test alert panel and select a Cisco device as the trigger node, I get an email formatted appropriately with the cisco node name in the message. If i trigger the alert by selecting a windows server it should suppress it. Instead, I still get an email which sometimes contains the node name and sometimes leaves the node name blank.


      Have I misunderstood suppression? Shouldn't it prevent the email from being sent?



        • Re: Problem with alerting on nodes down.
          Leon Adato

          You've been caught by the "suppression tab" trap. It gets us all.




          Upshot: Don't use the suppress tab. EVER.


          Put all your "if not" conditions in the alert trigger. Which in your case you don't need. Since you explicitly state "Vendor = Cisco" there's no need to indicate "Vendor <> Cisco".


          Finally, the test button doesn't REALLY test. It triggers the alert for the specified node/interface/disk/ham sandwich, but it uses CURRENT values. So often you won't get the right variable population since those values don't exist or don't exist correctly at the time you hit "test".


          A better test method is:

          1. ADD a line to specify a specific node. that way you don't get hammered.
          2. Change your thresholds. If you want to trigger on CPU, set it to be >= zero. If you want to test a down node, set it to "UP". etc.


          NOW your alert will trigger, with all the correct variables, etc. If you want to re-test it, just use the "CLEAR" button in the alert list.


          Hope that helps.

          1 of 1 people found this helpful