5 Replies Latest reply on Mar 26, 2015 11:29 PM by superfly99

    How to troubleshoot NPM alerting?

    rtobyr

      I recently had several nodes go down, and didn't get any alerts from Orion. I checked the Alert Manager on the server. Nothing wrong there. I sent and received a test alert, so it isn't SMTP. I went into active alerts, and despite the node still being down, there isn't an alert for it. What else can I check?

        • Re: How to troubleshoot NPM alerting?
          dhanson

          Semper Fi.

           

          Can you take a couple screenshots of your alerts in alert manager? We'll need to know more about the rule to determine what the cause is.

           

          Off the top of my head, you could have a Trigger Condition that excludes all possible nodes, an Alert Suppression rule that filters out all possible nodes, or a Trigger Action that doesn't notify you properly (if you've tested using this alert and were able to receive an e-mail that kind of rules out this one). Also, your Time of Day settings may hinder this (unlikely, but hey, it's still a possibility).

           

          Specifically within Alert Manager, I'd verify that in the Trigger Actions include posting something to NPM's Event Log in addition to your e-mail notification.

           

          If you post SS's, from your "Node Down" Alert, give us your trigger condition, alert suppression and trigger actions, and confirm your Time of Day page is the default (12:00 AM to 11:59 PM with every day selected).

            • Re: How to troubleshoot NPM alerting?
              rtobyr

              Just for reference, the last time an alert should have been triggered was for a downed APC UPS at 7:30 AM.

              Trigger Condition.png

              Alert Suppression.png

              Time of Day.png

              Alert Action.png

                • Re: How to troubleshoot NPM alerting?
                  mdecima

                  The problem is the logic of alert suppression, the alert will supress if it finds condition X, but it will suppress the WHOLE alert.

                   

                  You should delete all in alert supression and manage more conditions in the trigger like this

                   

                  Trigger alert when all of the following apply

                   

                  Node status is equal to down

                  Node name is not equal to SLO-R24.srv.courts-tc.ca.gov

                  Machine Type is not equal to Windows 7 workstation

                    • Re: How to troubleshoot NPM alerting?
                      superfly99

                      mdecima wrote:

                       

                      The problem is the logic of alert suppression, the alert will supress if it finds condition X, but it will suppress the WHOLE alert.

                       

                      You should delete all in alert supression and manage more conditions in the trigger like this

                       

                      Trigger alert when all of the following apply

                       

                      Node status is equal to down

                      Node name is not equal to SLO-R24.srv.courts-tc.ca.gov

                      Machine Type is not equal to Windows 7 workstation

                      mdecima is correct in saying to delete everything out of alert supression. It never works and in the lastest NPM, I believe this tab has been removed for good!

                       

                      But I would modify the above trigger alert to this.

                       

                      Trigger alert when all of the following apply

                      Node status is equal to down

                           Trigger alert when any of the following apply

                           Node name is not equal to SLO-R24.srv.courts-tc.ca.gov

                           Machine Type is not equal to Windows 7 workstation

                    • Re: How to troubleshoot NPM alerting?
                      dhanson

                      1. Is it possible Time Zone may have been a factor? Your time of day limitations exclude 12 hours, and if there's a significant timezone difference between the APC and the server, you might not get an alert.

                      2. It could be a hang up in your alerting service. Have you received any other alerts since? If not, restart the SolarWinds Alerting Engine.

                      3. Does the object(s) show as down in SolarWinds? If the object Status isn't "Down", your alert won't trigger.

                      4. Why not set your Time of Day restrictions in the E-Mail/Page instead of in the Alert? According to the logic you have deployed here, even if a node dropped outside of your business hours, you won't get any information on it. But if you put the "time of day" within the e-mail notification, you could still record to Event Log when something occurs outside of business hours, just not receive an e-mail on it.

                      5. Do you have a stipulation at the bottom of your trigger for "Do not trigger this action until condition exists for more than ____"? It could be that the negative condition hasn't existed long enough for the alert to trigger. For instance, if this was set to something crazy like 12 hours, this might not allow you to trigger.

                       

                      What version of NPM are you running? 11.0.1?