3 Replies Latest reply on Mar 6, 2012 1:13 PM by jrich

    NPM does not alert when a Windows device goes unresponsive when polling method is SNMP

    jrich

      Yesterday, around 3PM I had a Windows 2003 server that encountered a problem.  Windows devices are polled via SNMP in NPM.  We did not get alerted to an issue until this morning.  During that time, there is a gap in the performance history of the device if you were to look at Drive status, CPU usage, or Memory usage.  I have an alert written in Alert Manager that is supposed to alert me when a managed node has not been polled during the last 5 tries and another alert where the managed node last poll time is 20 minutes old. The Conditions are as follows:

      20minute alert

      Trigger Alert when all of the following apply
           (Now - Last Sync) in minutes is greater than 20
           Trigger alert when not all of the following apply
                Node Status is equal to Unmanaged
                Node Status is equal to External
                Node Status is equal to Down

      Not been polled during last 5 tries
      Trigger alert when all of the following apply
           Skipped polling cycles is greater than 5
           Vendor is equal to Windows
           Trigger alert when not all of the following apply
                 Node Status is equal to UnManaged
                 Node Status is equal to External

      The server was able to still respond to a ping request but no other TCP request could be made to the device.  If SNMP was not able to get performance data back, why did these two trigger conditions fail? What should my alert look like if my device still pings but I can't get any performance data stats back?  Wouldn't NPM mark that as unresponsive?

        • Re: NPM does not alert when a Windows device goes unresponsive when polling method is SNMP
          smargh

          Don't use LastSync. I tried it for this exact scenario, but it's not reliable and I think is only intended to show the last DB sync rather than the "last successful poll" time.

          I use this custom SQL node alert. It will only work on recent versions of NPM. I've only been testing it for a few days, but it seems to work okay so far. Set it to only trigger after 30 minutes of not being responsive, so that it won't alert for brief outages. Sometimes you'll find that when a Windows system is under heavy load, i.e. OS/SQL backups, SNMP might stop responding for a while, in which case you might want to set the alert to only work outside of the affected timeframe.

          Trigger:

          WHERE 

          (

            (DATEDIFF(mi,  Nodes.LastSystemUpTimePollUtc, getutcdate()) > 30) AND

            (Nodes.Status = '1')

          )

           

          Reset:

          WHERE

          (

            (DATEDIFF(mi,  Nodes.LastSystemUpTimePollUtc, getutcdate()) < 30) AND

            (Nodes.Status = '1')

          )