Yesterday, around 3PM I had a Windows 2003 server that encountered a problem. Windows devices are polled via SNMP in NPM. We did not get alerted to an issue until this morning. During that time, there is a gap in the performance history of the device if you were to look at Drive status, CPU usage, or Memory usage. I have an alert written in Alert Manager that is supposed to alert me when a managed node has not been polled during the last 5 tries and another alert where the managed node last poll time is 20 minutes old. The Conditions are as follows:
20minute alert
Trigger Alert when all of the following apply
(Now - Last Sync) in minutes is greater than 20
Trigger alert when not all of the following apply
Node Status is equal to Unmanaged
Node Status is equal to External
Node Status is equal to Down
Not been polled during last 5 tries
Trigger alert when all of the following apply
Skipped polling cycles is greater than 5
Vendor is equal to Windows
Trigger alert when not all of the following apply
Node Status is equal to UnManaged
Node Status is equal to External
The server was able to still respond to a ping request but no other TCP request could be made to the device. If SNMP was not able to get performance data back, why did these two trigger conditions fail? What should my alert look like if my device still pings but I can't get any performance data stats back? Wouldn't NPM mark that as unresponsive?