Community
- Command Central
- MVP Program
- Monthly Mission
- Blogs
- Groups
- Events
- Media Vault
Products
- Observability
- Network Management
- Application Management
- IT Security
- IT Service Management
- System Management
- Database Management
Content Exchange
- SolarWinds Platform
- Server & Application Monitor
- Database Performance Analyzer
- Server Configuration Monitor
- Network Performance Monitor
- Network Configuration Manager
- SQL Sentry
- Web Help Desk
Free Tools & Trials
Store

NPM does not alert when a Windows device goes unresponsive when polling method is SNMP

Yesterday, around 3PM I had a Windows 2003 server that encountered a problem. Windows devices are polled via SNMP in NPM. We did not get alerted to an issue until this morning. During that time, there is a gap in the performance history of the device if you were to look at Drive status, CPU usage, or Memory usage. I have an alert written in Alert Manager that is supposed to alert me when a managed node has not been polled during the last 5 tries and another alert where the managed node last poll time is 20 minutes old. The Conditions are as follows:

20minute alert

Trigger Alert when all of the following apply
     (Now - Last Sync) in minutes is greater than 20
     Trigger alert when not all of the following apply
          Node Status is equal to Unmanaged
          Node Status is equal to External
          Node Status is equal to Down

Not been polled during last 5 tries
Trigger alert when all of the following apply
     Skipped polling cycles is greater than 5
     Vendor is equal to Windows
     Trigger alert when not all of the following apply
           Node Status is equal to UnManaged
           Node Status is equal to External

The server was able to still respond to a ping request but no other TCP request could be made to the device. If SNMP was not able to get performance data back, why did these two trigger conditions fail? What should my alert look like if my device still pings but I can't get any performance data stats back? Wouldn't NPM mark that as unresponsive?

Find more posts tagged with

Alert

Performance

Node

unresponsive

data

missing

Accepted answers

smargh

Don't use LastSync. I tried it for this exact scenario, but it's not reliable and I think is only intended to show the last DB sync rather than the "last successful poll" time.

I use this custom SQL node alert. It will only work on recent versions of NPM. I've only been testing it for a few days, but it seems to work okay so far. Set it to only trigger after 30 minutes of not being responsive, so that it won't alert for brief outages. Sometimes you'll find that when a Windows system is under heavy load, i.e. OS/SQL backups, SNMP might stop responding for a while, in which case you might want to set the alert to only work outside of the affected timeframe.

Trigger:

WHERE

(

(DATEDIFF(mi, Nodes.LastSystemUpTimePollUtc, getutcdate()) > 30) AND

(Nodes.Status = '1')

)

Reset:

WHERE

(

(DATEDIFF(mi, Nodes.LastSystemUpTimePollUtc, getutcdate()) < 30) AND

(Nodes.Status = '1')

)

All comments

smargh

Don't use LastSync. I tried it for this exact scenario, but it's not reliable and I think is only intended to show the last DB sync rather than the "last successful poll" time.

Trigger:

WHERE

(

(DATEDIFF(mi, Nodes.LastSystemUpTimePollUtc, getutcdate()) > 30) AND

(Nodes.Status = '1')

)

Reset:

WHERE

(

(DATEDIFF(mi, Nodes.LastSystemUpTimePollUtc, getutcdate()) < 30) AND

(Nodes.Status = '1')

)

jrich

Yeah, I think 30 minutes is a little long, maybe 20, but either way the customer would alert us also if it was going on. I'll give this a try. I just wish that NPM would allow that LastSystemUpTimePollUTC trigger was viewable from their builder. I'll build an alert and see what it catches. Thanks again for your help, I'll come back and verify the answer if it picks it up. Why does NPM hide this stuff in sql?

I'm currently using Orion Core 2011.2.0, APM 4.2.0 SP1, NPM 10.2, IVIM 1.2.0

jrich

Just tested your SQL and I can confirm it does work. I found 15 devices that have failed a poll. One was a windows device that someone had disabled SNMP on so I couldn't even query the server yet NPM still showed the device as up. Kudos to you sir and thanks again!