This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.

You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

SNMP stops responding on Windows servers (affecting half of our nodes)

cmgurley over 13 years ago

This has been a long-standing issue that I'm finally getting around to troubleshooting. We have recurring issues monitoring our Windows servers where SNMP (native Windows SNMP service) responds for a while and then simply stops responding. The Windows services are running, but Orion NPM's queries to the servers apparently are not answered. Restarting the Windows SNMP service resolves the issue temporarily, but after a few days (time varies), it stops responding again.

At this time, I have roughly half of my 95 Windows nodes flashing with interfaces in an "unknown" state. And actually, everything dependent on SNMP (CPU, memory, disks, interfaces, etc) are "unknown", NPM just doesn't state that (for some reason, it thinks it is still gathering data).

We have seen this issue on Windows 2003, 2003 R2, 2008, and 2008 R2. Most of our servers are now 2008 R2. From a quick glance, I see 2008 and 2008 R2 in "unknown" states, while an equal number are polling fine.

We are running NPM 10.1.2 (SolarWinds Orion Core 2011.1.0, IPSLAMGR 3.5.1, NPM 10.1.2, IVIM 1.1.0). Anyone else out there seeing this? SW Staff: any ideas for troubleshooting?

Thanks,

Chris Gurley
www.bctechnet.com

Top Replies

0 ET over 13 years ago

Hi,
I have currently one similar case and it's exactly the same issue. SNMP service stops randomly respond without any error and only restart service solve this problem. I don't have solution for it right now.
It will be cool if you can set up wireshark trace on one of your affected machine and check when it stop respond (or send it to us)

Thanks
Cancel
Vote Up 0 Vote Down

Cancel
0 netlogix over 13 years ago

I had an issue like this every time the server lost access to the DNS server even for only a second or two. I fixed it by changing to an IP in the SNMP config instead of a host name. But of course if you are using IPs, then it doesn't apply. If you want, you could just try it on one or two servers that are having issue most often.
Cancel
Vote Up 0 Vote Down

Cancel
0 jrich over 13 years ago

I have a similar issue. If a server (Linux or Windows) snmp service stops solarwinds NPM doesn't even blink as far reporting that the server did not respond to a poll request. I've found this on a couple of instances where I've stopped the SNMP service on a windows device and received NO notification from solarwinds that the device has in fact stopped polling (CPU, Memory and Disk statuses all return green but graphing and data collection halt).
Is this by design? How do I get NPM to alert me that a node failed to respond to the SNMPGET? I would have thought this functionality and level of alerting would have come out-of-the-box. SolarWinds Orion Core 2010.2.1 SP1, APM 4.0.1, NPM 10.1.1 SP1, IVIM 1.0.0
Cancel
Vote Up 0 Vote Down

Cancel
0 sean.martinez over 13 years ago in reply to jrich

As of Orion 10.1, there are new default Advanced Alerts to Alert you when a Node has not returned SNMP information:
Alert me when a managed node has not been polled during the last 5 tries
Alert me when a managed node last poll time is 10 minutes old

I have also built a Custom SQL Alert for UnDP Devices for Custom Node Pollers to Alert when a UnDP has not been updated in 10 minute. Here is what you will need to put into the Trigger Condition Query Field:
Inner Join Nodes ON Nodes.NodeID = CustomPollerAssignment.NodeID
Inner Join CustomPollerStatus ON CustomPollerAssignment.CustomPollerAssignmentID = CustomPollerStatus.CustomPollerAssignmentID

WHERE ((DATEDIFF(mi, CustomPollerStatus.DateTime, getutcdate()) > 10)
AND (NOT (Nodes.Status = '9')
AND NOT (Nodes.Status = '11')
AND NOT (Nodes.Status = '2')))
Cancel
Vote Up +1 Vote Down

Cancel
0 jrich over 13 years ago in reply to sean.martinez

@ Sean,
I don't see those new Advanced Alerts in my Alert Manager. Where can I get those Advanced Alerts? I did a search on the downloads section and haven't had any luck.
Cancel
Vote Up 0 Vote Down

Cancel
0 sean.martinez over 13 years ago in reply to jrich

I have added the Alerts to our Content Exchange in case you did not receive the Alerts in the 10.1 Upgrade:

Alert me when a managed node has not been polled during the last 5 tries
Alert me when a managed node last poll time is 10 minutes old

I have also added my Custom SQL Alert in for making sure UnDP works, and troubleshooting the UnDP Application in case the Node still works correctly:
UnDP Last update 10+ Minutes ago
Cancel
Vote Up +1 Vote Down

Cancel
0 netlogix over 13 years ago in reply to sean.martinez

I have also copied and modified ones for volumes and network cards (just in case SNMP is working, but somehow the NIC or volume got switched).
Cancel
Vote Up 0 Vote Down

Cancel
0 cmgurley over 13 years ago in reply to ET

@ET: I'll load up Wireshark on two boxes and hope they flag soon. We just did a network-wide restart of the SNMP agents, in order to add another management station for troubleshooting this, so everything is actually working for the moment. I'm sure that won't last long, though, so we should have more data soon.
Let us know if you come up with anything on your end as well.
Thanks,
Chris | www.bctechnet.com
Cancel
Vote Up 0 Vote Down

Cancel
0 ET over 13 years ago in reply to cmgurley

Wireshark trace would be great, appreciate.
Cancel
Vote Up 0 Vote Down

Cancel
0 cmgurley over 13 years ago in reply to ET

Forgot to ask: can you suggest a filter to use in Wireshark to grab the data you want?
Yesterday some of my servers began to stop responding to SNMP polls, so I'm about to load up Wireshark on one or more. In fact, one of them is the Orion NPM server itself, which is where I intend to do the capture. Since it does a ton of polling, I'm going to need to trim down what we capture.
I'm sure I could bang my head a while and figure it out, but since I never do captures like this, I'd appreciate any direction to expedite it.
Thanks,
Chris
Cancel
Vote Up 0 Vote Down

Cancel