Good afternoon fellow thwackers.
I have some very weird things happening on some servers. I don't understand why it happens, but sometimes nodes apparently stop responding to SNMP. There's no real reason why it does this. The service doesn't stop. The service doesn't start responding when I restart the service, nor does it start up when I restart the server.
I am getting networks to check to see if anything stops the service from sending responses, likewise, I can ping from server to server. There is no pattern to when it starts and stops and neither is there any real pattern to the servers. It just randomly stops and then later on during the starts again.
It's not in a particular VLAN/cluster/domain. It's all spread out evenly among our infrastructure.
Networks have come to me stating that no SNMP was blocked.
The only thing I could try was restarting the solarwinds box.
Any other ideas?
Some causes in my network for nodes not getting the snmp information back to NPM have included:
Could either of these be part of the source of your observed problems?
In either case, "Packet-Capture is your friend!" Knowing what traffic is actually hitting the switches, and its source, can help you identify potential flows that may be responsible for limiting your switches' ability to respond to snmp in a timely manner.
I've found an issue on our servers where after a while they stop responding to SNMP.
Strangely when I open the SNMP Service properties window go to Security tab and click Ok it starts working again.
Did a couple of snapshots of the registry before and after and found that a new key was created.
Then made a script to add the missing key.
REG ADD HKEY_LOCAL_MACHINE\SYSTEM\ControlSet001\Services\SNMP\Parameters\RFC1156Agent /v sysName /d %computername% /t REG_SZ /f
I've seen similar problems--nodes showing Hardware Status Unknown, and it turns out they're not able to be polled with snmp by NPM.
It started two weeks ago, and the issue grew daily from only affecting a couple of nodes to 62 nodes unknown as of last Friday. I've opened several tickets with SW Support on the topic. Various recommendations have come through, including applying new hot fixes, a new Buddy Drop, replacing a DLL file on all pollers, etc. Nothing has shown promise for this particular issue, although other issues have been addressed by these repairs.
Today I ran across a Thwack note that mentioned a customer with (apparently) the same issue, and their SW Support engineer suggested removing NetPath monitors that frequently showed failure to reach end nodes. The thought had to do with a CPU leak. See this Thwack entry: Re: Checklist to Rebuild Orion Application
I recall adding a lot more NetPath monitors a few weeks ago, many of which were failing. Today I removed ALL the failing NetPath monitors and then rebooted all the pollers. When they came back up, the 62 "Unknown" nodes dropped to just seven, all of which were ASA 5525-Xs, which have a known bug for hardware monitoring that's an incompatibility between NPM 12.2 and Cisco's IOS, which was NOT a problem with NPM 12.0.1.
Other symptoms that caused confusion in the troubleshooting process include the fact that SW Support was only intermittently able to run SNMP Walks against these "Hardware Unknown" nodes from APE's. Sometimes the nodes (Cisco switches) would not respond at all, sometimes they'd respond with very few OID's discovered, sometimes they'd respond normally. SW suggested this wasn't their application's issue, but was a problem with the Cisco switches, and that I should open TAC cases to discover the problem and fix. I did do this, and the only thing TAC saw was that the switches were being polled a LOT by snmp--and not from APE's, but from other devices on my network.
I replicated the MIB walk / SNMP walk from my own PC (wanting to keep it completely isolated from the SW APE's), and found that the switches still were responding unpredictably.
This wasn't a problem at NPM 12.0.1, but showed up a couple of months after I upgraded to 12.2 this past September. At the same time of the upgrade, I also went to NAM licensing and added another APE and IPAM and UDT. IPAM and UDT add quite a bit more polling pressure on switches & routers, and I still get errors about UDT jobs still running when (apparently) they should be completed.
Worse, I'm getting more instances of APE C: drives filling up. Two SW Support cases for this resulted in different fixes that temporarily resolved the issue (one was enabling TLS 1.2 on all APE's). But the problem is back on two different APE's. SW Support isn't being helpful on this front at the moment, but I'll escalate and get it resolved (at least temporarily again).
Regarding the original concern of nodes showing up as Hardware Unknown due to not being able to be polled via snmp, I'm hopeful that removing NetPath monitoring of failed destinations may help a lot. I removed the "bad" NetPath monitors late today and rebooted the pollers. If the "unknown" devices stays below ten by EOB tomorrow, I'll think I'm on to something useful. If they start climbing . . . sigh.
Have you tried increasing your SNMP Timeout setting? I am wondering if you would see less holes in data / unknown events if you gave a little more time for response from the devices.
I am curious if when you get an unknown, can you hit the Poll now and get it to pick up again on the status. Or does it just stay as unknown?
As I didn't see your other 2nd question. when I try to poll it, it says polling failed, I can ping from my solarwinds box to the devices themselves with no dropped packets, so that's why I mentioned the rebooting of services.
But that seems a very wasteful amount of time to do it when devices start to fail to SNMP requests.
I would check your devices to see if the SNMP query is even making it to the device, and see what is happening to it (if it makes it).
Make sure the packet is not being dropped, SNMP service stays up on the boxes. If you have multiple routes you might need to assign some priority to use the same routes if the main links are up. You could also use Netpath to the IP:SNMP Port of a device to see if that maintains a good status. Netpath is less granular but has the potential to show if the port is inaccessible or something has changed on your network (i.e. path to the device is not consistent).
If you are adjusting the Poller that the boxes may be polled from then you would need to allow ALL IP's for your Orion Servers access to the servers. As changing them to a server that does not have access to your Server could cause failure.
(just make sure all IP's are on the setup for SNMP access on the server).
I know some of these may be a little outside the current scope of TS. I'm just trying to cover the bases here.
Someone mentioned that, I have increased it to 10,000 seconds. They haven't failed yet, but I will monitor them as the days progress.
I hope it's something simple as this (lets be honest; unless something is on fire, or you've drove a car into a device, a lot of IT problems are usually simple.)
I can change them to a windows domain based authentication, but it's not the point. It's the face that it's not doing what I want it to do.
Just a clarification, since you mention you *could* turn on windows based authentication: Are all of the nodes with issues Windows servers, Linux/*nix servers, or "network devices" like routers/switches/UPS's, etc?
If this is Server 2008 and you are seeing error 1501 relating to SNMP in the system log then this article may be of relevance
SNMP is using UDP so could just be getting dropped. Also could multiple links to the same destination that aren't routing correctly. Perhaps try setting up some tests in Netpath and see if there are some network issues you are not aware of.
As a development, networks did patching on the switches and now everything on one customer domain has failed. Both doing a test in edit settings and no stats are being shown. It's a combination of both Windows 2012r2, 2008r2 and Linux OS.
I have done both a connection test and an SNMP walk.
To get to the customer router, it is as so.
(Customer environment) ------ (Switch 1) ----- (Switch 2) ----- (Router) ----- (Customer router) ----- (Customer Switch)
I do a test within solarwinds and it get's to the first switch (via port 161, but with 0 bytes received) and then nothing else. It doesn't get dropped, or accepted. Nothing happens to it.
There isn't a policy to block it, or make it vanish, it just doesn't exist. I then did an SNMP walk on both the solarwinds server and on a different server and nothing got through. So it's a little bit of a weird one. I can't see what it would be. My other plan is to get networks to wireshark it and see if anything is making it vanish.
I'm guessing you have but have you checked that these are triggering correctly? So you can see visible data loss when you look at the Node in Solarwinds?
You also said the Service doesn't stop. Assuming you can't see it stopping in the windows logs? Being windows its also worth checking windows firewall settings.
Do you an additional polling engine? If so its very much worth adding the node again on that polling engine and seeing if they stop at the same time. If they don't, log a call to Solarwinds as it will be a collector issue or a fault to the same effect.
Next time I receive them, I will take a capture of them.
But the alert is this:
The windows firewall is turned off via GPO.
We don't have an additional polling engine, but I am trying to get them to at least buy HA, so that should we have the link die to one data centre, we will have the other datacentre to fail over to.
When I try to poll the node, it says failed, which I initially thought I was the service as there was no packet loss. Though I came to a dead end.
I then thought I might be networks. It is very unlikely it is, as it's only a handful in all of our nodes and it is the same seven at a random occurrence.
Finally, after almost thinking to myself that it isn't, I then decided it must be something on the box itself, either hardware/software/services.
This might help? Alert on Nodes that stopped responding to SNMP
The initial post is using a very similar query and the first commenter, adatole indicated there may be issues with it because: "...I'm not sure if it was because the pollers were getting behind, or because LastSystemUptimePollUTC wasn't getting updated even when data was being collected, OR that data was being collected but the database was behind..."
The post also gives alternate queries to use/try.
SolarWinds solutions are rooted in our deep connection to our user base in the THWACK® online community. More than 150,000 members are here to solve problems, share technology and best practices, and directly contribute to our product development process. Learn more today by joining now.