31 Replies Latest reply on Jan 10, 2018 2:57 PM by gary9193

    Node stops responding to SNMP

    shaun_9999

      Good afternoon fellow thwackers.

       

      I have some very weird things happening on some servers.  I don't understand why it happens, but sometimes nodes apparently stop responding to SNMP.  There's no real reason why it does this.  The service doesn't stop.  The service doesn't start responding when I restart the service, nor does it start up when I restart the server.

      I am getting networks to check to see if anything stops the service from sending responses, likewise, I can ping from server to server.   There is no pattern to when it starts and stops and neither is there any real pattern to the servers.  It just randomly stops and then later on during the starts again.

      It's not in a particular VLAN/cluster/domain.  It's all spread out evenly among our infrastructure.

       

      Networks have come to me stating that no SNMP was blocked.

       

      The only thing I could try was restarting the solarwinds box.

      Any other ideas?

        • Re: Node stops responding to SNMP
          pratikmehta003

          it would be good idea to check the timeout settings. Maybe its not having sufficient time to reach the end machine and get the response back...

          1 of 1 people found this helpful
          • Re: Node stops responding to SNMP
            grantallenby

            How do you know it stops? Do you get gaps in CPU and Memory Data, or does the node just appear as down?

            • Re: Node stops responding to SNMP
              bobmarley

              SNMP is using UDP so could just be getting dropped. Also could multiple links to the same destination that aren't routing correctly. Perhaps try setting up some tests in Netpath and see if there are some network issues you are not aware of. 

              • Re: Node stops responding to SNMP
                steshi

                If this is Server 2008 and you are seeing error 1501 relating to SNMP in the system log then this article may be of relevance

                 

                https://support.microsoft.com/en-us/help/967800/an-snmp-transport-is-disabled--and-snmp-event-id-1501-is-logged-when-l

                • Re: Node stops responding to SNMP
                  cahunt

                  Have you tried increasing your SNMP Timeout setting? I am wondering if you would see less holes in data / unknown events if you gave a little more time for response from the devices.

                   

                  I am curious if when you get an unknown, can you hit the Poll now and get it to pick up again on the status. Or does it just stay as unknown?

                    • Re: Node stops responding to SNMP
                      shaun_9999

                      Someone mentioned that, I have increased it to 10,000 seconds.  They haven't failed yet, but I will monitor them as the days progress.

                      I hope it's something simple as this (lets be honest; unless something is on fire, or you've drove a car into a device, a lot of IT problems are usually simple.)

                      I can change them to a windows domain based authentication, but it's not the point.  It's the face that it's not doing what I want it to do.

                      • Re: Node stops responding to SNMP
                        shaun_9999

                        As I didn't see your other 2nd question. when I try to poll it, it says polling failed, I can ping from my solarwinds box to the devices themselves with no dropped packets, so that's why I mentioned the rebooting of services.

                        But that seems a very wasteful amount of time to do it when devices start to fail to SNMP requests.

                          • Re: Node stops responding to SNMP
                            cahunt

                            I would check your devices to see if the SNMP query is even making it to the device, and see what is happening to it (if it makes it).

                             

                            Make sure the packet is not being dropped, SNMP service stays up on the boxes. If you have multiple routes you might need to assign some priority to use the same routes if the main links are up. You could also use Netpath to the IP:SNMP Port of a device to see if that maintains a good status. Netpath is less granular but has the potential to show if the port is inaccessible or something has changed on your network (i.e. path to the device is not consistent).

                             

                            If you are adjusting the Poller that the boxes may be polled from then you would need to allow ALL IP's for your Orion Servers access to the servers. As changing them to a server that does not have access to your Server could cause failure.

                            (just make sure all IP's are on the setup for SNMP access on the server).

                             

                            I know some of these may be a little outside the current scope of TS. I'm just trying to cover the bases here.

                        • Re: Node stops responding to SNMP
                          rschroeder

                          I've seen similar problems--nodes showing Hardware Status Unknown, and it turns out they're not able to be polled with snmp by NPM.

                           

                          It started two weeks ago, and the issue grew daily from only affecting a couple of nodes to 62 nodes unknown as of last Friday.  I've opened several tickets with SW Support on the topic.  Various recommendations have come through, including applying new hot fixes, a new Buddy Drop, replacing a DLL file on all pollers, etc.  Nothing has shown promise for this particular issue, although other issues have been addressed by these repairs.

                           

                          Today I ran across a Thwack note that mentioned a customer with (apparently) the same issue, and their SW Support engineer suggested removing NetPath monitors that frequently showed failure to reach end nodes.  The thought had to do with a CPU leak.  See this Thwack entry:  Re: Checklist to Rebuild Orion Application

                           

                          I recall adding a lot more NetPath monitors a few weeks ago, many of which were failing. Today I removed ALL the failing NetPath monitors and then rebooted all the pollers.  When they came back up, the 62 "Unknown" nodes dropped to just seven, all of which were ASA 5525-Xs, which have a known bug for hardware monitoring that's an incompatibility between NPM 12.2 and Cisco's IOS, which was NOT a problem with NPM 12.0.1.

                           

                          Other symptoms that caused confusion in the troubleshooting process include the fact that SW Support was only intermittently able to run SNMP Walks against these "Hardware Unknown" nodes from APE's.  Sometimes the nodes (Cisco switches) would not respond at all, sometimes they'd respond with very few OID's discovered, sometimes they'd respond normally.  SW suggested this wasn't their application's issue, but was a problem with the Cisco switches, and that I should open TAC cases to discover the problem and fix.  I did do this, and the only thing TAC saw was that the switches were being polled a LOT by snmp--and not from APE's, but from other devices on my network.

                           

                          I replicated the MIB walk / SNMP walk from my own PC (wanting to keep it completely isolated from the SW APE's), and found that the switches still were responding unpredictably.

                           

                          This wasn't a problem at NPM 12.0.1, but showed up a couple of months after I upgraded to 12.2 this past September.  At the same time of the upgrade, I also went to NAM licensing and added another APE and IPAM and UDT.  IPAM and UDT add quite a bit more polling pressure on switches & routers, and I still get errors about UDT jobs still running when (apparently) they should be completed.

                           

                          Worse, I'm getting more instances of APE C: drives filling up.  Two SW Support cases for this resulted in different fixes that temporarily resolved the issue (one was enabling TLS 1.2 on all APE's).  But the problem is back on two different APE's.  SW Support isn't being helpful on this front at the moment, but I'll escalate and get it resolved (at least temporarily again).

                           

                          Regarding the original concern of nodes showing up as Hardware Unknown due to not being able to be polled via snmp, I'm hopeful that removing NetPath monitoring of failed destinations may help a lot.  I removed the "bad" NetPath monitors late today and rebooted the pollers.  If the "unknown" devices stays below ten by EOB tomorrow, I'll think I'm on to something useful.  If they start climbing . . .   sigh.

                          • Re: Node stops responding to SNMP
                            gary9193

                            I've found an issue on our servers where after a while they stop responding to SNMP.

                             

                            Strangely when I open the SNMP Service properties window go to Security tab and click Ok it starts working again.

                             

                            Did a couple of snapshots of the registry before and after and found that a new key was created.

                             

                            Then made a script to add the missing key.

                             

                            REG ADD HKEY_LOCAL_MACHINE\SYSTEM\ControlSet001\Services\SNMP\Parameters\RFC1156Agent /v sysName /d %computername% /t REG_SZ /f