33 Replies Latest reply on Aug 1, 2018 11:05 AM by rschroeder

    Node stops responding to SNMP

    shaun_9999

      Good afternoon fellow thwackers.

       

      I have some very weird things happening on some servers.  I don't understand why it happens, but sometimes nodes apparently stop responding to SNMP.  There's no real reason why it does this.  The service doesn't stop.  The service doesn't start responding when I restart the service, nor does it start up when I restart the server.

      I am getting networks to check to see if anything stops the service from sending responses, likewise, I can ping from server to server.   There is no pattern to when it starts and stops and neither is there any real pattern to the servers.  It just randomly stops and then later on during the starts again.

      It's not in a particular VLAN/cluster/domain.  It's all spread out evenly among our infrastructure.

       

      Networks have come to me stating that no SNMP was blocked.

       

      The only thing I could try was restarting the solarwinds box.

      Any other ideas?

        • Re: Node stops responding to SNMP
          pratikmehta003

          it would be good idea to check the timeout settings. Maybe its not having sufficient time to reach the end machine and get the response back...

          1 of 1 people found this helpful
          • Re: Node stops responding to SNMP
            grantallenby

            How do you know it stops? Do you get gaps in CPU and Memory Data, or does the node just appear as down?

              • Re: Node stops responding to SNMP
                shaun_9999

                We receive emails about failing SNMP.  These were done by the person before me who managed the environment.

                 

                • Re: Node stops responding to SNMP
                  shaun_9999

                  As a development, networks did patching on the switches and now everything on one customer domain has failed.  Both doing a test in edit settings and no stats are being shown.  It's a combination of both Windows 2012r2, 2008r2 and Linux OS.

                   

                  I have done both a connection test and an SNMP walk.

                  To get to the customer router, it is as so.

                   

                  (Customer environment) ------ (Switch 1) ----- (Switch 2) ----- (Router) ----- (Customer router) ----- (Customer Switch)

                   

                  I do a test within solarwinds and it get's to the first switch (via port 161, but with 0 bytes received) and then nothing else.  It doesn't get dropped, or accepted.  Nothing happens to it.

                   

                  There isn't a policy to block it, or make it vanish, it just doesn't exist.  I then did an SNMP walk on both the solarwinds server and on a different server and nothing got through.  So it's a little bit of a weird one.  I can't see what it would be.  My other plan is to get networks to wireshark it and see if anything is making it vanish.

                • Re: Node stops responding to SNMP
                  bobmarley

                  SNMP is using UDP so could just be getting dropped. Also could multiple links to the same destination that aren't routing correctly. Perhaps try setting up some tests in Netpath and see if there are some network issues you are not aware of. 

                  • Re: Node stops responding to SNMP
                    steshi

                    If this is Server 2008 and you are seeing error 1501 relating to SNMP in the system log then this article may be of relevance

                     

                    https://support.microsoft.com/en-us/help/967800/an-snmp-transport-is-disabled--and-snmp-event-id-1501-is-logged-when-l

                    • Re: Node stops responding to SNMP
                      cahunt

                      Have you tried increasing your SNMP Timeout setting? I am wondering if you would see less holes in data / unknown events if you gave a little more time for response from the devices.

                       

                      I am curious if when you get an unknown, can you hit the Poll now and get it to pick up again on the status. Or does it just stay as unknown?

                        • Re: Node stops responding to SNMP
                          shaun_9999

                          Someone mentioned that, I have increased it to 10,000 seconds.  They haven't failed yet, but I will monitor them as the days progress.

                          I hope it's something simple as this (lets be honest; unless something is on fire, or you've drove a car into a device, a lot of IT problems are usually simple.)

                          I can change them to a windows domain based authentication, but it's not the point.  It's the face that it's not doing what I want it to do.

                          • Re: Node stops responding to SNMP
                            shaun_9999

                            As I didn't see your other 2nd question. when I try to poll it, it says polling failed, I can ping from my solarwinds box to the devices themselves with no dropped packets, so that's why I mentioned the rebooting of services.

                            But that seems a very wasteful amount of time to do it when devices start to fail to SNMP requests.

                              • Re: Node stops responding to SNMP
                                cahunt

                                I would check your devices to see if the SNMP query is even making it to the device, and see what is happening to it (if it makes it).

                                 

                                Make sure the packet is not being dropped, SNMP service stays up on the boxes. If you have multiple routes you might need to assign some priority to use the same routes if the main links are up. You could also use Netpath to the IP:SNMP Port of a device to see if that maintains a good status. Netpath is less granular but has the potential to show if the port is inaccessible or something has changed on your network (i.e. path to the device is not consistent).

                                 

                                If you are adjusting the Poller that the boxes may be polled from then you would need to allow ALL IP's for your Orion Servers access to the servers. As changing them to a server that does not have access to your Server could cause failure.

                                (just make sure all IP's are on the setup for SNMP access on the server).

                                 

                                I know some of these may be a little outside the current scope of TS. I'm just trying to cover the bases here.

                            • Re: Node stops responding to SNMP
                              rschroeder

                              I've seen similar problems--nodes showing Hardware Status Unknown, and it turns out they're not able to be polled with snmp by NPM.

                               

                              It started two weeks ago, and the issue grew daily from only affecting a couple of nodes to 62 nodes unknown as of last Friday.  I've opened several tickets with SW Support on the topic.  Various recommendations have come through, including applying new hot fixes, a new Buddy Drop, replacing a DLL file on all pollers, etc.  Nothing has shown promise for this particular issue, although other issues have been addressed by these repairs.

                               

                              Today I ran across a Thwack note that mentioned a customer with (apparently) the same issue, and their SW Support engineer suggested removing NetPath monitors that frequently showed failure to reach end nodes.  The thought had to do with a CPU leak.  See this Thwack entry:  Re: Checklist to Rebuild Orion Application

                               

                              I recall adding a lot more NetPath monitors a few weeks ago, many of which were failing. Today I removed ALL the failing NetPath monitors and then rebooted all the pollers.  When they came back up, the 62 "Unknown" nodes dropped to just seven, all of which were ASA 5525-Xs, which have a known bug for hardware monitoring that's an incompatibility between NPM 12.2 and Cisco's IOS, which was NOT a problem with NPM 12.0.1.

                               

                              Other symptoms that caused confusion in the troubleshooting process include the fact that SW Support was only intermittently able to run SNMP Walks against these "Hardware Unknown" nodes from APE's.  Sometimes the nodes (Cisco switches) would not respond at all, sometimes they'd respond with very few OID's discovered, sometimes they'd respond normally.  SW suggested this wasn't their application's issue, but was a problem with the Cisco switches, and that I should open TAC cases to discover the problem and fix.  I did do this, and the only thing TAC saw was that the switches were being polled a LOT by snmp--and not from APE's, but from other devices on my network.

                               

                              I replicated the MIB walk / SNMP walk from my own PC (wanting to keep it completely isolated from the SW APE's), and found that the switches still were responding unpredictably.

                               

                              This wasn't a problem at NPM 12.0.1, but showed up a couple of months after I upgraded to 12.2 this past September.  At the same time of the upgrade, I also went to NAM licensing and added another APE and IPAM and UDT.  IPAM and UDT add quite a bit more polling pressure on switches & routers, and I still get errors about UDT jobs still running when (apparently) they should be completed.

                               

                              Worse, I'm getting more instances of APE C: drives filling up.  Two SW Support cases for this resulted in different fixes that temporarily resolved the issue (one was enabling TLS 1.2 on all APE's).  But the problem is back on two different APE's.  SW Support isn't being helpful on this front at the moment, but I'll escalate and get it resolved (at least temporarily again).

                               

                              Regarding the original concern of nodes showing up as Hardware Unknown due to not being able to be polled via snmp, I'm hopeful that removing NetPath monitoring of failed destinations may help a lot.  I removed the "bad" NetPath monitors late today and rebooted the pollers.  If the "unknown" devices stays below ten by EOB tomorrow, I'll think I'm on to something useful.  If they start climbing . . .   sigh.

                              • Re: Node stops responding to SNMP
                                gary9193

                                I've found an issue on our servers where after a while they stop responding to SNMP.

                                 

                                Strangely when I open the SNMP Service properties window go to Security tab and click Ok it starts working again.

                                 

                                Did a couple of snapshots of the registry before and after and found that a new key was created.

                                 

                                Then made a script to add the missing key.

                                 

                                REG ADD HKEY_LOCAL_MACHINE\SYSTEM\ControlSet001\Services\SNMP\Parameters\RFC1156Agent /v sysName /d %computername% /t REG_SZ /f

                                • Re: Node stops responding to SNMP
                                  rschroeder

                                  Some causes in my network for nodes not getting the snmp information back to NPM have included:

                                  • Too many snmp polling engines are hitting the Distribution Switches.  This isn't limited to Solarwinds, but also includes things like HP's printer discovery / management applications.  My Distribution switches' logs show snmp packets being dropped because they're overloading the switches' snmp queues.  These aren't small distribution switches--they're Cisco 6509's with dual supe's and only two line cards.  They have plenty of CPU and RAM and TCAM resources for "normal" polling.  The trick is to work with TAC and discover the sources of the snmp polling, then contact the folks who manage those sources and explain the problem.  They may need to stop polling, poll less frequently, poll through other paths, etc.  Or you may need to build ACL's to deny that polling from passing through your switches.
                                  • Cisco promised us multiple pallets of switches, which we purchased in the last few years, were fully compatible with our plans, and had sufficient CPU and memory resources to handle any needs.  When we started deploying ISE on them, they stopped responding to snmp.  Closer examination revealed they do NOT have sufficient CPU or memory to handle all the demands of ISE, and they become "Unknown" to Solarwinds.  The only solution was to replace those multiple pallets of switches.  Happily, they were reaching their planned hardware lifetime replacement cycle, and Cisco gave us 75% discount on their replacements (we went from 2960S stacks to 3850's).

                                   

                                  Could either of these be part of the source of your observed problems?

                                   

                                  In either case, "Packet-Capture is your friend!"  Knowing what traffic is actually hitting the switches, and its source, can help you identify potential flows that may be responsible for limiting your switches' ability to respond to snmp in a timely manner.