18 Replies Latest reply on Jun 27, 2011 1:25 PM by netlogix

    Many nodes loose SNMP when..

      With Orion v9.1.0 SLX, I monitor all kinds of nodes.  Ciscos, Unix, Linux, Windows Servers (2000, 2003), UPS, AS400s, etc..  Some are in my LAN or my department's WAN and some (mostly Windows nodes) are in other LANs in the corporate MAN.

      I have problems with some of those that are in other parts of the MAN.  We recently had a DR test where we turn off some key interfaces on Cisco routers, to isolate the building being tested.  By doing this, it also cuts my monitoring server from them.  That's fine, it is expected.  The problem is that when we restore connectivity, more than 75% of the nodes monitored in the MAN do not respond to Orino's SNMP requests.  The SysAdmins have to "stop/start" the SNMP services on all the non-responsive nodes.  But, there are some nodes that still work fine, even in the same subnet.

      What could be causing SNMP to stop replying?

        • Re: Many nodes loose SNMP when..
          netlogix

          I have the same problem and I think it has to do with dns.

          I have hostnames in the "Accespt SNMP packets from these hosts" for some nodes and those are the ones that tend to have the most issues like that for me.

            • Re: Many nodes loose SNMP when..

              Thanks for adding to the thread!

              The dns thing could be your problem but I don't see how the dns would affect this on my end as I only use ip addresses when adding a host in the "Accept SNMP packets..." field and, right now, on the servers with the problem, there isn't any ip in that field.

                • Re: Many nodes loose SNMP when..
                  grantsewell

                  What type of message do you get, if any? Is it something like "Node is in an Unknown State"? Or is it just down?

                    • Re: Many nodes loose SNMP when..

                      This is what I get.  Pinging is still fine.  When trying to "Validate SNMP" it just never works (unless there is a start/stop of the SNMP service on the server).  Once that's done, Orion picks it up normally.

                        • Re: Many nodes loose SNMP when..
                          grantsewell

                          Yes! That looks like the same thing I get... that is when the node itself is up, but the interface has gone into an "Unknown State". I have about 6 nodes of multiple types that show up this way right now (a combination of firewalls, appliances, and servers) that suspend collection and do not respond to SNMP requests until the SNMP service is restarted, or in the case of firewalls and appliances, they are rebooted.

                          This has been a real issue for me for a LONG time. See these previous posts:

                          http://thwack.com/forums/thread/28212.aspx - Initial Problem
                          http://thwack.com/forums/thread/32991.aspx - Trying to resolve with an automatic script
                          http://thwack.com/forums/thread/52043.aspx - Other people with same issue

                          From what I can tell, this is not related to a specific NIC, firmware version, or brand. It's interesting to see what's causing your particular issue as well. This has been a grievance of mine for awhile, but we can never specifically show it's related to the Solarwinds poller or other another software issue.

                            • Re: Many nodes loose SNMP when..

                              Grant, I'm trying to see if there could be an issue with firewalls?

                              Is there a firewall between the nodes not responding and your poller?  On my end, there are 2 firewalls and one pix between the nodes not responding and my poller.

                              But, as mentionned in my OP, there are nodes, still working in the same subnet!  Ex:  xxx.xxx.33.15/24 (not working) and xxx.xxx.33.16/24 (working)!  If a firewall/pix would interfere it would for both.. that's why, I have no idea what else could be happening!

                              Keep posting your findings!

                              BTW, I've asked the SysAdmins to check the event logs on the Win servers and there is nothing for the SNMP service.

                                • Re: Many nodes loose SNMP when..

                                  Grant, just to add..

                                  I had searched and read your 2 previous threads (thanks for mentioning them again) about this, but I felt my issue was different because, for me, this only happens when there is some loss of communications between my poller and the affected nodes.  In the past 2 weekends, we've had a DR test and a building electrical shutdown (for emergency electrical/ups/generator maintenance).

                                  So, in both instances, we have cut telecomms from the polled nodes and that seemed to have triggered SNMP to become non responsive.

                                  I think I'll be looking into your "restart script" pretty soon!

                                  • Re: Many nodes loose SNMP when..
                                    grantsewell

                                    The devices I have the most problems with are behind additional firewalls. Usually, it's a Windows Server (Domain Controller, specifically Dell 2850 / 2950) at a remote location. Again, the cause I've never been able to determine, but a drop in connectivity could certainly be the related issue. Like you, not all of the devices always have a problem, in fact, it's rarely more than 1 or 2.

                                    That being said, I do have this problem creep up every now and then on servers that are local to the Orion poller. The problem is typically resolved with a restart of SNMP, however some devices running highly specialized operating systems have no way to just restart SNMP, so I have to reboot the entire box, which is often inconvenient to do in the middle of a workday.

                                    I can't imagine the problem is firewall related... come to think, have you tried accessing these devices using another SNMP program other than Orion when they fail? This could be the true test to see exactly where the problem lies. I myself have not, though I might check it as soon as I get a moment.

                        • Re: Many nodes loose SNMP when..
                          Craig Norborg

                          I had a similar problem with some unix boxes awhile back.   Upgraded the SNMP daemon and the problem went away...

                            • Re: Many nodes loose SNMP when..
                              mhh351

                              Have you tried putting a scope on a span port to see what one of these servers is doing? Try putting Wireshark on a PC that is on a spanned port and see if Orion is sending an SNMP poll and/or the server replies.

                              If there is a reply, what is the update time for the web page? Also, check on the polling time that Orion cycles on. Also, if the device is a downline member of an alert group where device 1 is down therefore device 1.a will not even be looked at, is happening.

                              It might just be a matter of timing in Orion.

                                • Re: Many nodes loose SNMP when..
                                  grantsewell

                                  If there is a reply, what is the update time for the web page? Also, check on the polling time that Orion cycles on. Also, if the device is a downline member of an alert group where device 1 is down therefore device 1.a will not even be looked at, is happening.

                                  It might just be a matter of timing in Orion.

                                  There is no reply. It's as if something has overrun the buffer of the SNMP service and caused it to fail. It's possible it could be a timing issue in Orion, as most of my devices are high-latency, however that can't really be justified if they never come back and they don't respond to a local SNMP query ever.

                                    • Re: Many nodes loose SNMP when..

                                      I've had no success with another SNMP poller but, I was able to find out that, all my Linux servers are still responding to SNMP queries but none of the Win2K3 servers are returning any SNMP data.  I tested 4 different subnets and they all behave the same.

                                      So, it looks like it might be Windows problems.. I'll concentrate on that with the sysadmins..

                                        • Re: Many nodes loose SNMP when..
                                          mhh351

                                          I just want to ask one more question for you.

                                          We have had it where devices that are behind FW's take different routes if the back up FW takes over. Also, if you are connecting using MPLS and VPN to sites, Windows boxes do not always update on how to get to a destination. I have used Engineer's Tools to look at devices that have been in the state that you describe. I have the Tools on the Orion server. When Tools goes out to ACTIVELY monitor the device and interfaces, it takes awhile but it finds the device.

                                          I am thinking that it is doing two things. First, it is starting a new poll service on the WINDOWS server housing itself and Orion. Second, it is updating the Windows Server and the following network devices to change the route matrix.

                                          After I run this, I seem to find that Orion picks up the polling within a couple of polling cycles.

                                          Just some things to try. This is certainly not exhaustive. Hope that something helps.

                                            • Re: Many nodes loose SNMP when..
                                              grantsewell

                                              I've been using Tembria SNMP Browser (great free tool by the way - www.tembria.com) to scan new devices. I just tried this on a local server that has lost SNMP - response timed out. I do have my devices locked down to a specific management machine - is anyone else configured this way?

                                              It seems like the system has just overloaded the SNMP service, the only way I have ever been able to resolve this is restarting the service, whether it be local or remote (VPN), Windows or Linux. The common denominator is Orion.

                                              • Re: Many nodes loose SNMP when..
                                                netlogix

                                                I had a strange issue with my SNMP this weekend, maybe it might help you.

                                                I just moved one of my networks from routing off my 4510 switch to my ASA.  After that Orion was able to list resources but not poll interfaces.  I did a network capture on both interfaces of the ASA and found that a lot more packets were being received on the Orion side of the ASA than being transmitted to the server side (like 100 more for each poll).  Once I restarted the Orion services everything started working.  I suspect there is some state table on the ASA that didn't like the poll because it was a continuation of a existing poll not a newly initiated poll.  I know SNMP is over UDP and therefore is stateless, but Cisco may have some type of 'fixup' for SNMP that stores session information.

                                                Just a thought I wanted to pass on.