cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post
Level 7

Many nodes loose SNMP when..

With Orion v9.1.0 SLX, I monitor all kinds of nodes.  Ciscos, Unix, Linux, Windows Servers (2000, 2003), UPS, AS400s, etc..  Some are in my LAN or my department's WAN and some (mostly Windows nodes) are in other LANs in the corporate MAN.

I have problems with some of those that are in other parts of the MAN.  We recently had a DR test where we turn off some key interfaces on Cisco routers, to isolate the building being tested.  By doing this, it also cuts my monitoring server from them.  That's fine, it is expected.  The problem is that when we restore connectivity, more than 75% of the nodes monitored in the MAN do not respond to Orino's SNMP requests.  The SysAdmins have to "stop/start" the SNMP services on all the non-responsive nodes.  But, there are some nodes that still work fine, even in the same subnet.

What could be causing SNMP to stop replying?

0 Kudos
18 Replies

I had a similar problem with some unix boxes awhile back.   Upgraded the SNMP daemon and the problem went away...

0 Kudos

Have you tried putting a scope on a span port to see what one of these servers is doing? Try putting Wireshark on a PC that is on a spanned port and see if Orion is sending an SNMP poll and/or the server replies.

If there is a reply, what is the update time for the web page? Also, check on the polling time that Orion cycles on. Also, if the device is a downline member of an alert group where device 1 is down therefore device 1.a will not even be looked at, is happening.

It might just be a matter of timing in Orion.

0 Kudos

If there is a reply, what is the update time for the web page? Also, check on the polling time that Orion cycles on. Also, if the device is a downline member of an alert group where device 1 is down therefore device 1.a will not even be looked at, is happening.

It might just be a matter of timing in Orion.

There is no reply. It's as if something has overrun the buffer of the SNMP service and caused it to fail. It's possible it could be a timing issue in Orion, as most of my devices are high-latency, however that can't really be justified if they never come back and they don't respond to a local SNMP query ever.

0 Kudos

I've had no success with another SNMP poller but, I was able to find out that, all my Linux servers are still responding to SNMP queries but none of the Win2K3 servers are returning any SNMP data.  I tested 4 different subnets and they all behave the same.

So, it looks like it might be Windows problems.. I'll concentrate on that with the sysadmins..

0 Kudos

I just want to ask one more question for you.

We have had it where devices that are behind FW's take different routes if the back up FW takes over. Also, if you are connecting using MPLS and VPN to sites, Windows boxes do not always update on how to get to a destination. I have used Engineer's Tools to look at devices that have been in the state that you describe. I have the Tools on the Orion server. When Tools goes out to ACTIVELY monitor the device and interfaces, it takes awhile but it finds the device.

I am thinking that it is doing two things. First, it is starting a new poll service on the WINDOWS server housing itself and Orion. Second, it is updating the Windows Server and the following network devices to change the route matrix.

After I run this, I seem to find that Orion picks up the polling within a couple of polling cycles.

Just some things to try. This is certainly not exhaustive. Hope that something helps.

0 Kudos

I had a strange issue with my SNMP this weekend, maybe it might help you.

I just moved one of my networks from routing off my 4510 switch to my ASA.  After that Orion was able to list resources but not poll interfaces.  I did a network capture on both interfaces of the ASA and found that a lot more packets were being received on the Orion side of the ASA than being transmitted to the server side (like 100 more for each poll).  Once I restarted the Orion services everything started working.  I suspect there is some state table on the ASA that didn't like the poll because it was a continuation of a existing poll not a newly initiated poll.  I know SNMP is over UDP and therefore is stateless, but Cisco may have some type of 'fixup' for SNMP that stores session information.

Just a thought I wanted to pass on.

0 Kudos

I've been using Tembria SNMP Browser (great free tool by the way - www.tembria.com) to scan new devices. I just tried this on a local server that has lost SNMP - response timed out. I do have my devices locked down to a specific management machine - is anyone else configured this way?

It seems like the system has just overloaded the SNMP service, the only way I have ever been able to resolve this is restarting the service, whether it be local or remote (VPN), Windows or Linux. The common denominator is Orion.

0 Kudos

Do not know if anyone is still having this issue. I found that nodes with high latency had issues sending the authentication trap in the SNMP settings. I unchecked it and SNMP began working again.

0 Kudos

This is a pretty old thread, what version are you on?  I am not aware of these types of issues anymore.

0 Kudos
Level 15

I have the same problem and I think it has to do with dns.

I have hostnames in the "Accespt SNMP packets from these hosts" for some nodes and those are the ones that tend to have the most issues like that for me.

0 Kudos

Thanks for adding to the thread!

The dns thing could be your problem but I don't see how the dns would affect this on my end as I only use ip addresses when adding a host in the "Accept SNMP packets..." field and, right now, on the servers with the problem, there isn't any ip in that field.

0 Kudos

What type of message do you get, if any? Is it something like "Node is in an Unknown State"? Or is it just down?

0 Kudos

This is what I get.  Pinging is still fine.  When trying to "Validate SNMP" it just never works (unless there is a start/stop of the SNMP service on the server).  Once that's done, Orion picks it up normally.

0 Kudos

Yes! That looks like the same thing I get... that is when the node itself is up, but the interface has gone into an "Unknown State". I have about 6 nodes of multiple types that show up this way right now (a combination of firewalls, appliances, and servers) that suspend collection and do not respond to SNMP requests until the SNMP service is restarted, or in the case of firewalls and appliances, they are rebooted.

This has been a real issue for me for a LONG time. See these previous posts:

http://thwack.com/forums/thread/28212.aspx - Initial Problem
http://thwack.com/forums/thread/32991.aspx - Trying to resolve with an automatic script
http://thwack.com/forums/thread/52043.aspx - Other people with same issue

From what I can tell, this is not related to a specific NIC, firmware version, or brand. It's interesting to see what's causing your particular issue as well. This has been a grievance of mine for awhile, but we can never specifically show it's related to the Solarwinds poller or other another software issue.

0 Kudos

Grant, I'm trying to see if there could be an issue with firewalls?

Is there a firewall between the nodes not responding and your poller?  On my end, there are 2 firewalls and one pix between the nodes not responding and my poller.

But, as mentionned in my OP, there are nodes, still working in the same subnet!  Ex:  xxx.xxx.33.15/24 (not working) and xxx.xxx.33.16/24 (working)!  If a firewall/pix would interfere it would for both.. that's why, I have no idea what else could be happening!

Keep posting your findings!

BTW, I've asked the SysAdmins to check the event logs on the Win servers and there is nothing for the SNMP service.

0 Kudos

The devices I have the most problems with are behind additional firewalls. Usually, it's a Windows Server (Domain Controller, specifically Dell 2850 / 2950) at a remote location. Again, the cause I've never been able to determine, but a drop in connectivity could certainly be the related issue. Like you, not all of the devices always have a problem, in fact, it's rarely more than 1 or 2.

That being said, I do have this problem creep up every now and then on servers that are local to the Orion poller. The problem is typically resolved with a restart of SNMP, however some devices running highly specialized operating systems have no way to just restart SNMP, so I have to reboot the entire box, which is often inconvenient to do in the middle of a workday.

I can't imagine the problem is firewall related... come to think, have you tried accessing these devices using another SNMP program other than Orion when they fail? This could be the true test to see exactly where the problem lies. I myself have not, though I might check it as soon as I get a moment.

0 Kudos

come to think, have you tried accessing these devices using another SNMP program other than Orion when they fail?

Ok, well using another SNMP browser I was unable to connect to an "unknown state" device. Not a big suprise, but would have been nice.

0 Kudos

Grant, just to add..

I had searched and read your 2 previous threads (thanks for mentioning them again) about this, but I felt my issue was different because, for me, this only happens when there is some loss of communications between my poller and the affected nodes.  In the past 2 weekends, we've had a DR test and a building electrical shutdown (for emergency electrical/ups/generator maintenance).

So, in both instances, we have cut telecomms from the polled nodes and that seemed to have triggered SNMP to become non responsive.

I think I'll be looking into your "restart script" pretty soon!

0 Kudos