I've been having a persistent issue within Solarwinds. Our company has a main facility as well as a number of remote facilities. A few times each week (usually in the morning) Solarwinds will show that (mostly) all of the Cisco 3700 series switches are down at one of our non primary locations. I can still ping these devices just fine and towards the end of the day these false alerts will usually have sorted themselves out. I have tried tweaking our polling setting thinking it may just be network latency causing the issue, but that hasn't yielded any results as of yet. I have changed polling times for the nodes in question, as well as changing the "node down" alert to allow more time before reporting as such.
I was wondering if anyone else has had similar issues and what steps were taken to resolve it.
I appreciate any and all feedback and will gladly share any more information that may be useful.
You might also try the poller tool that is within the install directory if you have access which might help you diagnose the problem, you can change the polling method in there and detect which if any its using,
mine was in C:\Program Files (x86)\SolarWinds\Orion
Are you using ICMP or SNMP for Status Polling? ICMP is normally the default unless it’s restricted.
Edit: This is located in List Resources of the Node and is different from the polling method found in Edit Node.
I would suggest changing that to ICMP if possible, that will then give you the ability to have an Up/Down status even if there are problems with the SNMP Connectivity or SNMP Agent.
To validate what is or is not happening, you could get IPSLA jobs configured on your affected Cisco devices. Even if just ICMP, you may gain some insight. Have them ping a SolarWinds poller. Another option is Netpath.
Might someone have changed the access control list rules allowing your pollers to discover these switches?
Could one or more firewall rules have changed that would deny your pollers from accessing the switches via snmp?
Did one or more snmp strings change, either on the switches or on the poller(s)? They have to stay in sync on both sides of the link for systems to be properly monitored and show "up" if you're monitoring via snmp instead of via icmp.
Nothing with the string has changed and I dont believe anyone has touched the access control list as of late (I'll ask around just to be safe). I was actually thinking that this may have something to do with our firewall so thats reassuring that you mentioned it. Some of our other devices that poll from smnp dont pull all of the info they should IE no system name, device type, ect. Just an IP (which is very annoying). These devices dont go down though. Only seems to be our 3700 series switches.
Thank you for your feedback!
Ask your firewall administrator to check their firewall logs for traffic between your poller(s) and the nodes to see what's being denied (if anything). Then request it be allowed.
Thank you! You were right. All the servers are showing 100% packet loss. I can still ping these devices, could this be some issue with our polling method (SNMP) not working correctly?
100%? All of them?
That doesn't sound good.
Can you post a one week screenshot of an example node like I did above?
I suggest applying packet level threshold to your Node down alerts. For example we alert above 90% to cut down on noise.
Also, this obscure setting below is discussed a lot on this very same topic - what is your's set to? I believe 120 is the default.
This is what all (3) of the down nodes graphs look like. All exactly the same.
As for the "node warning level" it is set to 120 seconds. I've reached out to the net admin at that location to see if he can confirm or deny these servers state. Thanks for the info about packet loss. Still relatively new to SW and didn't know that screen existed!
SolarWinds solutions are rooted in our deep connection to our user base in the THWACK® online community. More than 150,000 members are here to solve problems, share technology and best practices, and directly contribute to our product development process.