Nodes continue to show as down when they are up

We lost connection on our firewalls on two sites today.

Everything came back up, but in SolarWinds, several nodes remain down, but not all.

It had been over an hour and the polling interval is 2 minutes.  I tried manually polling them, too, and they still show as down.

I can ping them, connect/login to them (servers and routers and wireless access points are down), so I know they're up and I can verify the SNMP on the devices, yet they still show as down.

I went to the app server and stopped all the SolarWinds/Orion services, verified that I couldn't get to the web console, waited for 5 minutes and restarted them.  They still show as down.

Any idea why these devices are polling as down?  They are set up for both SNMPv2 and ICMP but if I run a test, it fails.

I rebooted the server.  Same problem.

Why is it that I can login to these servers and people are able to get files and authenticate and pull GP on domain controllers, but they still show as down int NPM?

I also see a strange notification that I'm in evaluation version, but this install has been a licensed install from the start on 2 new servers?


Edit:  an hour after posting this, one of domain controllers showed up as being up, but the rest are still down.

On a hunch, I pinged one of the other servers from the app server and sure enough, the ping didn't come back.  It resolved to the correct IP address, but did not ping.

The DNS servers are up and running and pingable.

I can ping it from my machine, which is on the same switch and routers as the server, and I can even remote desktop into it from my machine, but the SolarWinds app server can't see it.

Why would 2 firewalls dropping off for a few minutes and coming back up cause anything like this?

  • Did these firewalls have changes that weren't committed to the startup config?

    Solarwinds determines up/down based solely on pings except when you have specified to use another method on the particular node, so you pretty clearly have something in the pipe blocking ICMP between your polling server and the nodes.  Traceroute from the server and see where it stops.  Check the inbound and outbound firewall/acl rules to make sure that there isn't a directional issue.

    If you have a fancy firewall some of them can simulate packet flows to verify that the specified traffic can make it from x.x.x.x to y.y.y.y across specified interfaces.

    Whenever I am troubleshooting ANYTHING in Orion I do it from the polling server itself because I can't count the number of times that I have had firewall/acl/antivirus situations that cut communication to the polling server and didn't impact us on whatever workstation network we happened to be sitting on.

