Nodes continue to show as down when they are up

We lost connection on our firewalls on two sites today.

Everything came back up, but in SolarWinds, several nodes remain down, but not all.

It had been over an hour and the polling interval is 2 minutes.  I tried manually polling them, too, and they still show as down.

I can ping them, connect/login to them (servers and routers and wireless access points are down), so I know they're up and I can verify the SNMP on the devices, yet they still show as down.

I went to the app server and stopped all the SolarWinds/Orion services, verified that I couldn't get to the web console, waited for 5 minutes and restarted them.  They still show as down.

Any idea why these devices are polling as down?  They are set up for both SNMPv2 and ICMP but if I run a test, it fails.

I rebooted the server.  Same problem.

Why is it that I can login to these servers and people are able to get files and authenticate and pull GP on domain controllers, but they still show as down int NPM?

I also see a strange notification that I'm in evaluation version, but this install has been a licensed install from the start on 2 new servers?

Ideas?

Edit:  an hour after posting this, one of domain controllers showed up as being up, but the rest are still down.

On a hunch, I pinged one of the other servers from the app server and sure enough, the ping didn't come back.  It resolved to the correct IP address, but did not ping.

The DNS servers are up and running and pingable.

I can ping it from my machine, which is on the same switch and routers as the server, and I can even remote desktop into it from my machine, but the SolarWinds app server can't see it.

Why would 2 firewalls dropping off for a few minutes and coming back up cause anything like this?

  • Did these firewalls have changes that weren't committed to the startup config?

    Solarwinds determines up/down based solely on pings except when you have specified to use another method on the particular node, so you pretty clearly have something in the pipe blocking ICMP between your polling server and the nodes.  Traceroute from the server and see where it stops.  Check the inbound and outbound firewall/acl rules to make sure that there isn't a directional issue.

    If you have a fancy firewall some of them can simulate packet flows to verify that the specified traffic can make it from x.x.x.x to y.y.y.y across specified interfaces.

    Whenever I am troubleshooting ANYTHING in Orion I do it from the polling server itself because I can't count the number of times that I have had firewall/acl/antivirus situations that cut communication to the polling server and didn't impact us on whatever workstation network we happened to be sitting on.

    -Marc Netterfield

        Loop1 Systems: SolarWinds Training and Professional Services

  • Is your firewall a Checkpoint firewall by any chance?

  • what mesverrum‌ suggested

    I would also do a courtesy stop-all and restart all SW services on your polling engine

    can you confirm that the services on the nodes that are back up are working?

    seems firewall-ish from the symptoms

  • If you go to the website console and Poll in the settings on the solarwinds server and the server the websites on you should be able to see the website then. Let me know if it works.

  • Nothing changed on the firewall.

    The only thing that seems to be affected is the SolarWinds app server.  ICMP works for other devices on that server that follow the same network path, so I'm not sure why it's not working on these few.

    Tracert goes one hop and then times out the rest of the way, so it doesn't look like it's getting to the firewall when I tracert on one of those nodes.  When I do the others it hits all the correct hops over the bridge to the server, so if it was an issue with ICMP or SNMP on the SolarWinds server, it wouldn't be picking up the other nodes at that site.

    I'm not sure how that got marked as the right answer, because, although helpful, only reinforced all the troubleshooting I already did and didn't give me any new options.

    I appreciate it, but it shouldn't be marked as a correct answer for this issue.

  • I already stopped all the services and restarted them.  It's in my original post along with a full reboot.

    All the services on all the machines are working.

    We have a file server, a domain controller, and DNS/DHCP and all are running properly.  Users can pull GP, they are pulling IPs and DNS is resolving.  As a matter of fact, DNS resolves on the SolarWinds server when I ping some of these devices, but they just time out.  I've tried with the name and the IP.  Same thing.

    If it's a firewall issue, we can't find it.  Nothing should have changed.  We have had these dip several times before and had no issues.

    Routing tables look fine on the SolarWinds server as well, and it is getting information from other nodes on that site.

  • There's nothing wrong with the console.  That is working on all the nodes on the other site and some on the one we're having issues on.  I've done manual polls, I've polled each node individually, and I've even reset IP addresses and SNMP strings on them and tried again and it still doesn't work.

  • ‌the reason i asked is because of my comments on this thread

    False alerts on Checkpoint devices

    but i dont believe it is applicable in your case based on the troubkeshooting steps you have described. Throwing it out there anyway.

    the only other thing that comes to mind is the arp cache

    Look to the ARP Cache when troubleshooting flaky connectivity issues - TechRepublic

  • Yea, that didn't work either.

    We lost another domain controller on the other site this morning for no apparent reason, along with users over there not being able to connect to their file server.

    I would say it might be a switch issue, but some of the ones that are coming up as good are on the same switch.