Aye, the dreaded Node Down alerts caused by minor packet loss, methinks.
Check the Packet Loss chart on one of the Nodes. Look for Red.
Sounds like a use case for configuring dependencies
Set up SNMP traps for additional visibility. Trap on change of status for critical uplink interfaces. You could also supplement with syslog to understand what is.or isn't happening with critical interfaces.
Have you considered agent-based monitoring or a remote poller?
So here is the interesting point about that graph (thank you for including it, by the way!) - SOMETHING is wrong with the connection between the SolarWinds server and the node. That's a lot of packet loss to be having, and if the SolarWinds server is having it, there's a chance that other machines are experiencing it too. But since it's intermittant, and since there are re-tries and packets can be re-sent, users experience it as "the app is slow" not "the app is down".
It sounds like you are providing monitoring as a service - meaning that this could just be a problem with the connection between your data center and theirs, but not a problem on the customer's network. If this is the case, you need to figure out why. Check the monitoring of all the network devices between the SolarWinds server and that node - the VPN gateway, etc. NetPath would help a lot in this instance.
If I'm wrong and when you say "customer" you mean an internal customer, then you might have a latent but un-diagnosed issue within your network. And guess what? NetPath can *still* help you identify it!
For now, you may want to increase the delay in the alert so that it only triggers if the device is "down" for > 12 minutes (or whatever the average packet loss duration is). This isn't optimal, but it will cut out the noise until you can get the network issue resolved.
As a side note, this could be a simple issue of a bad NIC on the Solarwinds server (or the switch port the SW server is plugged into). Once again, NetPath will show you the slowdown point and from there you can turn up monitoring on THAT device to see the specific root cause.
This is what I am being told:
The node in question, appears to be not having any issues and is up the entire time that it is unreachable from Solarwinds. I just noticed it went down a minute ago and ran some quick tests and it was up from the management network and was up inside the routing domain. This appears to be a case of the random dropping that the Customer's Fortigates have done since the beginning of time. The migration to the Palo Altos will fix this issue. Also, from the customer's perspective, the server has never actually gone down or even been slow.
My personal stake in this is all of this Node down Events add up to a significant amount of down time, that seriously impacts Availability, which is a key metric and deliverable to the client via SLA. My fear is this type of packet loss is not unique to this one server. We do have multiple customers with a total of over 4500 Servers. We only found this one because there were enough long lasting events to set up a significant number of Alerts, and the alerts tipped us off.
There have been 244 node down Alerts in December. And these are alerts, not events. In the month of December so far there have been 920 Node Down Events.
1 of 1 people found this helpful
Is this node managed via SNMP? If so, you can change the method that is used to monitor status, availability, and response time via 'List Resources' from ICMP to SNMP. This may resolve the issue. If it's a server, a better option would be to install the Agent on that machine.