6 Replies Latest reply on Dec 28, 2017 10:44 AM by aLTeReGo

    Node Down Events are misleading me.

    pflanz

      I am getting a extraordinary number of Node Down Events, some of which eventually cause an Alert, by being down over 10 minutes.

      Since December 4th when a server came on line, I have had 523 such events,   and up until today I had not had a single day this month with out any.

       

      I am being told that the node is not actually down, and this is a reflection of the linkage between Solarwinds and the Server.   Supposedly the customer sees no degradation of service or any slowness, or any indication that the network is down.

       

      I just got my first event today, but was unable to ping while it was down... too busy writing this Thwack.

      \

      This has a severe impact on that nodes availability, that we report to the customer.  

       

      Any advice?

        • Re: Node Down Events are misleading me.
          hpstech

          Aye, the dreaded Node Down alerts caused by minor packet loss, methinks.

           

          Check the Packet Loss chart on one of the Nodes. Look for Red.

          • Re: Node Down Events are misleading me.
            d09h

            Sounds like a use case for configuring dependencies

             

            Also:

             

            Set up SNMP traps for additional visibility.  Trap on change of status for critical uplink interfaces.  You could also supplement with syslog to understand what is.or isn't happening with critical interfaces.

             

            Have you considered agent-based monitoring or a remote poller?

            • Re: Node Down Events are misleading me.
              pflanz

              This the packet loss for this server sampling every 15 minutes.

              I am told the node is not really down and that everything is functional, even though during this timeframe I have over 525 Node Down Events due to no response.

                • Re: Node Down Events are misleading me.
                  Leon Adato

                  So here is the interesting point about that graph (thank you for including it, by the way!) - SOMETHING is wrong with the connection between the SolarWinds server and the node. That's a lot of packet loss to be having, and if the SolarWinds server is having it, there's a chance that other machines are experiencing it too. But since it's intermittant, and since there are re-tries and packets can be re-sent, users experience it as "the app is slow" not "the app is down".

                   

                  It sounds like you are providing monitoring as a service - meaning that this could just be a problem with the connection between your data center and theirs, but not a problem on the customer's network. If this is the case, you need to figure out why. Check the monitoring of all the network devices between the SolarWinds server and that node - the VPN gateway, etc. NetPath would help a lot in this instance.

                   

                  If I'm wrong and when you say "customer" you mean an internal customer, then you might have a latent but un-diagnosed issue within your network. And guess what? NetPath can *still* help you identify it!

                   

                  For now, you may want to increase the delay in the alert so that it only triggers if the device is "down" for > 12 minutes (or whatever the average packet loss duration is). This isn't optimal, but it will cut out the noise until you can get the network issue resolved.

                   

                  As a side note, this could be a simple issue of a bad NIC on the Solarwinds server (or the switch port the SW server is plugged into). Once again, NetPath will show you the slowdown point and from there you can turn up monitoring on THAT device to see the specific root cause.

                   

                  HTH

                   

                  - Leon

                    • Re: Node Down Events are misleading me.
                      pflanz

                      This is what I am being told:

                       

                      The node in question,  appears to be not having any issues and is up the entire time that it is unreachable from Solarwinds. I just noticed it went down a minute ago and ran some quick tests and it was up from the management network and was up inside the  routing domain. This appears to be a case of the random dropping that the Customer's Fortigates have done since the beginning of time. The migration to the Palo Altos will fix this issue. Also, from the customer's perspective, the server has never actually gone down or even been slow.

                       

                      My personal stake in this is all of this Node down Events add up to a significant amount of down time, that seriously impacts Availability, which is a key metric and deliverable to the client  via SLA.   My fear is this type of packet loss is not unique to this one server.  We do have multiple customers with a total of over 4500 Servers.  We only found this one because there were enough long lasting events to set up a significant number of Alerts, and the alerts tipped us off.  

                       

                      There have been 244   node down Alerts in December.  And these are alerts, not events.   In the month of December so far there have been 920 Node Down Events.

                        • Re: Node Down Events are misleading me.
                          aLTeReGo

                          Is this node managed via SNMP? If so, you can change the method that is used to monitor status, availability, and response time via 'List Resources' from ICMP to SNMP. This may resolve the issue. If it's a server, a better option would be to install the Agent on that machine.

                           

                          1 of 1 people found this helpful