3 Replies Latest reply on Oct 25, 2016 7:30 AM by deverts

    False reboot node alerts

    mr.e

      We setup a SolarWinds alert, for whenever a node is rebooted.  See the trigger condition below...

      Trigger Condition.gif

       

      The problem we have is that we're getting SolarWinds alerts when a node is rebooted but also when the SNMP services are restarted.  So, our NOC team is complaining about getting a lot of false alerts from SolarWinds.  Have you seen this behavior before?  If so, have you come up w/a workaround so you get alerted only when the node is truly rebooted?

       

      By the way, when the SNMP service restarts on a router or a switch, the Last boot time stamp never changes. We have seen this over and over for multiple routers and switches. The only thing that changes is the Event Time, which makes sense.  Anyway, I cannot understand why would the alert be triggered.

       

       

      Thanks in advance!!!

          • Re: False reboot node alerts
            mr.e

            Thanks for the clarification and info, amarnath_r.  I understand now that we'd need each vendor to supply a MIB that allows for a better (and reliable) uptime reporting. Also, used SNMP walk on one of our Cisco switches, looking for the hrsystemuptime value but it was not present.  So, I guess we'll need to continue to live w/the false reboot alerts until Cisco (and the other vendors) supply the custom MIB that we need. 

             

            Again, many thanks for clearing this up for me.  Good day!

              • Re: False reboot node alerts
                deverts

                mr.e

                 

                You could be waiting a very long time for SNMP to get corrected. This issue has been around since SNMP was invented, and no one has every fixed it. I actually turn the "reboot" alert off, and just monitor the up/down state of a device. This doesn't come standard out of the box tuned for your environment however. When a switch reboot only takes 2 minutes, and default polling is every 2 minutes (120 seconds), you are not going to see some outages. I use a combination of polling alerts and catch most outages. Up/down node catches a lot, and up/down on critical interfaces catches most of the other outages. It's still not 100%, but I'm ok with 99%. Yes, this does sometimes give me duplicate "root cause" alerts, but I'm good with that also, I'd rather have 2+ alerts for an outage rather than no clue it happened. It's a give and take relationship.

                 

                D