3 Replies Latest reply on Apr 15, 2014 10:51 AM by pstewart726

    Continuous Down Events (no Up events)

    pstewart726

      Hi there.

       

      We have a site get hit by lightning overnight and lost several pieces of equipment.  One of those pieces of equipment is physically disconnected at this point and yet we are seeing the following in our logs.  There are no "up" events (remember it's unplugged).  Our email alerts are continuous from this device (and a couple of dozen others at the same site with the same behavior).  Some of the email alerts say status of unknown but we do not have an alert set to trigger on that status - only down and up.

       

      Can anyone shed any light why every 15 minutes we are getting a down event with no up events?

       

      Thanks,

       

      Paul

       

      TIME OF EVENTMESSAGE
      4/14/2014 5:00 PMacs2-3map-nw.nexicom.net has stopped responding ()
      4/14/2014 4:45 PMacs2-3map-nw.nexicom.net has stopped responding ()
      4/14/2014 4:30 PMacs2-3map-nw.nexicom.net has stopped responding ()
      4/14/2014 4:15 PMacs2-3map-nw.nexicom.net has stopped responding ()
      4/14/2014 4:00 PMacs2-3map-nw.nexicom.net has stopped responding ()
      4/14/2014 3:45 PMacs2-3map-nw.nexicom.net has stopped responding ()
      4/14/2014 3:30 PMacs2-3map-nw.nexicom.net has stopped responding ()
      4/14/2014 3:15 PMacs2-3map-nw.nexicom.net has stopped responding ()
      4/14/2014 3:00 PMacs2-3map-nw.nexicom.net has stopped responding ()
      4/14/2014 2:45 PMacs2-3map-nw.nexicom.net has stopped responding ()
      4/14/2014 2:30 PMacs2-3map-nw.nexicom.net has stopped responding ()
      4/14/2014 2:15 PMacs2-3map-nw.nexicom.net has stopped responding ()
      4/14/2014 2:00 PMacs2-3map-nw.nexicom.net has stopped responding ()
      4/14/2014 1:45 PMacs2-3map-nw.nexicom.net has stopped responding ()
      4/14/2014 1:30 PMacs2-3map-nw.nexicom.net has stopped responding ()
      4/14/2014 1:15 PMacs2-3map-nw.nexicom.net has stopped responding ()
      4/14/2014 1:00 PMacs2-3map-nw.nexicom.net has stopped responding ()
        • Re: Continuous Down Events (no Up events)
          pstewart726

          Sorry to bump my own post

           

          While the events log posted do not show this, I can see that in groups there are reports of this node (and others) as becoming unknown.

           

          So what appears to be happening is that many of our nodes that are down are flipping between "unknown" and "down".  When flipping back to "down" this is triggering a new email alert and it seems to happen on exactly the 15 minute mark - why? 

           

          We could change all of our alerts to say "not up" but these alerts have worked for years and suddenly now things have gone strange.  We did start building dependencies recently so would that have broken things?

            • Re: Continuous Down Events (no Up events)
              RichardLetts

              can you bump up the logging on the alert manager to see what it is saying?\

               

              I think you are right.

               

              There is probably some kind of implicit/explicit circular-dependencies at work here: Clearing an interface alert because a node is down, and a node-down alert that is being reset because an interface is down?

              i.e. the implicit dependency of an interface on a node (changing an interface down to an unknown) is conflicting with the explicit dependency you have created (that changes a node from being down to unknown)

                • Re: Continuous Down Events (no Up events)
                  pstewart726

                  Thanks - and to make it more interesting, those nodes in question are all being restored at the moment so my ability to re-create this may be challenging.

                   

                  There is definately something going on relating to dependencies in my opinion - the email alerts keep showing the same devices go from unknown to down and then back to unknown (and repeat).  The interval of these status changes is exactly 15 minutes.