Hi there. We have a site get hit by lightning overnight and lost several pieces of equipment. One of those pieces of equipment is physically disconnected at this point and yet we are seeing the following in our logs. There are no "up" events (remember it's unplugged). Our email alerts are continuous from this device (and a couple of dozen others at the same site with the same behavior). Some of the email alerts say status of unknown but we do not have an alert set to trigger on that status - only down and up. Can anyone shed any light why every 15 minutes we are getting a down event with no up events? Thanks, Paul TIME OF EVENT MESSAGE 4/14/2014 5:00 PM acs2-3map-nw.nexicom.net has stopped responding () 4/14/2014 4:45 PM acs2-3map-nw.nexicom.net has stopped responding () 4/14/2014 4:30 PM acs2-3map-nw.nexicom.net has stopped responding () 4/14/2014 4:15 PM acs2-3map-nw.nexicom.net has stopped responding () 4/14/2014 4:00 PM acs2-3map-nw.nexicom.net has stopped responding () 4/14/2014 3:45 PM acs2-3map-nw.nexicom.net has stopped responding () 4/14/2014 3:30 PM acs2-3map-nw.nexicom.net has stopped responding () 4/14/2014 3:15 PM acs2-3map-nw.nexicom.net has stopped responding () 4/14/2014 3:00 PM acs2-3map-nw.nexicom.net has stopped responding () 4/14/2014 2:45 PM acs2-3map-nw.nexicom.net has stopped responding () 4/14/2014 2:30 PM acs2-3map-nw.nexicom.net has stopped responding () 4/14/2014 2:15 PM acs2-3map-nw.nexicom.net has stopped responding () 4/14/2014 2:00 PM acs2-3map-nw.nexicom.net has stopped responding () 4/14/2014 1:45 PM acs2-3map-nw.nexicom.net has stopped responding () 4/14/2014 1:30 PM acs2-3map-nw.nexicom.net has stopped responding () 4/14/2014 1:15 PM acs2-3map-nw.nexicom.net has stopped responding () 4/14/2014 1:00 PM acs2-3map-nw.nexicom.net has stopped responding ()

This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.

You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

Continuous Down Events (no Up events)

pstewart726 over 10 years ago

Hi there.

We have a site get hit by lightning overnight and lost several pieces of equipment. One of those pieces of equipment is physically disconnected at this point and yet we are seeing the following in our logs. There are no "up" events (remember it's unplugged). Our email alerts are continuous from this device (and a couple of dozen others at the same site with the same behavior). Some of the email alerts say status of unknown but we do not have an alert set to trigger on that status - only down and up.

Can anyone shed any light why every 15 minutes we are getting a down event with no up events?

Thanks,

Paul


TIME OF EVENT		MESSAGE
	4/14/2014 5:00 PM	acs2-3map-nw.nexicom.net has stopped responding ()
	4/14/2014 4:45 PM	acs2-3map-nw.nexicom.net has stopped responding ()
	4/14/2014 4:30 PM	acs2-3map-nw.nexicom.net has stopped responding ()
	4/14/2014 4:15 PM	acs2-3map-nw.nexicom.net has stopped responding ()
	4/14/2014 4:00 PM	acs2-3map-nw.nexicom.net has stopped responding ()
	4/14/2014 3:45 PM	acs2-3map-nw.nexicom.net has stopped responding ()
	4/14/2014 3:30 PM	acs2-3map-nw.nexicom.net has stopped responding ()
	4/14/2014 3:15 PM	acs2-3map-nw.nexicom.net has stopped responding ()
	4/14/2014 3:00 PM	acs2-3map-nw.nexicom.net has stopped responding ()
	4/14/2014 2:45 PM	acs2-3map-nw.nexicom.net has stopped responding ()
	4/14/2014 2:30 PM	acs2-3map-nw.nexicom.net has stopped responding ()
	4/14/2014 2:15 PM	acs2-3map-nw.nexicom.net has stopped responding ()
	4/14/2014 2:00 PM	acs2-3map-nw.nexicom.net has stopped responding ()
	4/14/2014 1:45 PM	acs2-3map-nw.nexicom.net has stopped responding ()
	4/14/2014 1:30 PM	acs2-3map-nw.nexicom.net has stopped responding ()
	4/14/2014 1:15 PM	acs2-3map-nw.nexicom.net has stopped responding ()
	4/14/2014 1:00 PM	acs2-3map-nw.nexicom.net has stopped responding ()

0 pstewart726 over 10 years ago

Sorry to bump my own post
While the events log posted do not show this, I can see that in groups there are reports of this node (and others) as becoming unknown.
So what appears to be happening is that many of our nodes that are down are flipping between "unknown" and "down". When flipping back to "down" this is triggering a new email alert and it seems to happen on exactly the 15 minute mark - why?
We could change all of our alerts to say "not up" but these alerts have worked for years and suddenly now things have gone strange. We did start building dependencies recently so would that have broken things?
Cancel
Vote Up 0 Vote Down

Cancel
0 RichardLetts over 10 years ago in reply to pstewart726

can you bump up the logging on the alert manager to see what it is saying?\
I think you are right.
There is probably some kind of implicit/explicit circular-dependencies at work here: Clearing an interface alert because a node is down, and a node-down alert that is being reset because an interface is down?
i.e. the implicit dependency of an interface on a node (changing an interface down to an unknown) is conflicting with the explicit dependency you have created (that changes a node from being down to unknown)
Cancel
Vote Up 0 Vote Down

Cancel
0 pstewart726 over 10 years ago in reply to RichardLetts

Thanks - and to make it more interesting, those nodes in question are all being restored at the moment so my ability to re-create this may be challenging.
There is definately something going on relating to dependencies in my opinion - the email alerts keep showing the same devices go from unknown to down and then back to unknown (and repeat). The interval of these status changes is exactly 15 minutes.
Cancel
Vote Up 0 Vote Down

Cancel