cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post

Consider a Dependency Tweak

Consider a Dependency Tweak

I was just reading the and there is a lot of conversation about dependencies there.  Of course, asked questions about dependencies prompting the conversation but if you are interest, he also got good feedback from readers.  It reminded me that I have been meaning to post this suggestion about adding functionality to dependencies which might resolve a lot of people's headaches in this area.  We read throughout other posts as well that dependencies work for some and not as well for others.  I understand why and believe I may have a solution.

Polling times of Parents vs. the Child of the Dependency are important in the current design to help eliminate Child alerts if the Parent is down.  Sometimes, with the current design, those Child alerts will slip through even if the true problem lies with a down Parent.  The reason is that the Dependency is basing its evaluation on historical and not necessarily current data.  To help explain this current design, let's consider that the Parent is polled every 2 minutes and the Child every 5 minutes.  This means that for every one Child poll, the Parent has been polled twice.  The chances are pretty good that the Parent will catch the outage first and push the Children to an Unreachable state label.  However if the Child is found to be down after the last Parent poll reported the Parent up and before the next Parent poll finding it down, the Child will likely cause a down event before the Dependency takes over.

The above can get even more complicated when you start creating generations of dependencies.  In my case I have login ports that are dependent on an application running which is dependent on another application running which is dependent on the node running (which automatically catches everything running on the node but mentioned here for completeness) which is dependent on an intermediate switch running which is dependent on a firewall at the remote site which is then dependent on the firewall at my site being able to access the Internet.  Adjusting timings for all of those dependent relationships so that only the my local firewall reports can get pretty convoluted.

My suggestion would be that anytime a node is reported down or Application Monitor triggers an event, if that node or Application Monitor is part of a dependency, perform a Poll Now against the parent(s) of the dependency and then adjust the status of the parent to down and all children and below to unreachable.  This collection of current data is more likely to eliminate misleading Child events than basing the decision on historical data.

I used to do something similar with another monitoring tool where I would receive a down state of a node from a ping and before logging it, would run a script based on that down state, the script would ping other nodes on that same remote site and if the other nodes were all also down, would report the site's firewall down rather than reporting any of those nodes down.  Again using current and not historical data.

This tweak would slow down reporting of the event but, some of us would rather wait a bit longer to get more precise event triggering rather than getting a storm of misleading alerts.  That was my impression reading .  But for those that would like to stay with the current method for SLA or other reasons, the additional feature described here should be made optional.

It would seem to me that this would be an easy addition for the SW Development team and would resolve a long time issue around the Dependency controversy.

7 Comments
Level 12

Mike,

I agree with your message.  I do see the child alerts sometimes slip by, if they are noted down before the parent.  Dependencies also needs to be extended to Thin AP's that can have the controller as the parent. 

Zak Kahl

Loop1 Systems

http://www.loop1systems.com

Level 8

I have the following scenario:

1) parent node goes down

2) child node alerts are properly suppressed

3) parent node comes back up; thus releasing the dependency

4) alerts are generated for child nodes although they are now reachable

Is there an option to poll all the child nodes when an parent node comes back up?

This request needs more love - it basically solves most headaches all of us suffer through dependency slippage. Y'all are smart, right? Program it on up!!

Level 9

Dependency slippage hurts indeed...

These are good points, Mike and nfrancis - I voted up for this, hope SW guys will add this to next version.

Level 7

I have this issue with the WLC (Cisco Wireless Lan Controller).  I figured out how to suppress AP alerts if the WLC is down, but then when the WLC comes up, all of  the APs alert that they are up.   So far I can't figure out how to stop this.  I know the WLC is a bit different due to the WLC and AP relationship.

Level 12

mvanbavel​,

I think this is just a timing issue you are experiencing.  The problem is that even though the parent is reporting Up after a Poll of it, the children may not yet have completed polls indicating they are now in an Up state.  When the parent when Down, it was moved to the Down state and it's children moved to Unreachable.  Unreachable pauses polling of the children until the parent is again Up.  When the parent moves to Up, the state of the children changes to Down because no completed Polls have been recently received.  Basically they have to now prove, through new Polls, that they are actually Up.

By default, for all my Alert Definitions, I make sure I have valid Trigger Conditions before the Trigger Actions are applied.  I don't just rely on the first sign that the parent is back up to be gospel.  Usually I will wait until two Polls are returned with a Down state.  I do the same when I am evaluating over or under threshold situations.  Then I Trigger the Alert.  This works almost always and where I haven't waited long enough, like in your case of the children still Alert, I would extend to 3 polls.  Likely 2 polls will work.

If you are not familiar with how to do this, figure out the Polling Interval for the particular component the Alert Definition is based on and set the "Condition must exist for ?? [seconds/minutes/hours]" to that amount of time plus, I add 1 minute for response time of that last Poll.  As an example, if the poll is every 5 minutes and I want to Reset if the condition still exists with the second poll, I will set this value to 6 minutes (first poll + time between polls + second poll including some time for a delayed response).  Now the polls have more time to find out the current state of the children before causing more Alerts to trigger.

Community Manager
Community Manager
Status changed to: Open for Voting