I was just reading the asked questions about dependencies prompting the conversation but if you are interest, he also got good feedback from readers. It reminded me that I have been meaning to post this suggestion about adding functionality to dependencies which might resolve a lot of people's headaches in this area. We read throughout other posts as well that dependencies work for some and not as well for others. I understand why and believe I may have a solution.
and there is a lot of conversation about dependencies there. Of course,Polling times of Parents vs. the Child of the Dependency are important in the current design to help eliminate Child alerts if the Parent is down. Sometimes, with the current design, those Child alerts will slip through even if the true problem lies with a down Parent. The reason is that the Dependency is basing its evaluation on historical and not necessarily current data. To help explain this current design, let's consider that the Parent is polled every 2 minutes and the Child every 5 minutes. This means that for every one Child poll, the Parent has been polled twice. The chances are pretty good that the Parent will catch the outage first and push the Children to an Unreachable state label. However if the Child is found to be down after the last Parent poll reported the Parent up and before the next Parent poll finding it down, the Child will likely cause a down event before the Dependency takes over.
The above can get even more complicated when you start creating generations of dependencies. In my case I have login ports that are dependent on an application running which is dependent on another application running which is dependent on the node running (which automatically catches everything running on the node but mentioned here for completeness) which is dependent on an intermediate switch running which is dependent on a firewall at the remote site which is then dependent on the firewall at my site being able to access the Internet. Adjusting timings for all of those dependent relationships so that only the my local firewall reports can get pretty convoluted.
My suggestion would be that anytime a node is reported down or Application Monitor triggers an event, if that node or Application Monitor is part of a dependency, perform a Poll Now against the parent(s) of the dependency and then adjust the status of the parent to down and all children and below to unreachable. This collection of current data is more likely to eliminate misleading Child events than basing the decision on historical data.
I used to do something similar with another monitoring tool where I would receive a down state of a node from a ping and before logging it, would run a script based on that down state, the script would ping other nodes on that same remote site and if the other nodes were all also down, would report the site's firewall down rather than reporting any of those nodes down. Again using current and not historical data.
This tweak would slow down reporting of the event but, some of us would rather wait a bit longer to get more precise event triggering rather than getting a storm of misleading alerts. That was my impression reading
. But for those that would like to stay with the current method for SLA or other reasons, the additional feature described here should be made optional.It would seem to me that this would be an easy addition for the SW Development team and would resolve a long time issue around the Dependency controversy.
Top Comments