16 Replies Latest reply: Jun 15, 2011 1:17 PM by Karlo.Zatylny RSS

Tips for Defining Groups and Dependencies - a running list?

borgan

At the risk of being presumptuous, I thought it might be valuable to condense some of the knowledge that has been disseminated so far regarding the great new feature in Orion NPM version 10.1 - Groups and Dependencies.

It is easy to see that this topic will be a source of much discussion for the foreseeable future, so how about us Thwack users start compiling a list of essential tips to remember as we use this feature?

I'll start with two that seem obvious and correct and others can add to it or correct me:

(1) A Group can consist of any number and type of elements, including other groups.

(2) In order for a child to be seen as "unreachable", ALL of its parents have to be down or unreachable. (Thanks Karlo!)

 

Anyone else care to add something?

 
  • Re: Tips for Defining Groups and Dependencies - a running list?

    borgan--

    This sounds like a great idea. I will speak with development and see if we can create a forum for it. Can't promise anything but I will check.

    Thanks,

    M

  • Re: Tips for Defining Groups and Dependencies - a running list?
    patriot

    I'll add one I "think" is correct.

    A parent cannot be a member of it's own child group. Makes logical sense. right?

    • Re: Tips for Defining Groups and Dependencies - a running list?
      patriot

      I'll post another thing I think I have learned about groups and dependencies.

      Make sure your dependencies are correctly defined to match your actual network topology. If you don't, and Orion can find an alternate path to a child object that is down, the object will show as down rather than "unreachable" even if the defined parent is down at the time.

      Does that seem correct?

      • Re: Tips for Defining Groups and Dependencies - a running list?
        bshopp

        No not completely.

        This is accurate here.

        You have defined a parent-child relationship
        The Parent goes down
        From the Orion server there is another path in your network to poll the device and it is up
        We won't mark it unreachable, it will remain up 

        • Re: Tips for Defining Groups and Dependencies - a running list?
          Karlo.Zatylny

          Brandon is correct with the above statement and let me explain why:

          Node A is a parent of Node B but Orion can get to Node B through some route not through Node A.

          Node A goes down, but Node B does not so Orion can still ping it.  Node A Down - alert fired.  Node B Up - no alert.

          Now if Node B does go Down when Node A is down then Node B will be Unreachable and no Down alert will fire.  This is not accurate given the actual topology as now you don't know if Node B is actually Down or if the alternate route is down somehow, so you don't know to go check if Node B needs to be reset or you need to check some other part of the network.

          Hopefully this helps.

          • Re: Tips for Defining Groups and Dependencies - a running list?
            patriot

            Thanks Karlo and Brandon.

            So, in the scenario where Orion has an alternate path to B when both A and B are down, the only way to get a down alert on B is by designating the alternate path as a second parent to B. In that case, one of two parents will be up, therefore allowing Orion to see B as truly down.

            Do I have that right?

            • Re: Tips for Defining Groups and Dependencies - a running list?
              patriot

              If I may add one more thing to this discussion please?

              What is the step by step process by which Orion processes a node down alert with Dependencies in place?

              When a node is non-responding after the node warning interval expires, does Orion then immediately check to see if the node is dependent on one or more parents? Then if all parents are also down, set the status of the node to unreachable?

              Is that the order of things?

              • Re: Tips for Defining Groups and Dependencies - a running list?
                Karlo.Zatylny

                Hi Patriot,

                To answer your questions:

                 



                 

                So, in the scenario where Orion has an alternate path to B when both A and B are down, the only way to get a down alert on B is by designating the alternate path as a second parent to B. In that case, one of two parents will be up, therefore allowing Orion to see B as truly down.

                Do I have that right?

                 



                Correct.  You will want to define the two parents either by two separate dependency definitions, or by creating a group for the parents and setting the status roll up of the group to be Best or Mixed and have the group be the parent of a single dependency definition. Similar to Re: How Dependencies Work

                 



                If I may add one more thing to this discussion please?

                What is the step by step process by which Orion processes a node down alert with Dependencies in place?

                When a node is non-responding after the node warning interval expires, does Orion then immediately check to see if the node is dependent on one or more parents? Then if all parents are also down, set the status of the node to unreachable?

                Is that the order of things?

                 



                The algorithm used for determining if a Node status should be Unreachable is this:

                1. Is the Node Down (we have already done Fast Polling and set the Node Status to Warning)? Yes - go to step 2  -- No, exit algorithm
                2. Does the node have any parents? Yes - go to step 3 -- No, Keep the Node Status as Down and exit the algorithm
                3. Are all the parents Down? Yes - set this Node's status to be Unreachable and exit algorithm -- No, go to step 4
                4. Have we waited an additional polling cycle to be sure the parents aren't Down? Yes - Keep Node Status as Down -- No, Set the Node Status as Warning and wait until the next poll

                Once we write the new status to the database then the next Advanced Alert poll will pick up the change and alert on any Down status Nodes.  Basic alerts will trigger almost immediately after the algorithm finishes.

                Let me know if I can explain this better.

                • Re: Tips for Defining Groups and Dependencies - a running list?
                  patriot

                  Yes, Karlo that helps, but Step 4 is new to me. Are you saying that if a child has more than one parent, and all of them are not down ,that Orion will wait another polling cycle to verify the status of each parent before setting a status for the child?

                  After you answer that one, I have another question. Thanks for your patience.

                  • Re: Tips for Defining Groups and Dependencies - a running list?
                    Karlo.Zatylny

                    Hi,

                    Even if a node has only one parent we will wait an additional polling cycle to verify that the node is Down and here's why:

                    Node A is a parent of Node B.  At 12:00:01 we poll Node A and it is up, but immediately goes Down (power outage). At 12:00:02 we poll Node B which will not respond now because Node A is actually down, even though Orion doesn't know it yet.  Node B goes into Warning and begins to Fast Poll.  At 12:02:01 we poll Node A again and determine that it might be down, it goes into Warning and Fast Poll.  At 12:02:02 Fast Poll on Node B ends and we determine that it should be Down prior to dependency checks.

                    Now we check dependencies and see that Node B's parent, Node A is not Down (but is in Warning) so we wait another polling cycle because we don't want to mistakenly set Node B to Down and fire alerts when they are not necessary.  At 12:04:01, we set Node A as Down and at 12:04:02 we determine Node B is actually Unreachable and set it as such.

                    Now you get the one alert that matters: Node A is Down

                    So now we have exposed some best practices:

                    1. Poll parents at least as frequently as children because children wait one of their polling cycles to be sure their parent is not down.  This usually makes sense as important nodes like a Switch are important to poll.

                    2. Have your Fast Poll - Node Warning Level setting at the polling time of your parents.  This ensures a parent will not be in Warning longer than its poll cycle.

                    Note in my example that if Node A was polling every 60 seconds then Node B would not have to wait until 12:04:02 to be unreachable, as Node A would have become Down at 12:02:01 and we would know Node B is Unreachable then.

                    So do dependencies possibly add some time before you will receive an alert?  Yes, but at the convenience of not being flooded by 2000 false alerts when a key switch or router is Down.

                    • Re: Tips for Defining Groups and Dependencies - a running list?
                      patriot

                      Thanks Karlo. Thats the kind of detailed feedback I was seeking. The best practices are particularly appreciated and make sense.

                    • Re: Tips for Defining Groups and Dependencies - a running list?
                      patriot

                      By the way Karlo, I think this thread would be a good subject for a Blog Tip!

                    • Re: Tips for Defining Groups and Dependencies - a running list?
                      Leon Adato

                      Thanks for this (and all the other) posts on this feature.

                      One challenge we've run into is that over our very wide-spread environment, almost everything is a parent of something and a child of something else - with the exception of our very top level core equipment and our end-of-the-line devices.

                      An example is:

                      Orion server -> our internet link  -> remote site internet link -> our remote VPN router -> remote site core switch -> (multiple) remote site distribution switches ->endpoint devices (servers, WAPs, etc)

                      Making matters even more complex, often the remote sites have their own remote sites: ie: our VPN Router is at the remote corporate center, but there are satellite offices from there. It is not feasible to put separate VPN routers from our core to those sites.

                      In any case, what we are seeing is that alert suppression via dependencies is a hit-or-miss situation because of the polling frequency (which we've kept to be the same across the board).

                      What we've done to compensate is to set a systemwide node-down delay on alerts, so that things have a time to sort out before we send out redundant alarms. But of course that decreases the realtime aspect of Orion monitoring.

                      Any thoughts are appreciated.

                      • Re: Tips for Defining Groups and Dependencies - a running list?
                        Karlo.Zatylny

                        Hi Adatole,

                        Dependencies and unreachable should not be hit and miss.  If there is an issue when you get down status alerts when a node should be unreachable is an issue we want to know about and deal with.  

                        Keeping your polling frequency the same across the board is the best practice.  It sounds like you are having issues with great grandchildren firing down alerts when a great grandparent is Down.  Is that true?  About how many levels does it take before this becomes an issue?  Is the problem across multiple polling engines?

                        If you are able to replicate this situation then DEBUG logs from the Collector Data Processor during this time period would help us narrow down any timing issues that are out there. Please open a support case with these logs and reference this thwack post so that support can let me know.

                        Thanks