10 Replies Latest reply on Dec 16, 2011 4:59 AM by bleearg13

    Alerts, Dependencies, and Groups

    bleearg13

      I'm going insane trying to configure an incredibly simple dependency so we don't get multiple alerts when a parent goes down.  Unfortunately, we are still getting alerts on child nodes.  Here's what I have:

      Hardware Node A = Parent
      Group A = List of virtual servers on hardware node A (currently consists of a single member, see below)
      Virtual Server A = Member of 'Group A'

      I have an Advanced Alert set up (Node alert, not Group alert) that will trigger when nodes go into 'Down' status.

      To test this out, I block access to both the hardware node and the virtual server so Orion cannot reach either one.  The problem is, even though the hardware node transitions to the 'Down' status, the virtual server does not go into 'Unreachable' status.  The virtual server is a member of only this one dependency relationship, as shown by the 'All Dependencies' resource on the Orion page for the virtual server.

      I've read through various PDFs, posts, and the Admin Guide and I cannot see what it is that I'm doing wrong here.

        • Re: Alerts, Dependencies, and Groups
          bleearg13

          So I found something else out and I'd like Solarwinds staff to confirm.  Do both the parent and child have to exist on the same poller?  I have two pollers and my hardware node is assigned to one and the virtual server is assigned to another.  When I moved the virtual server to the same poller as the hardware node, it seems that the virtual server then becomes 'Unreachable'.  Why would this be?

            • Re: Alerts, Dependencies, and Groups
              Karlo.Zatylny

              The parent and child do not have to exist on the same poller.  The Dependencies algorithm checks the entire Dependencies table and will ask the system for the status of the parent.  Have you opened a case with support?  If not, send me a message and I can get on a GoToMeeting with you and see why this is not working.  I know of a couple of bugs in pre 10.2 releases that delay the child becoming unreachable but do not outright block it.

              Since you are so easily able to reproduce this, we should be able to track down the issue quickly.  Let me know.

              Thanks

                • Re: Alerts, Dependencies, and Groups
                  bleearg13

                  Ticket #295166 - just opened.

                    • Re: Alerts, Dependencies, and Groups
                      Karlo.Zatylny

                      Hi,

                      I looked at the case notes and your explanation of what was going on makes a good point that I would like to reiterate and point out why this close timing issue works the way it does.

                      In there you state "had set the polling intervals for the parent and child to 120 seconds and 60 seconds, respectively" - which to me means parent = 120 seconds, child = 60 seconds.  However, the text that follows makes me think that you actually meant parent = 60, child = 120.  So I'll talk about both issues.

                      Part1: Always poll a parent at least as frequently as the child, if not more frequent.  See my "best practices" recommendations here:

                      Tips for Defining Groups and Dependencies - a running list?

                      When your devices are on different pollers, then polling the parent more frequently is going to be necessary, even if only by a few seconds because the polling frequency of devices while quite accurate in 10.2, is subject to slight drift or delay.  (Typically less than a couple seconds.)  However, this delay can mean the difference between a parent still being in warning instead of being set as down when checked by the child.  This typically is not an issue on the same poller as nodes on the same poller will be subject to the same systems slight delay.  However, on different pollers that delay might be what causes the false alert.

                      By polling the parent more frequently (or again at least as frequently), then you make your risk of false alerts lower.   

                      Part 2: Why would this not work when you are polling the parent more frequently and the parent and child are on different pollers?

                      This is what I would like to see in action.  We found an issue in the 10.1.X code base where if dependencies hadn't been checked in a while due to all nodes in the network being up (common scenario), then the connection to the Information Service went stale, fired an exception and you might get a false alert.  This was addressed in 10.2.  If you were seeing this behavior in 10.2, then your case is still interesting as polling a parent at 60 seconds and a child at 120 should never fire the down alert.  You would be able to see this error in the DataProcessor log of the Collector, where the exception would be logged by the DependencyManager.

                      Let me know if this is what you were seeing.

                      Thanks

                        • Re: Alerts, Dependencies, and Groups
                          bleearg13


                          In there you state "had set the polling intervals for the parent and child to 120 seconds and 60 seconds, respectively" - which to me means parent = 120 seconds, child = 60 seconds.  However, the text that follows makes me think that you actually meant parent = 60, child = 120.  So I'll talk about both issues.



                          Argh - yes, that's what I meant.  The parent was a shorter interval than the child.

                           



                          This is what I would like to see in action.  We found an issue in the 10.1.X code base where if dependencies hadn't been checked in a while due to all nodes in the network being up (common scenario), then the connection to the Information Service went stale, fired an exception and you might get a false alert.  This was addressed in 10.2.  If you were seeing this behavior in 10.2, then your case is still interesting as polling a parent at 60 seconds and a child at 120 should never fire the down alert.  You would be able to see this error in the DataProcessor log of the Collector, where the exception would be logged by the DependencyManager.



                          I myself am a little confused, as I was testing the other day and found that the *only* time the child went into 'Unreachable' status was when I moved it to the same poller as the parent.  I had let them sit for a good 10 minutes at one point while I got sidetracked doing other stuff and they both always showed 'Down' as the status.  Immediately after moving the child to the same node, I saw that it went into 'Unreachable'.  I can try doing some more testing, but I can't guarantee that I will have enough time to really get into it.

                  • Re: Alerts, Dependencies, and Groups
                    klc2009

                    I'm just guessing here, but the issue may be that a virtual server already has a parent (it's host).  So when you add another parent, the only time the dependency will work is if both are down..