4 Replies Latest reply on Jan 5, 2017 12:54 PM by nstrebel

    Nodes showing as Warning instead of Down

    nstrebel

      We've been working with one of our end customers to setup a WAN alert suppression for each of their locations. We have each of their edge routers at each of their site locations as the dependency and the other devices at each location as the group so that they will only receive one alert rather than hundreds when a site goes down. We were trying to run a test with this yesterday to make sure that everything was working as intended, however, when our customer took down the two WAN interfaces, their Orion server only showed that all of those devices at that site had 100% packet loss and were in a warning state instead of down. We had the WAN links disconnected for at least 30 minutes with no change in results. My understanding is that if the Orion server is not able to reach a device via ICMP, it's supposed to mark it as down. Our customer recently changed their Orion server to version 12 so I'm not sure if something changed from version 11 to 12. Let me know what your thoughts are and as always, your assistance is greatly appreciated.

       

      -Nathan

        • Re: Nodes showing as Warning instead of Down
          borgan

          How often are you status polling the parent in the dependency? If the interval is the same or greater than the nodes in the child group, you might lower the poll interval for the edge router so that it's status is determined more often. It also could be that the child being a Group could make for a slower decision process depending on the roll up method for the child group.

          • Re: Nodes showing as Warning instead of Down
            nstrebel

            borgan,

             

            The status of the node is being polled ever 120 seconds. Everything went to a warning state and stayed that way for over 30 minutes. While looking through the polling settings of Orion, I see that the node warning level is set to 120 seconds so from my understanding (and correct me if I'm wrong), those devices should have stayed at a warning level for 120 seconds and then went to a down state. I changed the groups to reflect the worst state instead of using show as warning for mixed state, but I don't think that's the issue. The two edge routers should have eventually gone into a down state.

              • Re: Nodes showing as Warning instead of Down
                borgan

                You are correct about the function of the Node Warning Level and it is global in its function. Here is one thing you might try... temporarily replace your current dependency that uses the group as the child and try it using a dependency with the edge router as the parent and the next downstream device as the child. I'm wondering if the Group is the problem. Another alternative would be to avoid using dependencies altogether and create a Group of all Edge routers with a show worst roll up. Create an alert for only the edge routers. The assumption is that all personnel know enough to ignore alerts from site devices if the edge router for that site is down at the time.

                 

                Does any of this help?

              • Re: Nodes showing as Warning instead of Down
                nstrebel

                I can certainly try changing the parent/child to see what happens, however, we'll need this setup to work. Each device alert notification creates a ticket for our engineers to work and considering some of these locations have over 100 devices, we don't want tickets created for every device that goes down at a site location. From my understanding, in early December, our client did a test at a site with one router and that worked fine. I would like to do more testing with locations w/ one or more routers to get some consistency, however, it would all have to be done outside of business hours.