This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

Dependancies and Alerts

Hi,

I have been reading up a lot about dependencies and alerts in order to customize the alerting for our environment.  We have around 90 sites globally and would like to have some level of intelligent alerting.  So what that means is that when a site goes down (all nodes within the group) only the site should alert and not each node within the site (should work if I create a group for the site, and a group for all the nodes within the site, If I understand this correctly).  I have also tried the automatic dependency creation to flush out routes, however I am not convinced this is the right way. 

I Would then also like a second level dependency, each site has 1 or 2 internet routers, if they go down then none of the child nodes should raise an alert, however this does not mean that the site is down (first level), especially when there are 2 Internet Routers, as one router may be a fail over or simply supply internet to another rack on the site.

It sounds incredibly complex and I am not sure if this is even possible on SolarWinds, might have to do this with the alerts as a second level instead of dependencies.

  • As I read it, you have 2 scenarios to solve for:

    1. Suppress "down stream node" alerts when only the WAN link is dropped
    2. Suppress alerts to a single notification if the entire site goes offline

    Based on that, I would look at this:

    WAN Link:

    1. Create 'Child Groups' for each site, with membership including everything but the Internet routers
    2. Create manual dependencies where you place each router as the parent of the child group
      1. This means if you have 2 routers, you make 2 dependencies for the site

    This will suppress all node alerts for the site if a router goes into a 'down' status

    Site Offline: There's some challenges with this one

    1. If all nodes are down, that means that the Internet routers are down; so the rest of the nodes at the site would be marked "unreachable" instead of "down"; this makes creating a standard group alert hard because the "unreachable" status doesn't affect the group status as it's purpose is to suppress alerts by its very nature.
      1. I would recommend testing this theory, I am 99% sure "unreachable" doesn't impact group status, but I have been wrong before emoticons_wink.png
    2. Given that the site would only show 100% down if all the Internet routers are down, the WAN Link scenario will cover all sites with 1 router; if you have sites with 2 routers, you can create 'Parent Groups' with them using the status rollup "Show Best"; then create a group status alert for those.

    there's multiple other ways, some more complex, some simpler, but more manual to implement; if you have more questions, let's get a discussion going here emoticons_happy.png

  • My Question today is kind of two fold.

    I have set up dependencies for the different sites across the globe, so that we only receive on alert when all the nodes in that group goes down.  So I created a root group for the country, then a sub group for the site within the country (some countries have more than 5 sites) and then another sub group containing all the nodes. Thus if the nodes for site 1 is down, then the alert will only raise an alert for that site (not the whole country), I have then in turn (before adding the sub groups into the main groups) created the dependencies, and then added the sub groups into their parents for better display.  We had a site down last night, no alert was triggered since it is checking for site DOWN, this site remained in warning state for more than 5 hours.  Thus no alerts raised. My Questions are as follows:

    1. Why would the nodes and the site remain in Warning state for that long? The theory is that they should change to down after 30 seconds (my polling settings are set to that)
    2. Does the dependency have anything to do with it, or is it that my dependency no longer works because I added them as an object into the parent group?

    EXAMPLE:

    pastedImage_3.png

    Sorry can't display the node names

    Please help, this is making me brain dead. 

  • Are your alerts on the nodes themselves or on the Groups?

  • Yes they are on the nodes as well as on the main group (Portsmouth) as a site down alert

  • The site would only have stayed in warning if at least one node at the site was up, can you confirm that everything showed as being down in Orion?

  • Another possibility is that the unmanaged node was being counted against the "are all the nodes down?" logic

  • They weren't showing as down, they were all showing as Warning, causing the site to also only show warning, but in fact they were all down, there was a major power failure, and resulted in them all being off for more than 5 hours, yet SolarWinds only showed them as warning.

  • solarwinds only shows nodes in warning when they are responding intermittently to the ping requests from the poller.  If all settings are at default then they should have gone down after 120 seconds of continuous missed pings.

  • This is exactly why I posed the question, why did they not go into down status, I have even changed the default from 120 seconds to 30 seconds

  • Do any nodes ever show down in your environment?  I've never run into a situation where nodes failed to go down if the Orion server can't ping them so you will likely need to do some testing within your environment to pin down if that change you made is a factor.