This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

How do we adjust how often solarwinds checks node status? (We're seeing node statuses flap causing alerts to fire unnecessarily)

Summary: We want to configure one of our alerts to fire after 4 consecutive node status checks and where the node status is "down" (we don't want false negatives, hence the desire for 4 consecutive status checks).

In order to do this, it feels like we need to reliably know how often solarwinds is checking the node status.  At this time, it isn't clear to me how often solarwinds is checking the node status.  Is the node status check done at the frequency mentioned on the node details page (The "polling interval"?)?    Or is the node status check done at the frequency defined on the polling settings page?

Details:

We're using solarwinds to validate our nodes (servers) are up and working.  We are also using Server and Application Monitor and configured alerts to page us when a node is in a "down" status.

But it isn't clear to me how to configure the frequency at which solarwinds to checks the node status (nor what the default is).

On the node details page I can see the "Polling Details" section says the "Polling Interval" is 300 seconds

Is that the frequency at which the polling engine will check the node status?

If I go to Polling settings I see this:

Does this mean that the poll intervial is 120 seconds for every node?  Why does this differ from the polling interval on the node details page?  Or maybe this is a global polling interval default, and this it mean that for newly created nodes the poll interval will be 120 seconds?

Note: I've tried clicking "re-apply polling intervals" on the above admin page thinking it would change the poll interval for each node, but when I go to the node details page the poll interval remains at 120 seconds.  (I've tried to follow the instructions here: https://documentation.solarwinds.com/en/success_center/orionplatform/content/core-polling-intervals-sw1826.htm , but as mentioned, after clicking the re-apply polling intervals button, I see no change to the polling interval on the node details page)

Thus, it isn't clear to me how often my nodes are being checked for their status.  It feels like 300 seconds because that is what the node details page says?

Thank you~!

Note: I'm asking because we are getting paged when a node goes down and last night we got paged at 2am.  But when we inspected the node, everything was fine.

Message Center is telling us the following:

9:08pm - node has stopped responding

9:11pm - node is responding again

10:48pm - node has stopped responding

10:51pm - node is responding again

1:43am - node is not responding

1:46am - node is responding again

1:53am - node is not responding

2:04am - alert fires

2:06am - node is responding agian

2:09am - alert recovers

Our alert is configured to evaluate the trigger condition every 2 minutes, and the condition must exist for more than 7 minutes (that explains why the alert didn't fire until 2:04am... the node had been down for 11 minutes at that point, but if the condition is only checked every 2 minutes and the condition must exist for 7 minutes, then that explains the delay).

Obviously we have a communication issue, however, we don't want our alerts to be overly sensitive and page us when the node is fine.  Thus we are adjusting our trigger condition and we wanted to adjust the frequency at which Solarwinds is checking the node status.

We're thinking we want our alert to fire only if 4 consecutive checks of the node status indicate a down node.  In order to do that, though, we need to reliable know how often solarwinds is checking the node status.  If it is checking every 300 seconds, we'll adjust the condition must exist for setting to be 20 minutes (4 consecutive polls).  If it is checking every 120 seconds, then we'll probably adjust the condition must exist for setting to be 8 minutes (or perhaps a little more to avoid false negatives).

Thanks again

Parents
  • A few things to get through here.

    Polling Settings

    The values shown here are the default intervals when the node/interface/volume is added to the system. E.g. if you add a new node it will default to check for status (up/down, response time, packet loss) every 120 seconds and for statistics (CPU/mem/etc) every 10 minutes.

    You can force the system to change every object back to default but you will need to untick the Lock Custom Values. If someone has set a custom interval and that box is ticked, it won't update (it's not very clear).

    If you untick the Lock Custom Values and then click Re-Apply Polling Interval it will update every object. You will lose all custom configured intervals and will have to go edit those objects to apply the custom intervals again if needed.

    Node Polling Details

    You can override the global default settings and configure particular nodes/interfaces/volume to use different polling intervals. In this case, the node in your screenshot has been configured for 300 seconds.

    If you edit that node in your screenshot, you should be able to see the configured polling intervals and I'm guessing it will have the 300 seconds there.

    Alerting

    The "evaluate the trigger condition" is how often the alert will check the trigger conditions against the database. In your case it will check the database every 2 minutes.

    The "condition must exist for" timer starts when the trigger conditions are met and corresponds to your 2 minute "evaluate" check.

    Depending on the exact time the node goes down and when your 2 minute alert evaluation occurs, you'll find some variation in your notifications.

    e.g. If you're lucky the alert checks immediately after the node goes down:

    • T0 - Node goes down.
    • T1 - the alert checks and matches condition
    • T121 - alert confirms still active (waiting for 7 minute timer)
    • T241 - alert confirms still active (waiting for 7 minute timer)
    • T361 - alert confirms still active (waiting for 7 minute timer)
    • T481 - alert confirms still active. 7 mins confirmed and triggers alert.
    • Node has been down for 481 seconds

    vs if the node goes down immediately after an alert check

    • T0 - the alert checks and matches condition
    • T1 - Node goes down.
    • T120 - the alert checks and matches condition
    • T240 - alert confirms still active (waiting for 7 minute timer)
    • T360 - alert confirms still active (waiting for 7 minute timer)
    • T480 - alert confirms still active (waiting for 7 minute timer)
    • T600 - alert confirms still active. 7 mins confirmed and triggers alert.
    • Node has been down for 599 seconds

    The other factor you should be aware of is the Node Warning Interval (configured further down on the Polling Settings page). When a node stops responding, it doesn't immediately go "Down". It will go into "warning" and a "fast poll mode" where SolarWinds will ping it every 10 seconds for a set period of time (default 120 seconds). If at the end of that period it still doesn't have a response, then it is changed to Down status in SolarWinds.

    You can narrow the node down -> alert window by modifying all of these values: the polling interval, the alert evaluation interval, the condition must exist for timer, the node warning period. Make them too tight and you may get some false positives and catch nodes as they reboot. Make them too loose and you delay how long it takes to get notified of a problem.

    The middle ground is somewhere in between based on your alerting requirements - you could have devices with varying polling intervals as well as alerts with different intervals (e.g. high priority devices poll more frequent, alert evaluation is often, low to no timer on condition  vs lower priority devices that poll less frequently, alert evaluation is every few minutes, and you have a condition timer).

  • Thank you!

    I didn't know about the "Node Warning level" and how when the main polling process determines a node is down, the polling engine will start pinging (my guess is ICMP pings?) the node until the warning level is reached... and if the node is still not responding to pings at that time, the node is marked as down.

    This is most helpful!

Reply
  • Thank you!

    I didn't know about the "Node Warning level" and how when the main polling process determines a node is down, the polling engine will start pinging (my guess is ICMP pings?) the node until the warning level is reached... and if the node is still not responding to pings at that time, the node is marked as down.

    This is most helpful!

Children
No Data