This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.

You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

Alert cleanup - node down and high packet loss

tszilagyi over 6 years ago

We are trying to work on alert clean up in the system and set our alerts so that all configured alerts will integrate with WHD and create a ticket.

I'm currently having issues with the node down alerts and the high packet loss alerts. From reading online the system determines if a nodes status by ping every 120 seconds (we left the default here). Once the first ping is missed the node status is set to warning and then it goes into "fast ping" where it will ping the node every 10 seconds for 120 seconds. If all pings are missed it will set the status as down.

How does Node Down Work?

Now for % Packet loss, this is calculated by looking at the last 10 pings in memory and from what i have read includes the "fast ping" responses.

How is % Packet Loss calculated?

This means that we will get an alert triggered on a % packet loss just before the node is considered down. Now i have two separate tickets for what really should be just a down node. I am trying to figure out what i can do to fix this logic and still get notified of issues as soon as possible. Our site uses NPM and SAM, any recommendations on this would be appreciated.

Top Replies

mesverrum over 6 years ago +1

Generally I don't trust a packet loss alert unless it has the timer set to at least 3 minutes (2 minutes for the polling and some time to get the node switched to down status) and the node hasn't been…

0 mesverrum over 6 years ago

Generally I don't trust a packet loss alert unless it has the timer set to at least 3 minutes (2 minutes for the polling and some time to get the node switched to down status) and the node hasn't been marked down. No other way about it because all nodes will trigger a packet loss on their way down if you don't put that timer on there. You either get duplicate alerts or you wait.
Cancel
Vote Up +1 Vote Down

Cancel
0 dgsmith80 over 6 years ago in reply to mesverrum

Agreed. Personally I always edit the Packet Loss alert to include a rule of Node must be Up as well as a slight time increase. Otherwise for each Node down you will get a packet loss alert, which is a false alert. I always see this alert as a precursor to high traffic utilisation so you don’t want to confuse it with Node down or unreachable.
Cancel
Vote Up 0 Vote Down

Cancel
0 tonymacartney over 5 years ago in reply to dgsmith80

We deal with the high packet loss alert in exactly the way you suggest - we add a condition that says node must be up

But how do you deal with the alert when the node is actually up? For example, we are getting high packet alerts in the minutes after a node comes back online. In this case, the alert trigger condition is now TRUE and we get 80%,70%,60% etc packet loss warnings
Cancel
Vote Up 0 Vote Down

Cancel
0 tszilagyi over 5 years ago in reply to tonymacartney

So currently at our site I have added in both mesverrum and @david smith recommendations as well as one other check
I have my alerts scoped to include only nodes that the status is not equal to down and the system up time is greater than 600 seconds.
Then i have the alert trigger condition for our site being the percent loss is greater than 40%
Finally the condition has to exist for 5 Minutes as well.
This seems to have cleared up most false positives, including in the scope limitation that the node has to be up for 5 min seems to cut down on the alerts that trigger right after a node reboots.
Cancel
Vote Up 0 Vote Down

Cancel
0 dgsmith80 over 5 years ago in reply to tszilagyi

Glad you got it working as required. Good job.
Cancel
Vote Up 0 Vote Down

Cancel
0 tmacca over 5 years ago in reply to tszilagyi

Still seeing a high packet loss alert immediately after a device reboots, despite configuring exactly as you have done above

Status equals UP
System uptime is greater than 900 (assume this is seconds?)
% loss is greater than 40%
Condition exists for 5 minutes
Cancel
Vote Up 0 Vote Down

Cancel
0 tszilagyi over 5 years ago in reply to tmacca

Here's a screen cap of my trigger. Just a note we have our email alerts go to a different group depending on if its a windows server or a switch/router. Just so you know the reason for the "is server" logic in my trigger. This still inst full proof but it cleans up a lot of the junk. You also might make the condition exist longer.
Cancel
Vote Up 0 Vote Down

Cancel