This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.

You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

Critical Value Reached (Percent Loss) Triggering Falsely

fdamstra over 8 years ago

Since upgrading from 11.5.0 to 11.5.2, our alert on packet loss has started triggering (and recovering, and triggering, and recovering, ad nauseum). The alert is using the "Critical Value Reached (Percent Loss)" and "Warning Value Reached (Percent Loss)" (aka SWQL E0.[PercentLossThreshold].[IsLevel2State]).

The nodes are reporting 0% loss. There is nothing in the graph, and even in the trigger of the alert I send out ${Node.PercentLoss}, which is always showing 0%.

I've opened ticket 821972 on Tuesday, but I think with the all the releases, support is very backed up, so the case hasn't even been started.

Anybody using thresholds for alerting? Are you experiencing the same issues?

Top Replies

0 glink over 8 years ago

I also noticed issues after upgrading to 11.5.2 with the "Critical Value Reached (Percent Loss)" alerts for CPU and Memory erroneously reporting. In my case the alerts in question are not even monitoring packet loss but rather looking at CPU and Memory Thresholds however the results are the same. I have since adjusted my alerts to only look at UP\Active Nodes and hard coded percentages instead of relying solely on the Critical Value % variable. It is something that definitely needs looked into though as I never had this problem before the minor version upgrade from 11.5.1 to 11.5.2.
Cancel
Vote Up +1 Vote Down

Cancel
0 cnorcross over 8 years ago in reply to glink

I'm seeing the same problem with the High CPU alert after upgrading from 11.5 to 11.5.2 this morning. Some of the servers have never gone above 25% CPU usage and in one case a server that has been powered off for three days.
I've temporarily moved the alert off the threshold flag and it seems to have cleared the problem up for now.
Cancel
Vote Up 0 Vote Down

Cancel
0 pschellhaas over 8 years ago in reply to cnorcross

i have opened a case about this Problem. i think all "critical value reached" trigger are affected
Cancel
Vote Up 0 Vote Down

Cancel
0 humejo over 8 years ago

I recently observed this with my latest client and figured out what the problem is. Definitely a bug. When you create an Alert Condition using any of the Threshold choices the alerting engine is querying a threshold table that contains all types of Node thresholds including CPU, PercentLoss, Response Time, and Memory. Then, when it goes to check the current value against the threshold value the SWQL query that the Web Alerting Engine Trigger Condition GUI is building isn't specifying the threshold type. So in other words, if you create a CPU alert to trigger when the current CPULoad goes above the CPU Critical Threshold, it is alerting when a Node's current CPULoad is above any Critical Threshold set on that node. If a Node with a Critical CPU threshold of 90% is currently at 70% CPULoad and has a PercentLoss Critical Threshold set at 50%, then the alert will trigger saying it is above its CPU Threshold of 50%. To make things even more confusing, the Email action variables are resolving correctly, so if you have your alert message setup to display the current CPULoad percentage and the CPU Load Critical Threshold percent, you'll get a line like this in your email: "Node SERVER1 has a Critical CPU Load of 70% which is over its Critical Threshold value of 90%". What?? Umm, OK...
To figure this out I used the "Show SWQL" option in the Trigger Condition, copied the SWQL into SWQL Studio and ran the query. Then, I just added in a couple extra items in the SELECT statement to see what kind of info it was pulling and saw that the threshold it pulled was the PercentLoss threshold. So a temporary fix for this would be to manually create the SWQL query and add an extra item in the WHERE clause saying that the Threshold Type must also be CPULoad, or whatever threshold type you are creating an alert for. Hopefully they will fix this in the next release. Sorry I'm not giving exact info or putting any examples in, but I'm on a computer that doesn't have access to an Orion install at the time.
Thanks,

Jordan Hume
Field Systems Engineer
Loop1 Systems, Inc.
Specializing in SolarWinds Training and Consultation
Cancel
Vote Up +2 Vote Down

Cancel
0 kasaff over 8 years ago

Just FYI, we should have this corrected in hot fix 1 for NPM 11.5.2. It's due out in the next couple of weeks.
Cancel
Vote Up 0 Vote Down

Cancel
0 syncopix over 8 years ago

Something got broken in 11.5.2 so all alerts that trigger based on a threshold won't work as expected. As you've seen you'll get alerts when packet loss is 0% for example. Even if a node is of type "external" (e.g. no polling) you'll sometimes see it triggering an alert with negative packet loss reported.
For a workaround until the hotfix is released take a look at here: Re: NPM 11.5.2 RC1 is in the customer portal
Cancel
Vote Up 0 Vote Down

Cancel
0 Diner over 8 years ago in reply to kasaff

I'd like to correct a bit. The Hotfix for NPM 11.5.2 (Orion platform 2015.1.2) is already public. You can download it here.
Cancel
Vote Up +1 Vote Down

Cancel
0 syncopix over 8 years ago in reply to Diner

Thanks. And great to see that Solarwinds have released an MSP! No more having to manually replace files.
Cancel
Vote Up 0 Vote Down

Cancel
0 tyoung1 over 8 years ago

Same issue, 11.5.2 and response time thresholds on Warning Value Reached and Critical Value Reached generate hundreds of invalid alarms all day long. Case #838923, which I closed because I cannot devote time to troubleshooting a simple monitoring concept. I don't have any fix.
Cancel
Vote Up 0 Vote Down

Cancel
0 humejo over 8 years ago in reply to tyoung1

Did you apply the Hotfix? It fixes that exact issue (amongst several other items). If the issue is still present after installing it then either the fix did not apply properly for some reason (did you run the config wizard after applying the HotFix? This one requires it) or there may actually be something wrong with the alert logic itself.
Cancel
Vote Up 0 Vote Down

Cancel