11 Replies Latest reply on Jul 31, 2015 10:55 AM by tyoung@proskauer.com

    Critical Value Reached (Percent Loss) Triggering Falsely

    fdamstra

      Since upgrading from 11.5.0 to 11.5.2, our alert on packet loss has started triggering (and recovering, and triggering, and recovering, ad nauseum). The alert is using the "Critical Value Reached (Percent Loss)" and "Warning Value Reached (Percent Loss)" (aka SWQL E0.[PercentLossThreshold].[IsLevel2State]).

       

      The nodes are reporting 0% loss. There is nothing in the graph, and even in the trigger of the alert I send out ${Node.PercentLoss}, which is always showing 0%.

       

      I've opened ticket 821972 on Tuesday, but I think with the all the releases, support is very backed up, so the case hasn't even been started.

       

      Anybody using thresholds for alerting? Are you experiencing the same issues?

        • Re: Critical Value Reached (Percent Loss) Triggering Falsely
          glink

          I also noticed issues after upgrading to 11.5.2 with the "Critical Value Reached (Percent Loss)" alerts for CPU and Memory erroneously reporting.  In my case the alerts in question are not even monitoring packet loss but rather looking at CPU and Memory Thresholds however the results are the same.  I have since adjusted my alerts to only look at UP\Active Nodes and hard coded percentages instead of relying solely on the Critical Value % variable.  It is something that definitely needs looked into though as I never had this problem before the minor version upgrade from 11.5.1 to 11.5.2.

          • Re: Critical Value Reached (Percent Loss) Triggering Falsely
            humejo

            I recently observed this with my latest client and figured out what the problem is.  Definitely a bug.  When you create an Alert Condition using any of the Threshold choices the alerting engine is querying a threshold table that contains all types of Node thresholds including CPU, PercentLoss, Response Time, and Memory.  Then, when it goes to check the current value against the threshold value the SWQL query that the Web Alerting Engine Trigger Condition GUI is building isn't specifying the threshold type.  So in other words, if you create a CPU alert to trigger when the current CPULoad goes above the CPU Critical Threshold, it is alerting when a Node's current CPULoad is above any Critical Threshold set on that node.  If a Node with a Critical CPU threshold of 90% is currently at 70% CPULoad and has a PercentLoss Critical Threshold set at 50%, then the alert will trigger saying it is above its CPU Threshold of 50%.  To make things even more confusing, the Email action variables are resolving correctly, so if you have your alert message setup to display the current CPULoad percentage and the CPU Load Critical Threshold percent, you'll get a line like this in your email:  "Node SERVER1 has a Critical CPU Load of 70% which is over its Critical Threshold value of 90%".  What??  Umm, OK...

             

            To figure this out I used the "Show SWQL" option in the Trigger Condition, copied the SWQL into SWQL Studio and ran the query.  Then, I just added in a couple extra items in the SELECT statement to see what kind of info it was pulling and saw that the threshold it pulled was the PercentLoss threshold.  So a temporary fix for this would be to manually create the SWQL query and add an extra item in the WHERE clause saying that the Threshold Type must also be CPULoad, or whatever threshold type you are creating an alert for.  Hopefully they will fix this in the next release.  Sorry I'm not giving exact info or putting any examples in, but I'm on a computer that doesn't have access to an Orion install at the time.

             

            Thanks,


            Jordan Hume

            Field Systems Engineer

            Loop1 Systems, Inc.

            Specializing in SolarWinds Training and Consultation

            • Re: Critical Value Reached (Percent Loss) Triggering Falsely
              kasaff

              Just FYI, we should have this corrected in hot fix 1 for NPM 11.5.2.  It's due out in the next couple of weeks. 

              • Re: Critical Value Reached (Percent Loss) Triggering Falsely
                syncopix

                Something got broken in 11.5.2 so all alerts that trigger based on a threshold won't work as expected. As you've seen you'll get alerts when packet loss is 0% for example. Even if a node is of type "external" (e.g. no polling) you'll sometimes see it triggering an alert with negative packet loss reported.

                 

                For a workaround until the hotfix is released take a look at here: Re: NPM 11.5.2 RC1 is in the customer portal

                • Re: Critical Value Reached (Percent Loss) Triggering Falsely
                  tyoung@proskauer.com

                  Same issue,  11.5.2 and response time thresholds on Warning Value Reached and Critical Value Reached generate hundreds of invalid alarms all day long.  Case #838923, which I closed because I cannot devote time to troubleshooting a simple monitoring concept.  I don't have any fix.