3 Replies Latest reply on Jul 10, 2015 4:07 PM by daleharrisonips

    Seeing a huge increase in Response Time Alerts since upgrading to NPM 11.5

    cfwalker8

      We upgraded from NPM 11.0.1 to NPM 11.5 two weeks ago. Since the upgrade we are seeing a huge increase in Response Time Alerts for our remote sites. No changes have been made to the network and when we see these alerts and call remotes that show in alert they are not seeing any issues. I was wondering if anyone else has seen this issue. Could it be that we are experiencing latency with the poller?

        • Re: Seeing a huge increase in Response Time Alerts since upgrading to NPM 11.5
          daleharrisonips

          I too just upgraded from 11.0.1 to 11.5.2 yesterday.  Today I'm seeing thousands of high response time alerts that weren't firing before.

          In my case it seems to keep reporting that the node's Response Time Threshold - Critical Value Reached keeps returning true and yet the measured response time is well below the threshold.

          It quickly matches the Reset condition on the next poll it looks like but the initial Trigger keeps firing even though it doesn't seem to meet the criteria.

          It was also firing against all my unmanaged nodes which prompted me to put an additional qualifier in the trigger condition.

           

          Trigger Condition:

          The actual trigger condition:

          All child conditions must be satisfied (AND)

          Node Custom Properties - AlertableNode - is equal to - 1

          Node - Status - is not equal to - Unmanaged

          Response Time Threshold - Critical Value Reached - is equal to - 1

           

          All the other alerts seem to be fine so I figure it has to be related to some change in either the way the database tables now store/represent the latency thresholds, or with the new webified alerting module that interacts with the latency data that cause it to erroneously match.  I'm going to keep looking for other user's experience with these alert problems as well.

           

          Cheers,

          Dale.

              • Re: Seeing a huge increase in Response Time Alerts since upgrading to NPM 11.5
                daleharrisonips

                I did eventually submit a ticket for this (#832270) and I have received a response from them with this:

                 

                     I have checked this with our team and it appeared to be a regression bug regarding filtering of the Threshold fragments. Currently even the Orion.CpuLoadThreshold contain the records of all the thresholds, so      basically the alert is triggered, when random threshold is reached for node.
                     As current workaround there's needed to specify the appropriate Threshold name in the alert condition of the affected alerts
                     Node > Threshold Name (Response Time) is equal to Nodes.Stats.ResponseTime


                So basically because I'm looking to see if any critical value was reached it was almost always reporting as true.

                So now my trigger looks like this to make it work for me:


                Trigger Condition:

                The actual trigger condition:

                All child conditions must be satisfied (AND)

                  Nodes Custom Properties - AlertableNode - is equal to - 1

                  Node - Status - is not equal to - Unmanaged

                All child conditions must be satisfied (AND)

                Response Time Threshold - Threshold Name - is equal to - Nodes.Stats.ResponseTime

                Response Time Threshold - Critical Value Reached - is equal to - 1


                I've enabled this about a day ago and so far it hasn't triggered any erroneous alerts.  So it looks like this may be required to qualify which Threshold counter before checking the status.


                So far so good.


                Cheers,

                Dale.