3 Replies Latest reply on Dec 12, 2017 5:48 PM by mesverrum

    Evaluation Frequency of Alert - Explanation

    llemieux

      I am having a little trouble grasping the "Evaluation Frequency of Alert" option on alerts and the affect it has when alerts gets triggered. Is someone kind enough to give me their explanation of this option? The NPM admin guide unfortunately gives a very broad definition for it.

       

      Thanks!

        • Re: Evaluation Frequency of Alert - Explanation
          mesverrum

          So the evaluation frequency is just how often the tool will query the database to see if anything matches the conditions in your alert.  As an example, every 60 seconds it will check to see "Are any of the nodes down?" and trigger alerts for all of them that match the criteria at once.

           

          There is no intrinsic relationship to the polling cycles of the actual stats involved in the alert so you may want to consider those polling internvals when setting these up yourself.  For example, if you have a stat such a topology data being updated every 30 minutes then checking the db every minute to trigger a rule against that every is probably putting a bit more load on your db with those queries than is absolutely necessary.  The job scheduler rolls through the its list of objects to poll so in theory you could get new data at any minute but you want to factor the size of your environment, number of alerts, and amount of sql db resources you have available all into the decision making process.

           

          In a small environment with resources to spare leaving most alerts at the default 60 second frequency is generally fine, if you have performance problems with your Orion then it is just one more straw getting thrown into the pile that you might want to tune down.  I generally avoid setting any alerts to being more frequent than once a minute because at the end of the day a difference of a few seconds in my reaction is not going to make much of a difference, but your situation may vary.

            • Re: Evaluation Frequency of Alert - Explanation
              gundamunit1

              What is the difference between the Evaluation Frequency of Alert and "Condition must exist for more than x minutes" in the Trigger Condition section?

                • Re: Evaluation Frequency of Alert - Explanation
                  mesverrum

                  So lets take an example CPU load type of alert.    The relevant pieces of info to understand is that the status is going to be polled every 10 minutes, I set this alert with a frequency of 5 minutes, and it has to be over 90% for more than 30 minutes to trigger my alert because I don't want to chase short term spikes in load.  The job scheduler rolls through all it's tasks on a big to do list so the exact time our events happen is, for our purposes, random, but it always happens every 'x' minutes.

                   

                  So,

                   

                  00:00:00 poll the node, CPU load is 50%, everything is normal

                  00:03:00 my alert gets evaluated, based on current data cpuload is showing 50, no matches for the alert conditions

                  00:08:00 alert evaluated, cpuload is still showing 50 and has not been updated

                  00:10:00 poll the node, CPU load is 95%, seems bad, node cpuload charts will turn red but nothing else happens.

                  00:13:00 alert evaluated, cpuload is showing 95, checks trigger condition for the timer and it hasn't been bad for >30 mins, no alert triggered.

                  00:18:00 alert evaluated, cpuload is showing 95, checks trigger condition for the timer and it hasn't been bad for >30 mins, no alert triggered.

                  00:20:00 poll the node, CPU load is 85%

                  00:23:00 alert evaluated, cpuload is showing 85, no matches for alert conditions

                  00:28:00 alert evaluated, cpuload is still showing 85

                  00:30:00 poll the node, CPU load is 99%

                  00:33:00 alert evaluated, cpuload is showing 99, checks trigger condition for the timer and it hasn't been bad for >30 mins, no alert triggered.

                  00:38:00 alert evaluated, cpuload is showing 99, checks trigger condition for the timer and it hasn't been bad for >30 mins, no alert triggered.

                  00:40:00 poll the node, CPU load is 100%

                  00:43:00 alert evaluated, cpuload is showing 100, checks trigger condition for the timer and it hasn't been bad for >30 mins, no alert triggered.

                  00:48:00 alert evaluated, cpuload is showing 100, checks trigger condition for the timer and it hasn't been bad for >30 mins, no alert triggered.

                  00:50:00 poll the node, CPU load is 91%

                  00:53:00 alert evaluated, cpuload is showing 91, checks trigger condition for the timer and it has been bad for 33 minutes uninterrupted, matches alert condition and triggers the actions.

                   

                   

                  Lets demonstrate an extreme case of badly put together alerts, this time cpu is only being polled once an hour, but the alert frequency is 20 minutes, and the condition says it has to be over 90% for over 30 minutes.

                  00:00:00 poll the node, CPU load is 50%, everything is normal, between now and the next hour we have no idea what cpuload is doing until the next poll

                  00:20:00 my alert gets evaluated, based on current data cpuload is showing 50, no matches for the alert conditions

                  00:40:00 alert evaluated, cpuload is still showing 50 and has not been updated

                  01:00:00 poll the node, CPU load is 95%, seems bad, node cpuload charts will turn red but nothing else happens.

                  01:20:00 alert evaluated, cpuload is showing 95, checks trigger condition for the timer and it hasn't been bad for >30 mins, no alert triggered.

                  01:40:00 alert evaluated, cpuload is still showing 95 from the poll we took 40 minutes ago, we have no new data to indicate if that was a one time spike or a trend but we are going to trigger the alert and take the actions.

                   

                  In the background of this series of events the cpuload probably did all kinds of things but we only had one data point where it was high and we just sat on it for 40 minutes before we finally triggered an action on it.  By the time the admin team logs in everything is fine and they think the monitoring tool is giving them annoying false positives again.