This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

CPU Monitors

For reporting purposes I've implemented CPU monitors but I've set their max value threshold to 100% so I never get alerts. Now that I've got some historical data on CPU averages I want to set my thresholds so I get alerts when I go above 10% average. I am having some trouble figuring out how the Last-Value and Short-Term thresholds work with one another. Would someone please provide a quick synopsis of how the two work with each other along with an example of how the accumulated failures per alert play into it? To put it into context, let's say I want to alert if my CPU hits 60% but only if it's been at higher than 40% for the last 5 minutes.


Thanks,

  • Analysis of Test Results

    The CPU Usage Monitor utilizes two separate threshold values when analyzing test results.

    1. Last-Value Threshold - This is the primary threshold rate, which is compared to the value acquired by the Monitor during the most recently performed test.

    • Max CPU Usage (%)
      The maximum CPU load allowed on the system, represented as a percentage (%). You may need to adjust the default value based on the type of processor being monitored, however, we recommend maintaining this value at 90% or above.

    2. Short-Term Threshold - This is the secondary threshold rate. It uses the test results accumulated over the period of time set within the Sample Size field to detect a slow CPU load creep versus a spike in processor usage.

    • Max CPU Usage (%)
      The amount of CPU load allowed on the system, represented as a percentage (%). Although the proper settings for this threshold will depend on the type of processor being monitored, we recommend maintaining this threshold value at 40% or above (you may choose to increase it even further, to 70-80%).
    • Sample Size
      The data obtained from the monitoring tests performed during the length of time specified here will be averaged and compared against the most recent test result. By default, the Sample Size value is set to 915 seconds (15 minutes and 15 seconds).

    If the Last-Value Threshold rate is exceeded, the CPU Usage Monitor will re-analyze the data by comparing it against the Short-Term Threshold value specified. During this comparison process, the most recent test results are measured against the data obtained from tests that occurred within the length of time specified in the Sample Size field.

    This dual threshold method prevents false Alerts from occurring each time a spike in CPU consumption is detected. Alerts will only be triggered if processor load does not return to normal within the required period of time, indicating a steady CPU load increase that may affect system performance.

  • Thanks for the reply. I was playing around with this for a while and I think I'm missing something still. I set my Last-Value Threshold to 99% and my Short-Term Threshold to 5% for 120 seconds to see if it would cause the monitor to fail. The machine in question runs at 60-80% all the time so I thought I'd get an alert for certain, but I never saw the monitor go into a failed state. If the monitor works as explained I should have seen it fail because it exceeded the Short-Term Threshold even though it never hit the Last-Value Threshold, right?


     Thanks,


    Kendal

  • The Last-Value Threshold is the key.  It would need to be set < 60% in order for the alert to trigger.  This is what tells IPM that it needs to start further analysis.   As you have it set now, the CPU would have to spike to 99% or above for 120 seconds.