This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

Help needed on setting alert triggering on CPU & memory utilization

Hi all,

I'm hoping any wise gurus in Thwack can help me on this.

I'm having massive headache on configuring alerts for my servers. I'm trying to configure a CPU & Memory alerts which will send email to me whenever their utilization crosses certain threshold. A special condition that it must fulfill before alerting is the utilization must stay high for a period of time.

So my general setup would be

Evaluate the trigger condition every:  1min

The actual trigger condition: Node | CPU Load | Is Greater than | 85%

Condition must exist for more than: 2min

I think by looking at it, basically it's fulfilling what I need. 

However the alerts doesn't behave like that at all! This is what frustrates me, I'm working with support guys and trying to explain to them but I doubt they even understand me.

Here's what I'm getting.

  1. I receive email alert stating I have high CPU at 1:18PM
  2. I logon to Solarwinds console to check on the server, server is normal no high CPU!
  3. I open Perf Analyzer, zoom in to 1PM & 2PM. Verified from avg CPU load chart indeed there is a spike at 1:15PM.
  4. However the spike didn't last long, by the time at around 1:17PM utilization has come down yet at 1:18PM i receive email alert for an alert which happen at 1:15PM!!

What has gone wrong here? I've been explaining to support guys on my situation & he just didn't understand me and stated that this behaviour is normal and I should increase "Condition must exist for more than" to a higher value. However in a running production environment, any server which have high CPU / Memory longer than 2 mins or even 5 mins will suffer terrible performance issue.

Please dear thwack communities is there anyway I can adjust to fulfill my requirement? I'm pulling my hair everytime I tried to explain to the support guys.

  • What is the polling frequency set?
    Did u try to check the raw data that u can get via performance analyzer? It gives info as to when it went high and when it came down...

    Also what is the reset condition set to?
  • I did not make changes to the default polling time, I guess it's every 120 seconds.

    How do I view the RAW data from performance analyzer?

    There's no reset action for my alerts currently, due to the volume of email being sending out creating reset action email will cause even more email triggering. I will configure a reset action once everything is stablized.

  • In the performance analyzer, u have option to export...when u do that, it will give u in raw format...
  • The RAW data also shows that the same problem.

    CPU utilization was 100% at 10:29AM then 10:31AM I receive an email of CPU 100% alert, which is not true because at that time the CPU has already went down. Any idea how I can fix this? 

    Average CPU Load (%)11/2/2020 10:29100
    Average CPU Load (%)11/2/2020 10:3140
  • have exactly the same issue, alert is catching short CPU spikes only and triggering alert!!! "Condition must exist " time does not work, please help!

  • Not sure if you've managed to get this sorted. Your issue is that you need to increase the polling frequency for this node (or all nodes) to less than the condional check.

    So if you want to trigger the alert after 2 minutes, you need to poll every 1 minute to be able to have a chance to caputre the data.

    It's also worth considering that the polling capture is a single point in time for the polling interval. So lets say you poll the server at 13:00 and the CPU is 40%, and you have your polling interval set to 1 minute. If the CPU rises to 100% up till 13:00:59 and then drops back to 40% you will never capture the event.

    I have our system tuned down to 90 seconds for everything and it's ok. Turning down the polling interval results in higer load so check your polling engines inder the settings to see the utilisation before and after to see the impact you have on the polling sucess.

  • I would also add for reference the various time intervals and configurations in place to be aware of. It is a delicate balance to get them right. What you need to be focused on is how often item 2 below is checking for new data, how often item 3 will check the database for the trigger condition and how long item 4 is set to delay that condition before triggering an alert. Appropriately setting them all for your environment makes for good alerting.

    Also take into account any possible delay in the SMTP mail process. But note that you’re talking about a few minutes. It depends on your environment ultimately but server resources can spike regularly and I would more concerned if those resources exceeded my thresholds for longer than 2m, maybe 5m+? That seems extreme to alert on but that’s just my 2 cents.

    Try looking into the enhanced thresholds. That really avoids some of the timing issues and you could alert on cpu/mem specifically crossing a threshold - e.g. 5 consecutive polls > 90%. If statistics collections runs every 1m then that’s a sustained 5m sampling of 90%+ utilization.

    documentation.solarwinds.com/.../core-orion-thresholds-sw1775.htm

    1. Polling interval (this is the generic up/down status etc.)
    2. Statistics collection interval (this is the cpu/mem polling)
    3. Alert check interval (this is how often your alert checks the database for any trigger conditions)
    4. Alert Condition must exist for x time (this is essentially adding delay to the trigger)

    Hope this info helps and good luck!