High CPU Alert Discrepancies

Question

Scenario:

We have a "High CPU Alert" set for particular if the Production Node reaches High CPU for 5 minutes it triggers an email, text, etc

* Evaluation Frequency of the alert is for 1 minute
* The condition "CPU Load is equal to 100%" must exist for more than 5 minutes

Observation:

We are seeing these alerts with a lag in time between our Cloud Infrastructure and Solarwinds SAM where it's reporting 100% CPU but on the VM it's fine. We've also seen the opposite where we didn't get alerted before there was an issue but we happened to be looking at the Node in SAM and noticed the issue then got the Alert.

How can we "fine tune" this process between polling and actual results? I know there's likely always going to be some inherent lag but we should be getting the Alerts before our customers see any downtime and address the issue which is usually solved by an IIS Reset due to Worker Processes hogging the CPU.

vinay.by · Answer

1. Some tips on polling frequency and alerting

What is your polling frequency ?

I understand you check for your trigger condition every 1 min and you have a wait period of 5 mins on your trigger condition, but how many data points are you collecting in this interval ?

This might sound a little weird but I am sure you know polling and alerting are 2 different concepts within SolarWinds.

When you say High CPU Alert for 5 mins it is not exactly checking for 5 mins. Below is the scenario that you encountered maybe.

Scenario 1:

00:00 - SolarWinds polls this device CPULoad is 100

00:01 - Your alert checks for value in Database it is 100 but it will wait as wait period - Condition should exist for more than 5 mins

00:02 - Alert will wait

00:03 - Alert will wait

00:04 - Alert will wait

00:05 - SolarWinds polls this device again CPULoad is 100, Alert will also check the value but will wait as 5 min wait is still active

00:06 - An alert is sent from SolarWinds

In this case you are collecting 2 data points - one at 00:00 and another one at 00:05 so chances of CPULoad being 100% on VM is high.

Scenario 2:

00:00 - SolarWinds polls this device CPULoad is 100 - Your alert checks for value in Database it is 100 but it will wait as wait period - Condition should exist for more than 5 mins

00:01 - Alert will wait

00:02 - Alert will wait

00:03 - Alert will wait

00:04 - Alert will wait - But note ** In this case alert wait for 5 mins is completed **

00:05 - Alert is fired and SolarWinds polls this device again CPULoad is now 97%

The problem with this scenario is you just have 1 data point.

--------------

Baseline (this is completely based on my experience), if you have a 5 min or 300 seconds polling interval and you want to check if CPULoad is 100 for 5 mins, keep the wait period to 6 mins rather than 5 mins, this might reduce a bit of false alerts as you will collect 2 data points.

In Trigger Condition Change -> Condition must exist for more than 6 mins, this should help OR reduce your polling cycle interval time to collect more datapoints rather than default 5 mins polling interval.

--------------------------------------------------------------

2. We are seeing these alerts with a lag in time between our Cloud Infrastructure and Solarwinds SAM, if you think its something related to your infrastructure or how SolarWinds is setup within your environment then refer the below link

https://support.solarwinds.com/SuccessCenter/s/article/Orion-Platform-Troubleshooting-Database-Performance?language=en_US

Make sure your SolarWinds VM's are intact with no performance issues

Make sure you dont have any bandwidth issues

Make sure latency between pollers and SolarWinds DB is always very minimal and intact

Hope it helps.