cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post
Level 18

The Ultimate CPU Alert

CPU alerts are a yawner. Grab the CPULoad, check it against a threshold (maybe even a per-node custom threshold, as explained here: TIPS & TRICKS: Stop The Madness: How to set alert thresholds per-device), cut the alert, move on, right?

Here's the problem: If you are working with sophisticated Operations or server staff, you probably already know that they hate CPU alerts because they are

  1. always vague
  2. frequently invalid
  3. way too frequent because they are tuned too low OR
  4. never triggered when you need them because they are tuned too high.

At the heart of the issue is the fact that high CPU, by itself, tells you nothing of use. So the CPU is high? So what? If I've got a box that is constantly running hot but it is keeping up with the work, that's called "correctly sized".

What you really want need to about CPU know are 3 things:

  1. How many processors are in the box
  2. How many jobs are in the Processor Queue
  3. What's the current CPU load

If you've got more jobs in the queue than you have CPUs and you also have high CPU, then you have the makings of a meaningful, actionable issue.

Let's add a little icing on the cake: When the condition above occurs, I want to know what the top 10 processes are at that moment, so I can get an idea of the likely culprits.

Interested? Let's get to work!

For this to work, you need NPM and SAM. You will be assigning one Perfmon counter to all your servers, and doing a little bit of SQL voodoo in the alert.

The Perfmon Counter:

In SAM, set up a new template. In it, you want to add a perfmon counter monitor named “Win_Processor_Queue_Len” that points to

  • Counter: “Processor Queue Length”,
  • Instance: (blank)
  • Category: “System”

processor_queue_AM.png

After appropriate testing, adjustments, etc, you will eventually roll this template out to all your Windows systems.

The Alert Trigger

Your alert trigger is going to require some hardcore SQL. So you are setting up a Custom SQL Alert, with “Nodes” as the target table.

Along with the top part of the query that is automatically provided, you will add the following:

inner join APM_AlertsAndReportsData

on (Nodes.NodeID = APM_AlertsAndReportsData.NodeId)

INNER join (select c1.NodeID, COUNT(c1.CPUIndex) as CPUCount

   from (select DISTINCT CPUMultiLoad.NodeID, CPUMultiLoad.CPUIndex

   from CPUMultiLoad) c1

   group by c1.NodeID) c2 on Nodes.NodeID = c2.NodeID

where

APM_AlertsAndReportsData.ComponentName = 'Win_Processor_Queue_Len'

AND APM_AlertsAndReportsData.StatisticData > c2.CPUCount

AND nodes.CPULoad > 90

alert_trigger.png

What this is doing is

  1. pulling the count of CPU’s for this node from the CPUMultiLoad table
  2. Pulling the current statistic for the Win_Processor_Queue_Len perfmon counter
  3. Checking that the number of processes in the queue is greater than the number of CPU’s
  4. And finally checking that the CPULoad is over 90%

If the conditions in item 3 and 4 are true, you will get an alert.

If you stop here, you have a nifty alert that will tell you when something meaningful (and bad) is going on with your server. But let’s kick it up a notch.

Trigger Action

Your alert action is going to have two key steps:

  1. Run the “Solarwinds.APM.RealTimeProcessPoller.exe utility to get the top 10 processes
  2. After a 60 second delay, send your message

alert_action.png

Get the Processes

The “Solarwinds.APM.RealTimeProcessPoller.exe” comes as part of SolarWinds SAM.

NOTE: If you installed SolarWinds somewhere other than the default location (C:\program files (x86)) then you will need to provide the full path to \SolarWinds\Orion\APM\Solarwinds.APM.RealTimeProcessPoller.exe

Otherwise, your command will look like this:

  1. SolarWinds.APM.RealTimeProcessPoller.exe -n=${NodeID} -alert=${AlertDefID} -timeout=60

The only thing you may want to adjust is the –timeout, if you find you are getting alerts coming back with no process information (ie: it’s taking longer for the servers to respond)

Send Your Message

At its most basic, your alert message needs to look like this:

CPU on Node ${NodeName} is at ${CPULoad}  at ${AlertTriggerTime}.

Top 10 processes at the time of the alert are:

${Notes}

NOTE: The ${Notes} field is populated with the top 10 processes as part of the previous action.

However, if you want to dress it up, you can include more information using more SQL voodoo:

CPU on Node ${NodeName} is at ${CPULoad}  at ${AlertTriggerTime}.

There are ${SQL:Select APM_AlertsAndReportsData.StatisticData from APM_AlertsAndReportsData where APM_AlertsAndReportsData.NodeId = ${NodeID} and APM_AlertsAndReportsData.ComponentName = 'Win_Processor_Queue_Len'} items in the process queue and only ${SQL:Select COUNT(c1.CPUIndex) from (select DISTINCT CPUMultiLoad.NodeID, CPUMultiLoad.CPUIndex from CPUMultiLoad where CPUMultiLoad.nodeid = ${NodeID} ) c1 } CPUs to process them.

Top 10 processes at the time of the alert are:

${Notes}

If there is no list of alerts, it's because it took longer than 2 minutes to collect off the server. We felt that delivering the alert fast was more important.

What that big ${SQL… block in the middle does is pull the current Win_Processor_Queue_Len statistic, along with the count of CPUs for this node from the CPUMultiLoad table. The result would read:

There are 10 items in the process queue and only 4 CPUs to process them.

After setting up the message, make sure you go to the “Alert Escalation” tab and set the “Delay the execution of this action” to at least 1 minute.

alert_escalation.png

Summary

So there you have it. A CPU alert that not only tells you when something meaningful and actionable is happening, but it gives you (or your support staff) some initial information to get you started finding and resolving the problem.

As anecdotal proof of how valuable this is, within 24 hours of rolling out this alert at my company, we found 3 different applications which were chronically mis-behaving across the enterprise. 2 resulted in our being able to prove an issue to the vendor (who didn’t believe us) and get a bug-fix under way.

EDIT 2014-10-31:

As discovered by Joshua Biggley in this post: Custom SQL Alerts - Do reset conditions also need to be custom?, the reset trigger is problematic for this alert (as with all custom SQL alerts). You can't just select "reset when the condition is no longer true". The solution, as elaborated by Richard Letts here: Warning about custom SQL alerts (reset trigger), the reset trigger needs to be:

inner join APM_AlertsAndReportsData

on (Nodes.NodeID = APM_AlertsAndReportsData.NodeId)

INNER join (select c1.NodeID, COUNT(c1.CPUIndex) as CPUCount

   from (select DISTINCT CPUMultiLoad.NodeID, CPUMultiLoad.CPUIndex

   from CPUMultiLoad) c1

   group by c1.NodeID) c2 on Nodes.NodeID = c2.NodeID

where

(APM_AlertsAndReportsData.ComponentName = 'Win_Processor_Queue_Len' AND APM_AlertsAndReportsData.StatisticData <= c2.CPUCount)

OR nodes.CPULoad <= 90

The key change here is that you want to reset when EITHER the processes are less than the number of CPU's, OR the CPU load is under the threshold

EDIT 2015-02-23

Hat-Tip to garyuk who caught my greater-than / less-than confusion in the reset logic above. It's fixed now.

Leon Adato | Head Geek
------
"Measure what is measurable,
and make measurable what is not so." - Gallileo

Labels (1)
104 Replies
Level 7

This is great. I had some trouble with the reset trigger not applying, but after some digging around I came up with this which works well. 

WHERE Nodes.NodeID NOT IN
    (SELECT DISTINCT Nodes.NodeID
     FROM Nodes
     INNER JOIN APM_AlertsAndReportsData ON (Nodes.NodeID = APM_AlertsAndReportsData.NodeId)
     INNER JOIN
       (SELECT c1.NodeID,
               COUNT(c1.CPUIndex) AS CPUCount
        FROM
          (SELECT DISTINCT CPUMultiLoad.NodeID,
                           CPUMultiLoad.CPUIndex
           FROM CPUMultiLoad) c1
        GROUP BY c1.NodeID) c2 ON Nodes.NodeID = c2.NodeID
     WHERE APM_AlertsAndReportsData.ComponentName = 'Win_Processor_Queue_Len'
       AND APM_AlertsAndReportsData.StatisticData > c2.CPUCount
       AND nodes.CPULoad > 90 )

 

0 Kudos
Level 7

Excellent article! I've just created this alert to replace our generic CPU is over X% alert which was triggering 100 times a day for no reason, so thats pretty awesome.

Quick thing, I couldn't get the RealTimeProcessPoller.exe to work in the email, it worked in the console, so the exe was running, but the email didn't make the list. I spoke to support and a very helpful engineer suggested replacing

${Notes}

with

${N=Alerting;M=Notes}

And that suddenly made it work.


Cheers

Level 9

I believe I have this mostly working as well, however I'm not getting the top 10 processes to populate in my email alert.

pastedImage_0.png

0 Kudos

Has anyone been able to figure out the trick here. I cannot get the processes list to show up.

I have tried both suggestions for the Notes variables and have not had any luck yet.

 

GNordin_0-1606140939082.png

 

0 Kudos

Any chance we can get an upgrade screenshots for the web only version rather than using the old console

Is this actually built into the product and comes with support or is this just a community joint effort?

0 Kudos
Level 9

I have everything working except no processes are listed. 

For example It states there are 6 items in the process queue and only 2 CPUs to process them.

When I run the exe from the command line on the server it returns the results just fine. I'm wondering how the data is populated to the Notes field?

Thank you.

0 Kudos
Level 7

I am using the following entry in my trigger action, but when I simulate it I get the error FILE NOT FOUND.

SolarWinds.APM.RealTimeProcessPoller.exe -n=${NodeID} -alert=${AlertDefID}

The executable is located in the default install location (C:\Program Files (x86).

It looks like I have everything else ready to go, just haven't received any messages. Hard to believe that I can't get one server out of 300+ to trigger.

Where else should I be looking?

0 Kudos

0 Kudos

That did it for me. I've scoured this forum for days and overlooked that entry. Thank you for pointing it out. The simulation succeeded.

I submit the hypothesis that sometimes we'd be better off receiving a Quality Of Experience (or other Application functionality) alert to eliminate false positive high CPU alerts (or any other system process or hardware alert that could end up causing performance degradation).

The assumption is that high CPU utilization is causing decreased performance for users, degraded completion of tasks by the device because of insufficient CPU or hung processes.

Why not put some weight on the QoE part of the equation?  Suppose you DO have a valid high CPU utilization alert, but that it doesn't result in current or future issues for users.  Is the time spent chasing it down partially wasted?  Sure, you get to learn the processes better, and understand the alert is something that can be ignored, but maybe a User Experience alarm would be more useful when triaging alerts.  Less time spent spinning your wheels on those occasions when high CPU utilization is normal and expected--that's a win.

And when your QoE (or equivalent) alert comes it, tie it directly to the CPU alert if the high utilization is the cause of the QoE problem.  Now you not only have useful information immediately, without digging, but you've also got a quick link to the high CPU utilization smoking gun.

Level 7

I apologize for opening this thread back up and for my total noobieness but how do you go about adding in the Component Monitor. 

I went into the SAM settings, and click on "Create New Template" in the "Application Monitor Templates" section.

I name it Win_Processor_Queue_Len, give it a description and then click the 'Add Component Monitors' button below and select "Browse for Component Monitors"

I make sure 'Windows Performance Counter Monitor" is selected and click 'Next'

It asks to select a target server, so I plug in the IP of one of my servers and click 'Next'(tried one that is having CPU issues and one that isn't - same effect)

When I get to the 'Select Components' section, I don't see anywhere in here where I can select "Win_Processor_Queue_Len"

Am I missing something or just simply overlooking it?

Thanks,

Tyler

0 Kudos

It sounds as though you're naming the application monitor 'Win_Processor_Queue_Len' instead of the component monitor. What you want to do is create a template, add a performance counter monitor named 'Win_Processor_Queue_Len' then set the counter value to 'Processor Queue Length' as shown below.

2017-06-28 15_13_41-Edit Application Template - Alert Monitor.png

0 Kudos

When I go into SAM Settings, the only option I have is to create a new template is "Create a new template" in the Application monitor Templates section.  if i look at the Component Monitor Library, I don't see anything for the Win_Processor_Queue_Len.

If i create a new template, and I click on Browse for Component Monitor, i get a bunch of options "Windows Performance counter Monitor" is one of them.  I select that and click on Next, but in the next screen I can't locate anything that says Win_Processor_Queue_Len.

The component isn't called "Win_Processor_Queue_Len" by default, you have to add a performance counter monitor and configure it as shown in the screenshot I attached to my previous reply. After that just tick it and rename it as shown below.

pastedImage_0.png

This remains a classic example of excellent information being shared to the world--very useful data!--for free.

Thwack continues to rock, on and on . . .   In fact, you could call it:

pastedImage_0.png

Level 8

I’ve had partial luck with this alert, but I still have a couple issues I need some assistance getting sorted. 

For starters, we only want to apply this alert to five Windows servers.  I’m not clear on how to achieve this via SQL so enabled complex conditions.

The screen shot shows the trigger conditions. The undesired result is the alert always triggers for all five machines instead of just one.  Is there a way to achieve the desired result via SQL or complex conditions?

The second issue has to do with the fact that the alert will trigger with a low CPU load even though I configured it to trigger at greater than 99%.  Additionally, it will trigger when there are more processors than items in the process queue.  I have included an example alert below exhibiting these problems.

Any advice that might help me resolve this two issues will be greatly appreciated!

pastedImage_1.png

ALERT TIME: Friday, June 23, 2017 5:42 AM

CURRENT STATUS: 2 % load

SOLARWINDS URL: https://ISDRWCSWAPP:443/Orion/View.aspx?NetObject=N:1116

There are 2 items in the process queue and only 8 CPUs to process them.

Top 10 processes at the time of the alert are:

Name                    Process ID                           CPU

mcshield.exe                     2388                       11.21 %

WmiPrvSE.exe                  3268                       1.78 %

sqlservr.exe                       2052                       0.6 %

BESClient.exe                    5908                       0.25 %

svchost.exe                        1020                       0.19 %

services.exe                       652                         0.11 %

lsass.exe                             660                         0.05 %

WmiPrvSE.exe                  6600                       0.05 %

System                                 4                              0.03 %

WmiPrvSE.exe                  3260                       0.03 %

0 Kudos

I believe the alert triggers correctly but the email is generated afterwards so it can be out of sync?

This is an issue I had when playing around with this alert and in the end I dumbed down the email output, it was consistently showing the accumulative CPU usage at about 10% with the alert being 90%.

You could create and populate a custom parameter for these nodes and alert on this group in the trigger?

AND Nodes.exampleparameter = 'groupname'

If this is no help at all, don't mind me. I'm pretty new to Solarwinds.

0 Kudos

Thanks for the feedback.  My initial hunch was by time the process list function had completed the CPU utilization had dropped.  I tried to get around this by only alerting on nodes that had CPU > 99% for 20 minutes but have the same results.

I believe there is a flaw in the logic that I am using which causes all five nodes to send alerts even if only one of them meets the alert thresholds.  I am going to try and resolve this by digging into SQL.  SQL is new to me so it might take me a bit before I find the correct query.

I'm going to try the steps outlined in the Ultimate CPU alert Reloaded and run the pre-calculated CPU count.  At the very least it should free some cycles on the database.

If anyone else has any feedback I'm open to ideas.  If I find a solution, I'll post it here.

I have tried the CPU reloaded guide obviously it frees up load on the DB server but the email output issue remains. It's a shame, it would be ideal if it could snap the values which caused it to trigger in the email instead of pulling them after the fact!

EDIT: Oh and if you ever figure it out, let me know.

0 Kudos