cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post
Level 14

The Ultimate CPU Alert ... for Linux!

Jump to solution

Back in Oct 2014 Leon Adato made a beautiful CPU (The Ultimate CPU Alert) alert using SAM to capture the Windows processor queue length.  Using some SQL goodness he showed us how to count up the number of CPU and compare the CPU queue length so that only truly constrained servers were alerting.

Enter Linux. (insert ominous sound music)

I have the good fortune of following along in Leon's shoes in my current role so I've inherited a bunch of his innovation to play with as a foundation.  Here's what you are going to need to play along:

1.  NPM - We'll be building a Universal Device Poller and a transform

2.  A Linux server or two to test against (but you wouldn't be doing this if you didn't have Linux servers, right?)

3.  An understanding of Load Average and why it is important.  (Read this if you are wondering why you need to monitor for load average)

Remember, load average alone isn't going to be enough to alert with clarity just as CPU load isn't enough to tell you your server isn't correctly sized for the current load.  We're going to bundle the two of them up to help give some intelligence to this whole little sordid affair.

The Universal Device Poller

  1. Log into your NPM primary polling engine (or, if you only have 1 NPM server, the *only* polling engine) and open up Orion Universal Device Poller. (Start > Programs > Solarwinds Orion > Universal Device Poller if you are using Server 2008 or earlier).
  2. Click on New Universal Device Poller
  3. Configure your UnDP as per the screenshot below.  Note the name of loadAverage15MinInt -- this is an integer.  We'll be transforming this result starting in step 4.  The OID is 1.3.6.1.4.1.2021.10.1.5.3 for those of you who want to cut and paste.  Click through the remaining screens as per normal (testing, assignment, etc.)

2014-12-17_15h20_40.png

     4.  Click Transform Result.

     5.  Name your transform loadAverage15Min (unless you want to name it something else, but just remember you'll need to change the name in the alert later in the process!)

     6.  Configure the transform as is shown in the screenshots below.  You can name the group whatever you want -- but it is a good idea to group the UnDP and transform in the same group, at least in my mind.

2014-12-17_15h28_51.png     2014-12-17_15h30_00.png

     You have now created a Universal Device Poller that will query an SNMP-enabled Linux server for the 15 minute load average.  If you want the 1 Minute Load Average (OID: 1.3.6.1.4.1.2021.10.1.5.1) or the 5 Minute Load Average (OID: 1.3.6.1.4.1.2021.10.1.5.2).

The Alert

You've built yourself some Universal Device Pollers (15 minute and, if you are a keener, 1 minute and 5 minute) and the associated transforms.  Now you are going to build an alert.  This is going to be a custom SQL alert so remember that leaving the reset condition as a "when no longer true" isn't going to fly. You're going to need to build a reset trigger in SQL as well.  The query is a little complex (and I did steal the hardest part from Leon's Utlimate CPU Alert post), but once you understand what it is up to it will all make sense.

     1. Open up Advanced Alert Manager on your Orion Server.  (If you are reading this after NPM 12 has been released some time in 2015, remember the olden days of non-web accessible alerts!?!)

     2. Name your alert and set your evaluation frequency according to your standards.  (You do have standards for that sort of thing, right?)

     3. On the Trigger Condition tab select Custom SQL Query in the 'Type of Property to Monitor' drop down and then Custom Node Poller in the 'Set up your Trigger Query' drop down.

2014-12-17_15h44_43.png

     4. Selecting these values will pre-populate SELECT statements in the gray box.  This is a good thing as there are a whack of table joins in there to make all of this goodness work.  Copy and paste the code below in the white section of the trigger condition tab.

INNER join (select c1.NodeID, COUNT(c1.CPUIndex) as CPUCount

from (select DISTINCT CPUMultiLoad.NodeID, CPUMultiLoad.CPUIndex

from CPUMultiLoad) c1

group by c1.NodeID) c2 on Nodes.NodeID = c2.NodeID

WHERE  

  (  

  (Nodes.n_mute = 0) AND   

  (Nodes.OwnerGroup = 'LINUX') AND   

  (Nodes.Prod_State = 'PROD') AND   

  (   

  (CustomPollers.UniqueName = 'loadAverage15Min') AND

  (CustomPollerStatus.Rate > CPUCount)

  ) AND   

  (Nodes.CPULoad >= 95)

  )

     5. Set the 'Do not trigger until this condition exist for more than' to 15 minutes or so other value that is tied to your statistic collection interval.  In our case, we poll our nodes every 5 minutes.  This means that we have to experience this condition for at least 2 and up to 3 polling intervals for it to trigger.

     6. Click the Reset Condition (don't worry, I'll explain the query logic below!) and click the 'Reset this alert when the following conditions are met' radio button.

     7. Paste the following query text into the white space and set the 'Do not reset until this condition exist for more than' to 0 seconds (or a value that makes you happy)

INNER join (select c1.NodeID, COUNT(c1.CPUIndex) as CPUCount

from (select DISTINCT CPUMultiLoad.NodeID, CPUMultiLoad.CPUIndex

from CPUMultiLoad) c1

group by c1.NodeID) c2 on Nodes.NodeID = c2.NodeID

WHERE

(

  (Nodes.n_mute = 0) AND

  (Nodes.OwnerGroup = 'LINUX') AND

  (Nodes.Prod_State = 'PROD') AND

  (

   (CustomPollers.UniqueName = 'loadAverage15Min') AND

   ((CustomPollerStatus.Rate <= CPUCount) OR

   (Nodes.CPULoad < 95))

     8.  Finish configuring your alert as per your standards for things like Time of Day, Trigger Actions, Reset Actions (you use them both, right?) and click OK.

What does it all mean, you ask?

Let's look at the trigger query.

Using Leon's code for the INNER JOIN we're going to ask NPM to select and count the number of CPUs on a given node by using the CPUMultiLoad table.

INNER join (select c1.NodeID, COUNT(c1.CPUIndex) as CPUCount

from (select DISTINCT CPUMultiLoad.NodeID, CPUMultiLoad.CPUIndex

from CPUMultiLoad) c1

group by c1.NodeID) c2 on Nodes.NodeID = c2.NodeID

Armed with a CPU count we're going to do some checking to limit our scope of influence.  In our environment we're using custom properties to limit the scope of alerts (and you should too - it sure beats letting a SQL query check against every node in your environment for CPULoad when only your Linux servers will have the UnDP assigned!).

WHERE  

  (  

  (Nodes.n_mute = 0) AND    

  (Nodes.OwnerGroup = 'LINUX') AND    

  (Nodes.Prod_State = 'PROD') AND   

  (   

  (CustomPollers.UniqueName = 'loadAverage15Min') AND

  (CustomPollerStatus.Rate > CPUCount)

  ) AND   

  (Nodes.CPULoad >= 95)

  )

Next, we ensure that the stat from the UnDP is available AND the Custom Poller value is greater than CPU Count AND CPU Load is greater than 95%.  All 3 conditions have to be true.  (Remember, if you named your transform something different, this is the place to change it!)

WHERE  

  (  

  (Nodes.n_mute = 0) AND   

  (Nodes.OwnerGroup = 'LINUX') AND   

  (Nodes.Prod_State = 'PROD') AND   

  (   

  (CustomPollers.UniqueName = 'loadAverage15Min') AND

  (CustomPollerStatus.Rate > CPUCount)

  ) AND    

  (Nodes.CPULoad >= 95)

  )

The reset query looks almost exactly the same but the reset conditions are slightly different (of course!)  For the reset we want to check for the UnDP AND either the LoadAverage less than the CPUCount OR CPULoad is less than 95%.  Your tolerance for CPU load will be different than our threshold, feel free to adjust accordingly.  The key is that the reset will happen if either the CPU load is less than 95% OR the load average is less than the CPU count for all nodes that are polled for the loadAverage15Min.

INNER join (select c1.NodeID, COUNT(c1.CPUIndex) as CPUCount

from (select DISTINCT CPUMultiLoad.NodeID, CPUMultiLoad.CPUIndex

from CPUMultiLoad) c1

group by c1.NodeID) c2 on Nodes.NodeID = c2.NodeID

WHERE

(

  (Nodes.n_mute = 0) AND

  (Nodes.OwnerGroup = 'LINUX') AND

  (Nodes.Prod_State = 'PROD') AND

  (

   (CustomPollers.UniqueName = 'loadAverage15Min') AND

   ((CustomPollerStatus.Rate <= CPUCount) OR

   (Nodes.CPULoad < 95))

  )

)

That's it!  Empowered with a UnDP and this alert logic you are ready to ditch the out of the box CPU alerts and start giving your Linux (and with Leon's Ultimate CPU Alert, Windows) support teams really refined alerts giving them more time for important things (like Thwack Monthly contests!)

Feel free to comment, criticize or otherwise critique.

Message was edited by: Joshua Biggley - fixed grammar and added a link to the custom SQL reset condition discussion on Thwack.

Labels (1)
1 Solution
Level 17

How can I click "like" about a million more times? This is AWESOMESAUCE!!!

Well done! Bravo! Cheers! Yasher Koach! Salut!

Leon Adato | Head Geek
------
"Measure what is measurable,
and make measurable what is not so." - Gallileo

View solution in original post

16 Replies
Level 17

I revisited this post so that we could include it in an upcoming SolarWinds lab episode. I realized the trigger query needed some tweaking. here is what works for me for an alert trigger:

The top area will show:

SELECT CustomPollerAssignmentView.CustomPollerAssignmentID AS NetObjectID,

CustomPollerAssignmentView.AssignmentName AS Name

FROM CustomPollerAssignmentView

So past the following below the line:

INNER JOIN CustomPollerStatus ON CustomPollerAssignmentView.CustomPollerAssignmentID = CustomPollerStatus.CustomPollerAssignmentID

INNER JOIN Nodes ON CustomPollerAssignmentView.NodeID = Nodes.NodeID

LEFT OUTER JOIN CustomPollers ON CustomPollerAssignmentView.CustomPollerID = CustomPollers.CustomPollerID

INNER JOIN (SELECT c1.NodeID, COUNT(c1.CPUIndex) as CPUCount

            FROM

            (SELECT DISTINCT CPUMultiLoad.NodeID, CPUMultiLoad.CPUIndex FROM CPUMultiLoad) c1

            GROUP BY c1.NodeID) c2 on Nodes.NodeID = c2.NodeID

WHERE

CustomPollers.UniqueName = 'loadaverage15min'

AND Nodes.CPULoad >= 95

AND CustomPollerStatus.Rate >= CPUCount

Leon Adato | Head Geek
------
"Measure what is measurable,
and make measurable what is not so." - Gallileo

Level 7

Even after getting the above SQL to validate, when I tried to test the alert by changing the last two lines of query to:

AND Nodes.CPULoad >= 0

AND CustomPollerStatus.Rate >= 0

the Summary tab of the test always showed that 0 nodes would fire the alert. I opened a ticket with SW Support and after a great suggestion from Jennifer I came up with the following:

ForApproval.png

I tested this alert with our company Linux guru (thanks Bill) and it worked as expected. But, as I am relatively new to SolarWinds administration, I would like to submit my alert to the greater THWACK community for review to see if there is anything I may have missed the the translation from SQL.

Any feedback is welcome and appreciated.

0 Kudos
Level 7

When implementing this alert validation of the query kept failing. After comparing to the Windows counterpart I changed the last line of the query to:

AND CustomPollerStatus.Rate >= c2.CPUCount

Validation passed after this change.




Level 17

Nice catch!

Leon Adato | Head Geek
------
"Measure what is measurable,
and make measurable what is not so." - Gallileo

0 Kudos

Guys, let me clarify one thing - so, what we are doing here is getting the average over 15 minutes and then comparing it to threshold. Is this correct? Whereas in Leon's solution we do not calculate average, but instead - getting last CPU reading together with number of queued threads to work out contingencies. Correct?

0 Kudos
Level 14

Correct!  We use the 15 minute load average in our environment as that is what our Linux team wants to see whereas the Windows team was OK with the last processor queue length value.  I should also mention that we collect stats every 5 minutes and both alerts have a trigger delay built into them.  The delay was defined by the team and could be as short as 2 polling cycles (10 minutes) or as long as 30 minutes (for a critical alert.)

0 Kudos

Ok, I got this, thanks. Follow up question if I may - if you were to have SolarWinds to collect stats every minute and then fire alert based on the average calculated from SolarWinds stats data - would that be working solution?

0 Kudos
Level 17

Collecting the 5-minute-average every minute would be aggressive and probably not as helpful as you might like. Yes, the 5 minute average will change (since it's a rolling 5 minutes - actually, it's the average of the 1-minute load average stat), but not a ton (unless you see a huge spike in load in the last minute).

Better in your situation would be to collect the 1 minute load average (every minute), and then have your alert trigger if that value (along with the other variables) are over threshold for xx minutes.

The "problem" with that plan is that you are collecting stats every minute. I try to avoid doing that for all but the most critical of systems, because it's such a drain on poller resources. I've also found that collecting stats so often introduces a level of sensitivity (in terms of alerts) that support teams are unhappy to experience (meaning: it triggers too often, and by the time support gets to the system the problem has disappeared.

It comes down to what I affectionately call "the prozac moment" - that point when the MANAGER of the system realizes it's not as rock-solid steady as they imagined it. That the system actually does have frequent spikes (and valleys). They didn't notice it before because the metric collection wasn't that granular. But now that they have the data, it takes time to come to grips with it. The first urge is to UN-ALERT ALL THE THINGS!!

After a while they realize that all systems behave this way, and they are willing to ratchet down the polling cycles and/or extend the trigger timing so that only the actionable issues come through.

Leon Adato | Head Geek
------
"Measure what is measurable,
and make measurable what is not so." - Gallileo

0 Kudos
Level 11

With the newer NPM versions there is an even better way, by selecting Manage Pollers from your settings you can create a CPU and Memory poller that will replace the current Solarwinds CPU and Memory if the MIB is found when you List Resources for the node

You can make the poller here and define the datasources (OID's) and even do the math right here:

ss1.PNG

Once done anytime you scan for nodes or list resources for a node that has these MIB's it will give you the below option

ss2.PNG

This way you don't have to worry even about custom pollers and this is picked up by default CPU and Memory alerts.

Level 14

Looks like this is limited to only CPU & Memory and Mult-CPU pollers currently and there are some limitations on how the data is presented.  I expect NPM 12 will have all sorts of magical goodness for pollers, amongst other things.

0 Kudos
Level 11

its a but of a pain yeah to get it working , but you have to give it a little love and it will do what ya need it to. We have it assigned to all of our unix nodes and it works perfectly .

0 Kudos
Level 14

Holy smokes, how did I not know this already?  To quote Leon, "Awesomesauce!"

Level 13

Hey Joshua Biggley ...

Leon Adato stole "awesomesauce" from me!   I've been saying that for years...LONG before Discover started using it! LOL

Level 14

I knew we three were kindred spirits -- awesomesauce is my go-to exclamation too!  You, Leon Adato and I would be dangerous together.

download (1).jpg

0 Kudos
Level 11

I shall have a lot of use for this thank you!

Level 17

How can I click "like" about a million more times? This is AWESOMESAUCE!!!

Well done! Bravo! Cheers! Yasher Koach! Salut!

Leon Adato | Head Geek
------
"Measure what is measurable,
and make measurable what is not so." - Gallileo

View solution in original post