This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

The Ultimate CPU Alert ... for Linux!

Back in Oct 2014 Leon Adato made a beautiful CPU (The Ultimate CPU Alert) alert using SAM to capture the Windows processor queue length.  Using some SQL goodness he showed us how to count up the number of CPU and compare the CPU queue length so that only truly constrained servers were alerting.

Enter Linux. (insert ominous sound music)

I have the good fortune of following along in Leon's shoes in my current role so I've inherited a bunch of his innovation to play with as a foundation.  Here's what you are going to need to play along:

1.  NPM - We'll be building a Universal Device Poller and a transform

2.  A Linux server or two to test against (but you wouldn't be doing this if you didn't have Linux servers, right?)

3.  An understanding of Load Average and why it is important.  (Read this if you are wondering why you need to monitor for load average)

Remember, load average alone isn't going to be enough to alert with clarity just as CPU load isn't enough to tell you your server isn't correctly sized for the current load.  We're going to bundle the two of them up to help give some intelligence to this whole little sordid affair.

The Universal Device Poller

  1. Log into your NPM primary polling engine (or, if you only have 1 NPM server, the *only* polling engine) and open up Orion Universal Device Poller. (Start > Programs > Solarwinds Orion > Universal Device Poller if you are using Server 2008 or earlier).
  2. Click on New Universal Device Poller
  3. Configure your UnDP as per the screenshot below.  Note the name of loadAverage15MinInt -- this is an integer.  We'll be transforming this result starting in step 4.  The OID is 1.3.6.1.4.1.2021.10.1.5.3 for those of you who want to cut and paste.  Click through the remaining screens as per normal (testing, assignment, etc.)

2014-12-17_15h20_40.png

     4.  Click Transform Result.

     5.  Name your transform loadAverage15Min (unless you want to name it something else, but just remember you'll need to change the name in the alert later in the process!)

     6.  Configure the transform as is shown in the screenshots below.  You can name the group whatever you want -- but it is a good idea to group the UnDP and transform in the same group, at least in my mind.

2014-12-17_15h28_51.png     2014-12-17_15h30_00.png

     You have now created a Universal Device Poller that will query an SNMP-enabled Linux server for the 15 minute load average.  If you want the 1 Minute Load Average (OID: 1.3.6.1.4.1.2021.10.1.5.1) or the 5 Minute Load Average (OID: 1.3.6.1.4.1.2021.10.1.5.2).

The Alert

You've built yourself some Universal Device Pollers (15 minute and, if you are a keener, 1 minute and 5 minute) and the associated transforms.  Now you are going to build an alert.  This is going to be a custom SQL alert so remember that leaving the reset condition as a "when no longer true" isn't going to fly. You're going to need to build a reset trigger in SQL as well.  The query is a little complex (and I did steal the hardest part from Leon's Utlimate CPU Alert post), but once you understand what it is up to it will all make sense.

     1. Open up Advanced Alert Manager on your Orion Server.  (If you are reading this after NPM 12 has been released some time in 2015, remember the olden days of non-web accessible alerts!?!)

     2. Name your alert and set your evaluation frequency according to your standards.  (You do have standards for that sort of thing, right?)

     3. On the Trigger Condition tab select Custom SQL Query in the 'Type of Property to Monitor' drop down and then Custom Node Poller in the 'Set up your Trigger Query' drop down.

2014-12-17_15h44_43.png

     4. Selecting these values will pre-populate SELECT statements in the gray box.  This is a good thing as there are a whack of table joins in there to make all of this goodness work.  Copy and paste the code below in the white section of the trigger condition tab.

INNER join (select c1.NodeID, COUNT(c1.CPUIndex) as CPUCount

from (select DISTINCT CPUMultiLoad.NodeID, CPUMultiLoad.CPUIndex

from CPUMultiLoad) c1

group by c1.NodeID) c2 on Nodes.NodeID = c2.NodeID

WHERE  

  (  

  (Nodes.n_mute = 0) AND   

  (Nodes.OwnerGroup = 'LINUX') AND   

  (Nodes.Prod_State = 'PROD') AND   

  (   

  (CustomPollers.UniqueName = 'loadAverage15Min') AND

  (CustomPollerStatus.Rate > CPUCount)

  ) AND   

  (Nodes.CPULoad >= 95)

  )

     5. Set the 'Do not trigger until this condition exist for more than' to 15 minutes or so other value that is tied to your statistic collection interval.  In our case, we poll our nodes every 5 minutes.  This means that we have to experience this condition for at least 2 and up to 3 polling intervals for it to trigger.

     6. Click the Reset Condition (don't worry, I'll explain the query logic below!) and click the 'Reset this alert when the following conditions are met' radio button.

     7. Paste the following query text into the white space and set the 'Do not reset until this condition exist for more than' to 0 seconds (or a value that makes you happy)

INNER join (select c1.NodeID, COUNT(c1.CPUIndex) as CPUCount

from (select DISTINCT CPUMultiLoad.NodeID, CPUMultiLoad.CPUIndex

from CPUMultiLoad) c1

group by c1.NodeID) c2 on Nodes.NodeID = c2.NodeID

WHERE

(

  (Nodes.n_mute = 0) AND

  (Nodes.OwnerGroup = 'LINUX') AND

  (Nodes.Prod_State = 'PROD') AND

  (

   (CustomPollers.UniqueName = 'loadAverage15Min') AND

   ((CustomPollerStatus.Rate <= CPUCount) OR

   (Nodes.CPULoad < 95))

     8.  Finish configuring your alert as per your standards for things like Time of Day, Trigger Actions, Reset Actions (you use them both, right?) and click OK.

What does it all mean, you ask?

Let's look at the trigger query.

Using Leon's code for the INNER JOIN we're going to ask NPM to select and count the number of CPUs on a given node by using the CPUMultiLoad table.

INNER join (select c1.NodeID, COUNT(c1.CPUIndex) as CPUCount

from (select DISTINCT CPUMultiLoad.NodeID, CPUMultiLoad.CPUIndex

from CPUMultiLoad) c1

group by c1.NodeID) c2 on Nodes.NodeID = c2.NodeID

Armed with a CPU count we're going to do some checking to limit our scope of influence.  In our environment we're using custom properties to limit the scope of alerts (and you should too - it sure beats letting a SQL query check against every node in your environment for CPULoad when only your Linux servers will have the UnDP assigned!).

WHERE  

  (  

  (Nodes.n_mute = 0) AND    

  (Nodes.OwnerGroup = 'LINUX') AND    

  (Nodes.Prod_State = 'PROD') AND   

  (   

  (CustomPollers.UniqueName = 'loadAverage15Min') AND

  (CustomPollerStatus.Rate > CPUCount)

  ) AND   

  (Nodes.CPULoad >= 95)

  )

Next, we ensure that the stat from the UnDP is available AND the Custom Poller value is greater than CPU Count AND CPU Load is greater than 95%.  All 3 conditions have to be true.  (Remember, if you named your transform something different, this is the place to change it!)

WHERE  

  (  

  (Nodes.n_mute = 0) AND   

  (Nodes.OwnerGroup = 'LINUX') AND   

  (Nodes.Prod_State = 'PROD') AND   

  (   

  (CustomPollers.UniqueName = 'loadAverage15Min') AND

  (CustomPollerStatus.Rate > CPUCount)

  ) AND    

  (Nodes.CPULoad >= 95)

  )

The reset query looks almost exactly the same but the reset conditions are slightly different (of course!)  For the reset we want to check for the UnDP AND either the LoadAverage less than the CPUCount OR CPULoad is less than 95%.  Your tolerance for CPU load will be different than our threshold, feel free to adjust accordingly.  The key is that the reset will happen if either the CPU load is less than 95% OR the load average is less than the CPU count for all nodes that are polled for the loadAverage15Min.

INNER join (select c1.NodeID, COUNT(c1.CPUIndex) as CPUCount

from (select DISTINCT CPUMultiLoad.NodeID, CPUMultiLoad.CPUIndex

from CPUMultiLoad) c1

group by c1.NodeID) c2 on Nodes.NodeID = c2.NodeID

WHERE

(

  (Nodes.n_mute = 0) AND

  (Nodes.OwnerGroup = 'LINUX') AND

  (Nodes.Prod_State = 'PROD') AND

  (

   (CustomPollers.UniqueName = 'loadAverage15Min') AND

   ((CustomPollerStatus.Rate <= CPUCount) OR

   (Nodes.CPULoad < 95))

  )

)

That's it!  Empowered with a UnDP and this alert logic you are ready to ditch the out of the box CPU alerts and start giving your Linux (and with Leon's Ultimate CPU Alert, Windows) support teams really refined alerts giving them more time for important things (like Thwack Monthly contests!)

Feel free to comment, criticize or otherwise critique.

Message was edited by: Joshua Biggley - fixed grammar and added a link to the custom SQL reset condition discussion on Thwack.

Parents
  • Guys, let me clarify one thing - so, what we are doing here is getting the average over 15 minutes and then comparing it to threshold. Is this correct? Whereas in Leon's solution we do not calculate average, but instead - getting last CPU reading together with number of queued threads to work out contingencies. Correct?

Reply
  • Guys, let me clarify one thing - so, what we are doing here is getting the average over 15 minutes and then comparing it to threshold. Is this correct? Whereas in Leon's solution we do not calculate average, but instead - getting last CPU reading together with number of queued threads to work out contingencies. Correct?

Children
  • Correct!  We use the 15 minute load average in our environment as that is what our Linux team wants to see whereas the Windows team was OK with the last processor queue length value.  I should also mention that we collect stats every 5 minutes and both alerts have a trigger delay built into them.  The delay was defined by the team and could be as short as 2 polling cycles (10 minutes) or as long as 30 minutes (for a critical alert.)

  • Ok, I got this, thanks. Follow up question if I may - if you were to have SolarWinds to collect stats every minute and then fire alert based on the average calculated from SolarWinds stats data - would that be working solution?

  • Collecting the 5-minute-average every minute would be aggressive and probably not as helpful as you might like. Yes, the 5 minute average will change (since it's a rolling 5 minutes - actually, it's the average of the 1-minute load average stat), but not a ton (unless you see a huge spike in load in the last minute).

    Better in your situation would be to collect the 1 minute load average (every minute), and then have your alert trigger if that value (along with the other variables) are over threshold for xx minutes.

    The "problem" with that plan is that you are collecting stats every minute. I try to avoid doing that for all but the most critical of systems, because it's such a drain on poller resources. I've also found that collecting stats so often introduces a level of sensitivity (in terms of alerts) that support teams are unhappy to experience (meaning: it triggers too often, and by the time support gets to the system the problem has disappeared.

    It comes down to what I affectionately call "the prozac moment" - that point when the MANAGER of the system realizes it's not as rock-solid steady as they imagined it. That the system actually does have frequent spikes (and valleys). They didn't notice it before because the metric collection wasn't that granular. But now that they have the data, it takes time to come to grips with it. The first urge is to UN-ALERT ALL THE THINGS!!

    After a while they realize that all systems behave this way, and they are willing to ratchet down the polling cycles and/or extend the trigger timing so that only the actionable issues come through.