The Ultimate CPU Alert ... for Linux!

Question

Back in Oct 2014 Leon Adato made a beautiful CPU (The Ultimate CPU Alert) alert using SAM to capture the Windows processor queue length. Using some SQL goodness he showed us how to count up the number of CPU and compare the CPU queue length so that only truly constrained servers were alerting.

Enter Linux. (insert ominous sound music)

I have the good fortune of following along in Leon's shoes in my current role so I've inherited a bunch of his innovation to play with as a foundation. Here's what you are going to need to play along:

1. NPM - We'll be building a Universal Device Poller and a transform

2. A Linux server or two to test against (but you wouldn't be doing this if you didn't have Linux servers, right?)

3. An understanding of Load Average and why it is important. (Read this if you are wondering why you need to monitor for load average)

Remember, load average alone isn't going to be enough to alert with clarity just as CPU load isn't enough to tell you your server isn't correctly sized for the current load. We're going to bundle the two of them up to help give some intelligence to this whole little sordid affair.

The Universal Device Poller

Log into your NPM primary polling engine (or, if you only have 1 NPM server, the *only* polling engine) and open up Orion Universal Device Poller. (Start > Programs > Solarwinds Orion > Universal Device Poller if you are using Server 2008 or earlier).
Click on New Universal Device Poller
Configure your UnDP as per the screenshot below. Note the name of loadAverage15MinInt -- this is an integer. We'll be transforming this result starting in step 4. The OID is 1.3.6.1.4.1.2021.10.1.5.3 for those of you who want to cut and paste. Click through the remaining screens as per normal (testing, assignment, etc.)

4. Click Transform Result.

5. Name your transform loadAverage15Min (unless you want to name it something else, but just remember you'll need to change the name in the alert later in the process!)

6. Configure the transform as is shown in the screenshots below. You can name the group whatever you want -- but it is a good idea to group the UnDP and transform in the same group, at least in my mind.

You have now created a Universal Device Poller that will query an SNMP-enabled Linux server for the 15 minute load average. If you want the 1 Minute Load Average (OID: 1.3.6.1.4.1.2021.10.1.5.1) or the 5 Minute Load Average (OID: 1.3.6.1.4.1.2021.10.1.5.2).

The Alert

You've built yourself some Universal Device Pollers (15 minute and, if you are a keener, 1 minute and 5 minute) and the associated transforms. Now you are going to build an alert. This is going to be a custom SQL alert so remember that leaving the reset condition as a "when no longer true" isn't going to fly. You're going to need to build a reset trigger in SQL as well. The query is a little complex (and I did steal the hardest part from Leon's Utlimate CPU Alert post), but once you understand what it is up to it will all make sense.

1. Open up Advanced Alert Manager on your Orion Server. (If you are reading this after NPM 12 has been released some time in 2015, remember the olden days of non-web accessible alerts!?!)

2. Name your alert and set your evaluation frequency according to your standards. (You do have standards for that sort of thing, right?)

3. On the Trigger Condition tab select Custom SQL Query in the 'Type of Property to Monitor' drop down and then Custom Node Poller in the 'Set up your Trigger Query' drop down.

4. Selecting these values will pre-populate SELECT statements in the gray box. This is a good thing as there are a whack of table joins in there to make all of this goodness work. Copy and paste the code below in the white section of the trigger condition tab.

INNER join (select c1.NodeID, COUNT(c1.CPUIndex) as CPUCount

from (select DISTINCT CPUMultiLoad.NodeID, CPUMultiLoad.CPUIndex

from CPUMultiLoad) c1

group by c1.NodeID) c2 on Nodes.NodeID = c2.NodeID

WHERE

(

(Nodes.n_mute = 0) AND

(Nodes.OwnerGroup = 'LINUX') AND

(Nodes.Prod_State = 'PROD') AND

(

(CustomPollers.UniqueName = 'loadAverage15Min') AND

(CustomPollerStatus.Rate > CPUCount)

) AND

(Nodes.CPULoad >= 95)

)

5. Set the 'Do not trigger until this condition exist for more than' to 15 minutes or so other value that is tied to your statistic collection interval. In our case, we poll our nodes every 5 minutes. This means that we have to experience this condition for at least 2 and up to 3 polling intervals for it to trigger.

6. Click the Reset Condition (don't worry, I'll explain the query logic below!) and click the 'Reset this alert when the following conditions are met' radio button.

7. Paste the following query text into the white space and set the 'Do not reset until this condition exist for more than' to 0 seconds (or a value that makes you happy)

INNER join (select c1.NodeID, COUNT(c1.CPUIndex) as CPUCount

from (select DISTINCT CPUMultiLoad.NodeID, CPUMultiLoad.CPUIndex

from CPUMultiLoad) c1

group by c1.NodeID) c2 on Nodes.NodeID = c2.NodeID

WHERE

(

(Nodes.n_mute = 0) AND

(Nodes.OwnerGroup = 'LINUX') AND

(Nodes.Prod_State = 'PROD') AND

(

(CustomPollers.UniqueName = 'loadAverage15Min') AND

((CustomPollerStatus.Rate <= CPUCount) OR

(Nodes.CPULoad < 95))

8. Finish configuring your alert as per your standards for things like Time of Day, Trigger Actions, Reset Actions (you use them both, right?) and click OK.

What does it all mean, you ask?

Let's look at the trigger query.

Using Leon's code for the INNER JOIN we're going to ask NPM to select and count the number of CPUs on a given node by using the CPUMultiLoad table.

INNER join (select c1.NodeID, COUNT(c1.CPUIndex) as CPUCount

from (select DISTINCT CPUMultiLoad.NodeID, CPUMultiLoad.CPUIndex

from CPUMultiLoad) c1

group by c1.NodeID) c2 on Nodes.NodeID = c2.NodeID

Armed with a CPU count we're going to do some checking to limit our scope of influence. In our environment we're using custom properties to limit the scope of alerts (and you should too - it sure beats letting a SQL query check against every node in your environment for CPULoad when only your Linux servers will have the UnDP assigned!).

WHERE

(

(Nodes.n_mute = 0) AND

(Nodes.OwnerGroup = 'LINUX') AND

(Nodes.Prod_State = 'PROD') AND

(

(CustomPollers.UniqueName = 'loadAverage15Min') AND

(CustomPollerStatus.Rate > CPUCount)

) AND

(Nodes.CPULoad >= 95)

)

Next, we ensure that the stat from the UnDP is available AND the Custom Poller value is greater than CPU Count AND CPU Load is greater than 95%. All 3 conditions have to be true. (Remember, if you named your transform something different, this is the place to change it!)

WHERE

(

(Nodes.n_mute = 0) AND

(Nodes.OwnerGroup = 'LINUX') AND

(Nodes.Prod_State = 'PROD') AND

(

(CustomPollers.UniqueName = 'loadAverage15Min') AND

(CustomPollerStatus.Rate > CPUCount)

) AND

(Nodes.CPULoad >= 95)

)

The reset query looks almost exactly the same but the reset conditions are slightly different (of course!) For the reset we want to check for the UnDP AND either the LoadAverage less than the CPUCount OR CPULoad is less than 95%. Your tolerance for CPU load will be different than our threshold, feel free to adjust accordingly. The key is that the reset will happen if either the CPU load is less than 95% OR the load average is less than the CPU count for all nodes that are polled for the loadAverage15Min.

INNER join (select c1.NodeID, COUNT(c1.CPUIndex) as CPUCount

from (select DISTINCT CPUMultiLoad.NodeID, CPUMultiLoad.CPUIndex

from CPUMultiLoad) c1

group by c1.NodeID) c2 on Nodes.NodeID = c2.NodeID

WHERE

(

(Nodes.n_mute = 0) AND

(Nodes.OwnerGroup = 'LINUX') AND

(Nodes.Prod_State = 'PROD') AND

(

(CustomPollers.UniqueName = 'loadAverage15Min') AND

((CustomPollerStatus.Rate <= CPUCount) OR

(Nodes.CPULoad < 95))

)

That's it! Empowered with a UnDP and this alert logic you are ready to ditch the out of the box CPU alerts and start giving your Linux (and with Leon's Ultimate CPU Alert, Windows) support teams really refined alerts giving them more time for important things (like Thwack Monthly contests!)

Feel free to comment, criticize or otherwise critique.

Message was edited by: Joshua Biggley - fixed grammar and added a link to the custom SQL reset condition discussion on Thwack.

adatole · Accepted Answer

How can I click "like" about a million more times? This is AWESOMESAUCE!!! Well done! Bravo! Cheers! Yasher Koach! Salut!

The Ultimate CPU Alert ... for Linux!

Top Replies