We had a team approach us the other day and ask us to build an alert for a unique condition. They had experienced a scenario where a specific F5 load-balancer had a single CPU rail at 100% while the other 7 CPUs in the appliance were just fine. As our existing alerts were for average CPU utilization they did not trigger. The problem was in the way that the F5 handled specific requests. It assigned different pools to different CPUs and either did not (or does not) move them across CPUs as long as there are active connections.
When we started to build an alert we realized that the only metric that is exposed is average CPU % utilization and that just wasn't going to work for us so we created a custom SWQL alert to do the job.
First, we use a bunch of different custom properties. I've left them so that you can see what our filtering based on node names, custom properties, etc. looks like. If you want to pull them out but getting the brackets in SWQL scares you (they scare me) then just copy and paste this query into Notepad++ and switch to SQL formatting and it will help you keep them straight.
Second, Nodes.[CpuLoadThreshold].[Level2Value] confuses some people. This is the critical threshold on a node for CPU utilization. You can set these globally on the Orion Thresholds page (/Orion/NetPerfMon/Admin/NetPerfMonSettings.aspx). If you want to change these for an individual node then you simple Edit Properties on a specific node, check the Override Orion General Thresholds box and put in your values. You can also use Dynamic Thresholds but those require some thinking when designing your alerting. More on that at a later date.
And now, the SWQL query! Remember that the first 3 lines of the query below will be auto-populated when you create a custom SWQL alert for a node. If you are using SWQL Studio (part of the Orion SDK installer) then you'll copy the entire query.
This query checks for any CPU value in the last 4 minutes, 59 seconds. Why? Because we collect CPU performance metrics every 5 minutes. If we had done a <= then we risked having duplicate entries in that 1 second overlap. (It happens when you have a very large environment.) Modify line 28 below to match your specific CPU polling interval. If you have a mix of polling intervals then you are going to have to create different alerts for each polling interval.
And remember to pull out the custom properties that you don't need/want/use from lines 6 through 15.
SELECT Nodes.[Uri] , Nodes.[DisplayName] FROM Orion.Nodes AS Nodes WHERE ( ( ( Nodes.[CustomProperties].[n_mute] = '0' ) AND ( Nodes.[Vendor] LIKE 'F5%' ) AND ( ( Nodes.[CustomProperties].[OwnerGroup] = 'NETWORK' ) OR ( Nodes.[CustomProperties].[n_sn_assignment_group] = 'SP-Networking' ) ) AND ( ( Nodes.[CustomProperties].[Prod_State] = 'PROD' ) OR ( Nodes.[CustomProperties].[n_sn_environment] = 'Prod' ) OR ( Nodes.[CustomProperties].[n_sn_environment] = 'Production' ) ) ) AND ( ( ( Nodes.CPUMultiLoadHistory.AvgLoad >= Nodes.[CustomProperties].[CPU_Crit] ) AND ( Nodes.[CustomProperties].[CPU_Crit] IS NOT NULL ) ) OR ( ( Nodes.[CustomProperties].[CPU_Crit] IS NULL ) AND ( Nodes.[CPUMultiLoadHistory].[AvgLoad] >= Nodes.[CpuLoadThreshold].[Level2Value] ) ) ) AND SECONDDIFF(Nodes.[CPUMultiLoadHistory].[TimeStampUTC],GETUTCDATE()) < 300 )