This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

The Ultimate CPU Alert - Reloaded!

When adatole‌ posted his The Ultimate CPU Alert we all cheered.  His alert successfully paired CPU utilization and CPU queue length to accurately measure true load on a Windows server.  His alert successfully unlocked the mystery of CPU utilization to a whole generation of Thwacksters. Scores of administrators implemented the SAM template and custom SQL alert logic and rejoiced.  There was, however, an Achilles heel. The query relies on CPU count -- a statistic that is available with the data in the Orion 2014.1 version of the DB but requires that it be calculated.

NOTE:  The queries posted below use a couple of node custom properties.  You should create a n_CPUCount but do not populate it and, if you want to use the queries as pasted below, a n_CPUCrit which is a whole number threshold between 1 and 100 for when you want the CPU alert to trigger.  The logic below assumes a 95% threshold if n_CPUCrit isn't populated.

When ciulei‌ started digging through our Database Performance Analyzer instance he noticed that the SQL logic for The Ultimate CPU Alert kept showing up as one of the top queries that was generating wait time.  Our DB server had recently been moved to a brand new box with 256GB RAM and a 4 x SSD RAID10 array to support our 17 additional pollers and 11,000 nodes/72,000 elements.  It didn't look like a hardware problem so we decided to jump into the SQL logic of the query.  After brainstorming with ciulei and adatole we came up with a plan of attack.

The foundation of the plan was pre-calculating the CPUCount for the query using a stored procedure in our DB.  This is what ciulei came up with for us.  It calculates the CPU count (note the +1 since CPUIndex starts with 0) and writes that value into the nodes table to a custom property called n_CPUCount.  Create the custom property first then schedule the stored procedure to run at an interval that suits your environment.  We run our procedure once per day and we only check for data in the last 24 hours.

SELECT DISTINCT NodeID

              ,MAX(CPUIndex)+1 AS CPUCount

      INTO #CPUCount

      FROM CPUMultiLoad

      WHERE DATEDIFF(hh,TimeStampUTC,GETDATE()) <= 24

      GROUP BY NodeID;

      UPDATE n

              SET n.n_CPUCount = t.CPUCount

      FROM Nodes n

      JOIN #CPUCOUNT t ON n.NodeID = t.NodeID;

      DROP TABLE #CPUCount;

To take advantage of the new nodes.n_CPUCount you will need to modify both the trigger and reset queries for the Windows CPU alert. Note that we forced a check for n_CPUCount IS NOT NULL.  Why?  We have some nodes (usually network hardware) that is accessible via SNMP but doesn't not provide CPU statistics.  We're working on resolving those with custom pollers that we'll assign to specific hardware but without CPU data in the CPUMultiLoad table there is no way to gather the number of CPUs.

Trigger Query

SELECT Nodes.NodeID AS NetObjectID, Nodes.Caption AS Name

FROM Nodes

INNER JOIN APM_AlertsAndReportsData on (Nodes.NodeID = APM_AlertsAndReportsData.NodeId)

WHERE

  1. Nodes.n_mute <> 1

AND Nodes.Prod_State = 'PROD'

AND APM_AlertsAndReportsData.ComponentName = 'Win_Processor_Queue_Len'

AND Nodes.n_CPUCount IS NOT NULL

AND APM_AlertsAndReportsData.StatisticData > nodes.n_CPUCount

AND

    (

    (nodes.CPU_Crit is null

          AND nodes.CPULoad >= 90)

    OR (nodes.CPU_Crit is not null

              AND nodes.CPULoad >= nodes.CPU_Crit)

      )

Reset Query

SELECT Nodes.NodeID AS NetObjectID, Nodes.Caption AS Name

FROM Nodes

INNER join APM_AlertsAndReportsData on (Nodes.NodeID = APM_AlertsAndReportsData.NodeId)

WHERE nodes.n_mute <> 1

AND Nodes.Prod_State = 'PROD'

AND APM_AlertsAndReportsData.ComponentName = 'Win_Processor_Queue_Len'

AND Nodes.n_CPUCount IS NOT NULL

AND APM_AlertsAndReportsData.StatisticData <= Nodes.n_CPUCount

AND

    (

    (nodes.CPU_Crit is null

          AND nodes.CPULoad < 90)

    OR (nodes.CPU_Crit is not null

              AND nodes.CPULoad < nodes.CPU_Crit)

      )

That's it!  If you roll out these changes your DB will be much happier.  When we implemented the changes to the The Ultimate CPU Alert and The Ultimate CPU Alert ... for Linux! (I'll post the Linux version later) we reduced waits ~40%  on our DB.  Read that again -- we reduced waits by 40%  What is the moral of the story?  For peak application performance you *must* be watching your database and tuning for performance.  If we can do it, so can you!

EDIT:  After our initial exuberance we realized that the actual improvements in wait times was not 150% but closer to 40%.  Most notable, however, was the change in the types of waits and the sources of blocking that caused those waits.  Our waits shift from about 40% LCK_M_IX waits to CPU waits.  We also noticed an uptick in the amount of ASYNC_NETWORK_IO waits.  This is to be expected as we do have a number of polling engines in another data center in another state.  Most important was our change in blockers.  Initially we had seen the Windows CPU alert as one of the top blockers in every hour.

Parents
  • A straight copy of the trigger query results in this error. The entire pasted query is:

    INNER JOIN APM_AlertsAndReportsData on (Nodes.NodeID = APM_AlertsAndReportsData.NodeId)

    WHERE Nodes.n_mute <> 1

    AND Nodes.Prod_State = 'PROD'

    AND APM_AlertsAndReportsData.ComponentName = 'Win_Processor_Queue_Len'

    AND Nodes.n_CPUCount IS NOT NULL

    AND APM_AlertsAndReportsData.StatisticData > nodes.n_CPUCount

    AND

    (

    (nodes.CPU_Crit is null

    AND nodes.CPULoad >= 90)

    OR (nodes.CPU_Crit is not null

    AND nodes.CPULoad >= nodes.CPU_Crit)

    )

Reply
  • A straight copy of the trigger query results in this error. The entire pasted query is:

    INNER JOIN APM_AlertsAndReportsData on (Nodes.NodeID = APM_AlertsAndReportsData.NodeId)

    WHERE Nodes.n_mute <> 1

    AND Nodes.Prod_State = 'PROD'

    AND APM_AlertsAndReportsData.ComponentName = 'Win_Processor_Queue_Len'

    AND Nodes.n_CPUCount IS NOT NULL

    AND APM_AlertsAndReportsData.StatisticData > nodes.n_CPUCount

    AND

    (

    (nodes.CPU_Crit is null

    AND nodes.CPULoad >= 90)

    OR (nodes.CPU_Crit is not null

    AND nodes.CPULoad >= nodes.CPU_Crit)

    )

Children
No Data