This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

The Ultimate CPU Alert for Large Environments

If you've been playing along at home you will have likely implemented adatole‌'s The Ultimate CPU Alert and the slightly modified version (by yours truly!) for Linux The Ultimate CPU Alert ... for Linux!‌  While I don't have a problem with either alert for small and medium sized environments when you start scaling to large environments things can get a little hairy.

What is large?  We have nearly 11,000 nodes and almost 145 million entries in the CPUMultiLoad view.  (You can find out how many rows you have by running SELECT COUNT(NodeID) FROM CPUMultiLoad WITH (NOLOCK) against your DB)  While examining our database via Database Performance Analyzer (you have DPA, don't you!?) we noticed that the Alert Status query associated with our CPU alerts kept showing up as a source of blocking.  Blocking queries generate wait time for other queries while the consume resources to complete.  In this case the blocking appeared to be caused by a query that was taking a long time to execute.  I put on my SQL query detective's hat and went to work.

The core of the The Ultimate CPU Alert is a pair of INNER JOINS to get the number of CPUs for each node so that it can be compared to the Win_Processor_Queue_Len value that is captured by the Server and Application Monitor component.  The top of the query looks like this:

SELECT DISTINCT Nodes.NodeID AS NetObjectID, Nodes.Caption AS Name

   FROM Nodes

   INNER JOIN APM_AlertsAndReportsData  ON (Nodes.NodeID = APM_AlertsAndReportsData.NodeId)

   INNER JOIN

      (SELECT c1.NodeID, COUNT(c1.CPUIndex) as CPUCount

     FROM

         ( SELECT DISTINCT CPUMultiLoad.NodeID,  CPUMultiLoad.CPUIndex FROM CPUMultiLoad )  c1

      GROUP BY c1.NodeID

      )  c2

   ON Nodes.NodeID = c2.NodeID

That innermost SELECT statement is where we run through the entire CPUMultiLoad table so that we can then select the NodeID and count of the CPUIndex.  Given that this view (as CPUMultiLoad is a view, not a table) has 144 million rows in our DB (and since you don't really change the number of CPUs on a server all that often) there might be a better way to perform this query.  Here's what we did:

   INNER JOIN APM_AlertsAndReportsData

   ON (Nodes.NodeID = APM_AlertsAndReportsData.NodeId)

   INNER JOIN

      (

  SELECT DISTINCT CPUMultiLoad.NodeID, MAX(CPUMultiLoad.CPUIndex)+1 AS CPUCount

  FROM CPUMultiLoad

  WHERE DATEDIFF(hh,TimeStampUTC,GETDATE()) <=4

  GROUP BY CPUMultiLoad.NodeID

    ) c1

  ON Nodes.NodeID = c1.NodeID

Instead of grabbing the data from CPUMultiLoad and then selecting it again we removed an INNER JOIN and selected the NodeID and MAX(CPUMultiLoad.CPUIndex) instead.  Of course, CPUIndex values start with zero and, since we want a count for comparison against the number of processes running against those CPUs, we added 1 and called it CPUCount.  In order to trim the number of rows returned from this SELECT statement (remembering that we have to INNER JOIN the select results (we called it c1) with the Nodes table) we added a WHERE DATEDIFF clause.  We take the time stamp in the CPUMultiLoad table and compare the difference, in hours, against the current date and then return only those rows where the hours are less or equal to 4.  Why 4?  Our Orion environment is set to Eastern time and UTC is 4 hours difference from Eastern time.  (I might do 6 hours -- just to be safe for daylight savings time, etc. -- but you get the idea!)

When we ran the two queries back-to-back we found that the updated query returned results 250% faster!  I'm not a SQL whiz by any stretch, but I definitely think this is a great step in the right direction.

How would you improve the query?

For the record, this is the entire query for the The Ultimate CPU Alert:

SELECT DISTINCT Nodes.NodeID AS NetObjectID, 

      Nodes.Caption AS Name 

   FROM Nodes 

   INNER JOIN APM_AlertsAndReportsData 

   ON (Nodes.NodeID = APM_AlertsAndReportsData.NodeId) 

   INNER JOIN 

      (

  SELECT DISTINCT CPUMultiLoad.NodeID, 

        MAX(CPUMultiLoad.CPUIndex)+1 AS CPUCount

  FROM CPUMultiLoad

  WHERE DATEDIFF(hh,TimeStampUTC,GETDATE()) <=4

  GROUP BY CPUMultiLoad.NodeID

    ) c1 

     

   ON Nodes.NodeID = c1.NodeID 

   WHERE Nodes.n_mute <> 1 

  AND Nodes.Prod_State = 'PROD' 

  AND APM_AlertsAndReportsData.ComponentName = 'Win_Processor_Queue_Len' 

  AND APM_AlertsAndReportsData.StatisticData <= c1.CPUCount 

  AND ( (nodes.CPU_Crit is null 

  AND nodes.CPULoad < 90) 

   OR (nodes.CPU_Crit is not null 

  AND nodes.CPULoad < nodes.CPU_Crit) )

  • My first thought - when you bounced this off me in email - was that it could be further improved by taking the "Count the CPU" piece offline. But I'm curious about people's thoughts.

    Briefly, I'm thinking about:

    1. Creating a query that creates a table from the query that pulls NodeID and MAX(CPUMultiLoad.CPUIndex)+1.
    2. Set that query to run every 1-4 hours as a scheduled job
    3. Change the alert query to hit THAT table instead of pulling the records from CPUMultiLoad each time

    The questions I have include:

    • will doing that make a meaningful difference?
    • is anyone nervous about creating and maintaining non-SolarWinds-owned tables in the schema
    • Does the addition of scheduled jobs in the database add a layer of complexity that people are unwilling to own?

    I'd love to hear what sqlrockstar‌ thought about all this. As well as ciulei‌, of course.

  • After chatting with ciulei‌ a little we're thinking that a stored procedure to update a custom property on a regular interval (nightly?).  It means not having to maintain a table and gives us a static property in the Nodes table to us for this (and other!) queries.

  • To support that tweak to the query I was doing some testing today using my last 4 hours query and I found a fatal flaw.  In *some* cases a server may not see activity on all processors in the previous 4 hours.  Why?  I'm not exactly sure.  It could be a poorly coded application, a CPU pinned process, etc.  Regardless, when we ran out two queries side-by-side we saw that the CPUCount calculated inside the alert query and the one run earlier in the day produced different results.

    We're still doing some testing.  I'll post the query we're using in our stored procedure (courtesy of ciulei‌) as well as the versions of the alert for both Orion 2014 and Orion 2015 DB schemas.  I know we could use the Nodes view and use the same alert query but we already re-wrote the logic in our dev environment and it might help others who want to deploy this with their NPM 11.5/SAM 6.2 deployments.

  • Are there updated queries for this using NPM 12/SAM 6.2?

  • You know, I was just thinking about that the other day.  I think that ciulei​ and I need to put our heads together and convert this (and the other versions) to SWQL.

    I'll add it to the to-do list -- but anyone else could jump in to.  I won't be offended one bit emoticons_wink.png

  • That will be great, looking forward for new version of this mostly used Alert ...