The Ultimate CPU Alert...For Linux - Reloaded!

Version 5

    As promised in The Ultimate CPU Alert - Reloaded! this post details the steps required to modify The Ultimate CPU Alert ... for Linux!  As a recap, as our environment grew we started to notice database performance issues.  We invested in new hardware to get us from RAID5 to RAID10 on SSDs and, while performance definitely increased, we still saw significant wait times.  How much wait?  We often saw 22 hours of wait during each 60 minute period.  Database Performance Analyzer provided the insight into the queries that were the top blockers -- and The Ultimate CPU Alert ... for Linux! and The Ultimate CPU Alert were consistently in the top 25 with the latter almost always number 1 or 2 on the list.

     

    NOTE:  The queries posted below use a couple of node custom properties.  You should create a n_CPUCount but do not populate it and, if you want to use the queries as pasted below, a n_CPUCrit which is a whole number threshold between 1 and 100 for when you want the CPU alert to trigger.  The logic below assumes a 95% threshold if n_CPUCrit isn't populated.

     

    Let's stop for a moment to ask why.  The The Ultimate CPU Alert and The Ultimate CPU Alert ... for Linux! both relied on a INNER JOIN query that was executed every time the alert logic was checked and, because we all use custom SQL reset queries (you do don't you?), when the alert status is checked.  As a recap, this is this is the logic that is used.  See that CPUMultiLoad table?  Our table had 41 million entries!  That's a pretty huge table scan for every execution.

     

    INNER join (select c1.NodeID, COUNT(c1.CPUIndex) as CPUCount

      from (select DISTINCT CPUMultiLoad.NodeID, CPUMultiLoad.CPUIndex

      from CPUMultiLoad) c1

      group by c1.NodeID) c2 on Nodes.NodeID = c2.NodeID

     

    Just like with the Windows alert the foundation of the plan is pre-calculating the CPUCount for the query using a stored procedure in our DB.  This is what ciulei came up with for us.  It calculates the CPU count (note the +1 since CPUIndex starts with 0) and writes that value into the nodes table to a custom property called n_CPUCount.  Create the custom property first then schedule the stored procedure to run at an interval that suits your environment.  We run our procedure once per day and we only check for data in the last 24 hours.

     

    SELECT DISTINCT NodeID

                  ,MAX(CPUIndex)+1 AS CPUCount

          INTO #CPUCount

          FROM CPUMultiLoad

          WHERE DATEDIFF(hh,TimeStampUTC,GETDATE()) <= 24

          GROUP BY NodeID;

     

          UPDATE n

                  SET n.n_CPUCount = t.CPUCount

          FROM Nodes n

          JOIN #CPUCOUNT t ON n.NodeID = t.NodeID;

     

          DROP TABLE #CPUCount;

     

    To take advantage of the new nodes.n_CPUCount you will need to modify both the trigger and reset queries for the Linux CPU alert. Note that we forced a check for n_CPUCount IS NOT NULL.  Why?  We have some nodes (usually network hardware) that is accessible via SNMP but doesn't not provide CPU statistics.  We're working on resolving those with custom pollers that we'll assign to specific hardware but without CPU data in the CPUMultiLoad table there is no way to gather the number of CPUs.

     

    (Remember that both the 32-bit Alert Manager and the web-based Alert Manager both insert some mandatory text.  I've posted the entire text below so that you can just copy and paste into SSMS for testing purposes.  If you are pasting into Alert Manager start with the INNER JOIN APM_AlertsAndReportsData... query.)

     

    One last heads up -- we use LoadAverage15Min in our query.  Whatever custom poller you defined when you built the original The Ultimate CPU Alert ... for Linux! should be referenced in your query below.  The names must match -- spelling counts folks!

     

    EDIT:  Notice the /*SplitMarker*/ The text before this marker is unalterable and is predefined by the objects that you are building your alert against, in this case a custom poller alert.  Anything after that splitmarker is fair game and can be changed.

     

    Trigger Query

     

    SELECT CustomPollerAssignmentView.CustomPollerAssignmentID AS NetObjectID, CustomPollerAssignmentView.AssignmentName AS Name

    FROM ((CustomPollerStatus INNER JOIN CustomPollerAssignmentView ON (CustomPollerStatus.CustomPollerAssignmentID = CustomPollerAssignmentView.CustomPollerAssignmentID)) 

    LEFT OUTER JOIN CustomPollers ON (CustomPollerAssignmentView.CustomPollerID = CustomPollers.CustomPollerID))  INNER JOIN Nodes ON (CustomPollerAssignmentView.NodeID = Nodes.NodeID)

    /*SplitMarker*/

     

     

    WHERE    (   

                       

                      (Nodes.OwnerGroup = 'LINUX') AND

                      (Nodes.Prod_State = 'PROD') AND

                      (Nodes.n_mute = 0) AND

                      (Nodes.n_CPUCount IS NOT NULL) AND

                      (CustomPollers.UniqueName = 'loadAverage15Min') AND

     

                    (

                      ((CustomPollerStatus.Rate >= (n_CPUCount))

                        AND

                      (Nodes.CPU_Crit IS NULL) AND

                      (Nodes.CPULoad >= 95)

                    )

                      OR

                      (

                        (CustomPollerStatus.Rate >= (n_CPUCount)) AND

                      (Nodes.CPU_Crit IS NOT NULL) AND

                      (Nodes.CPULoad >= Nodes.CPU_Crit))

                      )

                  )

     

     

    Reset Query

    SELECT CustomPollerAssignmentView.CustomPollerAssignmentID AS NetObjectID, CustomPollerAssignmentView.AssignmentName AS Name

    FROM ((CustomPollerStatus INNER JOIN CustomPollerAssignmentView ON (CustomPollerStatus.CustomPollerAssignmentID = CustomPollerAssignmentView.CustomPollerAssignmentID)) 

    LEFT OUTER JOIN CustomPollers ON (CustomPollerAssignmentView.CustomPollerID = CustomPollers.CustomPollerID)) 

    INNER JOIN Nodes ON (CustomPollerAssignmentView.NodeID = Nodes.NodeID)

    /*SplitMarker*/

     

     

    WHERE    (   

                      (Nodes.OwnerGroup = 'LINUX') AND

                      (Nodes.Prod_State = 'PROD') AND

                      (Nodes.n_mute = 0) AND

                      (CustomPollers.UniqueName = 'loadAverage15Min') AND

                    (

                        ((CustomPollerStatus.Rate < (n_CPUCount))

                        OR

                      ((Nodes.CPU_Crit IS NULL) AND (Nodes.CPULoad < 95))

                        )

                      OR

                      ((CustomPollerStatus.Rate < (n_CPUCount)) AND

                      ((Nodes.CPU_Crit IS NOT NULL) AND

                      (Nodes.CPULoad < Nodes.CPU_Crit)))

                      )

                  )

     

    That's it!  If you roll out these changes your DB will be much happier.  When we implemented the changes to the The Ultimate CPU Alert and The Ultimate CPU Alert ... for Linux! we reduced waits 40% on our DB.  Read that again -- we reduced waits by 40%  What is the moral of the story?  For peak application performance you *must* be watching your database and tuning for performance.  If we can do it, so can you!

     

    EDIT:  After our initial exuberance we realized that the actual improvements in wait times was not 150% but closer to 40%.  Most notable, however, was the change in the types of waits and the sources of blocking that caused those waits.  Our waits shift from about 40% LCK_M_IX waits to CPU waits.  We also noticed an uptick in the amount of ASYNC_NETWORK_IO waits.  This is to be expected as we do have a number of polling engines in another data center in another state.  Most important was our change in blockers.  Initially we had seen the Windows CPU alert as one of the top blockers in every hour.


    EDIT2:  zackm pointed out that my cut and paste skills need some work.  (Thanks Zack!)  I've updated both the trigger and reset queries for this Linux alert.  Make sure you have all of the custom properties (Prod_State, n_mute, n_CPUCount, etc.) or else you'll need to start trimming.  n_CPUCount and CPU_Crit are mandatory for the query to work as designed.  The other custom properties can be removed without impact. 


    EDIT3: zackm noted that the INNER JOIN APM_AlertsAndReportsData on (Nodes.NodeID = APM_AlertsAndReportsData.NodeId) after the /*SplitMarker*/ was an artifact from our old style alert and is not required.  Not only that but it makes the results much cleaner. Extra points for Zack!