1 2 3 Previous Next 102 Replies Latest reply on Nov 26, 2018 9:29 AM by greg.white

    The Ultimate CPU Alert

    Leon Adato

      CPU alerts are a yawner. Grab the CPULoad, check it against a threshold (maybe even a per-node custom threshold, as explained here: TIPS & TRICKS: Stop The Madness: How to set alert thresholds per-device), cut the alert, move on, right?

       

      Here's the problem: If you are working with sophisticated Operations or server staff, you probably already know that they hate CPU alerts because they are

      1. always vague
      2. frequently invalid
      3. way too frequent because they are tuned too low OR
      4. never triggered when you need them because they are tuned too high.

       

      At the heart of the issue is the fact that high CPU, by itself, tells you nothing of use. So the CPU is high? So what? If I've got a box that is constantly running hot but it is keeping up with the work, that's called "correctly sized".

       

      What you really want need to about CPU know are 3 things:

      1. How many processors are in the box
      2. How many jobs are in the Processor Queue
      3. What's the current CPU load

       

      If you've got more jobs in the queue than you have CPUs and you also have high CPU, then you have the makings of a meaningful, actionable issue.

       

      Let's add a little icing on the cake: When the condition above occurs, I want to know what the top 10 processes are at that moment, so I can get an idea of the likely culprits.

       

      Interested? Let's get to work!

       

      For this to work, you need NPM and SAM. You will be assigning one Perfmon counter to all your servers, and doing a little bit of SQL voodoo in the alert.

       

      The Perfmon Counter:

      In SAM, set up a new template. In it, you want to add a perfmon counter monitor named “Win_Processor_Queue_Len” that points to

      • Counter: “Processor Queue Length”,
      • Instance: (blank)
      • Category: “System”

       

      processor_queue_AM.png

       

      After appropriate testing, adjustments, etc, you will eventually roll this template out to all your Windows systems.

       

      The Alert Trigger

      Your alert trigger is going to require some hardcore SQL. So you are setting up a Custom SQL Alert, with “Nodes” as the target table.

       

      Along with the top part of the query that is automatically provided, you will add the following:

      inner join APM_AlertsAndReportsData

      on (Nodes.NodeID = APM_AlertsAndReportsData.NodeId)

      INNER join (select c1.NodeID, COUNT(c1.CPUIndex) as CPUCount

         from (select DISTINCT CPUMultiLoad.NodeID, CPUMultiLoad.CPUIndex

         from CPUMultiLoad) c1

         group by c1.NodeID) c2 on Nodes.NodeID = c2.NodeID

      where

      APM_AlertsAndReportsData.ComponentName = 'Win_Processor_Queue_Len'

      AND APM_AlertsAndReportsData.StatisticData > c2.CPUCount

      AND nodes.CPULoad > 90

       

      alert_trigger.png

       

      What this is doing is

      1. pulling the count of CPU’s for this node from the CPUMultiLoad table
      2. Pulling the current statistic for the Win_Processor_Queue_Len perfmon counter
      3. Checking that the number of processes in the queue is greater than the number of CPU’s
      4. And finally checking that the CPULoad is over 90%

       

      If the conditions in item 3 and 4 are true, you will get an alert.

       

      If you stop here, you have a nifty alert that will tell you when something meaningful (and bad) is going on with your server. But let’s kick it up a notch.

       

      Trigger Action

      Your alert action is going to have two key steps:

      1. Run the “Solarwinds.APM.RealTimeProcessPoller.exe utility to get the top 10 processes
      2. After a 60 second delay, send your message

       

      alert_action.png

       

      Get the Processes

      The “Solarwinds.APM.RealTimeProcessPoller.exe” comes as part of SolarWinds SAM.

       

      NOTE: If you installed SolarWinds somewhere other than the default location (C:\program files (x86)) then you will need to provide the full path to \SolarWinds\Orion\APM\Solarwinds.APM.RealTimeProcessPoller.exe

       

      Otherwise, your command will look like this:

      1. SolarWinds.APM.RealTimeProcessPoller.exe -n=${NodeID} -alert=${AlertDefID} -timeout=60

       

      The only thing you may want to adjust is the –timeout, if you find you are getting alerts coming back with no process information (ie: it’s taking longer for the servers to respond)

       

      Send Your Message

      At its most basic, your alert message needs to look like this:

      CPU on Node ${NodeName} is at ${CPULoad}  at ${AlertTriggerTime}.

       

      Top 10 processes at the time of the alert are:

      ${Notes}

       

      NOTE: The ${Notes} field is populated with the top 10 processes as part of the previous action.

       

      However, if you want to dress it up, you can include more information using more SQL voodoo:

      CPU on Node ${NodeName} is at ${CPULoad}  at ${AlertTriggerTime}.

       

      There are ${SQL:Select APM_AlertsAndReportsData.StatisticData from APM_AlertsAndReportsData where APM_AlertsAndReportsData.NodeId = ${NodeID} and APM_AlertsAndReportsData.ComponentName = 'Win_Processor_Queue_Len'} items in the process queue and only ${SQL:Select COUNT(c1.CPUIndex) from (select DISTINCT CPUMultiLoad.NodeID, CPUMultiLoad.CPUIndex from CPUMultiLoad where CPUMultiLoad.nodeid = ${NodeID} ) c1 } CPUs to process them.

       

      Top 10 processes at the time of the alert are:

      ${Notes}

       

      If there is no list of alerts, it's because it took longer than 2 minutes to collect off the server. We felt that delivering the alert fast was more important.

       

      What that big ${SQL… block in the middle does is pull the current Win_Processor_Queue_Len statistic, along with the count of CPUs for this node from the CPUMultiLoad table. The result would read:

      There are 10 items in the process queue and only 4 CPUs to process them.

       

      After setting up the message, make sure you go to the “Alert Escalation” tab and set the “Delay the execution of this action” to at least 1 minute.

       

      alert_escalation.png

       

      Summary

      So there you have it. A CPU alert that not only tells you when something meaningful and actionable is happening, but it gives you (or your support staff) some initial information to get you started finding and resolving the problem.

       

      As anecdotal proof of how valuable this is, within 24 hours of rolling out this alert at my company, we found 3 different applications which were chronically mis-behaving across the enterprise. 2 resulted in our being able to prove an issue to the vendor (who didn’t believe us) and get a bug-fix under way.

       

      EDIT 2014-10-31:

      As discovered by jbiggley in this post: Custom SQL Alerts - Do reset conditions also need to be custom?, the reset trigger is problematic for this alert (as with all custom SQL alerts). You can't just select "reset when the condition is no longer true". The solution, as elaborated by RichardLetts here: Warning about custom SQL alerts (reset trigger), the reset trigger needs to be:

      inner join APM_AlertsAndReportsData

      on (Nodes.NodeID = APM_AlertsAndReportsData.NodeId)

      INNER join (select c1.NodeID, COUNT(c1.CPUIndex) as CPUCount

         from (select DISTINCT CPUMultiLoad.NodeID, CPUMultiLoad.CPUIndex

         from CPUMultiLoad) c1

         group by c1.NodeID) c2 on Nodes.NodeID = c2.NodeID

      where

      (APM_AlertsAndReportsData.ComponentName = 'Win_Processor_Queue_Len' AND APM_AlertsAndReportsData.StatisticData <= c2.CPUCount)

      OR nodes.CPULoad <= 90

       

      The key change here is that you want to reset when EITHER the processes are less than the number of CPU's, OR the CPU load is under the threshold

       

      EDIT 2015-02-23

      Hat-Tip to garyuk who caught my greater-than / less-than confusion in the reset logic above. It's fixed now.

        • 1. Re: The Ultimate CPU Alert
          Jfrazier

          Bravo !!!

          This is what I have been trying to describe to my engineers...now there is a gotcha.

          For a windows server that is a VM, it has X number of CPU's as VM has "told" it but the server doesn't know it is virtual.

          If the one of the VM guests is so busy that it is taking a large amount of resources, it is possible that this server thinks it's maxed out when in actually it is being starved for resources.   See this article:   http://blog.logicmonitor.com/2013/02/25/a-tale-of-two-metrics-windows-cpu-or-vcenter-vm-cpu/

           

          If we can tie in cpu ready from the ESX side then we'd have the "More Ultimate CPU Alert !"

           

          Ah...the joys of virtual enviroments.  It's all shell games.

          • 2. Re: The Ultimate CPU Alert
            Leon Adato

            I have a couple of thoughts about this:

             

            • the alert works equally well for VMWare systems, so you could tune the ESX alert to have a shorter window than the guest alert.
            • I have to believe there's a way to alert for guests that are consuming too many host resources. It's not a canned alert, but I'm going to look into this now now that there's a fly in my ointment.
            • ...and of course, SolarWinds would tell you to take a look at Virtualization Manager
            • 3. Re: The Ultimate CPU Alert
              Jfrazier

              Thanks Leon...  The fun part is knowing when and if the guest is being starved for resources which is why the utilization is "high".

              Then proper steps can be taken to improve resource availability...which reduces alerts,improves productivity, reduces "app is slow" tickets.

               

              This wasn't an intent to put a fly in your ointment...but rather using the collaborative power of Thwack for good.

              • 4. Re: The Ultimate CPU Alert
                Leon Adato

                I did some digging and the fact is that setting up this alert is pretty simple - no SQL needed.

                 

                When you create an alert using the "Property to Monitor" of "VirtualMachine" then you can get the CPU of the guest, OR the actual CPU consumption on the host machine.

                Simple trigger, Virtual Machines, Current CPU and Memory, CPU Utilization.

                 

                That statistic is NOT the same as the CPU on the host. I'm still testing it out but I'm pretty sure that's the case.

                 

                - Leon

                • 5. Re: The Ultimate CPU Alert
                  rhidians

                  I think I need a follow adatole button.

                   

                  Nice work on the above and nice work on the icmp Status monitoring but no snmp stats alert to!

                  • 6. Re: The Ultimate CPU Alert
                    Leon Adato

                    Awwwwww!

                     

                    Just click the picture of my fuzzy face, and in the upper-right corner you should see a "follow" button. I'll even friend you if you ask!

                    • 7. Re: The Ultimate CPU Alert
                      fitzy141

                      So I believe i followed your above to the TEE and when I test the alert i get this

                       

                      CPU on Node ASC-PRD-SQL13 is at 60 %  at 5/13/2014 3:55:40 PM.

                       

                      There are ${SQL:Select APM_AlertsAndReportsData.StatisticData from APM_AlertsAndReportsData where APM_AlertsAndReportsData.NodeId = $@NodeID@ and APM_AlertsAndReportsData.ComponentName = 'Win_Processor_Queue_Len'} items in the process queue and only 4 CPUs to process them.

                       

                      Top 10 processes at the time of the alert are:

                       

                       

                      If there is no list of alerts, it's because it took longer than 2 minutes to collect off the server. We felt that delivering the alert fast was more important.

                       

                       

                      With no cool info :-) what am I doing wrong

                      • 8. Re: The Ultimate CPU Alert
                        Leon Adato

                        Are you collecting the metric named "Win_Processor_Queue_Len" for that node? (spellling counts, cApiTaliZatioN counts). If not, that's why you are getting the error.

                        • 9. Re: The Ultimate CPU Alert
                          patriot

                          The part that is not working for me in my alert message is the Notes. How do I populate the Notes field with the list of Top 10 components? Thanks for the very nice alert by the way Leon.

                          • 10. Re: The Ultimate CPU Alert
                            Leon Adato

                            Are you making sure you execute the program to get the processes

                             

                            SolarWinds.APM.RealTimeProcessPoller.exe -n=${NodeID} -alert=${AlertDefID} -timeout=60

                             

                            In my post above, it was bullet-ed with an "A. " before it.

                             

                            Also, you have to make sure you put the PATH to that executable (SolarWinds.APM.RealTimeProcessPoller.exe) if you installed SolarWinds anywhere EXCEPT the default (C:\program files(x86\solarwinds).

                             

                            Finally, you have to put a delay for the next action - the email - so that the first action has time to run. And sometimes it still doesn't, because - DUH! - the CPU on the box is so busy it can't complete. So that's one of those hit-or-miss kinds of things.

                            • 11. Re: The Ultimate CPU Alert
                              patriot

                              Well, here is a screenshot of my Trigger tab. I verified the location of the EXE file:

                               

                              5-13-2014 4-52-49 PM.jpg

                              • 12. Re: The Ultimate CPU Alert
                                fitzy141

                                yea maybe I am missing something i thought i did it right ... SAM is not my strong point  its this Win_Processor_Queue_Len i am trying to figure out how to get it it to the template .. let me try this agin

                                • 13. Re: The Ultimate CPU Alert
                                  Jfrazier

                                  And sometimes it still doesn't, because - DUH! - the CPU on the box is so busy it can't complete. So that's one of those hit-or-miss kinds of things.

                                  Inconceivable !!!

                                  • 14. Re: The Ultimate CPU Alert
                                    jwilson2013

                                    I do not think that word means what you think it means....

                                    • 16. Re: The Ultimate CPU Alert
                                      patriot

                                      Leon,

                                       

                                      Here is the alert message I get. The Top 10 processes are not shown. What am I not doing?test image.jpg

                                      • 17. Re: The Ultimate CPU Alert
                                        Leon Adato

                                        I looked back at your screenshot. You have a T+60 delay on the RealTimeProcess action, AND ALSO a T+60 delay for the email. That means the email is happening at the same time as the email. There's no time for the "get processes" task to execute before the email is sent.

                                         

                                        Remove the delay from the first task, and then add *at least* a 5 minute delay to the email action.

                                        1 of 1 people found this helpful
                                        • 18. Re: The Ultimate CPU Alert
                                          jbiggley

                                          I was going to point out something very similar, but more along the lines that, in a virtualized environment you *want* to see boxes running as lean as possible so that you can increase consolidation ratios.  Isn't that why we spend so much money on redundant network, storage and server hardware?  Stack 'em deep and run 'em cheap!

                                           

                                          Layering alerts from OSEs and the virtualization layer.  #mindblown

                                          • 19. Re: The Ultimate CPU Alert
                                            Jfrazier

                                            Problem is when you pack them too deep you don't have any wiggle room for resources...then you get servers that are maxed out and vCops indicating they aren't using everything they have been assigned and the environment is further degraded because they are "right-sized".  When in fact the resources are being hogged by other servers..... 

                                            It's kind of like insurance...  The insurance company (hosts for the virtual servers) gets money (resources) from all the insured (the virtual servers) gambling that all the insured wont file claims at the same time because there is not enough money(resources) to go around if everyone needed everything at the same time..

                                            • 20. Re: The Ultimate CPU Alert
                                              browntd

                                              I know I am late to the game on this one but I have a question.  

                                              Does this only work with the full SQL version?

                                              I am trying in a lab environment with the free SQL bits but i get errors in the join command syntax.  

                                              • 21. Re: The Ultimate CPU Alert
                                                augiedc

                                                This is Great! Thanks for sharing.

                                                Quick question - for machines with multiple CPU, does the SQL query check the individual CPU usage or the average of all?

                                                • 22. Re: The Ultimate CPU Alert
                                                  Leon Adato

                                                  Thank you for the kind words. It uses the average of all CPU's.

                                                  • 23. Re: The Ultimate CPU Alert
                                                    cfizz34

                                                    Thanks for posting but I just can't get the top processes to show up in the email (and note, we are using the default orion install location). 

                                                    Any help would be most appreciated.

                                                    11-13-2014 5-02-34 PM.jpg

                                                    • 24. Re: The Ultimate CPU Alert
                                                      ayegel

                                                      Leon,

                                                       

                                                      Thanks for this wonderful tool! I'm having the same issue as fitzy141. I've checked and double-checked spelling and capitalization and I'm at the end of the rope on this one. Any suggestions are greatly appreciated!

                                                       

                                                      CPU on Node XXXXXXXX is at 18 % at 1/28/2015 11:02:16 AM.

                                                       

                                                      There are ${SQL:Select APM_AlertsAndReportsData.StatisticData from APM_AlertsAndReportsData where APM_AlertsAndReportsData.NodeId = $@NodeID@ and APM_AlertsAndReportsData.ComponentName = 'Win_Processor_Queue_Len'} items in the process queue and only 4 CPUs to process them.

                                                       

                                                      Top 10 processes at the time of the alert are:

                                                       

                                                       

                                                      If there is no list of alerts, it's because it took longer than 2 minutes to collect off the server. I felt that delivering the alert fast was more important.

                                                      1 2 3 Previous Next