1 Reply Latest reply on Jul 4, 2018 11:22 AM by yaquaholic

    CORE CPU MONITORING

    uniswtc

      We are having a situation where a server has 4 CPUs and CPU1 has reached 100% and system got hung. Some times CPU2 or CPU3 or CPU4 reached 100%. Sometimes 2 CPUs reached 100%. On all these ocassions, system got hung.

       

      Since Solarwinds monitors the average of 4 CPUs and not at the each CPU level, every time system gets hung.

       

      Is there a solution to the above scenario, where we can monitor and alert the server owner that each CPU is reaching its Warning/Critical thresholds. Please advise.

        • Re: CORE CPU MONITORING
          yaquaholic

          Yes, you need to look at the Orion database to get CPU metric per core:

           

          SELECT n.Caption ,cml.TimeStampUTC ,cml.CPUIndex ,cml.MaxLoad ,cml.AvgLoad

            ,'/Orion/NetPerfMon/NodeDetails.aspx?NetObject=N:'+CAST(n.NodeID AS varchar(256)) AS [DetailsURL]

            FROM CPUMultiLoad cml

            INNER JOIN NOdes n ON n.NodeID = cml.NodeID

            WHERE cml.NodeID = <NodeID of server in question>

            AND cml.MAxLoad > 99

            AND TimeStampUTC > DATEADD(mi,-10, getdate())   --you may need to play with the DATEADD to correct for your timezone

            ORDER BY TimeStampUTC DESC

           

          This will show any core that has exceeded 99% utilisation in the past 10 minutes (if you are in UTC).

           

          Next you need to wrap this into an alert, a little harder, as the node alerts need to based form the node table. Open the Alert Trigger condition, set it to Custom SQL Alert (Advanced) and you'll see what i mean.

          Select Node in the Set up your SQL condition and try this SQL under neath the pre populated grey box:

           

          INNER JOIN CPUMultiLoad ON Nodes.NodeID = CPUMultiLoad.NodeID

          WHERE CPUMultiLoad.MAxLoad > 99

          AND CPUMultiLoad.TimeStampUTC > DATEADD(mi,-10, getdate())

           

          it should look like this:

          That will trigger when any device, with multiple cpu cores, exceeds the 99% utilisation.

          Obviously, adjust the timezone and threshold to suit your environment and requirements.

           

          I hope it helps