4 Replies Latest reply on Mar 5, 2015 7:46 AM by jbiggley

    CPU and Memory Load - Why don't they age?

    jbiggley

      We recently had a node where the platform owner asked us to disable CPU and Memory utilization data collection.  The node is still polled by SNMP and we are still collecting other statistics (disk and interface stats, etc.) but not CPU and memory usage.  To remove the stats we did a 'List Resources' and unchecked the 'CPU and Memory' check box.  Easy as pie, right?

       

      Wrong.

       

      A few days later the platform team complained that they were still getting CPU load alerts.  How could that be?  Granted, our CPU alerts were the The Ultimate CPU Alert (care of adatole) but we didn't have any current CPU or memory usage data.  Why would this alert trigger?  We did some digging and this is what we found:

       

      1)     The CPULoad_Detail table contains a detailed collection of the CPU and memory usage for the node for up to the last 192 hours (8 days -- depending when the roll-up jobs run).  The last values in the CPULoad_Detail.AvgLoad and CPULoad_Detail.AvgPercentMemoryUsed were 98 and 16.0877 respectively.

      2)     The Nodes table contains data from the LAST entry in the CPULoad_Detail table, specifically it has the nodes.CPULoad (=CPULoad_Detail.AvgLoad) and the Nodes.PercentMemoryUsed (=CPULoad_Detail.AvgPercentMemoryUsed)

      3)     The gauges on the nodes view show the values from the Nodes table for CPU and memory utilization (average) from the Nodes table

       

      A couple of outstanding questions:

       

      1)     Once the data from the CPULoad_Detail table has been rolled up to the CPULoad_Hourly table and, since we are no longer updating the CPULoad_Detail table for this node, will the data in the Nodes table for Nodes.CPULoad and Nodes.PercentMemoryUsed clear?

      2)     If the values in the Nodes table do not clear, is there any way to clear those values without deleting and re-adding the nodes (without the CPU and memory of course)

      3)     Thoughts on modifying The Ultimate CPU Alert (and likely The Ultimate CPU Alert ... for Linux!  since it is built on the same logic) to capture aged data?  It could be something as simple as a datediff on the most recent entry in the CPULoad_Detail table where the value > Nodes.StatCollection.

       

      Case 737340 is opened with support for this one.  The first response was to delete the node and re-add it with the CPU and memory but I've asked for an alternative so we don't lose the historical data.  Last resort, I'll delete and re-add.

       

      Cheers.

        • Re: CPU and Memory Load - Why don't they age?
          jbiggley

          This is a mashup of two alerts that we already use as a possible answer to question 3 above.  I chose 6 minutes because our data collection interval is 5 minutes.  I mashed the italicized/bolded parts into The Ultimate CPU Alert to allow us to check to ensure that the data in the CPULoad_Detail table was current.  I've tested it in our environment and it does strip out the node above from the alerts.  The question I am still pondering is do I *want* to strip out that node and how often will I run into this condition?

           

          SELECT Nodes.NodeID AS NetObjectID, Nodes.Caption AS Name

          FROM Nodes

           

          INNER JOIN APM_AlertsAndReportsData on (Nodes.NodeID = APM_AlertsAndReportsData.NodeId)

          INNER JOIN (select c1.NodeID, COUNT(c1.CPUIndex) as CPUCount

                          from (select DISTINCT CPUMultiLoad.NodeID, CPUMultiLoad.CPUIndex

                          from CPUMultiLoad) c1

                          group by c1.NodeID) c2 on Nodes.NodeID = c2.NodeID

          LEFT JOIN (select CPULoad_Detail.NodeID, MAX(CPULoad_Detail.DateTime) as LastCPU

            from CPULoad_Detail

            group by CPULoad_Detail.NodeID) c1 on nodes.NodeID = c1.NodeID

           

          WHERE

          Nodes.n_mute <> 1

          AND Nodes.Prod_State = 'PROD'

          AND APM_AlertsAndReportsData.ComponentName = 'Win_Processor_Queue_Len'

          AND APM_AlertsAndReportsData.StatisticData > c2.CPUCount

          AND

              (

              (nodes.CPU_Crit is null

               AND nodes.CPULoad >= 90 AND DATEDIFF(mi, c1.LastCPU, getdate()) < 6)

              OR (nodes.CPU_Crit is not null

            AND nodes.CPULoad >= nodes.CPU_Crit AND DATEDIFF(mi, c1.LastCPU, getdate()) < 6)

            )

          1 of 1 people found this helpful
          • Re: CPU and Memory Load - Why don't they age?
            Leon Adato

            I wonder if this is related to the well-known SAM issue, where components which are disabled in a "critical" state remain in a critical state and re-fire alerts.It might be worth asking support about.

            • Re: CPU and Memory Load - Why don't they age?
              jbiggley

              To close off this thread, we had opened incident 737340 with support.  What we determined was that in our environment (NPM 11.0 and SAM 6.1.1 installed) when you were polling CPU and memory utilization via WMI and then remove those resources from monitoring the Nodes table does not get updated with the current value.  The table needs to have a value of -2 for CPULoad and PercentMemoryUsed to remove the gauges.

               

              I asked support to open a bug report with development on this.  Unless someone can point me in a different direction I'm going to chalk this one up to a bug in our current releases.  One more reason to update to NPM 11.5 and SAM 6.2