Open for Voting

FEATURE REQUEST-Setup alert monitor multiple CPU cores (Case #603886 )

We setup an alert about CPU with the condition (CPU Load > 80%). However, we found one of our switches had CPU issue but cannot receive any alert. Finally we found the switch has multiple CPU core and one of them was more than 90% but the CPU Load is the average of all CPU cores. That is not resonable to set up an CPU alert by using this condition. Would like to ask how about setup alert can monitor each CPU cores in a device? Thanks

  • I would still like this to become a built in feature.  We have a small handful of stack switches where the master is running at 90-99%, but won't trip the built in average CPU load alert due to the other switches in the stack averaging down the CPU load to like 30%.  Due to recent instability with our Orion instance, we have been banned from testing/using SQL/SWQL based queries or really any customization until our instance is stable. 

    Yes, we know, bring up a lab environment and test implementations there.  One problem, the firewall folks blocked all access to any devices outside the lab to test against.  In other words, our lab environment has no active nodes it can access, and no one has been able to convince the powers that be to change that.

  • Has this issue been resolved?  Was anyone able to get the alert working correctly to trigger on a single switch CPU?

    Thank you.

  • FormerMember
    FormerMember

    I am surprised this needs to be a feature request.

  • We just got burned by this "new polling feature".

    In the old Orion pollers it used the SNMP "GET" function so it always polled only the first CPU in the table.

    The new CPU poller uses the "GET BULK" function.

    Because the average CPU was diluted by 8 other CPU's running normally, the Supervisor CPU which went from 20% to 75% utilization hardly made a change in the overall CPU of the Cisco 6500.

    Now I'm trying to get this poller working to alert us when things go south on any of our 6500s.

    I have this working in Report Writer, but it still does not work in Alert Manager.

    Here is my Report Writer query which shows any CPU above 25%:

    SELECT

    Nodes.NodeID AS NodeID, Nodes.MachineType AS MachineType, Nodes.Caption AS NodeName,  CPUMultiLoad_detail.CPUIndex, CPUMultiLoad_detail.AvgLoad AS '5 Minute CPU Load', CPUMultiLoad_Detail.TimeStampUTC as PolledTime

    FROM

    Nodes INNER JOIN CPUMultiLoad_detail ON (Nodes.NodeID = CPUMultiLoad_detail.NodeID)

    WHERE 

    (

    (DATEDIFF(mi, CPUMultiLoad_Detail.TimeStampUTC, SYSUTCDATETIME()) < 5) AND

    (CPUMultiLoad_detail.avgload >= 25) AND

    (Nodes.Type = 'DR')

    )

    And here is the Alert Manager query:

    JOIN CPUMultiLoad_detail ON (Nodes.NodeID = CPUMultiLoad_detail.NodeID)

    where DATEDIFF(mi, CPUMultiLoad_Detail.TimeStampUTC, SYSUTCDATETIME()) < 5

    and CPUMultiLoad_detail.AvgLoad > 25

    AND

       (Nodes.Type = 'DR')

    The previous SQL queries in this thread were incorrectly querying the CPUMultiLoad table, since the actual table name is CPUMultiLoad_detail.

    But even after fixing that, it is still broken.
    Any help would be appreciated (note I am not interested in having a Reset for this alert).

  • Hi Leon,

    Already tried to use "when no longer true", however, the reset action keep fired while the trigger action does not ever trigger.   Do you know why? Thanks a lot

  • Your problem is in the reset condition. Think about it:

    You have 4 CPU's which are reading 1%, 20%, 5% and 95%

    Your trigger returns the fact that CPU 4 is > 80%

    But your reset trigger returns the fact that CPU 1, 2 and 3 are < 80, so it resets as soon as it triggers.

    As mentioned elsewhere, SQL queries are notoriously difficult to create reset actions for, so you are best to simply use the "when no longer true" radio button. Along with that, make sure you have a sufficient delay on the trigger so that you don't get repeat alerts if you have a CPU that is sawtoothing (going up and down). Say 15 minutes or so.

    I know that's less than perfect and may lead to repeat alerts. In my environment we use NetCool to de-duplicate alerts (if there's already a ticket open, we don't create another one). But not every environment has that. However given the nature of this alert that's probably about the best you'll be able to do.

  • Hi Leon,

    Have tried this one but seems not work so create a case. Since the the Reset action keeps triggering while setup this alert (At that moment, one of the CPU cores in the testing switch  reaches 99%), do you have any idea? Thanks

    Trigger Condition:

    join CPUMultiLoad on Nodes.NodeID = CPUMultiLoad.NodeID

    where DATEDIFF(mi, CPUMultiLoad.TimeStampUTC, SYSUTCDATETIME()) < 5

    and CPUMultiLoad.AvgLoad > 80

    and (Nodes.Monitor = 'y') AND

      (

       (Nodes.Caption LIKE '%rtr%') OR

       (Nodes.Caption LIKE '%cat%') OR

       (Nodes.Caption LIKE '%wan%'))

    Reset Condition:


    join CPUMultiLoad on Nodes.NodeID = CPUMultiLoad.NodeID

    where DATEDIFF(mi, CPUMultiLoad.TimeStampUTC, SYSUTCDATETIME()) < 5

    and CPUMultiLoad.AvgLoad < 80

    and (Nodes.Monitor = 'y') AND

      (

       (Nodes.Caption LIKE '%rtr%') OR

       (Nodes.Caption LIKE '%cat%') OR

       (Nodes.Caption LIKE '%wan%'))

  • Per my note to your question here: http://thwack.solarwinds.com/thread/63999, where I referenced another thread (http://thwack.solarwinds.com/thread/49393)  You *can* set up an alert on a single CPU. It would look like this:

    Start a new alert. Set it to be a Node alert. Then change it to be a Custom SQL query (with the sub-type still Node).

    In the box where you can type your query, add:

    join CPUMultiLoad on Nodes.NodeID = CPUMultiLoad.NodeID

    where DATEDIFF(mi, CPUMultiLoad.TimeStampUTC, SYSUTCDATETIME()) < 5

    and CPUMultiLoad.AvgLoad > 85

    This will trigger when any single "CPU" (core or otherwise) is over 85 within the last 5 minutes (ie: one polling cycle)