This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

Alert on Nodes that stopped responding to SNMP

Custom alert for Nodes which stopped respond to SNMP:

use Advanced Alert Manager and create custom SQL alert on Nodes with this custom SQL Query:

SELECT LastSystemUptimePollUtc FROM Nodes WHERE

ObjectSubType='SNMP' AND

DATEDIFF(s, LastSystemUptimePollUtc, GETUTCDATE())>PollInterval


pastedImage_0.png

  • We tested this alert out for about 3 months in a large (10,000 node) environment and found that we were cutting a lot of false alarms (around 1-200 per day).

    By "false alarm" I mean that this alert would trigger but we would have CPU, RAM, etc data for the node for that time period.

    I'm not sure if it was because the pollers were getting behind, or because LastSystemUptimePollUTC wasn't getting updated even when data was being collected, OR that data was being collected but the database was behind, so we saw the alert and subsequently the data for that time period was written into the database.

    In any case, we found it was better to check for a recent entry in the CPU table, and alert when that was absent from 30 to 120 minutes ago:

    SELECT Nodes.NodeID AS NetObjectID, Nodes.Caption AS Name

    From Nodes

    left join (select CPULoad_Detail.NodeID, MAX(CPULOad_Detail.DateTime) as LastCPU

            from CPULOad_Detail

            group by CPULoad_Detail.NodeID) c1 on Nodes.NodeID = c1.NodeID

    where

    nodes.status = 1

    and nodes.Unmanaged = 0

    and nodes.ObjectSubtype = 'SNMP'

    and DateDiff(mi, c1.LastCPU, getdate()) > 30

    and DateDiff(mi, c1.LastCPU, getdate()) < 120

  • Hi Alex,

    Any chance you could post the underlying SQL for that alert?

    Thanks,

    Tony

  • Hi Tony,

    THis must be driven by SQL, for sure, but how to get this I don't know. You can probably ask some SQL gurus here...

    Why do you need SQL anyway? The above works perfectly fine the way it is...

  • I like it Leon.   Plagiarizing for my instances!   eerrr...  I mean, copy/paste...  eerrr...availing myself of your good idea. 

    Okay - none of that sounds good.    Doing what all of us do anyway.  Thanks!

  • Alex - You can get the trigger query by going into the Solarwinds db, opening up the dbo.AlertDefintions table and searching for the alert name for that alert you have created there.  There is a column in that table called "Trigger Query", that is the underlying SQL querying created for your alert defintion

  • "good artists copy but great artists steal"

    - Pablo Picasso

  • Hey Alex,

    I like to be able to test the sql when creating an new alert just as a sanity check, especially with multiple nested conditions etc it help me understand precisly which objecte the alert will trigger against before i enable it. With NPM 10.6+ you can go to settings, manage advanced alerts, edit the alert and view the read only properties which gives you the SQL for the trigger and reset actions.

    Thanks

    Tony

  • We took an entirely different approach that works great for us.  We made a SAM monitor.  It executes a powershell script that calls the sysuptime from SNMP.  If we get a valid response the monitor is up.  If we don't we utilize the new sustained threshold in solarwinds to look for 3 consecutive failures.  What I was finding with the other methods was that they didn't seem to cover all scenarios (some nodes don't have CPU that report back). 

  • It's interesting to see all the different ways people approach this. We have had such an alert for a while now and it has been really useful to catch servers that have their SNMP string changed etc.

    We do the following:

    SELECT Nodes.NodeID AS NetObjectID, Nodes.Caption AS Name

    From Nodes

    WHERE

    (

         (LastSystemUpTimePollUtc < DATEADD(minute, -120, GETUTCDATE())) AND

         (Nodes.Status = 1)

    )

    If the node is up but hasn't been polled for more than 2 hours it fires an alert. We've not had any issues with this sending false positives either.