This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.

You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

Alert on Nodes that stopped responding to SNMP

michal.hrncirik over 9 years ago

Custom alert for Nodes which stopped respond to SNMP:

use Advanced Alert Manager and create custom SQL alert on Nodes with this custom SQL Query:

SELECT LastSystemUptimePollUtc FROM Nodes WHERE

ObjectSubType='SNMP' AND

DATEDIFF(s, LastSystemUptimePollUtc, GETUTCDATE())>PollInterval

Top Replies

0 adatole over 9 years ago

We tested this alert out for about 3 months in a large (10,000 node) environment and found that we were cutting a lot of false alarms (around 1-200 per day).
By "false alarm" I mean that this alert would trigger but we would have CPU, RAM, etc data for the node for that time period.
I'm not sure if it was because the pollers were getting behind, or because LastSystemUptimePollUTC wasn't getting updated even when data was being collected, OR that data was being collected but the database was behind, so we saw the alert and subsequently the data for that time period was written into the database.
In any case, we found it was better to check for a recent entry in the CPU table, and alert when that was absent from 30 to 120 minutes ago:
SELECT Nodes.NodeID AS NetObjectID, Nodes.Caption AS Name
From Nodes
left join (select CPULoad_Detail.NodeID, MAX(CPULOad_Detail.DateTime) as LastCPU
from CPULOad_Detail
group by CPULoad_Detail.NodeID) c1 on Nodes.NodeID = c1.NodeID
where
nodes.status = 1
and nodes.Unmanaged = 0
and nodes.ObjectSubtype = 'SNMP'
and DateDiff(mi, c1.LastCPU, getdate()) > 30
and DateDiff(mi, c1.LastCPU, getdate()) < 120
Cancel
Vote Up +4 Vote Down

Cancel
0 AlexSoul over 9 years ago

How about this approach?
Cancel
Vote Up +1 Vote Down

Cancel
0 tony.johnson over 9 years ago

Hi Alex,
Any chance you could post the underlying SQL for that alert?
Thanks,
Tony
Cancel
Vote Up 0 Vote Down

Cancel
0 AlexSoul over 9 years ago

Hi Tony,
THis must be driven by SQL, for sure, but how to get this I don't know. You can probably ask some SQL gurus here...
Why do you need SQL anyway? The above works perfectly fine the way it is...
Cancel
Vote Up 0 Vote Down

Cancel
0 rstoney00 over 9 years ago

I like it Leon. Plagiarizing for my instances! eerrr... I mean, copy/paste... eerrr...availing myself of your good idea.
Okay - none of that sounds good. Doing what all of us do anyway. Thanks!
Cancel
Vote Up +1 Vote Down

Cancel
0 automag928 over 9 years ago

Alex - You can get the trigger query by going into the Solarwinds db, opening up the dbo.AlertDefintions table and searching for the alert name for that alert you have created there. There is a column in that table called "Trigger Query", that is the underlying SQL querying created for your alert defintion
Cancel
Vote Up +2 Vote Down

Cancel
0 adatole over 9 years ago

"good artists copy but great artists steal"
- Pablo Picasso
Cancel
Vote Up +1 Vote Down

Cancel
0 tony.johnson over 9 years ago

Hey Alex,
I like to be able to test the sql when creating an new alert just as a sanity check, especially with multiple nested conditions etc it help me understand precisly which objecte the alert will trigger against before i enable it. With NPM 10.6+ you can go to settings, manage advanced alerts, edit the alert and view the read only properties which gives you the SQL for the trigger and reset actions.
Thanks
Tony
Cancel
Vote Up +1 Vote Down

Cancel
0 mdriskell over 9 years ago

We took an entirely different approach that works great for us. We made a SAM monitor. It executes a powershell script that calls the sysuptime from SNMP. If we get a valid response the monitor is up. If we don't we utilize the new sustained threshold in solarwinds to look for 3 consecutive failures. What I was finding with the other methods was that they didn't seem to cover all scenarios (some nodes don't have CPU that report back).
Cancel
Vote Up 0 Vote Down

Cancel
0 syncopix over 9 years ago

It's interesting to see all the different ways people approach this. We have had such an alert for a while now and it has been really useful to catch servers that have their SNMP string changed etc.
We do the following:
SELECT Nodes.NodeID AS NetObjectID, Nodes.Caption AS Name
From Nodes
WHERE
(
(LastSystemUpTimePollUtc < DATEADD(minute, -120, GETUTCDATE())) AND
(Nodes.Status = 1)
)
If the node is up but hasn't been polled for more than 2 hours it fires an alert. We've not had any issues with this sending false positives either.
Cancel
Vote Up +2 Vote Down

Cancel