cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post

Alert on Nodes that stopped responding to SNMP

Custom alert for Nodes which stopped respond to SNMP:

use Advanced Alert Manager and create custom SQL alert on Nodes with this custom SQL Query:

SELECT LastSystemUptimePollUtc FROM Nodes WHERE

ObjectSubType='SNMP' AND

DATEDIFF(s, LastSystemUptimePollUtc, GETUTCDATE())>PollInterval


pastedImage_0.png

Labels (2)
Comments

We tested this alert out for about 3 months in a large (10,000 node) environment and found that we were cutting a lot of false alarms (around 1-200 per day).

By "false alarm" I mean that this alert would trigger but we would have CPU, RAM, etc data for the node for that time period.

I'm not sure if it was because the pollers were getting behind, or because LastSystemUptimePollUTC wasn't getting updated even when data was being collected, OR that data was being collected but the database was behind, so we saw the alert and subsequently the data for that time period was written into the database.

In any case, we found it was better to check for a recent entry in the CPU table, and alert when that was absent from 30 to 120 minutes ago:

SELECT Nodes.NodeID AS NetObjectID, Nodes.Caption AS Name

From Nodes

left join (select CPULoad_Detail.NodeID, MAX(CPULOad_Detail.DateTime) as LastCPU

        from CPULOad_Detail

        group by CPULoad_Detail.NodeID) c1 on Nodes.NodeID = c1.NodeID

where

nodes.status = 1

and nodes.Unmanaged = 0

and nodes.ObjectSubtype = 'SNMP'

and DateDiff(mi, c1.LastCPU, getdate()) > 30

and DateDiff(mi, c1.LastCPU, getdate()) < 120

How about this approach?

Capture.JPG

Hi Alex,

Any chance you could post the underlying SQL for that alert?

Thanks,

Tony

Hi Tony,

THis must be driven by SQL, for sure, but how to get this I don't know. You can probably ask some SQL gurus here...

Why do you need SQL anyway? The above works perfectly fine the way it is...

I like it Leon.   Plagiarizing for my instances!   eerrr...  I mean, copy/paste...  eerrr...availing myself of your good idea. 

Okay - none of that sounds good.    Doing what all of us do anyway.  Thanks!

Alex - You can get the trigger query by going into the Solarwinds db, opening up the dbo.AlertDefintions table and searching for the alert name for that alert you have created there.  There is a column in that table called "Trigger Query", that is the underlying SQL querying created for your alert defintion

"good artists copy but great artists steal"

- Pablo Picasso

Hey Alex,

I like to be able to test the sql when creating an new alert just as a sanity check, especially with multiple nested conditions etc it help me understand precisly which objecte the alert will trigger against before i enable it. With NPM 10.6+ you can go to settings, manage advanced alerts, edit the alert and view the read only properties which gives you the SQL for the trigger and reset actions.

Thanks

Tony

We took an entirely different approach that works great for us.  We made a SAM monitor.  It executes a powershell script that calls the sysuptime from SNMP.  If we get a valid response the monitor is up.  If we don't we utilize the new sustained threshold in solarwinds to look for 3 consecutive failures.  What I was finding with the other methods was that they didn't seem to cover all scenarios (some nodes don't have CPU that report back). 

It's interesting to see all the different ways people approach this. We have had such an alert for a while now and it has been really useful to catch servers that have their SNMP string changed etc.

We do the following:

SELECT Nodes.NodeID AS NetObjectID, Nodes.Caption AS Name

From Nodes

WHERE

(

     (LastSystemUpTimePollUtc < DATEADD(minute, -120, GETUTCDATE())) AND

     (Nodes.Status = 1)

)

If the node is up but hasn't been polled for more than 2 hours it fires an alert. We've not had any issues with this sending false positives either.

Cool, thanks, I didn't know that. Always learning...

Here it is:

SELECT DISTINCT Nodes.NodeID AS NetObjectID, Nodes.Caption AS Name FROM Nodes WHERE ( ( (Nodes.Status <> '9') AND (Nodes.Status <> '11')) AND (DATEDIFF(ss, Nodes.LastSync, getdate())/Nodes.PollInterval > 5) AND (Nodes.ObjectSubType = 'SNMP') )  

Interesting... I have picked up 3 nodes with this SQL, which was not picked up by the one I used earlier (above). Thank you for sharing

Two of them have had string changed indeed.

The other one (with "Microsoft Windows Embedded Standard" OS) just doesn't return CPU or Memory. Now, why I do not have CPU/Memory on this box for more than 2 month is a separate issue, but your SQL did a great job digging it out

Thank you

syncopix you have inspired me to improve this alert. The slight issue I had with your version is that if some nodes are specifically configured to be polled every 2 hours or more - this will trigger an alert. This is rare case, but is still possible. Instead, I have combined yours one and my one together - if there were no polls for the last 5 tries - it will fire an alert (will work equally good for any poll time)

WHERE (

Nodes.Status = '1' AND

((DATEDIFF(ss, Nodes.LastSystemUpTimePollUtc, getUTCdate()) / Nodes.PollInterval) > 5) AND

Nodes.ObjectSubType <> 'ICMP'

)

--

Alex

Awesome! Thank you for sharing.

lol, that's a story about Steve's MAC ... Steve Jobs' 2005 Stanford Commencement Address - YouTube

Leon, what is the reason for filtering out all that < 120 minutes?

Mike, do you mean "LastSystemUpTimeUtc"? What exactly are you checking?

Just wanted to "chime in" here & say thanks to Alex Slv for putting together the query shown above. I've been looking for something along these lines for a while & 'stumbled' across this this morning...

Testing this in our environment this morning & it's happily reporting, accurately, on devices that are displaying issues! Brilliant!

So... Thanks Alex, nice one!

Thanks danielgilbey for your feedback - you made my morning ... possibly the whole day

I have improved it since then and have combined both checks together - availability of CPU readings as well as Last Poll Time. I found that it gives more accurate results that way.

Here it is:

SELECT Nodes.NodeID AS NetObjectID, Nodes.Caption AS Name, Nodes.NodeID

FROM Nodes

------------------------------------------------------------

LEFT JOIN

(SELECT CPULoad_Detail.NodeID, MAX(CPULOad_Detail.DateTime) as LastCPU

   FROM CPULOad_Detail

   GROUP BY CPULoad_Detail.NodeID) c1 on Nodes.NodeID = c1.NodeID

(

Nodes.Status = 1 OR

Nodes.Status = 3

) AND

Nodes.Unmanaged = 0 AND

Nodes.ObjectSubType <> 'ICMP' AND

-- Nodes.ObjectSubtype = 'SNMP' AND -- in case you want SNMP only

(

(DATEDIFF(ss, c1.LastCPU, getdate()) / Nodes.PollInterval > 15 ) OR

(DATEDIFF(ss, Nodes.LastSystemUpTimePollUtc, getUTCdate()) / Nodes.PollInterval > 15)

)

Just copy-paste directly into your Alerts and Report Writer. If you will then add resource "Custom Report" here is what you will get (we have just upgraded firmware and for some reason lost CPU readings via SNMP for iDRAC - good example here):

Capture.JPG

CREDITS TO michal.hrncirik, syncopix and Leon Adato

Thank you all,

Alex

Incorrect syntax near "=".

Line 15 ( Nodes.Status = 1 OR )

I believe a WHERE statement is missing:

SELECT Nodes.NodeID AS NetObjectID, Nodes.Caption AS Name, Nodes.NodeID

FROM Nodes

LEFT JOIN

(SELECT CPULoad_Detail.NodeID, MAX(CPULOad_Detail.DateTime) as LastCPU

   FROM CPULOad_Detail

   GROUP BY CPULoad_Detail.NodeID) c1 on Nodes.NodeID = c1.NodeID

WHERE

(Nodes.Status = 1 OR

Nodes.Status = 3) AND

Nodes.Unmanaged = 0 AND

Nodes.ObjectSubType <> 'ICMP' AND

-- Nodes.ObjectSubtype = 'SNMP' AND -- in case you want SNMP only

(

(DATEDIFF(ss, c1.LastCPU, getdate()) / Nodes.PollInterval > 15 ) OR

(DATEDIFF(ss, Nodes.LastSystemUpTimePollUtc, getUTCdate()) / Nodes.PollInterval > 15)

)

Thanks for this.

Any way to make it into an alert in 11.5 now that it's web based?  I can't find a place to put in SQL.

Hope you found your answer by now, if not (and for the sake of others that stumble across this thread) here is the place to enter/show SQL for a Web-Based Alert (SWQL is now an option and what you probably should start using since SolarWinds has and will make schema changes to the SQL database and may not always be as nice as they were with the 11.5 upgrade to create SQL Views for tables that no longer exist.)

The SQL/SWQL option is actually in the same place it was with Advanced Alerts.  When creating the Trigger Condition just hit the "I want to alert on:" drop-down box and go all the way to the bottom.  See screenshot below:

pastedImage_1.png

Jordan  Hume

Field Systems Engineer

Loop1 Systems, Inc.

Thanks. I actually hadn't found this, but made it part of our routine to check the custom query box on our summary page. An alert will be more useful to stir others into action.

Be careful with default reset statement logic. I found that the below structure will work for almost all my custom SQL alerts:

ALERT:

WHERE

<SQL TRIGGER>

RESET:

WHERE NodeID NOT IN

(

SELECT DISTINCT NodeID

FROM Nodes

WHERE

<SQL TRIGGER>

)

Works!! Thanks for sharing.

Works! Used it with snmp response.

Thanks Alex. This works!!!

Thanks to everyone for the ideas.  This is working great in report writer, not so much as an alert.  When I try to use this query as an alert, it fires on every node in the environment. I'm thinking there has to be something basic that I'm missing.  Any pointers would be welcome.  Thanks!

What time frame are you allowing for the alert criteria to be met ? In the instance that I have (for this) I've set it quite 'aggressively' in so much that the condition needs to exist for a couple of hours before the alert fires (in order to rule out network "blips").

Perhaps you can share a screen shot of the alert criteria you have in place?

Thanks for the reply Daniel.  I had it set to alert after 30 minutes, precisely to avoid the blips you mentioned.  I got a query by alexslv from upthread to work as expected:

pastedImage_0.png

The latest query provided by jspanitz‌ was firing on all nodes after the 30 minutes (and yes, it did warn me before saving, and thankfully disabling the alert prevented all 800 some-odd emails from getting sent):

pastedImage_2.png

pastedImage_1.png

And yet when placed in report writer, works as expected.

Here is a SWQL Query that finds Nodes whose Last System UpTime Poll has been unsuccessful for over 60 minutes.  Adjust to your liking.  As stated in a previous reply, you should get in the habit of using SWQL instead of SQL for any custom reports/alerts (when possible of course) since SolarWinds can and will change the database schema at any time, potentially breaking any custom SQL queries you may have.  They were nice enough to create a SQL view to combine the multiple Nodes tables they created in their last schema change (they broke the old Nodes table up into 3 or 4 separate tables and did away with the Nodes table completely during the NPM 11.5 update), but they may not always do that in the future, so use SWQL as much as possible.

This will return all nodes that are Up and not responding to SNMP/WMI polls.

SELECT

n.Caption AS Node,

e.ServerName AS PollingEngine,

n.NodeID,

n.ObjectSubType AS PollingMethod,

n.Status,

n.ChildStatus,

n.DetailsUrl,

n.IPAddress,

n.MachineType,

TOLOCAL(n.LastSystemUpTimePollUtc) AS LastUptimeLocal

FROM Orion.Nodes n

JOIN Orion.Engines e ON n.EngineID = e.EngineID

WHERE n.ObjectSubType <> 'ICMP' AND n.Status = 1

AND MINUTEDIFF(TOLOCAL(n.LastSystemUpTimePollUtc), GETDATE()) > 60

Good advice about using SWQL for custom queries. One question... it looks like I could change the "n.Status=1" to a value of 2, and then I could use the query to display a list of nodes that have been down for some period of time (in this case 10 minutes), right?

Yes, although there is probably a better value to filter on then LastSystemUpTimePollUtc, since node status is based on ping response, not SNMP response (unless you've set your nodes to base their Status on SNMP, which isn't the default setting).

Good point. I am still using ICMP as default status polling. So, do we know what field should be used? What I am ultimately after is a report that will show nodes that have not polled up in some designated number of hours or days. That data would then be used to determine "stale" devices that could be deleted from the database.

I think I may have found what I need re: the last successful status poll. In the NetObjectDowntime table are the "DateFromTime" and "DateTimeUntil" fields. In the screenshot you can see one node that I purposely took down a little earlier today. The DateFromTime shows when the node went down, while the DateTimeUntil field shows NULL because it is still down. So, it seems that my report or query should filter on nodes that have a NULL value for the DateTimeUntil field. I just need to configure the query to calculate the time differential between the two fields.

Agreed? If so, can you help with the SQL/SWQL syntax?

Thanks2015-12-10_12-53-27.jpg

Humejo or anyone?

There are several customization's for Node Down with Duration.  Just search for the term NODE DOWN DURATION in Thwack and you'll get a good amount of results.  No need to reinvent the wheel.

  This post right here is an old Report Writer style report, but you can open it up and view the SQL and see how they did it (although I myself would create a SWQL version instead for reasons already stated in a previous reply):  NODES_Currently_Down_with_Duration_.OrionReport

Hello,

I want to report only for nodes that are not responding to SNMP, I don't need the CPU part. In that case can anyone help with the updated query?

Not good with SQL so requesting for this.

Also my polling method is ICMP and not SNMP...

Any reason why SNMP polling is not happening even though the node is up ? I can understand the pro-active measures taken in order to find the nodes that is up but not being polled for X minutes , But any reason why it is not being polled when it is up ?

Version history
Revision #:
1 of 1
Last update:
‎05-26-2014 07:18 AM
Updated by: