As many are aware, one of the shortcoming of Orion is it's inability to correctly show a node that is up but whose data is no longer being collected (SNMP issues). We are trying to set up alerts for this condition but want to verify the current state of Orion as it pertains to how it handles this condition. The two advanced alerts we are looking at are: Alert me when a managed node has not been polled during the last 5 tries Alert me when a managed node last poll time is 10 minutes old Reading through thwack, it looks as if these two alerts really won't do what we need as they seem to use PING status and not SNMP status, as per the threads below: Alert me when a managed node last poll time is 10 minutes NPM does not alert when a Windows device goes unresponsive when polling method is SNMP We just want to me sure this is still the case in NPM 10.7. Do we still need to use custom SQL to do this?

This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.

You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

NPM Alert on Node UP (Ping) SNMP DOWN (unknown / unresponsive)

jspanitz over 10 years ago

As many are aware, one of the shortcoming of Orion is it's inability to correctly show a node that is up but whose data is no longer being collected (SNMP issues). We are trying to set up alerts for this condition but want to verify the current state of Orion as it pertains to how it handles this condition.

The two advanced alerts we are looking at are:

Alert me when a managed node has not been polled during the last 5 tries
Alert me when a managed node last poll time is 10 minutes old

Reading through thwack, it looks as if these two alerts really won't do what we need as they seem to use PING status and not SNMP status, as per the threads below:

We just want to me sure this is still the case in NPM 10.7. Do we still need to use custom SQL to do this?

Top Replies

0 rob.hock over 10 years ago

John,
In 10.7, we did introduce the ability to determine status via SNMP response. "List resources" on the node, and you should see the below. you could also set in bulk via new "Manage Pollers" interface.
Cancel
Vote Up +1 Vote Down

Cancel
0 jspanitz over 10 years ago in reply to rob.hock

rob.hock thanks for pointing that out. We did see the option, but it presents another issue for us. If we use just SNMP, the status is still as inaccurate as using just ICMP, as SNMP not responding does not mean the node is down. Is there a way to have our cake and eat it too?
Cancel
Vote Up +1 Vote Down

Cancel
0 rob.hock over 10 years ago in reply to jspanitz

John,
Not at present sir, as it's a binary option. It is certainly something we will bear in mind for future release.
Cancel
Vote Up +2 Vote Down

Cancel
0 jspanitz over 10 years ago in reply to rob.hock

Put in a feature request here - http://thwack.solarwinds.com/ideas/3422
Cancel
Vote Up +2 Vote Down

Cancel
0 adatole over 10 years ago
While I am excited for the new "up/down via SNMP" feature, I don't think it will ultimately resolve anything. Do up/down via SNMP and inevitably some group will say that it isn't as clear as ping whether it's up or down. Do it by ping and you get the opposite. You need both and you can have both - even under pre-10.7.
Let everything continue to give status via PING, and add this alert:
Type of property: Custom SQL
Trigger Query: Node
(after the initial "Select Nodes.NodIE as NetObjectID... stuff)
left join (select CPULoad_Detail.NodeID, MAX(CPULoad_Detail.DateTime) as LastCPU
      from CPULoad_Detail
      group by CPULoad_Detail.NodeID) c1
      on Nodes.NodeID = c1.NodeID
where
Nodes.Status = 1
and Nodes.UnManaged = 0
and DATEDIFF(mi,c1.LastCPU, getdate()) > 30
and DATEDIFF(mi,c1.LastCPU, getdate()) < 120
For those who don't read SQL, what this is saying is:
Grab the LAST (ie: most recent) CPULoad collection for each node
Trigger the alert for any node where
the node is UP
AND the node is MANAGED
AND The last CPU collection is older than 30 min
AND The last CPU collection is younger than 120 min
That last bullet point is there to avoid re-triggering alerts for stuff that is anciently out of date. You can build a report for that.
The key here is to ensure enough of a delay. Otherwise, when a device has been down for a more than 30 min, it would trigger an alert when it came back (ie: ping shows it's UP, but in the first 2-5 minutes it may very well not have SNMP data collected. We set up a 15 minute delay on ours, so that we don't cut a ticket unless a device has been out of date for 45 minutes.
THIS alert, along with regular ping, lets you know the true status of devices.
Hope it helps.
Cancel
Vote Up +13 Vote Down

Cancel
0 thamizh85 over 9 years ago in reply to adatole

I wish I got this answer earlier. I was looking for the exact same solution before.
http://thwack.solarwinds.com/message/189391#189391
Cancel
Vote Up 0 Vote Down

Cancel
0 FormerMember over 9 years ago in reply to rob.hock

Hi rob.hock
In my case, the issue is the interface for the following device is up. But somehow neighbor ship is not there. Is there any way to get alert for this issue ?
Below is the screen shot.
Thanks.
Cancel
Vote Up 0 Vote Down

Cancel
0 automag928 over 9 years ago in reply to adatole

Leon Adato - nice job! Great idea to query a table that needs snmp to update. I've literally just implemented it as a new alert for devices we have issues with snmp zonking out on, but icmp remaining up - and of course no alert going out.
Cancel
Vote Up +1 Vote Down

Cancel
0 mwb over 9 years ago in reply to adatole
Can anyone lend a hand getting this cleanly into SWQL format? Runs great as a SQL query for reports, but what I'd love to do is stick it on the dashboard with a custom SWQL query. I've tried my hand at it but something isn't formatted quite right.
SELECT

a.NodeID,

a.Caption,

a. Status,

a.MachineType,

MAX (c.DateTime) as Last_Poll

FROM Orion.Nodes a

JOIN Orion.CPULoad(nolock=true) c ON a.NodeID=c.NodeID

where a.Status = 1

and a.UnManaged = 0

and MinuteDIFF((MAX (c.DateTime)), GetDate()) > 30
Cancel
Vote Up 0 Vote Down

Cancel
0 mwb over 9 years ago in reply to mwb

SWQL query for those that may want to add this to a dashboard:
Assistance with SWQL Query for Node UP/SNMP DOWN
Cancel
Vote Up 0 Vote Down

Cancel