Want to configure an alert which can check snmp is working fine on all nodes or not and send an email of problematic node
I've tried to do this and had a few discussions with MVPs and the end result was that the best way to do this was by monitoring/reporting on the 'Nodes.LastSystemUpTimePollUtc' - however, so long as ICMP is still working we don't deem this as a critical issue. So, the other side of this coin is in monitoring if an APE fails in some manner. As we only have NPM (and not SAM) we have taken a two-fold approach.For the APE's we have an alert using custom SWQL [happy to share if needs be] that looks at the %age of nodes not responding and if > than X triggers and sends an eMail to a team of folks that could respond and check. The other side is a daily report that shows all the nodes that hasn't collected for the prior 3 times. It also triggers a scheduled send once a day.
@stuartd thanks yes please share the custom swql for APEs that would be very helpful.
Set your trigger condition to be like this:
Then in the bottom box copy/paste the following:
RIGHT JOIN (Select EngineID, brokepercentfrom( select sum(broken.total) as total , sum(broken.count_broken) as broke , sum(broken.count_broken)*100/sum(broken.total) as brokepercent , broken.engineID from ( Select Case When minutediff(lastsystemuptimepollutc, getutcdate())>=5*StatCollection then 1 ELSE 0 END AS count_broken , n.EngineID , 1 as total FROM orion.nodes n WHERE objectsubtype<>'ICMP' and status <>2 and status<>9 and status<>11 and status<>12 ) as broken group by broken.engineID order by broken.engineID)as countswhere counts.brokepercent > 60) broke on broke.EngineID = engines.EngineID
Figures you may want to adjust are the 60 in last but one line and the 5 in the WHEN minutediff line. What this is essentially doing is saying if 60% or more of nodes on any one APE have failed to respond to SNMP over the last 5 polls then trigger.
Below query will help you create as alerts when node not responding to SNMP, WMI or Agent.
SELECT Nodes.Uri, Nodes.DisplayName FROM Orion.Nodes AS Nodes
WHERE Nodes.Status IN(1,3,14) AND Nodes.ObjectSubType<>'ICMP' AND MINUTEDIFF(Nodes.LastSystemUpTimePollUtc,GETUTCDATE())>(Nodes.StatCollection*6)