I have Node not polled in last 10 minutes alerts enabled in our environment. For last few days I have started receiving around 700 alerts every minute due to one polling engine showing showing up as down and last database sync also having some issues.
Can you please tell how this can be fixed?
The problem isn't with the alert it's with your polling engine. The alert is correctly telling you that the nodes on that polling engine are not being correctly polled, because the polled data isn't being written back into the database.
You need to repair the APE or move the Nodes to a polling engine that is working.
As suggested above by adam.beedell using custom properties and alerting will be the best way to manage the situation.
On top of that, there is an acknowledged problem with the current 2019.2 HF1 where polling falls behind on any APE that is connected to the database over a slower connection (even when the APE worked just fine in previous versions). We're experiencing this issue on only four or five APEs out of 20+ on two Orion instances.
Support has told us that our only options are to wait for HF2 or rollback - and rollback isn't an option when we're a month of data in.
So to that end, the alert may be fine, the APE may be as fine as can be and you may be stuck as we are waiting for HF2.
How slow we talking? Havnt heard of that one and I'm due to have moved up to 2019.2 already
That said since I posted in this thread last i've got a couple alerts up to mute alerts for nodes out of sync, and restart collector services when the polling engines unsync and between the two it's been fairly clean for a bit.
That's the curious part - the latency isn't bad, but I know the throughput isn't great - we're talking about an APE and database across the country from each other.
That's the only current correlation I have currently for the APEs that are falling behind. We've gone 20 rounds with Support and escalated the ticket and they've basically acknowledged a problem with the pubsub (I don't know precisely if it's timing out or what). ICMP data in particular (and we poll a lot and twice as fast as default), just sort of backs up on these handful of APEs until it dumps into the database seemingly all at once anywhere from 20 mins to a few hours later.
It puts us at risk because we don't necessarily see that a server went down in a timely fashion at this point.
But the most curious thing is that this just got introduced when we rolled out 2019.2 HF1 at the start of this month. These connections never had this sort of delay issue with previous software iterations going back several years.
They've promises a fix in HF2 but are reticent to say anything about the release of HF2 except "in a few weeks". Ugh.
Odd I’ve not heard/seen that either - try and see if support will tell you what the root cause is and even ask if there is a buddy drop available to get your system operational while you wait for HF.
I wrote a few dashboards to keep an eye on nodes/components/elements that stop getting polled.
While not an alert, it might inspire someone
"LastSystemUpTimePollUtc" is the key I look for to know when they stop collecting SNMP/WMI metrics. Status is the ICMP status indicator.
SELECT TOP 1000 NodeID, Caption, ObjectSubType, Status, ChildStatus, IPAddress, E0.LastSystemUpTimePollUtc, ToLocal(LastSync) as LastSync, PollInterval, EngineID, ToLocal(NextPoll) as NextPoll, ToLocal(NextRediscovery) as NextRediscovery, SkippedPollingCycles, MinutesSinceLastSync, E0.Engine.DisplayName as PollingEngine, E0.DetailsUrl
FROM Orion.Nodes as E0
E0.UnManaged = 0 --node has not been unmanaged
AND E0.LastSystemUpTimePollUtc < ADDDATE('Minute', -30, GETUTCDATE())
Order by LastSystemUpTimePollUtc
My own dashboard resource for the same:
'/Orion/NetPerfMon/NodeDetails.aspx?NetObject=N%3a' + ToString(NodeID) AS [_LinkFor_Caption], EngineID as Poller,
TOLOCAL(NextPoll) as NextPoll,
MINUTEDIFF(NextPoll, GETUTCDATE()) as Poll
WHERE (UnManaged<>1 and Status<>11)
and MINUTEDIFF(NextPoll, GETUTCDATE())>10
Order by NextPoll DESC, Caption
I've been meaning to fix this in my own environment for a while. Havnt done it yet, couple paths open:
Maybe all of the above
You might also want to look out for when the clocks go back - depending how you phrase your alert the time between the last poll and now will be > 10 mins for everything
If you work out anything clever, let me know
SolarWinds solutions are rooted in our deep connection to our user base in the THWACK® online community. More than 150,000 members are here to solve problems, share technology and best practices, and directly contribute to our product development process. Learn more today by joining now.