cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post
Level 7

Node not polled in last 10 minutes alerts receiving frequently

Hi,

I have Node not polled in last 10 minutes alerts enabled in our environment. For last few days I have started receiving around 700 alerts every minute due to one polling engine showing showing up as down and last database sync also having some issues.

Can you please tell how this can be fixed?

Thanks,

Ankit

0 Kudos
11 Replies

The problem isn't with the alert it's with your polling engine. The alert is correctly telling you that the nodes on that polling engine are not being correctly polled, because the polled data isn't being written back into the database.

You need to repair the APE or move the Nodes to a polling engine that is working.

As suggested above by adam.beedell​ using custom properties and alerting will be the best way to manage the situation.

- David Smith
0 Kudos

On top of that, there is an acknowledged problem with the current 2019.2 HF1 where polling falls behind on any APE that is connected to the database over a slower connection (even when the APE worked just fine in previous versions).  We're experiencing this issue on only four or five APEs out of 20+ on two Orion instances.

Support has told us that our only options are to wait for HF2 or rollback - and rollback isn't an option when we're a month of data in.

So to that end, the alert may be fine, the APE may be as fine as can be and you may be stuck as we are waiting for HF2.

0 Kudos

How slow we talking? Havnt heard of that one and I'm due to have moved up to 2019.2 already

That said since I posted in this thread last i've got a couple alerts up to mute alerts for nodes out of sync, and restart collector services when the polling engines unsync and between the two it's been fairly clean for a bit.

0 Kudos

That's the curious part - the latency isn't bad, but I know the throughput isn't great - we're talking about an APE and database across the country from each other.


That's the only current correlation I have currently for the APEs that are falling behind.  We've gone 20 rounds with Support and escalated the ticket and they've basically acknowledged a problem with the pubsub (I don't know precisely if it's timing out or what).  ICMP data in particular (and we poll a lot and twice as fast as default), just sort of backs up on these handful of APEs until it dumps into the database seemingly all at once anywhere from 20 mins to a few hours later.

It puts us at risk because we don't necessarily see that a server went down in a timely fashion at this point.

But the most curious thing is that this just got introduced when we rolled out 2019.2 HF1 at the start of this month.  These connections never had this sort of delay issue with previous software iterations going back several years.

They've promises a fix in HF2 but are reticent to say anything about the release of HF2 except "in a few weeks". Ugh.

0 Kudos

Odd I’ve not heard/seen that either - try and see if support will tell you what the root cause is and even ask if there is a buddy drop available to get your system operational while you wait for HF.

- David Smith
0 Kudos

HF2 released today.  I'm just about done slapping it on everywhere.  Fingers crossed.

0 Kudos

Good luck - Let us know if it makes any difference.

- David Smith
0 Kudos

Definitely not fixed with HF2.  Continuing with Support.

0 Kudos
Level 12

I wrote a few dashboards to keep an eye on nodes/components/elements that stop getting polled.

While not an alert, it might inspire someone

"LastSystemUpTimePollUtc" is the key I look for to know when they stop collecting SNMP/WMI metrics.  Status is the ICMP status indicator.

pastedImage_1.png

SQWL

SELECT TOP 1000 NodeID, Caption, ObjectSubType, Status, ChildStatus, IPAddress, E0.LastSystemUpTimePollUtc, ToLocal(LastSync) as LastSync, PollInterval, EngineID, ToLocal(NextPoll) as NextPoll, ToLocal(NextRediscovery) as NextRediscovery,  SkippedPollingCycles, MinutesSinceLastSync, E0.Engine.DisplayName as PollingEngine, E0.DetailsUrl

FROM Orion.Nodes as E0

WHERE

    E0.UnManaged = 0 --node has not been unmanaged

    AND E0.LastSystemUpTimePollUtc < ADDDATE('Minute', -30, GETUTCDATE())

Order by LastSystemUpTimePollUtc

My own dashboard resource for the same:

SELECT Caption,
'/Orion/NetPerfMon/NodeDetails.aspx?NetObject=N%3a' + ToString(NodeID) AS [_LinkFor_Caption], EngineID as Poller,
TOLOCAL(NextPoll) as NextPoll,
MINUTEDIFF(NextPoll, GETUTCDATE()) as Poll
FROM Orion.Nodes

WHERE (UnManaged<>1 and Status<>11)
and MINUTEDIFF(NextPoll, GETUTCDATE())>10

Order by NextPoll DESC, Caption
0 Kudos
Level 12

I've been meaning to fix this in my own environment for a while. Havnt done it yet, couple paths open:

  1. Include a check in the alert that the polling engine is connected to the DB and polling normally
  2. Create a new alert that auto-fixes SLW when the database sync drops out (perhaps restarts collector service and a couple others)
  3. Create an alert that updates a custom property when database sync drops out, add a check to the alert that that custom property isnt filled in.
  4. Create an alert that pauses actions or alerting when SLW is broken

Maybe all of the above

You might also want to look out for when the clocks go back - depending how you phrase your alert the time between the last poll and now will be > 10 mins for everything

If you work out anything clever, let me know