SNMP randomly stops polling

Question

Hi All,

I've searched for the answer but not really found anything that describes my scenario.

We've got Orion NPM 500 (10.3) and periodically (usually overnight) one or more of my nodes will trigger an alert I have that trips when a node is not being monitored by SNMP.

The alert is based off the LastSystemUptimePollUtc column in the database and when I check the underlying table the alert correctly reflects this value.

When this occurs the only fix seems to be either a restart of all Orion services or a reboot of the polling server, restarting the nodes doesn't seem to make any difference.

Whilst the alert exists I can still go to a node, list resources and get the correct resources back so the polling server can communicate with the node via SNMP. Clicking Poll Now or Rediscover makes no difference to the alert.

The nodes that have issues are primarily split over two sites but the nodes are not all identical, one site has a Windows 2008 R2 Server, a Cisco 3750 switch and an ASA5505 (accessible through the ASA via VPN). The other site has two Windows 2008 servers (behind a Draytek 2820, no VPN)

It's almost like there's a break in internet connectivity, Orion gets upset somehow and then refuses to acknowledge they're accessible via SNMP.

Has anybody seen this before or have any idea how it can be fixed?

Cheers,

Alex

hwtechnology · Answer

Thanks Jan, I've just opened a support ticket.

I've restarted the Job Engine v2 service and the node is now being polled correctly, will wait to see what tech support say.

jan1 · Answer

You've said that this was happening for two sites, each behind a different firewall. But if all the traffic is going through that core site firewall, it looks like the problem might be in there. That is really unfortunate that you can't restart it.

When this happens again, please try whether restarting only the Job Engine v2 service makes it start working again (so you don't have to restart everything).

And yes, please do open support ticket. We'll need to check the firewall configuration, hopefully there's something we can do to fix this.

Thanks

Jan

hwtechnology · Answer

Thanks Jan,

That's exactly what we're seeing, the strange thing is that a "List Resources" works fine so there is SNMP connectivity to the node.

I've popped Wireshark on the polling server and I see SNMP going out but nothing coming back apart from a TTL exceeded from our core site firewall (Cisco ASA5510).

The problem node continues to respond via ICMP whilst this is going on (it's connected via an ASA to ASA VPN tunnel).

I've restarted the firewall at the problem site but it's not made any difference and unfortunately I can't restart the core site firewall.

The thing that's throwing me is that if I restart the Orion services onthe polling server (or restart the server itself) the system start's polling the node correctly via SNMP again, making me think that it's something on the polling server that's going wrong.

Am I best opening a support ticket about this?