4 Replies Latest reply on Jul 11, 2012 11:33 AM by hwtechnology

    SNMP randomly stops polling

    hwtechnology

      Hi All,

       

      I've searched for the answer but not really found anything that describes my scenario.

       

      We've got Orion NPM 500 (10.3) and periodically (usually overnight) one or more of my nodes will trigger an alert I have that trips when a node is not being monitored by SNMP.


      The alert is based off the LastSystemUptimePollUtc column in the database and when I check the underlying table the alert correctly reflects this value.

       

      When this occurs the only fix seems to be either a restart of all Orion services or a reboot of the polling server, restarting the nodes doesn't seem to make any difference.

       

      Whilst the alert exists I can still go to a node, list resources and get the correct resources back so the polling server can communicate with the node via SNMP. Clicking Poll Now or Rediscover makes no difference to the alert.

       

      The nodes that have issues are primarily split over two sites but the nodes are not all identical, one site has a Windows 2008 R2 Server, a Cisco 3750 switch and an ASA5505 (accessible through the ASA via VPN). The other site has two Windows 2008 servers (behind a Draytek 2820, no VPN)

       

      It's almost like there's a break in internet connectivity, Orion gets upset somehow and then refuses to acknowledge they're accessible via SNMP.

       

      Has anybody seen this before or have any idea how it can be fixed?

       

      Cheers,


      Alex

        • Re: SNMP randomly stops polling
          jan

          Hi Alex,

           

          we've seen similar case before, where it was caused by something between Orion and the polled device. It looked like the SNMP requests were being blocked or dropped by a firewall.

          In the wireshark trace (run on the Orion server), we saw that the requests were being sent but no response was coming back.

          Can you run wireshark on your server at the time the issue occurs to check if response is coming back for that particular node?

          If it doesn't, can you restart the firewall to see if it makes any difference?

           

          Thanks

          Jan

            • Re: SNMP randomly stops polling
              hwtechnology

              Thanks Jan,

               

              That's exactly what we're seeing, the strange thing is that a "List Resources" works fine so there is SNMP connectivity to the node.

               

              I've popped Wireshark on the polling server and I see SNMP going out but nothing coming back apart from a TTL exceeded from our core site firewall (Cisco ASA5510).

               

              The problem node continues to respond via ICMP whilst this is going on (it's connected via an ASA to ASA VPN tunnel).

               

              I've restarted the firewall at the problem site but it's not made any difference and unfortunately I can't restart the core site firewall.

               

              The thing that's throwing me is that if I restart the Orion services onthe polling server (or restart the server itself) the system start's polling the node correctly via SNMP again, making me think that it's something on the polling server that's going wrong.

               

              Am I best opening a support ticket about this?

                • Re: SNMP randomly stops polling
                  jan

                  You've said that this was happening for two sites, each behind a different firewall. But if all the traffic is going through that core site firewall, it looks like the problem might be in there. That is really unfortunate that you can't restart it.

                  When this happens again, please try whether restarting only the Job Engine v2 service makes it start working again (so you don't have to restart everything).

                  And yes, please do open support ticket. We'll need to check the firewall configuration, hopefully there's something we can do to fix this.

                   

                  Thanks

                  Jan