This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

NPM server is sending too much ICMP traffic to our Nexus Core Switches.

NPM server is sending too much ICMP traffic to our Nexus Core Switches, so much that the Core Switches are dropping most icmp traffic.

What can I do  the make this better, the Core switched are constantly showing as down,  because they are designed to ignore excess icmp traffic.

All I did thus far was to increase the polling intervals only for the Nexus Switches.

Hope this works.

Collin.

  • Can you not poll them with SNMP instead of ICMP?

  • How much ICMP traffic is being sent from SolarWinds to your Nexus core Switches?  Curious how/ why it's too much.

    In any case, you could configure your Nexus core switches to send relevant traps to your SolarWinds server.  This could offset the delay in catching issues due to increasing polling intervals, assuming your traps are catching issues you would have otherwise known about via polling.  You could also consider sending syslog out of the Nexus core switches to your SolarWinds server or to a syslog server on your network to increase chances of detecting issues.  You could monitor the upstream device's interface(s) through which your Nexus core switches connect.

    I would be surprised if your SolarWinds server is sending too much ICMP traffic vs. the more likely scenario that ICMP traffic is not as high priority as other traffic that is traversing your Nexus core switches.  Hence the question about the quantity of ICMP traffic.

  • I had a somewhat similar experience, and I opened a TAC case with Cisco while simultaneously working with Solarwinds Support.  I found misconfigured traffic / credentials were causing high CPU utilization on the Nexus 7009's, which resulted in loss of ICMP response due to insufficient Nexus resources.

    Work with TAC and SW to identify the cause and correct it--they have some great diagnostic commands designed specifically to do this kind of troubleshooting.  In one case it was non Orion traffic causing the root issue, which resulted in unexpected ICMP stats.

    Don't give up until TAC identifies all causes of high CPU or high memory on the Nexus equipment--it's powerful gear, but your network apps, servers, security, and monitoring solutions must be properly configured to not abuse the big switches.

    Poll Nexus with snmp-v3 instead of ICMP if you want better reliability and increased detail.

  • Thanks rschroeder, I will rediscover and poll using  ICMP instead of pings,  I guess I could go back to the default SolarWinds polling intervals.

    Will let you know what happened here soon.

    Collin.

  • Try polling the Nexus boxes with SNMP v1 - I have seen this work before :-)

    It seems like the Getbulk function in v2 could stress up the Nexus nodes in certain scenarios

  • How often were you polling it? I SNMP poll our Nexus Core Switches on 5 minute intervals with no problems. Is it possible other things -non Solarwinds are also hitting them?

    If you changed the settings from Solarwinds defaults then also make sure you understand how these metrics function.

    For example polling topology every 5 minutes is probably not necessary. Once a day is good for my environment.

    pastedImage_0.png

    Same goes when you turn out polling for Routing, and Layer 2/3 discovery. It all adds up to more SNMP traffic.

    Try going back to the defaults first, then if you are still having issues I would go with rschroeder's advice and open up a TAC case or try using the commands on your own to try to solve the issue.

  • collins4941

    Are you positive NPM is causing all the pain? I suspected this a while back and found that someone placed a new print server on the network that was scanning all the RFC1918 IPs ranges. I've also see IPAM create a nightmare during it's discovery process. The next time you see the issue, try to run this command and review the packet capture with Wireshark:

    ethanalyzer local interface inband capture-filter “icmp” limit-captured-frames 5000 write bootflash:icmp

    https://supportforums.cisco.com/document/132151/using-ethanalyzer-nexus-platform-control-plane-and-data-plane-traffic-an…

    There are other examples to help you pull more data in that link. if it is the NPM polling, the packet capture might also tell you why and how to adjust your settings. You might also need to adjust your Control Plane Policing configuration:

    How to Verify CoPP Policy and Drops in NX-OS | interc0nnect--A Networking Blog


    D

  • Seems all good now guys, thanks for the help, I used SNMP instead of ICMP. At some point we need to also look at the CoPP Policy as well.

    Good for now.

    Collin.