Good Morning All,
We recently installed and configured SolarWinds NPM with NTA on our transport network. The network we are using is a Cisco Carrier Ethernet model based on ASR 920 core routers. The Network is setup to run five layer 2 VLANS (Bridge Domains in Carrier Ethernet). The top Bridge Domain is for management, then we have 4 other Bridge Domains to transport customer traffic between work sites. The network seems to work very well, but we have one large problem, it seems.
Issue: If a node IP fails to respond to a poll NPM sends the ping request to the lost node to try and find it. For some reason this is causing a broadcast storm on the network where wireshark is seeing tens of thousands of ICMP packets circulating on the management VLAN. The storm is bad enough to cause a DOS condition for about 10-20 seconds when the NPM server sends the ping out.
We have been grappling trying to resolve why the ping would storm the network, but cannot nail it down.
1. Tried to localize the problem by shutting a port on polled nodes at various places in the network topology. SAME RESULT
2. Tried to determine if the storm is at layer 2 or 3 by implementing MSTP on all nodes and inspecting all of the spanning tree properties. NO ISSUE FOUND.
3. Tried to replicate the problem by manually pinging the shutdown node address from the SolarWinds Server Windows Command Line. The ping acts normal where the ARP can't resolve the address and the ping fails. NO STORM OCCURRED.
Has anyone seen this issue before, or maybe have some pointers on how we can nail this down?
At this point we are considering moving the entire monitoring system off SolarWinds and over to Cisco Ethernet OAM.
Example Router Config is attached.
This reminded me of a ping-related issue I encountered. My case had some devices showing as unpingable (and therefore down) though pings were successful from a Windows command prompt.
It turned out that the default PING / ICMP data size from the command prompt was 32 bytes, but from NPM was only 23 bytes. As from my old case notes: "You can check the ICMP data by going to settings>all settings>polling settings and look under Network at the ICMP data field. The default text is SolarWinds Status Query. You can try to adjust the length of the text to see if the nodes will respond." Once I increased the data size of NPM's ICMP ECHO Requests to 32 characters, it did fix my issue.
I have no idea whether this may fix your issue, but it's rather unlikely to harm anything; might be worth a try.
SolarWinds solutions are rooted in our deep connection to our user base in the THWACK® online community. More than 150,000 members are here to solve problems, share technology and best practices, and directly contribute to our product development process.