Why would the Orion server respond with the "protocol unreachable"? I would appreciate another take on this problem.
Maybe because the ping timed out?
We are looking into the ping timing out. Right now we do not see that as an issue. If we look at the packets right before and right after the issue, there is not change in delay, etc.
Where is the packet capture happening? At the Orion server or some point in between?
We have looked in two different areas.
We have a tap installed between our core switch and the MPLS router. This allows us to capture data flowing between the core and the router. This is the traffic moving to and from the central office and the branch offices.
We have also captured the same data packets on the core switch.
The Orion server is plugged into the core switch.
So, to make a long store short, we see the same results at the core and as the packets flow between the core to the MPLS router out to the sites.
That's pretty weird. As you probably know, a "protocol unreachable" is supposed to mean that the IP protocol (i.e., the layer 4 protocol) is unknown to the device. Are there any clues in the data field of the ICMP error packet? The ICMP error message should include a copy of the header of the offending packet which produced the error.
I would very much like to hear how you make out with this....
I have opened several tickets with Solarwinds on a similar issue. They have told me that it's my Windows 2008 OS that's the issue but I'm having a hard time believing that it is. We have an issue where once every couple of weeks, two certain Juniper switches go into alarm in Solarwinds....
Each time, we can ping and reach these switches with no issues. I know this sounds like a network issue on the surface but I can guarantee it's not. These switches are totally independent of one another and not even in the same geographic region of the world.
When I login to the Solarwinds server, I cannot ping the device from that server but if I check from any other location in our network the switch is reachable. Yes, I've opened tickets with Juniper and a trace on their switches shows the ICMP echo request not even arriving at their switches.
I'm writing this because I was just paged about this incident. The usual fix is to reboot the server and the alarm clears (which it just did now).
I can appreciate what Solarwinds is saying about this being an OS issue but we completely rebuilt the box with a brand new fresh copy of Windows 2008 and it continues.
Based on everything I have read, this is a transport layer issue. The host (the Orion server) stops listening for these protocols.
My thought on this issue right now is the Windows 2008 OS and/or the Dell server is having some type of issue. Areas I plan to check are:
1) is NIC teaming causing the issue?
2) is the Dell firmware fully upgraded?
3) is Cisco etherchannel at fault?
4) is Windows fully upgraded?
That's very interesting - we are running on a Dell Poweredge 2950 with built in NIC cards.
In our case I will double check but believe the NIC drivers and server firmware are all up to speed - will double check though.
No NIC teaming or Cisco etherchannel involved in our side.
It appears this may be an issue with Windows Server 2008 R2.
I'm having this same exact problem with Nimsoft's net_connect probe which sends out 3 immediate pings, then receives 3 pings back from the remote host.
Intermittently, I'm seeing my Nimsoft server send out a ICMP Type 3, Code 2: Protocol Unreachable. When I check the headers of the "offending packet" (the ICMP Ping Reply), it has an IP Protocol Type of 1 (ICMP) which is correct and expected. For some unknown reason, the TCP/IP stack on the server is intermittently not able to process the ICMP reply request (received too fast?) and sends out a Protocol Unreachable message. But, this is NOT generated by the application (Solarwinds / Nimsoft) based on my knowledge of the TCP/IP stack. It's the kernel's TCP/IP stack that generates these messages.
Now, once this "ICMP Protocol Unreachable" message is sent out the remote host, it's up to the remote end to decide how to proceed. In most cases, additional pings are still replied to. However, if the customer has some type of security appliance such as an Intrusion Prevention System, many have signatures that prevent additional traffic from being sent after an ICMP Protocol Unreachable message has been received. This is what's happening to me for some customers. Once the ICMP Protocol Unreachable is sent to the remote host after processing it's ping reply, I receive no further ping replys to my requests. (I can however still telnet / http / https to these devices...)
We're still investigating what is actually preventing the ICMP replys from being sent (IPS, ASA, or Host OS) after the ICMP Type 3, Code 2 is received. However, I've come up with a simple bandaid to the problem that will allow monitoring to be unaffected.
Just block ICMP protocol unreachable messages from your Orion / Nimsoft server that is sending out pings. You can do this with the Windows Firewall pretty easily. If you have a Cisco router, you can apply the following access-list to your interface/vlan attached to your monitoring server.
access-list 101 deny icmp host 10.10.254.254 any unreachable
access-list 101 permit ip any any
ip access-group 101 in
You can see below how many it's catching after a 24-48hr period.
Extended IP access list 198
10 deny icmp host 10.255.254.211 any unreachable (27340 matches)20 permit ip any any (5944388 matches)Since I implemented this bandaid, no further hosts stop responding to pings.