I would like to start off saying I am not the Orion admin. I am the network engineer. I hope I can provide enough information to help solve the problem we are having.
We are use Orion to monitor our branch offices connected by a MPLS network. Each site has a T1. Some have secondary connections. We monitor routers, switches, and servers at each office. From what I have been told, we poll the router more often than other device so we can detect the router down or in trouble, hopefully before the branch office knows there is an issue.
My basic understanding of Orion is Orion uses ICMP pings to determine if the router is up. Orion will mark the router in a warning state if we miss a couple of pings. I understand this may seem too quick, but this is what we have to work under.
The Orion admin has written a script that will do trace routes from the Windows server (to the router, switch, and branch server) if the router enters a warning state. The problem is we see way too many of these trace files. I can correlate the time on the trace routes to packet captures. Each time a trace is generated I see in the packet capture:
The Orion server sends an ICMP Echo.
The router will respond with an ICMP Echo Reply.
Almost immediately the Orion server will respond with an ICMP Destination Unreachable, Protocol Unreachable. This corresponds to an ICMP Type 3, Code 2.
Most of the ICMP traffic between Orion and the router is normal: Echo->Echo Reply.
There is no apparent problem on the network. We run voice and many other applications across the network. No issues. What do we need to look at? Why would the Orion server respond with the "protocol unreachable"? I would appreciate another take on this problem.
It appears this may be an issue with Windows Server 2008 R2.
I'm having this same exact problem with Nimsoft's net_connect probe which sends out 3 immediate pings, then receives 3 pings back from the remote host.
Intermittently, I'm seeing my Nimsoft server send out a ICMP Type 3, Code 2: Protocol Unreachable. When I check the headers of the "offending packet" (the ICMP Ping Reply), it has an IP Protocol Type of 1 (ICMP) which is correct and expected. For some unknown reason, the TCP/IP stack on the server is intermittently not able to process the ICMP reply request (received too fast?) and sends out a Protocol Unreachable message. But, this is NOT generated by the application (Solarwinds / Nimsoft) based on my knowledge of the TCP/IP stack. It's the kernel's TCP/IP stack that generates these messages.
Now, once this "ICMP Protocol Unreachable" message is sent out the remote host, it's up to the remote end to decide how to proceed. In most cases, additional pings are still replied to. However, if the customer has some type of security appliance such as an Intrusion Prevention System, many have signatures that prevent additional traffic from being sent after an ICMP Protocol Unreachable message has been received. This is what's happening to me for some customers. Once the ICMP Protocol Unreachable is sent to the remote host after processing it's ping reply, I receive no further ping replys to my requests. (I can however still telnet / http / https to these devices...)
We're still investigating what is actually preventing the ICMP replys from being sent (IPS, ASA, or Host OS) after the ICMP Type 3, Code 2 is received. However, I've come up with a simple bandaid to the problem that will allow monitoring to be unaffected.
Just block ICMP protocol unreachable messages from your Orion / Nimsoft server that is sending out pings. You can do this with the Windows Firewall pretty easily. If you have a Cisco router, you can apply the following access-list to your interface/vlan attached to your monitoring server.
access-list 101 deny icmp host 10.10.254.254 any unreachable
access-list 101 permit ip any any
ip access-group 101 in
You can see below how many it's catching after a 24-48hr period.
Extended IP access list 198
10 deny icmp host 10.255.254.211 any unreachable (27340 matches)
I would very much like to hear how you make out with this....
I have opened several tickets with Solarwinds on a similar issue. They have told me that it's my Windows 2008 OS that's the issue but I'm having a hard time believing that it is. We have an issue where once every couple of weeks, two certain Juniper switches go into alarm in Solarwinds....
Each time, we can ping and reach these switches with no issues. I know this sounds like a network issue on the surface but I can guarantee it's not. These switches are totally independent of one another and not even in the same geographic region of the world.
When I login to the Solarwinds server, I cannot ping the device from that server but if I check from any other location in our network the switch is reachable. Yes, I've opened tickets with Juniper and a trace on their switches shows the ICMP echo request not even arriving at their switches.
I'm writing this because I was just paged about this incident. The usual fix is to reboot the server and the alarm clears (which it just did now).
I can appreciate what Solarwinds is saying about this being an OS issue but we completely rebuilt the box with a brand new fresh copy of Windows 2008 and it continues.
Based on everything I have read, this is a transport layer issue. The host (the Orion server) stops listening for these protocols.
My thought on this issue right now is the Windows 2008 OS and/or the Dell server is having some type of issue. Areas I plan to check are:
1) is NIC teaming causing the issue?
2) is the Dell firmware fully upgraded?
3) is Cisco etherchannel at fault?
4) is Windows fully upgraded?
That's very interesting - we are running on a Dell Poweredge 2950 with built in NIC cards.
In our case I will double check but believe the NIC drivers and server firmware are all up to speed - will double check though.
No NIC teaming or Cisco etherchannel involved in our side.
We have looked in two different areas.
We have a tap installed between our core switch and the MPLS router. This allows us to capture data flowing between the core and the router. This is the traffic moving to and from the central office and the branch offices.
We have also captured the same data packets on the core switch.
The Orion server is plugged into the core switch.
So, to make a long store short, we see the same results at the core and as the packets flow between the core to the MPLS router out to the sites.
That's pretty weird. As you probably know, a "protocol unreachable" is supposed to mean that the IP protocol (i.e., the layer 4 protocol) is unknown to the device. Are there any clues in the data field of the ICMP error packet? The ICMP error message should include a copy of the header of the offending packet which produced the error.
We are looking into the ping timing out. Right now we do not see that as an issue. If we look at the packets right before and right after the issue, there is not change in delay, etc.
SolarWinds solutions are rooted in our deep connection to our user base in the THWACK® online community. More than 150,000 members are here to solve problems, share technology and best practices, and directly contribute to our product development process. Learn more today by joining now.