cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post
Level 11

Weird Juniper Ping Problem

We have an issue that occurs about once a week.  It only involves Juniper EX3200/EX4200 switches and only some of them we have deployed/monitored.

About once a week, we will get an alert from Solarwinds that one of these devices is down (this ALWAYS occurs in the same order of devices making it even more strange).  When we check the device it's actually up, responding to pings etc.  BUT it will not respond to pings from the Solarwinds platform itself.

I opened several tickets with Solarwinds and they have told me that they just rely on the ping utility within Windows 2008 so if the device isn't pingable then it's got to be a Windows 2008 problem.  I have opened tickets with Juniper and we have proven that the ping request is arriving at the EX switch and that the response is going back out.  A Wireshark on the Windows server shows the ping going out but never coming back.

We have the latest Windows patches and the server is considered "clean" from a software perspective (and viruses of course).  There is nothing in between on the network that can cause this issue so we focused on the Windows server.

Bit the bullet as per Solarwinds suggestion and blew away the Solarwinds server that does the polling.  Rebuilt it, restored everything and same problem.  Server now also has the latest network card drivers.

Have totally run out of ideas - frustrating for sure and no idea what to do next.  Solarwinds said they are bringing out a feature where ping of the remote device isn't required (thank goodness, we were shocked when that was a requirement during our initial installation).  This new "feature" would at least alleviate this issue.

It's also worth noting that while the pings fail, SNMP continues to poll data successfully.  It's also worth noting that the alert will trigger every time on the same initial device and follow the same pattern through other Juniper EX switches at the same time interval.

Thoughts? 😉

Tags (3)
34 Replies

Thanks - that's quite interesting.  We have had IPv6 enabled and operational on that box for quite some time actually.  I do recall actually disabling it to see if it could be an issue.

This problem is still occurring and we use IPv6 now for active monitoring (with latest Solarwinds release that finally included IPv6 support).  I am slowly moving over some of the "problem switches" to be monitored via IPv6 ... I'm trying to see if they still appear unresponsive from the Solarwinds system.  Unfortunately this takes some time internally but I was hoping to prove that it's only an IPv4 problem.

Yes, this is on win2k8 R2

Thanks,

Paul

0 Kudos
Level 14

Try changing the size of the data packet that gets sent.  You can do this under the NPM Polling Settings/ICMP Data.  I've seen odd issues before where the default text is either not big enough or is too large.  It makes absolutely no sense whatsoever, but it happens.  Mess around with smaller and larger sizes of text data to see if that is the issue.

Also, does the behavior occur at a certain time of day consistently?

0 Kudos

Various times.. happens roughly every 10 days ... but that varies too.

 

Changed the size of the data packet and made no difference unfortunately.

 



Try changing the size of the data packet that gets sent.  You can do this under the NPM Polling Settings/ICMP Data.  I've seen odd issues before where the default text is either not big enough or is too large.  It makes absolutely no sense whatsoever, but it happens.  Mess around with smaller and larger sizes of text data to see if that is the issue.

Also, does the behavior occur at a certain time of day consistently?



0 Kudos
Level 10

check the hosts arp table when the issue occurs. We saw something similar and it was related to a new nic / driver combo some of our newer hosts we're shipping with.

0 Kudos

You mean the NIC driver on the Windows 2008 server right?  It's a Broadcom GigE card (can't remember model but could look it up)... we updated all the drivers on it a few weeks ago and the same issue keeps occuring.

Thanks for the reply! 😉

0 Kudos

Two things on this.  

 

First, if you are doing active/active NIC teaming on that server try active/standby.

 

Also what might be the problem is indeed NIC drivers but not the one you are trying to ping.  Broadcom (as far as I am concerned) is responsible for making drivers from time to time that do all sorts or horrid things.  Most times you will not notice because some switches will disregard those things and others will react to them and things will break. 

When the issue occurs do indeed check the ARP tables of whatever is being routed through and compare that to the actual MAC of the NIC as it shows.  Make sure they are all the same.

Depending on how Juniper works (I really do not know) and I have seen certain Cisco switches do this, but they will learn ARP entries in different ways and I know for a fact Broadcom has several driver versions out there that will hose that up because one machine will report itself as another hosing up the tables in your switches.

 

Hope it helps.

0 Kudos

Thanks - no NIC teaming currently.

ARP issue - interesting thought but wouldn't explain why SNMP can reach the device properly while ICMP Ping can't? 😉

Cheers,

Paul

 

 



Two things on this.  

 

First, if you are doing active/active NIC teaming on that server try active/standby.

 

Also what might be the problem is indeed NIC drivers but not the one you are trying to ping.  Broadcom (as far as I am concerned) is responsible for making drivers from time to time that do all sorts or horrid things.  Most times you will not notice because some switches will disregard those things and others will react to them and things will break. 

When the issue occurs do indeed check the ARP tables of whatever is being routed through and compare that to the actual MAC of the NIC as it shows.  Make sure they are all the same.

Depending on how Juniper works (I really do not know) and I have seen certain Cisco switches do this, but they will learn ARP entries in different ways and I know for a fact Broadcom has several driver versions out there that will hose that up because one machine will report itself as another hosing up the tables in your switches.

 

Hope it helps.



0 Kudos

The fact that it is so intermittent really makes troubleshooting difficult.  If it happens every tenth day, can you keep a console session open that morning, drop to shell, and do the following?

 router$ tcpdump -i <input_interface> -w /var/tmp/icmp.cap -s 65000 proto icmp and ip host <npm_ip> 

You might be able to see if the packets are at least making it to the switch and see if it's sending back an error or anything else once you export it from the switch and check it out in Wireshark.  I've had problems where SNMP didn't work (when it worked fine querying with the Toolset, but not Orion) and ping did, but this is a new one to me. You could also try setting up a firewall filter to log ICMP packets on your ingress interface, assuming it fits within the strict parameters of what will work on Juniper switch firewall filter, but that won't give you nearly as much detail as a tcpdump.

I know it doesn't make sense that ARP would be the culprit, but I've seen very weird issues caused by ARP, so I certainly wouldn't discount it as being the problem.

One other thing to check, if you haven't already, is to see if there have been any STP changes on your network during that time.  If all these switches are in one location and they all seem to have the same problem at the same time, you might be having an STP topology change.  Doesn't necessarily explain why SNMP works the entire time, but hey, I've seen weird issues with STP, too.

-evt

0 Kudos

Thanks - we can do tcpdump directly on the Juniper switch (using one of them that the problem appears on).  It shows the icmp request come in and the echo response go back out.  I will double check the ARP condition when this occurs again.  Appreciate it...

Paul

0 Kudos

I am just curious, does a ping from cmd work and the solarwinds one have the problem?

Also, do you have or could you get a hardware network tap or this to see if the packet is coming to the server? (outside of the OS)  the homemade one can be a little flaky, but you get what you pay for.

0 Kudos

Thanks for the responses.  Ping from cmd does not work neither.

We just moved this server in behind a Juniper SRX firewall (where it used to be, but we moved it out in case the SRX was causing some surprises).  On the SRX I can perform a full packet capture which I'm hoping to do when it happens again.

😉

0 Kudos

Hi nexicom--

Have you tried contacting support at all? If so, I'm curious what they told you. If not, you may want to at this point.

When you open a support ticket, would you:

--Reference this thread to Support.
--Post back here with a case number.
--Post any solutions you get from Support.


Many thx,

M

0 Kudos

Thank you.

 

Yes, we have opened cases on this in the past - each time we were told that Solarwinds just uses the ping utility within Windows itself.  We were also unable to ping this device from the command prompt which led credibility to the responses we got.  We have an open feature request to remove ping as a requirement to determine if a node is up (a valid request as we have some customers we monitor who block ping requests - something I'm not a fan of, but since it's their networks it's difficult to tell them they must adjust their "security policy" because of our needs).

Feature request 244447 - PING device - required?

Most recent ticket was: #168251 - "Unable to Ping Host Occassionally"

Thank you! 😉

0 Kudos

Great Nexicom--

Thx for updating the community.

M

0 Kudos