This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

Weird Juniper Ping Problem

We have an issue that occurs about once a week.  It only involves Juniper EX3200/EX4200 switches and only some of them we have deployed/monitored.

About once a week, we will get an alert from Solarwinds that one of these devices is down (this ALWAYS occurs in the same order of devices making it even more strange).  When we check the device it's actually up, responding to pings etc.  BUT it will not respond to pings from the Solarwinds platform itself.

I opened several tickets with Solarwinds and they have told me that they just rely on the ping utility within Windows 2008 so if the device isn't pingable then it's got to be a Windows 2008 problem.  I have opened tickets with Juniper and we have proven that the ping request is arriving at the EX switch and that the response is going back out.  A Wireshark on the Windows server shows the ping going out but never coming back.

We have the latest Windows patches and the server is considered "clean" from a software perspective (and viruses of course).  There is nothing in between on the network that can cause this issue so we focused on the Windows server.

Bit the bullet as per Solarwinds suggestion and blew away the Solarwinds server that does the polling.  Rebuilt it, restored everything and same problem.  Server now also has the latest network card drivers.

Have totally run out of ideas - frustrating for sure and no idea what to do next.  Solarwinds said they are bringing out a feature where ping of the remote device isn't required (thank goodness, we were shocked when that was a requirement during our initial installation).  This new "feature" would at least alleviate this issue.

It's also worth noting that while the pings fail, SNMP continues to poll data successfully.  It's also worth noting that the alert will trigger every time on the same initial device and follow the same pattern through other Juniper EX switches at the same time interval.

Thoughts? ;)

Parents
  • check the hosts arp table when the issue occurs. We saw something similar and it was related to a new nic / driver combo some of our newer hosts we're shipping with.

  • You mean the NIC driver on the Windows 2008 server right?  It's a Broadcom GigE card (can't remember model but could look it up)... we updated all the drivers on it a few weeks ago and the same issue keeps occuring.

    Thanks for the reply! ;)

  • Two things on this.  

     

    First, if you are doing active/active NIC teaming on that server try active/standby.

     

    Also what might be the problem is indeed NIC drivers but not the one you are trying to ping.  Broadcom (as far as I am concerned) is responsible for making drivers from time to time that do all sorts or horrid things.  Most times you will not notice because some switches will disregard those things and others will react to them and things will break. 

    When the issue occurs do indeed check the ARP tables of whatever is being routed through and compare that to the actual MAC of the NIC as it shows.  Make sure they are all the same.

    Depending on how Juniper works (I really do not know) and I have seen certain Cisco switches do this, but they will learn ARP entries in different ways and I know for a fact Broadcom has several driver versions out there that will hose that up because one machine will report itself as another hosing up the tables in your switches.

     

    Hope it helps.

  • Thanks - no NIC teaming currently.

    ARP issue - interesting thought but wouldn't explain why SNMP can reach the device properly while ICMP Ping can't? ;)

    Cheers,

    Paul

     

     



    Two things on this.  

     

    First, if you are doing active/active NIC teaming on that server try active/standby.

     

    Also what might be the problem is indeed NIC drivers but not the one you are trying to ping.  Broadcom (as far as I am concerned) is responsible for making drivers from time to time that do all sorts or horrid things.  Most times you will not notice because some switches will disregard those things and others will react to them and things will break. 

    When the issue occurs do indeed check the ARP tables of whatever is being routed through and compare that to the actual MAC of the NIC as it shows.  Make sure they are all the same.

    Depending on how Juniper works (I really do not know) and I have seen certain Cisco switches do this, but they will learn ARP entries in different ways and I know for a fact Broadcom has several driver versions out there that will hose that up because one machine will report itself as another hosing up the tables in your switches.

     

    Hope it helps.



  • The fact that it is so intermittent really makes troubleshooting difficult.  If it happens every tenth day, can you keep a console session open that morning, drop to shell, and do the following?

     router$ tcpdump -i <input_interface> -w /var/tmp/icmp.cap -s 65000 proto icmp and ip host <npm_ip> 

    You might be able to see if the packets are at least making it to the switch and see if it's sending back an error or anything else once you export it from the switch and check it out in Wireshark.  I've had problems where SNMP didn't work (when it worked fine querying with the Toolset, but not Orion) and ping did, but this is a new one to me. You could also try setting up a firewall filter to log ICMP packets on your ingress interface, assuming it fits within the strict parameters of what will work on Juniper switch firewall filter, but that won't give you nearly as much detail as a tcpdump.

    I know it doesn't make sense that ARP would be the culprit, but I've seen very weird issues caused by ARP, so I certainly wouldn't discount it as being the problem.

    One other thing to check, if you haven't already, is to see if there have been any STP changes on your network during that time.  If all these switches are in one location and they all seem to have the same problem at the same time, you might be having an STP topology change.  Doesn't necessarily explain why SNMP works the entire time, but hey, I've seen weird issues with STP, too.

    -evt

  • I am just curious, does a ping from cmd work and the solarwinds one have the problem?

    Also, do you have or could you get a hardware network tap or this to see if the packet is coming to the server? (outside of the OS)  the homemade one can be a little flaky, but you get what you pay for.

  • Thanks - we can do tcpdump directly on the Juniper switch (using one of them that the problem appears on).  It shows the icmp request come in and the echo response go back out.  I will double check the ARP condition when this occurs again.  Appreciate it...

    Paul

  • Thanks for the responses.  Ping from cmd does not work neither.

    We just moved this server in behind a Juniper SRX firewall (where it used to be, but we moved it out in case the SRX was causing some surprises).  On the SRX I can perform a full packet capture which I'm hoping to do when it happens again.

    ;)

Reply
  • Thanks for the responses.  Ping from cmd does not work neither.

    We just moved this server in behind a Juniper SRX firewall (where it used to be, but we moved it out in case the SRX was causing some surprises).  On the SRX I can perform a full packet capture which I'm hoping to do when it happens again.

    ;)

Children
  • Hi nexicom--

    Have you tried contacting support at all? If so, I'm curious what they told you. If not, you may want to at this point.

    When you open a support ticket, would you:

    --Reference this thread to Support.
    --Post back here with a case number.
    --Post any solutions you get from Support.


    Many thx,

    M

  • Thank you.

     

    Yes, we have opened cases on this in the past - each time we were told that Solarwinds just uses the ping utility within Windows itself.  We were also unable to ping this device from the command prompt which led credibility to the responses we got.  We have an open feature request to remove ping as a requirement to determine if a node is up (a valid request as we have some customers we monitor who block ping requests - something I'm not a fan of, but since it's their networks it's difficult to tell them they must adjust their "security policy" because of our needs).

    Feature request 244447 - PING device - required?

    Most recent ticket was: #168251 - "Unable to Ping Host Occassionally"

    Thank you! ;)

  • Great Nexicom--

    Thx for updating the community.

    M