34 Replies Latest reply on Oct 22, 2015 6:36 PM by callmejon

    Weird Juniper Ping Problem

    pstewart726

      We have an issue that occurs about once a week.  It only involves Juniper EX3200/EX4200 switches and only some of them we have deployed/monitored.

      About once a week, we will get an alert from Solarwinds that one of these devices is down (this ALWAYS occurs in the same order of devices making it even more strange).  When we check the device it's actually up, responding to pings etc.  BUT it will not respond to pings from the Solarwinds platform itself.

      I opened several tickets with Solarwinds and they have told me that they just rely on the ping utility within Windows 2008 so if the device isn't pingable then it's got to be a Windows 2008 problem.  I have opened tickets with Juniper and we have proven that the ping request is arriving at the EX switch and that the response is going back out.  A Wireshark on the Windows server shows the ping going out but never coming back.

      We have the latest Windows patches and the server is considered "clean" from a software perspective (and viruses of course).  There is nothing in between on the network that can cause this issue so we focused on the Windows server.

      Bit the bullet as per Solarwinds suggestion and blew away the Solarwinds server that does the polling.  Rebuilt it, restored everything and same problem.  Server now also has the latest network card drivers.

      Have totally run out of ideas - frustrating for sure and no idea what to do next.  Solarwinds said they are bringing out a feature where ping of the remote device isn't required (thank goodness, we were shocked when that was a requirement during our initial installation).  This new "feature" would at least alleviate this issue.

      It's also worth noting that while the pings fail, SNMP continues to poll data successfully.  It's also worth noting that the alert will trigger every time on the same initial device and follow the same pattern through other Juniper EX switches at the same time interval.

      Thoughts? ;)

        • Re: Weird Juniper Ping Problem
          adeimel

          check the hosts arp table when the issue occurs. We saw something similar and it was related to a new nic / driver combo some of our newer hosts we're shipping with.

            • Re: Weird Juniper Ping Problem
              pstewart726

              You mean the NIC driver on the Windows 2008 server right?  It's a Broadcom GigE card (can't remember model but could look it up)... we updated all the drivers on it a few weeks ago and the same issue keeps occuring.

              Thanks for the reply! ;)

                • Re: Weird Juniper Ping Problem
                  Donald_Francis

                  Two things on this.  

                   

                  First, if you are doing active/active NIC teaming on that server try active/standby.

                   

                  Also what might be the problem is indeed NIC drivers but not the one you are trying to ping.  Broadcom (as far as I am concerned) is responsible for making drivers from time to time that do all sorts or horrid things.  Most times you will not notice because some switches will disregard those things and others will react to them and things will break. 

                  When the issue occurs do indeed check the ARP tables of whatever is being routed through and compare that to the actual MAC of the NIC as it shows.  Make sure they are all the same.

                  Depending on how Juniper works (I really do not know) and I have seen certain Cisco switches do this, but they will learn ARP entries in different ways and I know for a fact Broadcom has several driver versions out there that will hose that up because one machine will report itself as another hosing up the tables in your switches.

                   

                  Hope it helps.

                    • Re: Weird Juniper Ping Problem
                      pstewart726

                      Thanks - no NIC teaming currently.

                      ARP issue - interesting thought but wouldn't explain why SNMP can reach the device properly while ICMP Ping can't? ;)

                      Cheers,

                      Paul

                       

                       

                       



                      Two things on this.  

                       

                      First, if you are doing active/active NIC teaming on that server try active/standby.

                       

                      Also what might be the problem is indeed NIC drivers but not the one you are trying to ping.  Broadcom (as far as I am concerned) is responsible for making drivers from time to time that do all sorts or horrid things.  Most times you will not notice because some switches will disregard those things and others will react to them and things will break. 

                      When the issue occurs do indeed check the ARP tables of whatever is being routed through and compare that to the actual MAC of the NIC as it shows.  Make sure they are all the same.

                      Depending on how Juniper works (I really do not know) and I have seen certain Cisco switches do this, but they will learn ARP entries in different ways and I know for a fact Broadcom has several driver versions out there that will hose that up because one machine will report itself as another hosing up the tables in your switches.

                       

                      Hope it helps.

                       



                        • Re: Weird Juniper Ping Problem
                          bleearg13

                          The fact that it is so intermittent really makes troubleshooting difficult.  If it happens every tenth day, can you keep a console session open that morning, drop to shell, and do the following?

                           router$ tcpdump -i <input_interface> -w /var/tmp/icmp.cap -s 65000 proto icmp and ip host <npm_ip> 

                          You might be able to see if the packets are at least making it to the switch and see if it's sending back an error or anything else once you export it from the switch and check it out in Wireshark.  I've had problems where SNMP didn't work (when it worked fine querying with the Toolset, but not Orion) and ping did, but this is a new one to me. You could also try setting up a firewall filter to log ICMP packets on your ingress interface, assuming it fits within the strict parameters of what will work on Juniper switch firewall filter, but that won't give you nearly as much detail as a tcpdump.

                          I know it doesn't make sense that ARP would be the culprit, but I've seen very weird issues caused by ARP, so I certainly wouldn't discount it as being the problem.

                          One other thing to check, if you haven't already, is to see if there have been any STP changes on your network during that time.  If all these switches are in one location and they all seem to have the same problem at the same time, you might be having an STP topology change.  Doesn't necessarily explain why SNMP works the entire time, but hey, I've seen weird issues with STP, too.

                          -evt

                            • Re: Weird Juniper Ping Problem
                              netlogix

                              I am just curious, does a ping from cmd work and the solarwinds one have the problem?

                              Also, do you have or could you get a hardware network tap or this to see if the packet is coming to the server? (outside of the OS)  the homemade one can be a little flaky, but you get what you pay for.

                                • Re: Weird Juniper Ping Problem
                                  pstewart726

                                  Thanks for the responses.  Ping from cmd does not work neither.

                                  We just moved this server in behind a Juniper SRX firewall (where it used to be, but we moved it out in case the SRX was causing some surprises).  On the SRX I can perform a full packet capture which I'm hoping to do when it happens again.

                                  ;)

                                    • Re: Weird Juniper Ping Problem

                                      Hi nexicom--

                                      Have you tried contacting support at all? If so, I'm curious what they told you. If not, you may want to at this point.

                                      When you open a support ticket, would you:

                                      --Reference this thread to Support.
                                      --Post back here with a case number.
                                      --Post any solutions you get from Support.


                                      Many thx,

                                      M

                                        • Re: Weird Juniper Ping Problem
                                          pstewart726

                                          Thank you.

                                           

                                          Yes, we have opened cases on this in the past - each time we were told that Solarwinds just uses the ping utility within Windows itself.  We were also unable to ping this device from the command prompt which led credibility to the responses we got.  We have an open feature request to remove ping as a requirement to determine if a node is up (a valid request as we have some customers we monitor who block ping requests - something I'm not a fan of, but since it's their networks it's difficult to tell them they must adjust their "security policy" because of our needs).

                                          Feature request 244447 - PING device - required?

                                          Most recent ticket was: #168251 - "Unable to Ping Host Occassionally"

                                           

                                          Thank you! ;)

                                    • Re: Weird Juniper Ping Problem
                                      pstewart726

                                      Thanks - we can do tcpdump directly on the Juniper switch (using one of them that the problem appears on).  It shows the icmp request come in and the echo response go back out.  I will double check the ARP condition when this occurs again.  Appreciate it...

                                      Paul

                            • Re: Weird Juniper Ping Problem
                              bleearg13

                              Try changing the size of the data packet that gets sent.  You can do this under the NPM Polling Settings/ICMP Data.  I've seen odd issues before where the default text is either not big enough or is too large.  It makes absolutely no sense whatsoever, but it happens.  Mess around with smaller and larger sizes of text data to see if that is the issue.

                              Also, does the behavior occur at a certain time of day consistently?

                                • Re: Weird Juniper Ping Problem
                                  pstewart726

                                  Various times.. happens roughly every 10 days ... but that varies too.

                                   

                                  Changed the size of the data packet and made no difference unfortunately.

                                   

                                   



                                  Try changing the size of the data packet that gets sent.  You can do this under the NPM Polling Settings/ICMP Data.  I've seen odd issues before where the default text is either not big enough or is too large.  It makes absolutely no sense whatsoever, but it happens.  Mess around with smaller and larger sizes of text data to see if that is the issue.

                                  Also, does the behavior occur at a certain time of day consistently?

                                   



                                • Re: Weird Juniper Ping Problem
                                  gibjim01

                                  Hi There!

                                  Did you find a solution to this problem? I guess I have the same issue as descibed in this post. I also have many juniper 4200 switches. Is this related?

                                  The weird thing is that once the problem happens, Orion is unable to ping two or three devices. To make these devices reappear again, you only have to ping them from your workstation... and there they come!

                                  Any idea?

                                  Thx!

                                    • Re: Weird Juniper Ping Problem
                                      pstewart726

                                      No solution yet found to this issue.  We can always ping the remote Juniper switch from a workstation but not from the solarwinds server itself when this occurs.

                                      Yes, when it happens it is very predictable - at least a few Juniper EX switches will not be pingable at first, then after many hours they will become pingable again.  After the first few switches are responsive again, then another "batch" of switches will go into the same state.  This is occuring to us on Juniper EX2200, EX3200 and EX4200 switches and no other Juniper based hardware is effected.

                                      From the Juniper EX switch, we can run a packet trace that proves the ping is reaching the switch and that the switch is responding back with an echo - but on the windows 2008 server we only see the ping go out and not come back (it's dropping at the Windows box for some reason).  Turning off the firewall doesn't help - we have tried.

                                      Thanks,

                                      Paul

                                        • Re: Weird Juniper Ping Problem
                                          pstewart726

                                          I should have added that we have engaged Microsoft and they have come up with no answers on the dropped ping.  We have tested another monitoring solution (which doesn't require ping) and no issues.  Also, during the time that the ping stops responding we are able to still get SNMP based data from the switches.

                                            • Re: Weird Juniper Ping Problem
                                              adeimel

                                              what did you find in your arp tables during these events?

                                              I *believe* we had this issue on a lot of HP dlXXX G7s that were shipping with a newer broadcom card. We swapped out to another nic and the issues went away. The only issues I saw during my events were missing arps during the event. Our SEs swapped NICs after I showed them that so I didnt get to dig much deeper

                                                • Re: Weird Juniper Ping Problem
                                                  pstewart726

                                                  Thanks for reminding me - we were going to order new NIC cards and see if that helped with this issue.  This is something we haven't tried yet but will definately and report back....

                                                    • Re: Weird Juniper Ping Problem
                                                      gibjim01

                                                      Looks like I also have Broadcom onboard cards mounted on a IBMX3650 machine. The NIC model is BCM5709C. Driver version is 5.2.17.0 (2009-12-28). It seems that the latest driver version is 14.4.8.4, which was released on march 18th 2011. Did you tried to update the driver?

                                                        • Re: Weird Juniper Ping Problem
                                                          pstewart726

                                                          We replaced the NIC card last week - took three days and same issue started again.  This time we used a "tried and true" Intel Pro 1000 NIC card.

                                                          I am perplexed by this issue - literally running out of ideas.  Solarwinds still has not brought out a feature we've asked for since purchase - being able to monitor devices without pinging them!!!  I did a search before posting and there's lots of folks who have asked for this feature...;(

                                                          When it occured the last time, I managed to fire up Wireshark and could see echo requests leaving the server and never coming back - this ONLY happens on Juniper devices.  When I do a packet trace on the Juniper devices I can see the echo request come in and the response leave the device..... logic would say it's something in between our Solarwinds boxes and the Juniper devices but we can't find it - spent a LOT of time on this issue.....

                                                          Thanks,

                                                          Paul

                                                            • Re: Weird Juniper Ping Problem
                                                              kenosmith3

                                                              Paul,

                                                              What verison of JunOS are you running? We have had mulitple issues with JunOS doing very basic networks operations. At one point,  I was pulling the router table using snmp and it was causing the CPU on our 8600s to sawtooth from 50 to 90 every 6 hours.

                                                              I would lean toward an issue with the JunOS code verus an issue with the NIC (Even though BroadComm are pretty buggy)

                                                               

                                                              Ken

                                                                • Re: Weird Juniper Ping Problem
                                                                  pstewart726

                                                                  Hi Ken...

                                                                  We have been through several revisions of JunOS code over the timeframe this issue has been occuring.  Our "luck" with Juniper has been pretty good overall, especially on EX switches - will admit though that the 9.x code on EX switches in particular was not very impressive.  Can you describe the issue and what code version was causing your SNMP issues with Juniper (8600??) ?

                                                                  Today in the Juniper world, we run M, MX, E, EX, ERX, J, SRX .. I'm sure I missed something in the alphabet there ;)  But this event with Solarwinds *only* happens to EX based equipment .... I was going with the logic that it has to be a code issue on the Juniper for a long time, and perhaps it is - but everytime I work on this I prove out the Juniper boxes from the equation.....

                                                                  Thanks,

                                                                  Paul

                                                                    • Re: Weird Juniper Ping Problem
                                                                      kenosmith3

                                                                      I actually meant to say that we had issue with the code on our EX8208s not 8600s (we have some Nortels on the network.) We're currently running 10.4r7 so far the most stable code for us. We had been bug chasing through the early versions of 10. We'll probably be stay here for a while..

                                                                      When I was polling the following OIDs from the EX8208s we were seeing the RPD process running up the CPU:

                                                                      1.3.6.1.2.1.4.24.4.1.1 ipCidrRouteDest

                                                                      1.3.6.1.2.1.4.24.4.1.2 ipCidrRouteMask

                                                                      1.3.6.1.2.1.4.24.4.1.4 ipCidrRouteNextHop

                                                                      1.3.6.1.2.1.4.24.4.1.5 ipCidrRouteIfIndex

                                                                      1.3.6.1.2.1.4.24.4.1.6 ipCidrRouteType

                                                                      1.3.6.1.2.1.4.24.4.1.7 ipCidrRouteProto

                                                                      Below is a chart that reflects CPU change when we had disabled those OID...

                                                                       

                                                                      With your issue, Have you tried using a demo What'sup Gold or similar network monitoring application to see if you have similar results?

                                                                      Ken

                                                                        • Re: Weird Juniper Ping Problem
                                                                          pstewart726

                                                                          Ken - how big is your routing table?  By polling all those OID's you are in effect downloading the entire routing table in your network no?  Perhaps I need more caffeine this morning ;)

                                                                          Good question - yes, I have ran a couple of other open source packages against this same equipment during the same time periods and we did not have the issue.  I do plan though to keep a demo of What's Up or similar running again - wondering if there's a certain bug that's hit after "X ICMP pings" with "certain characteristics" type of situation occuring.

                                                                          I have also tried to change the ping size in Solarwinds from one extreme to the next and no change in behaviour.

                                                                          The incredibly strange thing that occurs is that it's always the same devices, in the same order, with the exact same time intervals between each device.  So once we see Device1 go "down" (can't ping it, but Solarwinds can poll SNMP no problem), we know it's exactly 3.5 hours and once it comes up then we knew device2 and device3 will show "down" etc. (I have summarized this, there's a bit more to it).  But it's completely predictable *once* it starts - the time for the event to occur does vary to some degree but the sequence after started is exactly the same.  The devices effected (about half of the EX switches) are geographically diverse in our network with very few common points, which makes this a real "head scratcher".

                                                                          thanks,

                                                                          Paul

                                                                            • Re: Weird Juniper Ping Problem
                                                                              bleearg13

                                                                              Paul,

                                                                              Is there a firewall in front of the Orion server?  3.5 hours seems like a timeout of some sort, like session, NAT tables, or ARP cache.  Also, if the IP is pingable from the server itself during this time, I'd say the issue is definitely related to Solarwinds.  Can you do a port mirror on the switchport that connects to the Orion server?  That should at least verify the responses are getting all the way back to the server and help in showing SW that the problem isn't network connectivity.  It's a very strange problem, indeed!  If you find the answer, please post, as this problem has my curiosity piqued.

                                                                               

                                                                              -evt

                                                                                • Re: Weird Juniper Ping Problem
                                                                                  pstewart726

                                                                                  Thank you evt....

                                                                                  There *was* a Juniper SRX firewall in front of the Solarwinds installation until we moved things around this morning (we were beginning to think of some kind of flow timeouts).  The Solarwinds systems are now connected directly to a Juniper MX80 router.  My hopes will be to run a full packet dump on the MX80 when this event occurs next.  We also had an IPv6 issue with the SRX firewall and need to start monitoring via IPv6 as well.  I have taken a couple of the EX switches that are "problems" and started monitoring them via IPv6.

                                                                                  It is worth noting however that we did previously move the Solarwinds system away from the SRX firewall and still had issues - but it's worth trying again (and avoids a nasty IPv6 problem we were encountering).

                                                                                  My hope will be to get the packet trace directly in front of the Solarwinds server (on the MX80) and also see if the nodes now on IPv6 monitoring behave any differently.  This will hopefully provide some clues.

                                                                                  When this issue does happen, I cannot ping from the Solarwinds server itself - Solarwinds support told me they utilize the "builtin" ping of Windows 2008 .... so that really points back to a network layer issue....

                                                                                  Appreciate it,

                                                                                  Paul

                                                                                    • Re: Weird Juniper Ping Problem
                                                                                      bleearg13

                                                                                      Hi Paul,

                                                                                       



                                                                                      When this issue does happen, I cannot ping from the Solarwinds server itself - Solarwinds support told me they utilize the "builtin" ping of Windows 2008 .... so that really points back to a network layer issue....

                                                                                       



                                                                                      Sorry, I think you mentioned that before, but I was apparently too lazy to browse back.  If you are the same Paul Stewart who is on the j-nsp list, someone just posted a resolution to a strange EX4200 "missing ARP" problem, so you might want to respond/check that out if you haven't seen it yet:

                                                                                      http://www.gossamer-threads.com/lists/nsp/juniper/32941

                                                                                      It sounds like it might be similar to what you are seeing.  I personally have not seen this problem, but you might have something enabled in your EX switches that we do not.  It could be a particular flavor of STP that you're running which is triggering the bug or perhaps multicast.

                                                                                      -evt

                                                                                        • Re: Weird Juniper Ping Problem
                                                                                          pstewart726

                                                                                          Thanks - yes, same guy! ;)

                                                                                          I'm pretty sure this isn't an ARP issue - the switches would not have an ARP entry for the Solarwinds server that is at least a few layer3 hops away.

                                                                                          Hopefully when this occurs again I can get a good capture from the MX80 directly in front of the Solarwinds server - then if I can see the ping go out and come back it'll prove it out....

                                                                                          For reference, in the past I was able to do a packet capture directly on the Juniper EX switches that were reported as unreachable - the capture did show the ping coming in and the echo response going out.

                                                                                          Cheers,

                                                                                          Paul

                                                                                            • Re: Weird Juniper Ping Problem
                                                                                              techprof

                                                                                              I had the same issue.  By chance is your Orion server running on Win2k8 R2?  I had noticed that my ipv6 was disabled for the NIC, and once enabled the every 10 days or so issue with Juniper false alerts stopped.

                                                                                                • Re: Weird Juniper Ping Problem
                                                                                                  pstewart726

                                                                                                  Thanks - that's quite interesting.  We have had IPv6 enabled and operational on that box for quite some time actually.  I do recall actually disabling it to see if it could be an issue.

                                                                                                  This problem is still occurring and we use IPv6 now for active monitoring (with latest Solarwinds release that finally included IPv6 support).  I am slowly moving over some of the "problem switches" to be monitored via IPv6 ... I'm trying to see if they still appear unresponsive from the Solarwinds system.  Unfortunately this takes some time internally but I was hoping to prove that it's only an IPv4 problem.

                                                                                                  Yes, this is on win2k8 R2

                                                                                                  Thanks,

                                                                                                  Paul

                                                                  • Re: Weird Juniper Ping Problem
                                                                    pstewart726

                                                                    I wanted to follow up on this again as we have made some progress.

                                                                     

                                                                    We spun up an entirely new set of VM"s in our lab and loaded in the list of "problem nodes".  To date none of them have falsely tripped - but yet in our production environment they have tripped.

                                                                     

                                                                    This would seem like some kind of Solarwinds issue however we have no way to prove it.  I'm going to try to delete and re add these problem nodes to see if that has any impact next.

                                                                     

                                                                    Paul

                                                                      • Re: Weird Juniper Ping Problem
                                                                        v_kosarev

                                                                        I assume in this problem you could be facing simple icmp-rate-limit policer in Juniper.

                                                                        JunOS kernel based on FreeBSD, which have built-in limit on responding icmp requests. In FreeBSD you can change this limit by using "sysctl net.inet.icmp.icmplim=Value" command, however in JunOS this limit is hardcoded and could not be changed.
                                                                        Unfortunatelly i dont have any Juniper device to test right now, but in FreeBSD when kernel discards icmp packets it sends to dmesg message like "Limiting ICMP ping response from xxx to 200 packets per second".

                                                                        I guess this limiting performed at per host basis, so Juniper still responding to other hosts in network, ignoring only Solarwinds.

                                                                        To check this, you could try to open cmd>ping -t juniper_ip_addr1 at SolarWinds Server, and leave it pinging for a week lets say. And if device juniper_ip_addr1 will be Down during this week much more often, then other Junipers, we found the issue.

                                                                        • Re: Weird Juniper Ping Problem
                                                                          dgerald69

                                                                          We are having the exact same problem but with Meraki switches and AP s instead of Juniper.  Same issue, every week or two, these devices will show as having a node down.  We cannot ping thedevices from a command prompt from the SolarWinds server, but they are pingable from every other worksation/server on the network,  Within a couple of hours, each node comes back online over different time intervals.  All other IPs on the network are pingable, just not these devices for a time.

                                                                           

                                                                          Currently the SolarWinds server is also Windows 2008 R2 Hyper-V VM running under a 2012 R2 Datacenter Hypervisor..

                                                                           

                                                                          Interesting fact, we recently migrated our SolarWinds server from a VMware VM to a Hyper-V VM.  We did NOT have this problem until migrating the server to Hyper-V.  

                                                                           

                                                                          Does anyone have anymore insight or progress on finding out what this issue is?

                                                                           

                                                                          Thanks!