cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post
Level 11

Weird Juniper Ping Problem

We have an issue that occurs about once a week.  It only involves Juniper EX3200/EX4200 switches and only some of them we have deployed/monitored.

About once a week, we will get an alert from Solarwinds that one of these devices is down (this ALWAYS occurs in the same order of devices making it even more strange).  When we check the device it's actually up, responding to pings etc.  BUT it will not respond to pings from the Solarwinds platform itself.

I opened several tickets with Solarwinds and they have told me that they just rely on the ping utility within Windows 2008 so if the device isn't pingable then it's got to be a Windows 2008 problem.  I have opened tickets with Juniper and we have proven that the ping request is arriving at the EX switch and that the response is going back out.  A Wireshark on the Windows server shows the ping going out but never coming back.

We have the latest Windows patches and the server is considered "clean" from a software perspective (and viruses of course).  There is nothing in between on the network that can cause this issue so we focused on the Windows server.

Bit the bullet as per Solarwinds suggestion and blew away the Solarwinds server that does the polling.  Rebuilt it, restored everything and same problem.  Server now also has the latest network card drivers.

Have totally run out of ideas - frustrating for sure and no idea what to do next.  Solarwinds said they are bringing out a feature where ping of the remote device isn't required (thank goodness, we were shocked when that was a requirement during our initial installation).  This new "feature" would at least alleviate this issue.

It's also worth noting that while the pings fail, SNMP continues to poll data successfully.  It's also worth noting that the alert will trigger every time on the same initial device and follow the same pattern through other Juniper EX switches at the same time interval.

Thoughts? 😉

Tags (3)
34 Replies
Level 11

I wanted to follow up on this again as we have made some progress.

We spun up an entirely new set of VM"s in our lab and loaded in the list of "problem nodes".  To date none of them have falsely tripped - but yet in our production environment they have tripped.

This would seem like some kind of Solarwinds issue however we have no way to prove it.  I'm going to try to delete and re add these problem nodes to see if that has any impact next.

Paul

0 Kudos

We are having the exact same problem but with Meraki switches and AP s instead of Juniper.  Same issue, every week or two, these devices will show as having a node down.  We cannot ping thedevices from a command prompt from the SolarWinds server, but they are pingable from every other worksation/server on the network,  Within a couple of hours, each node comes back online over different time intervals.  All other IPs on the network are pingable, just not these devices for a time.

Currently the SolarWinds server is also Windows 2008 R2 Hyper-V VM running under a 2012 R2 Datacenter Hypervisor..

Interesting fact, we recently migrated our SolarWinds server from a VMware VM to a Hyper-V VM.  We did NOT have this problem until migrating the server to Hyper-V.  

Does anyone have anymore insight or progress on finding out what this issue is?

Thanks!

0 Kudos

had the same random issue, change the pulling to SNMP when you do List Resources

seems fixed the issue

0 Kudos

I assume in this problem you could be facing simple icmp-rate-limit policer in Juniper.

JunOS kernel based on FreeBSD, which have built-in limit on responding icmp requests. In FreeBSD you can change this limit by using "sysctl net.inet.icmp.icmplim=Value" command, however in JunOS this limit is hardcoded and could not be changed.
Unfortunatelly i dont have any Juniper device to test right now, but in FreeBSD when kernel discards icmp packets it sends to dmesg message like "Limiting ICMP ping response from xxx to 200 packets per second".

I guess this limiting performed at per host basis, so Juniper still responding to other hosts in network, ignoring only Solarwinds.

To check this, you could try to open cmd>ping -t juniper_ip_addr1 at SolarWinds Server, and leave it pinging for a week lets say. And if device juniper_ip_addr1 will be Down during this week much more often, then other Junipers, we found the issue.

0 Kudos
Level 9

Hi There!

Did you find a solution to this problem? I guess I have the same issue as descibed in this post. I also have many juniper 4200 switches. Is this related?

The weird thing is that once the problem happens, Orion is unable to ping two or three devices. To make these devices reappear again, you only have to ping them from your workstation... and there they come!

Any idea?

Thx!

0 Kudos

No solution yet found to this issue.  We can always ping the remote Juniper switch from a workstation but not from the solarwinds server itself when this occurs.

Yes, when it happens it is very predictable - at least a few Juniper EX switches will not be pingable at first, then after many hours they will become pingable again.  After the first few switches are responsive again, then another "batch" of switches will go into the same state.  This is occuring to us on Juniper EX2200, EX3200 and EX4200 switches and no other Juniper based hardware is effected.

From the Juniper EX switch, we can run a packet trace that proves the ping is reaching the switch and that the switch is responding back with an echo - but on the windows 2008 server we only see the ping go out and not come back (it's dropping at the Windows box for some reason).  Turning off the firewall doesn't help - we have tried.

Thanks,

Paul

0 Kudos

I should have added that we have engaged Microsoft and they have come up with no answers on the dropped ping.  We have tested another monitoring solution (which doesn't require ping) and no issues.  Also, during the time that the ping stops responding we are able to still get SNMP based data from the switches.

0 Kudos

what did you find in your arp tables during these events?

I *believe* we had this issue on a lot of HP dlXXX G7s that were shipping with a newer broadcom card. We swapped out to another nic and the issues went away. The only issues I saw during my events were missing arps during the event. Our SEs swapped NICs after I showed them that so I didnt get to dig much deeper

0 Kudos

Thanks for reminding me - we were going to order new NIC cards and see if that helped with this issue.  This is something we haven't tried yet but will definately and report back....

0 Kudos

Looks like I also have Broadcom onboard cards mounted on a IBMX3650 machine. The NIC model is BCM5709C. Driver version is 5.2.17.0 (2009-12-28). It seems that the latest driver version is 14.4.8.4, which was released on march 18th 2011. Did you tried to update the driver?

0 Kudos

We replaced the NIC card last week - took three days and same issue started again.  This time we used a "tried and true" Intel Pro 1000 NIC card.

I am perplexed by this issue - literally running out of ideas.  Solarwinds still has not brought out a feature we've asked for since purchase - being able to monitor devices without pinging them!!!  I did a search before posting and there's lots of folks who have asked for this feature...;(

When it occured the last time, I managed to fire up Wireshark and could see echo requests leaving the server and never coming back - this ONLY happens on Juniper devices.  When I do a packet trace on the Juniper devices I can see the echo request come in and the response leave the device..... logic would say it's something in between our Solarwinds boxes and the Juniper devices but we can't find it - spent a LOT of time on this issue.....

Thanks,

Paul

0 Kudos

Paul,

What verison of JunOS are you running? We have had mulitple issues with JunOS doing very basic networks operations. At one point,  I was pulling the router table using snmp and it was causing the CPU on our 8600s to sawtooth from 50 to 90 every 6 hours.

I would lean toward an issue with the JunOS code verus an issue with the NIC (Even though BroadComm are pretty buggy)

 

Ken

0 Kudos

Hi Ken...

We have been through several revisions of JunOS code over the timeframe this issue has been occuring.  Our "luck" with Juniper has been pretty good overall, especially on EX switches - will admit though that the 9.x code on EX switches in particular was not very impressive.  Can you describe the issue and what code version was causing your SNMP issues with Juniper (8600??) ?

Today in the Juniper world, we run M, MX, E, EX, ERX, J, SRX .. I'm sure I missed something in the alphabet there 😉  But this event with Solarwinds *only* happens to EX based equipment .... I was going with the logic that it has to be a code issue on the Juniper for a long time, and perhaps it is - but everytime I work on this I prove out the Juniper boxes from the equation.....

Thanks,

Paul

0 Kudos

I actually meant to say that we had issue with the code on our EX8208s not 8600s (we have some Nortels on the network.) We're currently running 10.4r7 so far the most stable code for us. We had been bug chasing through the early versions of 10. We'll probably be stay here for a while..

When I was polling the following OIDs from the EX8208s we were seeing the RPD process running up the CPU:

1.3.6.1.2.1.4.24.4.1.1 ipCidrRouteDest

1.3.6.1.2.1.4.24.4.1.2 ipCidrRouteMask

1.3.6.1.2.1.4.24.4.1.4 ipCidrRouteNextHop

1.3.6.1.2.1.4.24.4.1.5 ipCidrRouteIfIndex

1.3.6.1.2.1.4.24.4.1.6 ipCidrRouteType

1.3.6.1.2.1.4.24.4.1.7 ipCidrRouteProto

Below is a chart that reflects CPU change when we had disabled those OID...

 

With your issue, Have you tried using a demo What'sup Gold or similar network monitoring application to see if you have similar results?

Ken

0 Kudos

Ken - how big is your routing table?  By polling all those OID's you are in effect downloading the entire routing table in your network no?  Perhaps I need more caffeine this morning 😉

Good question - yes, I have ran a couple of other open source packages against this same equipment during the same time periods and we did not have the issue.  I do plan though to keep a demo of What's Up or similar running again - wondering if there's a certain bug that's hit after "X ICMP pings" with "certain characteristics" type of situation occuring.

I have also tried to change the ping size in Solarwinds from one extreme to the next and no change in behaviour.

The incredibly strange thing that occurs is that it's always the same devices, in the same order, with the exact same time intervals between each device.  So once we see Device1 go "down" (can't ping it, but Solarwinds can poll SNMP no problem), we know it's exactly 3.5 hours and once it comes up then we knew device2 and device3 will show "down" etc. (I have summarized this, there's a bit more to it).  But it's completely predictable *once* it starts - the time for the event to occur does vary to some degree but the sequence after started is exactly the same.  The devices effected (about half of the EX switches) are geographically diverse in our network with very few common points, which makes this a real "head scratcher".

thanks,

Paul

0 Kudos

Paul,

Is there a firewall in front of the Orion server?  3.5 hours seems like a timeout of some sort, like session, NAT tables, or ARP cache.  Also, if the IP is pingable from the server itself during this time, I'd say the issue is definitely related to Solarwinds.  Can you do a port mirror on the switchport that connects to the Orion server?  That should at least verify the responses are getting all the way back to the server and help in showing SW that the problem isn't network connectivity.  It's a very strange problem, indeed!  If you find the answer, please post, as this problem has my curiosity piqued.

 

-evt

0 Kudos

Thank you evt....

There *was* a Juniper SRX firewall in front of the Solarwinds installation until we moved things around this morning (we were beginning to think of some kind of flow timeouts).  The Solarwinds systems are now connected directly to a Juniper MX80 router.  My hopes will be to run a full packet dump on the MX80 when this event occurs next.  We also had an IPv6 issue with the SRX firewall and need to start monitoring via IPv6 as well.  I have taken a couple of the EX switches that are "problems" and started monitoring them via IPv6.

It is worth noting however that we did previously move the Solarwinds system away from the SRX firewall and still had issues - but it's worth trying again (and avoids a nasty IPv6 problem we were encountering).

My hope will be to get the packet trace directly in front of the Solarwinds server (on the MX80) and also see if the nodes now on IPv6 monitoring behave any differently.  This will hopefully provide some clues.

When this issue does happen, I cannot ping from the Solarwinds server itself - Solarwinds support told me they utilize the "builtin" ping of Windows 2008 .... so that really points back to a network layer issue....

Appreciate it,

Paul

0 Kudos

Hi Paul,



When this issue does happen, I cannot ping from the Solarwinds server itself - Solarwinds support told me they utilize the "builtin" ping of Windows 2008 .... so that really points back to a network layer issue....



Sorry, I think you mentioned that before, but I was apparently too lazy to browse back.  If you are the same Paul Stewart who is on the j-nsp list, someone just posted a resolution to a strange EX4200 "missing ARP" problem, so you might want to respond/check that out if you haven't seen it yet:

http://www.gossamer-threads.com/lists/nsp/juniper/32941

It sounds like it might be similar to what you are seeing.  I personally have not seen this problem, but you might have something enabled in your EX switches that we do not.  It could be a particular flavor of STP that you're running which is triggering the bug or perhaps multicast.

-evt

0 Kudos

Thanks - yes, same guy! 😉

I'm pretty sure this isn't an ARP issue - the switches would not have an ARP entry for the Solarwinds server that is at least a few layer3 hops away.

Hopefully when this occurs again I can get a good capture from the MX80 directly in front of the Solarwinds server - then if I can see the ping go out and come back it'll prove it out....

For reference, in the past I was able to do a packet capture directly on the Juniper EX switches that were reported as unreachable - the capture did show the ping coming in and the echo response going out.

Cheers,

Paul

0 Kudos

I had the same issue.  By chance is your Orion server running on Win2k8 R2?  I had noticed that my ipv6 was disabled for the NIC, and once enabled the every 10 days or so issue with Juniper false alerts stopped.

0 Kudos