Switches show as down when backup circuit goes down

I have a strange issue with SolarWinds that came to a head when a bunch of our backup circuits went down at once. At each of our remote branches we have two routers, each with it's own MPLS circuit. We do not load balance because our backup circuit is a 4G wireless connection, so we just use a VRRP address for failover. When the backup router goes down about half the switches in the branch show as down. I can ping these switches from anywhere else in the company, except SolarWinds. I can ping from VPN, my desk, at the branch itself, and even from a VM on the same server as Orion. When this happens, I cant even ping from the command prompt on the server. I ran a PCAP from the core switch at these locations, and the pings from SolarWinds are arriving, and getting a response from the switch. 

When I log into these devices, they all have the backup router's IP in their ARP table even though they should only have the VRRP address. The switches that don't have the issue, do not have the IP in their ARP table. If I clear the ARP SolarWinds can immediately start pinging. 

Anyone have any insight on what might be going on here? Is it possible SolarWinds tries to poll across both circuits somehow? 

  • I have so many questions.... but first some comments.

    The path your SolarWinds server takes to poll these devices has very little to do with SolarWinds, depending (for the most part) entirely on how your network infrastructure is configured... e.g your routing (route metrics, route-maps, protocols {bgp, ospf etc.} configuration).

    Vrrp is used as a way to provide high-availability for first-hop routers — basically your gateways but it can be used anywhere to provide redundancy at layer3. It does not guarantee your routing path to be consistent.

    When the issue occurs, are trace routes to/from the affected nodes any different than normal operational hours?

    How about show ip route on the routers to the SolarWinds network?

    What does the vrrp configuration look like? Are you setting preferences? (e.g high pref on preferred circuit and lower on backup)

    What is your arp-age timeout configured as, or is it default?

    What is the method of determining a failover? E.g is there a tracker (even icmp) to do a health check upstream to trigger a failover?

    What route-maps are in place of any...?

    Is this part of a DMVPN topology? hub-and-spoke?

    These are some of the areas I’d focus on to determine how complex or not this issue may be. However my first move would be to gather those trace routes and look at the route tables on the routers both during normal operations and during a failure.

    Wish you the best of luck!