I have been working on Support Case #00074553, where when we fail over between HA devices the vIP is un-contactable for long enough most of the monitored environment to report as down.
We performed a failover, whilst on a SW support session and true enough we lost access to the vIP for about 10 minutes, outside of the VLAN where our Orion platform lives.
I could ping the vIP from the Orion servers, but outside of the subnet it was not resolvable, and all polling failed.
But our production servers are customer facing, downtime isn't an option and neither is flooding customers with false node down alertts on every failover, so I replicated the environment into test&dev, fired up Wireshark and started testing.
After several attempts and much head scratching, it appears that we aren't sending GARPs out, when the vIP moves and I can even see the server which is in possession of the vIP, send ARP requests asking who has the vIP.
Then there is quite a bit of time where IPs are allocated two MAC addresses and a lot of duplicate IP stuff going on until the cache sorts itself out.
I am more confused now then I was at the beginning of this exercise, and I understood at the time that this was a needle in a haystack type quest (finding what was broken between the VMware, Application and Network layers).
Can anyone give me a full breakdown of the HA function on failover?
Is there any control we can have over it, or add things to the procedure?
If I could add an arping or attempt to force an arp spoof or something to force its hand….
From the packet sniffs it appears that the vIP is not updating the ARP cache and instead is having to wait for the ASA ARP cache to age out and refresh naturally, which involves ~10 minutes with the vast majority of our monitored nodes reporting as down in the meantime.
Can anyone assist please?