Had a network blip last night that caused split brain in our HA (again).....working with TAC at the moment, but I was poking around with SWQL to confirm something and found this...
Beginning to take this personally.
I always tell my clients that I have had more downtime from fixing split brain scenarios for clients than I've ever seen from actual single server failures. Worst case I can fix a broken single instance pretty quick, often before any users realize anything happened. Fixing a problem and then sorting out all the reg keys and db entries for split brain to get it running again almost always adds another 15-20 minutes to the recovery time. If I had to live with an HA environment full time I'd probably have saved a set of scripts to set all the keys and db entries for HA back to my "default" position and i would fire that off immediately upon any failed HA handoffs.
@danbert If you can share the case number, we can help validate the issue. There are several customer running into this issue in 2024.2 and uncovered a potential fix for these scenarios in 2024.2.1. This was added to the release notes as well.
Hi @Kita here is the case #01734665. The engineer fixed the HA, but it went split brain again shortly after. We're aware of the 2024.2.1 fix, but need a stable platform first. Plus we're dealing with approximately 60 Nodes showing as down with false positive alerts that we're waiting for an engineer on potentially related to the first issue.
Ha....we'll be building out those scripts.....Solarwinds does not do HA well...
@danbert The fix in 2024.2.1 is to address the split brain issue. We will dig into why the false positive are occurring
Awesome, thanks @Kita . Don't hesitate to have the engineer reach out to us, as we're all on a call trying to repair these issues without the engineer at the moment.
15-20?! Arg you're a machine!