This FR comes off the back of a recent upgrade process moving to an almost fully HA/DR scenario. The customer is running a 3 Node SQL AlwaysON AG, with Primary Orion Server in a Multi-Subnet HA Pool, with 4 Additional Pollers in HA VIP Pools.
The scenario that triggered this FR is that the customer had an issue with the SQL AlwaysON which caused it to be unreachable rendering the SQL down and therefore application down. This caused HA to go nuts. The HA-Primary Pool Member lost connection to the SQL so failed over to the Standby, at which point the HA service disables all of the SolarWinds services. Then the HA-Standby is unable to connect to the SQL Server and stops working also.
When the SQL came back up, neither HA Pool member was able to self-heal. Both servers then thought they were HA Standby and so neither came back up. Running the config wizard on the primary didn't work as the HA Service had locked it into Standby mode, which sets all the services to Disabled, meaning the config wizard can't start them. You have to edit the registry to remove that flag and manually change the services to Manual. Then run config wizard again to get the Primary back up.
As each of the HA Pool Members must communicate on some level with the SQL DB as well as each other, I would like to suggest a feature that enables an SQL Connection test. If the Primary server isn't able to connect to the DB, rather than instantly failing it should ask the Standby if it can reach the SQL, and if that check returns negative, then don't failover, just remain the Primary and accept that the application is down.
Happy to take suggestions on additional ways to improve this part of the HA solution.
dsimpkins feel free to add any extra feedback.