nav[aria-label="Primary Navigation"] { padding: 0; & ul { list-style: none; width: 100%; display: flex; flex-direction: row; justify-content: start; align-items: start; gap: 30px; padding: 0; & li { margin: 0; } & ul li { list-style: none; } } }

Community
- Command Central
- MVP Program
- Monthly Mission
- Blogs
- Groups
- Events
- Media Vault
- SolarWinds Academy
Products
- Observability
- Network Management
- Application Management
- IT Security
- IT Service Management
- System Management
- Database Management
Content Exchange
- SolarWinds Platform
- Server & Application Monitor
- Database Performance Analyzer
- Server Configuration Monitor
- Network Performance Monitor
- Network Configuration Manager
- SQL Sentry
- Web Help Desk
Free Tools & Trials
Store

HA Database Availability Check

dgsmith80

This FR comes off the back of a recent upgrade process moving to an almost fully HA/DR scenario. The customer is running a 3 Node SQL AlwaysON AG, with Primary Orion Server in a Multi-Subnet HA Pool, with 4 Additional Pollers in HA VIP Pools.

The scenario that triggered this FR is that the customer had an issue with the SQL AlwaysON which caused it to be unreachable rendering the SQL down and therefore application down. This caused HA to go nuts. The HA-Primary Pool Member lost connection to the SQL so failed over to the Standby, at which point the HA service disables all of the SolarWinds services. Then the HA-Standby is unable to connect to the SQL Server and stops working also.

When the SQL came back up, neither HA Pool member was able to self-heal. Both servers then thought they were HA Standby and so neither came back up. Running the config wizard on the primary didn't work as the HA Service had locked it into Standby mode, which sets all the services to Disabled, meaning the config wizard can't start them. You have to edit the registry to remove that flag and manually change the services to Manual. Then run config wizard again to get the Primary back up.

As each of the HA Pool Members must communicate on some level with the SQL DB as well as each other, I would like to suggest a feature that enables an SQL Connection test. If the Primary server isn't able to connect to the DB, rather than instantly failing it should ask the Standby if it can reach the SQL, and if that check returns negative, then don't failover, just remain the Primary and accept that the application is down.

Happy to take suggestions on additional ways to improve this part of the HA solution.

dsimpkins feel free to add any extra feedback.

Find more posts tagged with

Status: None

Comments

garystorr

We have just built a HA set up in our lab to test different failure scenarios and this was one of them. We killed SQL to see what happened and it resulted in a similar "stuck" state. We had to shut down the Standby NPM and then reboot the Primary one to bring it all back into service. The SQL connection test wold be a great idea.

dsimpkins

we didn't have to reboot servers but stop the HA service on both servers, check the registry key for HA was correct for each server (there is a registry key that specifies install and running mode).

Once SQL was back up we started the HA service on the primary started all the services and once the software was loading ok start the HA service on the secondary.

The big issue had is that both servers seem to change the registry key to "MainPollerStandby" it was almost as if rather than fight for who would be active they both went into standby.