Hello all and happy new (ish) year!
Hopefully, you’ve made it through the holidays and into the new year safe in the knowledge that your critical network, infrastructure, and applications are highly available and fault tolerant. You have tested your failover scenarios for planned and unplanned outages, your systems are redundant, you have automation in place for disaster recovery, and all of which is monitored and observed appropriately. Time for some well-earned rest and uninterrupted downtime.
But what of the SolarWinds Platform? What about its database!? Your anxiety levels quickly rise. What if it goes down? Will I receive my usual alerts and reports? What happens if the SQL server fails? What if I completely lose visibility into the health and performance of all my systems and have no observability?
Don’t panic! Hopefully, most of you reading this will have already answered the questions above and said confidently, “We have SolarWinds Platform High Availability in place with the necessary database redundancy.” If that’s you, I salute you!
For those who may be wondering what SolarWinds Platform High Availability is or are just starting your journey into making your applications, like SQL, highly available and wondering how that ties into your SolarWinds Platform Architecture, then this is for you. Or, if you prefer to just get started and want to skip the preamble, take a look at the SolarWinds Platform High Availability Requirements. Or you can skip ahead to read How to Set up a SolarWinds Platform High Availability Pool. Easier still, you can follow the walkthrough
What is SolarWinds Platform High Availability (HA)?
SolarWinds Platform High Availability (HA) provides failover protection for your SolarWinds Platform server and, optionally, the additional polling engines to reduce data loss if a server goes down.
If a server fails, the HA feature allows your secondary server to take over all services, such as polling and alerting, with minimal downtime. HA protects your main server, also known as your main polling engine, and additional polling engines. It does not protect your databases or your additional web servers.
The Key Features of High Availability are
- Failover deployment
- Near-instantaneous failover
- Automatic failback
- Failover to cloud
- Notification and alerting
- Failover rules
You’re likely asking yourself the following questions: Do I want automatic failover? Should I failover to the cloud? What cloud? Who’s cloud? Suffice it to say that HA is easy to use and highly configurable to meet your needs.
When a monitored service is down, the SolarWinds Platform server tries to allow the service to recover before failing over to the secondary server. A failover occurs if the same service fails within the default self-recovery period.
When a failover condition is met and failover occurs in a pool, a failover event is logged and can be viewed in the Event Summary resource or the Events view. An email is also sent to your default recipients.
For example, if the job engine service is down, HA attempts to start it. If the job engine fails again within one hour, then a failover occurs, and the event is logged. If the job engine fails in 61 minutes, a failover does not occur.
For more details on these features read more
How does HA work?
When you configure your environment for SolarWinds High Availability, for each protected server, HA uses a pool of two servers: a primary server, and a secondary server. Both servers will share the same database resources. Only one server in a pool is active at any given time.
The High Availability software monitors the health of both servers in the pool, and both servers keep an open communication channel between them to perform failover tasks.
When a critical service or process goes down or is unresponsive, such as the SolarWinds Information Service, the software initiates a failover to the standby server.
After a failover to the standby server, the standby becomes the active server and will continue to act as the active server until another failover event occurs.
The standby server assumes all the responsibilities of the main server, including receiving syslogs, traps, and NetFlow information.
To access the active member of the pool you can use a virtual IP address (VIP) to reference your protected servers when you are protecting a server on a single subnet or use a virtual hostname either on a single subnet or across multiple subnets.
Not sure which to use? See When do I use a VIP or a virtual hostname and SolarWinds High Availability in Cloud.
After SolarWinds Platform High Availability (HA) is enabled and you have set up a pool, each pool monitors itself for failover conditions such as:
- Inability to connect to the network
- Stopped SolarWinds services
- Power loss
- Network connection loss to the primary server
Once created a pool will have one on the following statuses:
Pool is running fine, both members are up.
Any member in pool is not having UP status (switchover is disabled).
Pool is disabled (e.g., manually or because of missing license).
For more on failover conditions and behavior read more
The SolarWinds information service only runs on the 'Active' member and should be used as your health check when front-ending HA with a load balancer. The Information Service port runs on TCP 17777.
High availability or disaster recovery?
The net result is the same although there is a slight difference in implementation and so for the sake of simplicity let's describe HA as having the two pool members on the same subnet, and DR as being across multiple subnets such as LAN -> WAN.
When you configure your environment for SolarWinds Platform High Availability on a single subnet, you place your secondary server on the same subnet as the primary server.
These two servers are then added to a High Availability pool and placed behind a Virtual IP (VIP) address so the servers in the HA pool can share the same incoming IP address.
Once the servers are part of the High Availability pool, all incoming requests through the VIP can be addressed by the server that is currently active.
When configuring High Availability across multiple subnets there are additional configuration items available. A virtual hostname is used when configuring HA across multiple subnets. During the pool creation, you can choose your DNS settings to manage the creation of the virtual hostname.
Just to note, a virtual DNS hostname is completely optional. It’s not at all required and provided simply as a convenience mechanism for accessing the ‘active’ member through a consistent URL or FQDN. Some customers prefer not to use DNS, and instead access the individual servers through each of their unique IPs or hostnames. If this applies to you, simply select ‘other’ when asked which DNS server type you would like to use with HA. This is the same option customers can use if they run a DNS server that is not BIND compatible or Microsoft DNS.
DNS can also be used by some devices to send Syslog, SNMP Traps, or Flow to a poller in a multi-subnet failover configuration, also in the event a load balancer is unavailable. If you have your own solution for directing incoming traffic to the active Orion server, then there's really no need to configure Orion HA to update DNS records.
Why and when to failover
After you have set up your HA environment there may be occasions when you want to manage a pool or perform a failover action such as when testing SolarWinds Platform High Availability, network configuration changes, or when upgrading, you can failover to the standby pool member manually. Some common actions include.
- Disabling HA Pools
- Forcing a manual Failover
- Updating credentials, VIP address, virtual hostname, or active server
- Removing HA Pools
How long does failover take?
Since we prioritize polling and alerting first and foremost, these services typically recover in roughly two minutes. Faster in some instances depending upon hardware. The web interface is the last thing to return to service when a recovery condition occurs. This typically takes 3-4 minutes on average but is also somewhat dependent upon the speed and performance of the machine.
Tech Tip! If the HA pool uses a virtual hostname, you may need to flush your browser's DNS cache by closing and reopening your browser.
What about the database?
Ok, now that we have the how when why, and where covered, let’s look at the database aspects of failover scenarios. In very simple terms, both the primary and secondary servers must be able to connect to your SQL database.
- The SQL server does not need to reside on the same subnet as either the primary or secondary servers.
- Both primary and secondary servers must be able to connect to your database.
- SolarWinds does not provide failover support for the database.
SQL AlwaysOn is the recommended solution for HA, although SQL Clustering is another popular option. In fact, there is no special configuration on the Solarwinds Platform side needed. Simply specify the listener name of the SQL AlwaysOn availability group rather than the SQL instance name in the database step of Configuration Wizard and away you go.
With that in mind, the actual configuration of the SQL environment is rather transparent to the SolarWinds Platform server. The details of which are best described in this article