A few of you have asked for test failover scenarios for High Availability you can try yourself. Below I outline a few that can be tested in easily in your own environment. This is by no means an exhaustive list, but these are some of the most popular.
What to do: Unplug Network Cable or Disable Network Interface on the 'Active' member in the pool
What to Expect: Failover should occur within a minute or two of disconnecting the server from the network. The server which was previously in 'Standby' mode should now be 'Active'.
Note: Ensure you re-enable the network interface or reconnect the network cable before moving on to test #2.
What to do: Pull Power Plug or Forcibly Power Off The Virtual Machine of the 'Active' member in the pool.
Alternative Test Path: Crash Windows with the Blue Screen of Death
What to Expect: Failover should occur within a minute or two of powering off the server from the network. The server which was previously in 'Standby' mode should now be 'Active'.
Note: Be sure to power back on the server you shut down prior to moving on to test #3
What to do: Forcibly terminate critical Orion processes via Task Manager or Stop Orion Services on the 'Active' member in the pool.
What to Expect: Failover should occur within a minute or two of stopping Orion services or terminating a critical Orion process. The server which was previously in 'Standby' mode should now be 'Active'.
From the 'Orion Deployment Summary' located under [Settings -> All Settings -> High Availability Deployment Summary] select the Pool. From the right panel, click the 'Commands' drop down and select 'Force Failover'.
What to do: Power off, disconnect or otherwise cause the database server to become inaccessible to both the primary and secondary servers in the HA pool.
What to Expect: When this occurs both members are in isolation mode, meaning neither can't communicate with one another or with the database. In this situation, failover does not occur because neither member is better off than the other. Polling remains on the active member which queues its results until database connectivity is restored. The passive member remains in this state since it is neither able to communicate with the database or with the active pool member.
aLTeReGo My testing results and comments...
Overall, these new features and functions are awesome to say the least...
Torture Testing HA:
Test 1 – Unplug Network of primary:
Eventually came up (less than 5 minutes), had to re-log in,
and the Network Discovery Polling and Applciation Summary was blank – had to
re-run the Discovery.
Torture Test 2 – Power Off Primary unexpectedly:
Torture Test 3:
All tests passed. My secondary server came up with only
minor issues (had to restart my Network polling and adding application monitors
as I disabled the network card in the middle of it.
The only other comment would be to add a setting to be able
to fail back automatically as some managers/administrators <sic> would
like the primary to always be the primary…
Nice catch email@example.com! That test was written for Beta 2. Now in Beta 3, we have implemented local recovery attempts to prevent unnecessary failovers. When/if the server can recover locally without failing over, it will do so. If for some reason the server is unable to fully recover after 3 attempts in a one hour timeframe, the server will failover to the other member in the pool.
The other thing I liked, is the fact that EVERYTHING failed over and was accessible...website, polling, reporting and everything. I also didn't have to put in a new address - I kept using http://solartest1 even though it was off the network.
aLTeReGo, may I please possibly confirm one scenario which I am not quite sure about expected behaviour. We have stretched subnet between 2 DCs. SolarWinds HA cluster sitting in it as well as SQL always on cluster:
* DC1 (active poller + active SQL)
* DC2 (standby poller + read-only SQL).
So, my question is - what if there are some WAN related connectivity issues so that DC2 becomes isolated from DC1, but both continue to be operational. What will happen then? Will DC2 springs into live and we are potentially having 2 live instances at both ends?
alexslv, split brain conditions are not possible with HA. Despite appearance, there is always a quorum to break an electoral tie. That third tie breaker is always the SQL server itself. This eliminates any possibility of both members running in isolation mode.
One very basic question on this scenario. When the DB is unavailable for any reason. All the polling to devices will continue but the updates will be queued and pushed to DB once available?
And where does the queuing exactly gets piled up in this case?
So later when i pull the data it will show everything?
If the Database is inaccessible, the polling results queue on the polling engine they were collected from. These results are stored in MSMQ, and flushed once the SQL instance is brought back online and persisted to the database.
SolarWinds solutions are rooted in our deep connection to our user base in the THWACK® online community. More than 150,000 members are here to solve problems, share technology and best practices, and directly contribute to our product development process. Learn more today by joining now.