This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.

You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

Torture Testing High Availability

aLTeReGo over 8 years ago

A few of you have asked for test failover scenarios for High Availability you can try yourself. Below I outline a few that can be tested in easily in your own environment. This is by no means an exhaustive list, but these are some of the most popular.

Test #1 - Network Connectivity Failure

What to do: Unplug Network Cable or Disable Network Interface on the 'Active' member in the pool

What to Expect: Failover should occur within a minute or two of disconnecting the server from the network. The server which was previously in 'Standby' mode should now be 'Active'.

Note: Ensure you re-enable the network interface or reconnect the network cable before moving on to test #2.

Test #2 - Power Failure

What to do: Pull Power Plug or Forcibly Power Off The Virtual Machine of the 'Active' member in the pool.

Alternative Test Path: Crash Windows with the Blue Screen of Death

What to Expect: Failover should occur within a minute or two of powering off the server from the network. The server which was previously in 'Standby' mode should now be 'Active'.

Note: Be sure to power back on the server you shut down prior to moving on to test #3

Test #3 - Application Failure

What to do: Forcibly terminate critical Orion processes via Task Manager or Stop Orion Services on the 'Active' member in the pool.

What to Expect: Failover should occur within a minute or two of stopping Orion services or terminating a critical Orion process. The server which was previously in 'Standby' mode should now be 'Active'.

Stop Service.png

Test #4 - Force a Manual Failover

From the 'Orion Deployment Summary' located under [Settings -> All Settings -> High Availability Deployment Summary] select the Pool. From the right panel, click the 'Commands' drop down and select 'Force Failover'.

Fore Failover.png

Test #5 - Catastrophic Database Failure

What to do: Power off, disconnect or otherwise cause the database server to become inaccessible to both the primary and secondary servers in the HA pool.

What to Expect: When this occurs both members are in isolation mode, meaning neither can't communicate with one another or with the database. In this situation, failover does not occur because neither member is better off than the other. Polling remains on the active member which queues its results until database connectivity is restored. The passive member remains in this state since it is neither able to communicate with the database or with the active pool member.

Top Replies

0 shuckyshark over 7 years ago

Make sure you disable the Recovery settings or the system will restart them automatically in test number 3...
Cancel
Vote Up +2 Vote Down

Cancel
0 shuckyshark over 7 years ago

aLTeReGo My testing results and comments...
Overall, these new features and functions are awesome to say the least...
Torture Testing HA:
Test 1 – Unplug Network of primary:
Before:

After:

Eventually came up (less than 5 minutes), had to re-log in,
and the Network Discovery Polling and Applciation Summary was blank – had to
re-run the Discovery.
Torture Test 2 – Power Off Primary unexpectedly:

Torture Test 3:
Service Failure:
Before:

After:

All tests passed. My secondary server came up with only
minor issues (had to restart my Network polling and adding application monitors
as I disabled the network card in the middle of it.
The only other comment would be to add a setting to be able
to fail back automatically as some managers/administrators <sic> would
like the primary to always be the primary…
Cancel
Vote Up +2 Vote Down

Cancel
0 aLTeReGo over 7 years ago in reply to shuckyshark

Nice catch steven.melnichuk@tbdssab.ca! That test was written for Beta 2. Now in Beta 3, we have implemented local recovery attempts to prevent unnecessary failovers. When/if the server can recover locally without failing over, it will do so. If for some reason the server is unable to fully recover after 3 attempts in a one hour timeframe, the server will failover to the other member in the pool.
Cancel
Vote Up +1 Vote Down

Cancel
0 shuckyshark over 7 years ago

The other thing I liked, is the fact that EVERYTHING failed over and was accessible...website, polling, reporting and everything. I also didn't have to put in a new address - I kept using http://solartest1 even though it was off the network.
Cancel
Vote Up +1 Vote Down

Cancel
0 AlexSoul over 6 years ago

Just bumping up for those who didn't spot that awesome article yet. Thanks aLTeReGo
Cancel
Vote Up +3 Vote Down

Cancel
0 AlexSoul over 6 years ago

aLTeReGo, may I please possibly confirm one scenario which I am not quite sure about expected behaviour. We have stretched subnet between 2 DCs. SolarWinds HA cluster sitting in it as well as SQL always on cluster:
* DC1 (active poller + active SQL)
* DC2 (standby poller + read-only SQL).
So, my question is - what if there are some WAN related connectivity issues so that DC2 becomes isolated from DC1, but both continue to be operational. What will happen then? Will DC2 springs into live and we are potentially having 2 live instances at both ends?
Cancel
Vote Up +1 Vote Down

Cancel
0 aLTeReGo over 6 years ago in reply to AlexSoul

alexslv, split brain conditions are not possible with HA. Despite appearance, there is always a quorum to break an electoral tie. That third tie breaker is always the SQL server itself. This eliminates any possibility of both members running in isolation mode.
Cancel
Vote Up +4 Vote Down

Cancel
0 pratikmehta003 over 6 years ago

HI aLTeReGo
One very basic question on this scenario. When the DB is unavailable for any reason. All the polling to devices will continue but the updates will be queued and pushed to DB once available?
And where does the queuing exactly gets piled up in this case?
So later when i pull the data it will show everything?
Cancel
Vote Up 0 Vote Down

Cancel
0 aLTeReGo over 6 years ago in reply to pratikmehta003

If the Database is inaccessible, the polling results queue on the polling engine they were collected from. These results are stored in MSMQ, and flushed once the SQL instance is brought back online and persisted to the database.
Cancel
Vote Up +2 Vote Down

Cancel
0 AlexSoul over 6 years ago in reply to aLTeReGo

... just on the same subject, we also do additional checks to identify these sort of issues:
Watch The Watcher... Or Self-Monitoring For MSMQs
Cancel
Vote Up 0 Vote Down

Cancel