This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

Torture Testing High Availability

A few of you have asked for test failover scenarios for High Availability you can try yourself. Below I outline a few that can be tested in easily in your own environment. This is by no means an exhaustive list, but these are some of the most popular.

Test #1 - Network Connectivity Failure

What to do: Unplug Network Cable or Disable Network Interface on the 'Active' member in the pool

What to Expect: Failover should occur within a minute or two of disconnecting the server from the network. The server which was previously in 'Standby' mode should now be 'Active'.

Connectivity Failure.pngDisable Windows Adapter.png

Note: Ensure you re-enable the network interface or reconnect the network cable before moving on to test #2.

Test #2 - Power Failure

What to do: Pull Power Plug or Forcibly Power Off The Virtual Machine of the 'Active' member in the pool.

Alternative Test Path: Crash Windows with the Blue Screen of Death

What to Expect: Failover should occur within a minute or two of powering off the server from the network. The server which was previously in 'Standby' mode should now be 'Active'.

Power Failure.pngPower Off.png

Note: Be sure to power back on the server you shut down prior to moving on to test #3

Test #3 - Application Failure

What to do: Forcibly terminate critical Orion processes via Task Manager or Stop Orion Services on the 'Active' member in the pool.

What to Expect: Failover should occur within a minute or two of stopping Orion services or terminating a critical Orion process. The server which was previously in 'Standby' mode should now be 'Active'.

Terminate Process.png

Stop Service.png

Test #4 - Force a Manual Failover

From the 'Orion Deployment Summary' located under [Settings -> All Settings -> High Availability Deployment Summary] select the Pool. From the right panel, click the 'Commands' drop down and select 'Force Failover'.

Fore Failover.png

Test #5 - Catastrophic Database Failure

What to do: Power off, disconnect or otherwise cause the database server to become inaccessible to both the primary and secondary servers in the HA pool.

What to Expect: When this occurs both members are in isolation mode, meaning neither can't communicate with one another or with the database. In this situation, failover does not occur because neither member is better off than the other. Polling remains on the active member which queues its results until database connectivity is restored. The passive member remains in this state since it is neither able to communicate with the database or with the active pool member.

Catastrophic Database Failure.png

  • Make sure you disable the Recovery settings or the system will restart them automatically in test number 3...

    pastedImage_1.png

  • aLTeReGo  My testing results and comments...

    Overall, these new features and functions are awesome to say the least...

    Torture Testing HA:

    Test 1 – Unplug Network of primary:

    Before:

    pastedImage_0.png

    After:

    pastedImage_2.png

    pastedImage_3.png

    Eventually came up (less than 5 minutes), had to re-log in,
    and the Network Discovery Polling and Applciation Summary was blank – had to
    re-run the Discovery.

    Torture Test 2 – Power Off Primary unexpectedly:

    pastedImage_4.png

    Torture Test 3:

    Service Failure:

    Before:

    pastedImage_5.png

    After:

    pastedImage_6.png


    All tests passed. My secondary server came up with only
    minor issues (had to restart my Network polling and adding application monitors
    as I disabled the network card in the middle of it.

    The only other comment would be to add a setting to be able
    to fail back automatically as some managers/administrators <sic> would
    like the primary to always be the primary…

  • Nice catch steven.melnichuk@tbdssab.ca! That test was written for Beta 2. Now in Beta 3, we have implemented local recovery attempts to prevent unnecessary failovers. When/if the server can recover locally without failing over, it will do so. If for some reason the server is unable to fully recover after 3 attempts in a one hour timeframe, the server will failover to the other member in the pool.

  • The other thing I liked, is the fact that EVERYTHING failed over and was accessible...website, polling, reporting and everything. I also didn't have to put in a new address - I kept using http://solartest1 even though it was off the network.

  • Just bumping up for those who didn't spot that awesome article yet. Thanks aLTeReGo

  • aLTeReGo​, may I please possibly confirm one scenario which I am not quite sure about expected behaviour. We have stretched subnet between 2 DCs. SolarWinds HA cluster sitting in it as well as SQL always on cluster:

    * DC1 (active poller + active SQL)

    * DC2 (standby poller + read-only SQL).

    So, my question is - what if there are some WAN related connectivity issues so that DC2 becomes isolated from DC1, but both continue to be operational. What will happen then? Will DC2 springs into live and we are potentially having 2 live instances at both ends?

  • alexslv​, split brain conditions are not possible with HA. Despite appearance, there is always a quorum to break an electoral tie. That third tie breaker is always the SQL server itself. This eliminates any possibility of both members running in isolation mode.

  • HI aLTeReGo

    One very basic question on this scenario. When the DB is unavailable for any reason. All the polling to devices will continue but the updates will be queued and pushed to DB once available?

    And where does the queuing exactly gets piled up in this case?

    So later when i pull the data it will show everything?

  • If the Database is inaccessible, the polling results queue on the polling engine they were collected from. These results are stored in MSMQ, and flushed once the SQL instance is brought back online and persisted to the database.

  • ... just on the same subject, we also do additional checks to identify these sort of issues:

    Watch The Watcher... Or Self-Monitoring For MSMQs