In two previous blog posts I have introduced the Failover Engine and walked through some common Q&A we received right after launching the Failover Engine last year.

In this blog I wanted to get down into the nuts and bolts of how the Failover Engine actually works under the covers.  When it comes to protecting an application, there is more to it than just watching the services. You need to watch and protect the entire Application stack.  On your Orion server, the following components exists:

  1. Services
  2. Registry Settings
  3. File System Structure
  4. Web Server (IIS)

We could just watch the Orion services and be done with it, but the problem with this is that you must then maintain your secondary failover server with the exact same configuration and settings manually.  If/when a failover condition occurs and your end users notice reports or setting changed or missing, then you are going to get a call.  Also, what if the problem is not with Orion itself, but something is going on with Microsoft IIS?  Since the Failover Engine is watching and protecting the four areas above, as also illustrated in the below image, you do not have to worry about these scenarios as they are covered.  

1

 

 

Let’s walk through each of these four areas in further detail.

  1. Services   
    As shown above the Heartbeat is checking periodically that all the services are up and running.  The Heartbeat portion of the application is responsible for the data replication, switchover and failover processes.  The protected service list is created dynamically by looking at a specific registry setting that we write to on product installs.  So as long as you have the appropriate license, if you install a new module then protection is automatically picked up. For the given services being monitored, within the Failover Management client you can specify behavior (you can define up to three steps per service), like which services are the most critical and when to initiate a failover. Example, SolarWinds Syslog Service:
    1. Service fails
    2. First attempt- restart the service
    3. Restart fails – second attempt restart the entire Orion application
    4. Syslog service still fails, then initiate a failover to the secondary server
  2. Registry Settings   
    As I discussed above, there are key critical registry settings the Failover Engine watches and replicates between the primary and secondary server (more details on replication below).  These include things like licensing info, SolarWinds directory locations and registered SolarWinds services.  This is the reason you don’t have to buy two copies of Orion.  Since only one copy Orion is running at any given time and the registry is in sync you don’t need two license keys.     
  3. File System Structure   
    With Orion most data is stored in the database, so why is this important?  There are files which are important to the use and operation of the product which need to be replicated across servers.  Examples include report template definitions as you may create a set of custom reports.  From a back end operational standpoint the service SolarWinds Job Engine has a small database we install that handles the job dispatching and processing, which is also very important to keep in sync.  Here are some key items to know about the replication portion of the Failover Engine
    • If for some reason the channel between the two servers is broken, the Failover Engine will queue up the replication changes .  When connection is re-established, we will restart replication to verify data sets are the same.
    • Near real-time byte level data replication is provided between the active and passive servers.  Byte level replication ensures that only file deltas are replicated and not whole files or transactions.
    • Near real-time byte level replication works within the Windows kernel to ensure that near real-time data changes are sent from the active to the passive (secondary) server and once the process is complete. Below is a basic overview of how this process works.
      1. Data change is requested
      2. Failover Engine Filter Driver intercepts the request at the I/O leve
      3. Failover Engine Filter Driver checks the replication settings to see if this change needs replicatin
      4. Failover Engine Filter Driver generates a unique sequence number for the replication reques
      5. Failover Engine Replicates the data and also sends the change on to the windows file syste
      6. Windows commits the data change and sends confirmation to the application laye
      7. Failover Engine Filter Driver intercepts the confirmation
      8. Failover Engine replicates the confirmation to the passive server if require
      9. Data change process is now complete
  4. Web Server (IIS)   
    Orion services can be up and running just fine, but users may be complaining that the web console does not come up or is slow.  Is IIS running?  Failover Engine watches IIS at a service level; you can also define checks & tests to ensure the website is up and responding within an acceptable period of time.

Let’s switch gears now to licensing/packaging. Since one of my previous Failover Engine posts, APM and IPAM have released new versions which can be deployed as a module (as you could always do), but now both can be installed standalone without requiring NPM as well.

We still license by what we call a “primary product” per server. Previously, what was classified as a primary product were Orion NPM, APM and NCM.  This is where the change comes in, prior to IPAM 2.0 you could only install IPAM as a module, we didn’t charge for protecting it.  If you still deploy it as a module, this remains true.  If you purchase IPAM and choose to deploy it standalone and desire Failover Engine protection, then you will need to purchase the Failover Engine for One Primary Product.  Now that we have release SolarWinds User Device Tracker or UDT, it also behaves the same way IPAM does here.

Let’s walk through two different examples.

  1. Orion NPM, IPAM and NTA – you will need Failover Engine for One Primary Product.  Since NPM is considered a primary product and IPAM is installed as a module, you get protection for IPAM for free.
  2. Orion IPAM only – you will need Failover Engine for One Primary Product.  Since you are deploying IPAM as a standalone you will need to purchase a license to protect it.

One last scenario with the release of UDT or User Device Tracker.  Since IPAM and UDT are not considered "primary products" as described above, what if I purchase UDT and IPAM and want to protect both of them with FoE, what do I need to buy?  The answer is you just need an FoE for One primary product in this scenario.

Any questions or comments, please post them.  As I have illustrated in this post and the two previous posts I referenced at the start of this post, the Failover Engine product is very feature rich and more than just High Availability/Disaster Recovery, but more about Application Availability.