cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post
Product Manager
Product Manager

Multi-Subnet Failover (WAN/DR) Deployment

High Availability 2.0 provides the first peek into supporting redundancy for Orion across subnets. This was previously referred to WAN deployment or Disaster Recovery with the Failover Engine, but under High Availability we refer to this simply as a multi-subnet failover configuration. In other words, this provides the same automated, near instantaneous, failover and recovery mechanisms as High Availability does in its first release, but extends that functionality to support pollers spread across different subnets. Those could be different sites, a dedicated disaster recovery location, or possibly even the cloud.

HIGH AVAILABILITY REQUIREMENTS

  • High Availability 2.0 Installer (Built-in and located under [Settings -> All Settings -> High Availability Deployment Summary -> Setup A New HA Server -> Get Started Setting Up a Server -> Download Installer Now]
    • High Availability 2.0  Can be used only with product modules running with Orion Core 2017.3
  • Two servers running Windows Server 2012 or later
    • Both primary and secondary servers must reside on different subnets for multi-subnet failover
      • Primary and secondary servers which reside on the same subnet can be used for same-subnet failover using a traditional VIP
    • Windows or BIND DNS Server credentials for configuring the virtual hostname
    • Windows Server OS version, edition, or bitness need not match between primary and secondary servers.
    • Primary and secondary servers may be optionally joined to a Windows domain
    • High Availability supports the following configurations of primary and secondary servers.
      • Physical to Physical
      • Physical to Virtual
      • Virtual to Virtual
      • Virtual to Physical
  • A separate server running SQL 2012 or later.
    • This server does not need to reside on the same subnet as either the primary and secondary Orion server
    • Any Microsoft SQL edition may be used, including SQL Express
    • Bonus points for utilizing a SQL Cluster

pastedImage_4.png

PRIMARY SERVER INSTALL

When installing the Primary Orion server you will follow the normal 'Advanced' installation process that you would for any other Orion product. Ensure not to select the 'Express' install option during installation, as a separate server running Microsoft SQL 2012 or later is required. When the Configuration Wizard runs you will be prompted to provide the Username, Password, and IP address of the SQL server you will be using for the installation.

SECONDARY SERVER INSTALL

Once the primary server is up and running using the NPM 12.2 installer, you will need to perform a similar installation on the secondary server using the separate High Availability installer which can be downloaded from within the Orion web interface under [Settings -> All Settings -> High Availability Deployment Summary -> Setup A New HA Server -> Get Started Setting Up a Server -> Download Installer Now].

Download the High Availability Secondary Server Installer

All Settings.png
High Availability Settings.png
High Availability Deployment Summary.png
pastedImage_7.png
Evaluate High Availability.png
pastedImage_9.png

Next, execute the installation by double clicking on the "SolarWinds-Orion-Installer.exe" downloaded or copied to the secondary server.  Enter the IP address of fully qualified domain name (FQDN) of your main Orion server, along with 'Admin' or equivalent credentials used to log into the Orion web interface and click 'Next'. On the following step of the Wizard, select the additional server role you wish to install. Since this will be a High Availability Backup for the main Orion server, select 'Backup Server for Main Server Protection' and click 'Next'.

Enter IP of Main Orion Server & Provide 'Admin' Credentials

Select Server Role to Install

pastedImage_0.pngpastedImage_1.png

Once the Installation completes the Configuration Wizard will be started. When prompted to provide information regarding the SQL server database, ensure you utilize the same SQL instance and SQL database that was chosen for the primary Orion server.

The following video, while arguably boring to watch, demonstrates the secondary server installation process.

CLUSTER POOL CREATION

As soon as both the primary and secondary servers are installed, return to the Orion web interface under [Settings -> All Settings -> High Availability Deployment Summary]. There you will be able to join the two servers into a multi-subnet failover pool.

Click 'Set up High Availability Pool"
Setup High Availability Pool.png
Enter a Virtual Hostname and click 'Next'
Pool Properties.png
Select your DNS Server Type
DNS Settings.png
Microsoft DNS

Enter the IP Address of your DNS Server, the DNS Zone (E.G. solarwinds.com) and administrative credentials to the DNS server to create the shared virtual hostname

Microsoft DNS.png

BIND DNS

If you are running BIND DNS, enter the IP address of your BIND DNS server, the DNS Zone, your TSIG secret key name, and the TSIG shared secret key value.

BIND.png

Summary

Once complete, review the summary and click "Create Pool"

Summary.png

Success

When done, you will have pooled two Orion servers together across multiple subnets into a redundant, high availability pool

Setup Complete.png

The following short video walks through this process in under a minute.

110 Replies

The scenario m currently in is:

Customer is not ready to provide access to DNS for configuring solarwinds

in HA. So in this case how will the load balancer work without HA config in

place?

They also mentioned that, to use F5 methodology, solarwinds should be in

active active mode.

So I tried to explain from all angles but they did not get my point and

hence I thought of checking with folks here as to how are they utilizing F5.

On Fri, Jun 29, 2018, 12:57 AM aLTeReGo

0 Kudos

There is absolutely zero requirement for a load balancer an Active/Active pair. You need to think of an Active/Passive relationship as what happens when a member in a load balanced cluster fails. The only difference here is that's normal behavior for one member of the pool to be in a 'down' state. If they frontended two web servers with a load balancer and one web server failed, the whole website wouldn't become completely unavailable. In that same scenario if the load balancer is configured properly, 50% of the connections wouldn't fail either. 100% of the traffic would be redirected to the surviving member. This is also how Active/Passive pairs are handled when frontended by a load balancer.

So If I understand correctly, I still need HA config at solarwinds level to

be configured? The reason I am again and again referring to is because we

have both solarwinds servers in different subnet and which means we need to

use virtual hostname.if it was VIP then I think it wouldn't have been much

of a prob.

So on top of this the F5 config will sit, am I right in understanding?

0 Kudos

Yes, Orion High Availability would be required regardless of whether the servers are on the same or different subnets, if a load balancer is used or isn't, and if a virtual hostname or a VIP is used. A load balacner will not be able to failover the Orion server to the secondary server. Nor will it be able to determine if the Orion server has lost connectivity to the SQL database server, has run out of free disk space, a service has crashed, etc. etc. etc.. The load balancer able to tell you if the website (IIS) is up and serving pages, but that's really about it.

Perfect! This is what I was trying to arrive at...

Thanks a lot for all the inputs...you clarify it with complete details....

On Fri, Jun 29, 2018, 11:38 PM aLTeReGo

0 Kudos

I'm running into a similar question. I'm not highly knowledgeable on "the network side" of things, so hopefully I'm not going too far down the rabbit hole for something obvious.

To be clear on how setting up a multi-subnet failover works: The only method supported today is to use a "virtual hostname" which is a DNS CNAME/Alias record (I don't know the record type).

This means that anything sending data *to* Orion, via SNMP/Syslog/etc will have to use the DNS "virtual hostname" name so it will route to the current IP address/active server.

When an Orion failover occurs the new "active" Orion server updates the DNS record of the "virtual hostname" with a new IP address (of the new active server).

My questions revolve around the caching of the old IP associated with the DNS virtual hostname scenario:

There are a few warnings in the docs about the IP address caching on anything connecting to Orion, since you're using a DNS name with a changing IP address.

For users, this means they may have to refresh their browser cache.  I'm not too concerned about them for this scenario

For external devices sending in SNMP/Syslog data, I'm not sure how this is handled as we have old (ancient?) and "weirdo" things sending in SNMP/Syslog.  I don't think I could get all of the device owners to make sure their device is flushing their DNS caches, nor if it' seven possible to configure thisfor some devices.

In addition, I've asked around and apparently some devices can *only* be configured to use a single IP address (no DNS names) to send SNMP/Syslog data to.

This means that, when Orion fails over I really can't say how much SNMP/Syslog data I may lose due to external devices not being able to pick up the new IP, and some can't even use a DNS name so what do I do with those?

What it looks like is I need to have some network device with a static IP address that all the remote devices connect to, that then routes to my DNS "virtual hostname" entry. This device then has a low DNS cache refresh time...or something.

I talked to my network team and they indicated that something like a load balancer can route traffic based on testing what node is "available".  I'm not too sure, but both Orion primary/secondary servers should both be "up", so it comes to running some specific tests from the load balancer to determine which one is the primary. They mentioned checking an HTTP status page/URL, etc. but I don't know what Orion services would be "up" on the secondary or what's a good test.

My questions:

- Has anyone decided to use a network load balancer or other solution to handle the above scenario?  If so, are you running "tests" for determining traffic routing or just keeping the DNS caching of the load balancer time low?

- If you didn't load balance and just used a network device to replicate all incoming SNMP/Syslog data to both IP addresses of the servers in the HA pool (bypassing the virtual hostname) will the secondary Orion server pick it up?  From what I've read, some secondary services will be running to handle failover but I don't know what's "not running".  I assume SNMP/Syslog would not be running services on the non-active server.

tigger2  wrote:

The only method supported today is to use a "virtual hostname" which is a DNS CNAME/Alias record (I don't know the record type).

In a multi-subnet failover configuration a virtual hostname is optional and provided as a convenience feature. Some customers opt to use alternative means of directing traffic to the Orion server, such as a Network Load Balancer.

This means that anything sending data *to* Orion, via SNMP/Syslog/etc will have to use the DNS "virtual hostname" name so it will route to the current IP address/active server.

That's certainly one option, though most customers opt instead to configure their devices to send NetFlow, Syslog & SNMP Traps to both members of the pool. A few have created NCM Configuration Alert Actions to update the Syslog, Trap, NetFlow destinations on their devices to point to the 'Active' pool member when a failover occurs. There really are quite a few options available. You just need to pick the option that works best for you in your environment.

tigger2

For users, this means they may have to refresh their browser cache.

Modern browsers maintain their own DNS cache, separate from the operating system. Unfortunately, this browser cache does not respect certain key components of DNS, such as the TTL for when a DNS entry should expire from the cache. This means that users who are actively working in the Orion web interface when a failover occurs may need to close their browser and reopen it before they can resume their session.  A load balancer, or transparent proxy like nginx can be used as a workaround if this is bothersome.

tigger2

What it looks like is I need to have some network device with a static IP address that all the remote devices and users connect to, that then routes to my DNS "virtual hostname" entry. This device then has a low DNS cache refresh time...or something.

The TTL used by HA for the virtual hostname is already very low, at one minute. The issue is that browsers do not respect that value within their own cache, even though the operating system fully does.

Oh wow, an LB just seems like overkill, won't be doing that for sure.  Virtual hostname it is, thanks for your help.

0 Kudos
Level 16

Hi aLTeReGo

Have some queries on the failover setup.

1. Once we download the installer for secondary server and complete the installation it will redirect to console of Primary server, correct? Will all the services on secondary server be in running state?

2. Then once we configure the HA pool by using VIP and finish it, will the services still show in running mode in both?

3. For testing failover, what all scenarios it will work? Service restart is one, how about other scenarios?

4. In case of using VIP, console be accessible from whichever is active and VIP, right?

5. Any specific settings to be done so that we can access VIP to access the console?

0 Kudos

ss

1. Once we download the installer for secondary server and complete the installation it will redirect to console of Primary server, correct? Will all the services on secondary server be in running state?

Negative. Only a few critical services will be running on the standby server. The SolarWinds Administration Service, the SolarWinds Agent, SolarWinds HighAvailability, and SolarWinds Orion Module Engine services.

2. Then once we configure the HA pool by using VIP and finish it, will the services still show in running mode in both?

The same services I listed above will be running on the standby server. All other services will remain stopped and disabled until a failover occurs and the standby server becomes the active member of the pool.

3. For testing failover, what all scenarios it will work? Service restart is one, how about other scenarios?

I recommend reviewing my post here -> Torture Testing High Availability

4. In case of using VIP, console be accessible from whichever is active and VIP, right?

Yes, that is correct.

5. Any specific settings to be done so that we can access VIP to access the console?

In the off chance you configured your Orion web console to only be accessible from one specific IP address, you will need to change this so IIS is bound to all adapters. E.G. (All Unassigned).

pastedImage_25.png

thanks for the response 🙂 We are trying to setup the HA for evaluation purpose and found some issues.. Let me go through all your points and let u know incase i am still not able to resolve it.

0 Kudos

aLTeReGo

One quick query i have... if there is HA setup and i have the integration to Service Now where in i am actually using a Powershell script in the alert trigger action,to send the details to SNOW. Now, when the instance gets failed over to Secondary, will my trigger actions and alerting be in-tact?

This question could also apply when i have integration SNOW without powershell script..

This might be a very stupid question but just want to ensure i am not missing anything critical...

0 Kudos

Yes, the state of the Alerts, as well as their escalations (if any) are preserved when failovers occur. This means that the alerts will not reset or retrigger unless you've explicitly defined them to through alert escalation. Alert Acknowledgement state is also perfectly preserved throughout the failover/failback process.

0 Kudos

Great... Thanks a lot for confirming

0 Kudos
Level 16

Thanks a lot for sharing this.. makes it more easier to understand and implement.

One query on the Virtual hostname part. What all communication on ports needs to be allowed towards/from it? The pre requisites for monitoring any device will have to be opened towards both Primary and Secondary right OR even towards the Virtual host name.  If any link is there then please send me that so that i will directly refer that.

0 Kudos

For Bind, updating the DNS name is performed over port 53. For WIndows DNS, updating the virtual hostname is done over WMI. In either case, this needs to be allowed from both members of the pool.

0 Kudos

is there a description somewhere of how the DNS interaction works? All of the documentation and links are on screenshots of an actual NPM12.2 install which makes it hard to figure out what rights are needed.

Today with the FoE we're running a nsupdate script that adds and removed the IP addresses of the polling engines from a zone, but since we explicitly say what to do, it just works, and I can test it outside of the FoE.

i.e. nsupdate primary.txt

server A.B.C.D
update del app.swo.local A
update add app.swo.local 3600 A W.X.Y.Z
send

I don't know that I have a way to test what is needed with WAN-HA before I actually do a production upgrade.

0 Kudos

yes, but for BIND.

the screen shot BIND.png

Doesn't tell me what the actual DNS changes are going to be, it wants a zone, but doesn't tell me what records are going to be added or removed.

I can't click on the 'What changes will be made' link until I've done the install, which makes planning harder

I am concerned about it asking for a username and password for setting up access control -- I already have that bit working for an existing domain, I assume I can leave it blank and it'll work?

I've opened a support case asking technical support for the information as well.

0 Kudos

RichardLetts

what kind of account did you create for this? Did the non admin creds work and was able to authenticate from Solarwinds?

We are having tough time with one of the customer who is not willing to allow connection to DNS inspite of informing no changes will be performed apart from the entry being created for HA...

0 Kudos

BIND does not use account credentials. Instead, a TSIG key is used.