Force Orion High Availability to Utilize VIP as Source IP Address

Orion High Availability (HA) is designed to provide uninterrupted access to the Orion Web Console as well as incoming traffic (i.e. NetFlow, Syslog, Traps, Agents). For this purpose, HA allows using a Virtual IP address (VIP), Virtual Host Name (VHN) or both, depending on the environment where HA is implemented.

But from the other perspective of monitoring the outgoing traffic can utilize any of IP addresses associated with the HA pool - VIP (if configured) or any of IP addresses of the active pool members. That means devices have to answer polling queries (i.e. SNMP) coming from any of these three IP addresses. In many environments, this may not pose any trouble, as devices IPs allow traffic from any source or network access control lists are centralized and easily modified to allow these exceptions. However, in some cases devices are locked down to certain IP address. This is typically the VIP which was previously the IP address of the initial Orion server preceding the HA implementation. This would then result in failed polling (packets being dropped/refused, data not returned). Unfortunetly, this behavior results from the way operating system (Windows) makes the decision which IP address to use as a source, which is beyond an application's ability to control.

The Windows implementation of the TCP/IP stack provides a mechanisum for letting the system know which IP addresses can be skipped during the decision-making process. For that purpose, each IP address has a property - SkipAsSource - which can be modified on the fly and immediately affect the way outgoing traffic is sent out. One of the easiest ways of doing it is to use PowerShell in conjunction with Windows Task Scheduler. Below you can find an example PowerShell script which:

  1. Checks if the VIP exists on the server, in the example below '10.160.198.8', and sets its SkipAsSource to False and at the same time sets all remaining IP addresses SkipAsSource to True, which means that Windows will use VIP for outgoing traffic
  2. in case VIP does not exist on the server sets all remaining IP addresses' SkipAsSource to False, which means Windows will use any of the IP address available for outgoing traffic

<#
.SYNOPSIS
  Script adjusts SkipAsSource setting on IP addresses.

.DESCRIPTION
  Adjusting SkipAsSource settign on IP addressess allows Windows to direct traffic using as source IP address for which SkipAsSource is set to False.

.INPUTS
  None

.OUTPUTS
  None

.NOTES
  Version: 1.0
  Author: Mariusz Handke
  Creation Date: 2018-08-31
  Purpose/Change: Initial release
#>

$VIP = "10.160.198.8"
$IPS = Get-NetAdapter | Get-NetIPAddress -AddressFamily IPv4 | foreach { $_.ipaddress }
If ($IPS -Match $VIP) {
  foreach ($IP in $IPS) {
  Set-NetIPAddress –IPAddress $IP –SkipAsSource $True
  }
  Set-NetIPAddress –IPAddress $VIP –SkipAsSource $False
} Else {
  foreach ($IP in $IPS) {
  Set-NetIPAddress –IPAddress $IP –SkipAsSource $False
  }
}
Implementing it as an all-the-time running solution:
  1. Save the above script to a file on the server (i.e. C:\Orion_HA_set_IP_addresses.ps1)
  2. Using Windows Task Scheduler, create a simple task which executes the above script on a recurring schedule. Be aware that shortest repetition interval the script can be executed is every five minutes, and if you require more frequent execution simply create multiple triggers within the task (i.e. 00:00, 00:01, 00:02, 00:03, 00:04 each one repeated every 5 minutes resulting in execution every minute)

Script Behaviour description:

  1. When HA pool is set up with VIP and pool is enabled, HA service will assign VIP to the network interface card (NIC) of the active server
  2. At this point, all IP addresses have their SkipAsSource set to False
  3. When the script executes it adjusts the 'SkipAsSource' property of IP addresses resulting in the active server sending traffic with VIP as source
  4. When failover occurs, the HA service removes VIP from the server resulting in the short period of outgoing traffic failure due to remaining IP addresses set to be skipped
  5. When the script executes again (quicker the better) the failover process completes as the IP addresses has now available for outgoing traffic
  6. At this point, HA completes process letting standby server take over, from which process repeats from 1.

Disclaimer:

  • I have created the small test environment with a single node and ICMP polling only. Since the test has started (setup as described above with randomly killing Orion services) looks like the BAD/TOTAL percentage is well below 0.1% and I believe increasing frequency of script execution may even further lower down this figure. Attached screenshot of the spreadsheet shows results and I will keep the test environment running for another while.

    Capture.PNG
  • Unfortunately, this is the last batch of results since I have overlooked servers patching and they have rebooted... But looks like 0.1% of "baddies" is a realistic overall result.

    thwack-20180924.png
  • Worth to notice is the fact that when using AD integrated DNS, and NIC(s) on servers set with enabled "Register this connection's addresses in DNS" may result in some unexplained behaviour at the DNS. I have seen alias records (A) being removed upon failovers by Windows updating/registering connections in DNS. And sometimes result in Web Console not even allowing to click on buttons, for example when trying to Force Failover.

    My advice is to:

    1. make sure "Register this connection's addresses in DNS" option for IPv4 is disabled
    2. make sure "Register this connection's addresses in DNS" option for IPv6 is disabled
    3. required DNS hosts records (A/AAAA) are added manually to the DNS
  • Awesome work around hats off aLTeReGo, such an ideas would be really great if incorporated in solution itself, may be there can be separate piece of code to handle HA from SolarWinds it self instead of relying on Windows TCP/IP.

    oiram​ great testing and putting off results to rely on !!

  • Thanks for this aLTeReGo.

    I do have a question on how HA manages DNS during failover.

    Let's say we have these servers in a single subnet configuration:

    Primary+additional pollers

    poller1

    poller2

    poller3

    and the following HA pollers

    poller4

    poller5

    poller6

    I understand we can use the IP of poller1 as the VIP, but the documentation recommends not to use the same hostname if using virtual hostnames. Is this recommendation just to avoid cases of confusion on what is going on during a failover?

    I can't find any documentation about what HA does to DNS on the primary or failover pollers - ideally we can use poller1 as the virtual hostname, and HA will do it's magic and change the hostname of poller4 to poller1 if/as needed? And then use your script to make sure it only uses the one NIC/interface to send outbound SNMP traffic to our monitored nodes. If you could provide some more detail on the specifics of what is happening there with DNS that would be most appreciated, I've searched and found nothing related to the specific technical details.

    Thanks.

  • Each server has its own DNS record (most likely), and this remains fully intact and unchanged. The Virtual Hostname is shared between both members in the pool but points only to the IP address of the 'Active' member. When a failover occurs that virtual hostname DNS entry is updated to point to the IP address of the secondary server that is now 'active'.

  • aLTeReGo Is it correct in assuming that when HA is enabled, that our primary server in a same subnet environment is now sending all outbound SNMP polling traffic from the VHN IP?

    Thanks

    EDIT - I guess you answered that but wanted to triple confirm - "But from the other perspective of monitoring the outgoing traffic can utilize any of IP addresses associated with the HA pool - VIP (if configured) or any of IP addresses of the active pool members."

  • hpstech  wrote:

    aLTeReGo  Is it correct in assuming that when HA is enabled, that our primary server in a same subnet environment is now sending all outbound SNMP polling traffic from the VHN IP?

    Thanks

    EDIT - I guess you answered that but wanted to triple confirm - "But from the other perspective of monitoring the outgoing traffic can utilize any of IP addresses associated with the HA pool - VIP (if configured) or any of IP addresses of the active pool members."

    By default the Windows operating system decides which IP address to use as the source IP address. This is decided by the highest order matching bit to the gateway IP. This is outlined in some detail here > Source IP address selection on a Multi-Homed Windows Computer | Networking Blog

    You can however, force Windows to use a specific IP address as the source IP using the method outlined here above.

  • Awesome, this saves me all types of headaches, I am surprised this isn't part of HA already.

    Would it be better to trigger the script run from an alert action to execute an external program or something?