Getting more ThroughPut out of your Solarwinds Pollers

Getting more ThroughPut out of your Solarwinds Pollers

Have you ever wondered why there is plenty of system resources (CPU/Memory/Network/Disk) on your Solarwinds Poller box but still the throughput is not where you need it to be?

From my years of being a C-Programmer (Sockets) and knowledge of the ISO 7 Layer stack, it was apparent to me that the issue was in the Transport layer.   This post will explain how-to tune the Transport layer to maximize the TCP throughput.

First, to prove this was the case, I wrote a quick Powershell script to list the TCP Connection Counts for each of the Solarwinds Pollers in our monitoring environment.   We were having "unpredictable results" when the Pollers were too busy.   This occurred when the Total TCP connections per poller were in the 12,000 to 15,000 range.    The symptoms of this issue were, like I stated, "Unpredictable Solarwinds results" (i.e. random failures) while there were plenty of resources (CPU/Memory) on the poller box.    That is where my powershell script came in handy to periodically check the TCP Connection counts.    As suspected, the bottleneck was in Layer 4 (the Transport Layer) where the number of TCP connections were not closing quick enough.    By making a simple change (described below) to the TCP settings, it allows the TCP connections to close faster and thus you get a higher TCP throughput and the issue of the "unpredictable results" went away.    Granted, there may be a future time when the TCP connections start to fail again; at which time we may look at purchasing another poller license.   

After the TCP Kernel parameters are changed and you have restarted Solarwinds then run your powershell script for checking the TCP Connection counts again.   You should see a drastic decrease in the number of connections as they are now closing quicker.

Bottom line, if you want to squeeze the most throughput out of a poller, look into tuning your TCP parameters.     This worked for us.   Disclaimer; as with any kernel change you will need to perform your own due deligence and testing for your own environment.

On Windows platforms, if the following tcp parameters are not explicitly defined in the regedit tables then the default values will be used.   If the default timeout is 120 seconds and the maximum number of ports is approximately 4,000, resulting in a maximum rate of 33 connections per second. If your index has four partitions, each search requires four ports, which provides a maximum query rate of 8.3 queries per second.

(maximum ports/timeout period)/number of partitions = maximum query rate

If this rate is exceeded, you may see failures as the supply of TCP/IP ports is exhausted. Symptoms include drops in throughput and errors indicating failed network connections. You can diagnose this problem by observing the system while it is under load, using the netstat utility provided on most operating systems.

Changing the TCP parameters involves two (2) steps:

Step #1: Configure the TCP settings for the server

To set TcpTimedWaitDelay (TIME_WAIT):

  1. Use the regedit command to access the HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\ Services\TCPIP\Parameters registry subkey.
  2. Create a new REG_DWORD value named TcpTimedWaitDelay.
  3. Set the value to 30.
  4. Stop and restart the system.

NOTE:  TcpTimedWaitDelay will not work unless the StrictTimeWaitSeqCheck is set to 1.

To set MaxUserPort (ephemeral port range):

  1. Use the regedit command to access the HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\ Services\TCPIP\Parameters registry subkey.
  2. Create a new REG_DWORD value named MaxUserPort.
  3. Set this value to 32768.
  4. Stop and restart the system.

Step #2 – Reboot the server

Reboot of server is required after these changes.

  • I have implemented 6 X (hard metal instances) - each instance is double stack licensed for every poller that is distributed across North America.

    We are starting to see major limits to the scaling we deployed and future implementing. 

    I am intrigued and hopeful to see a nice delta increase, it makes sense that you are shortening the timeout factor per port then increasing the volume of open ports per session when you fall down the rabbit hole or fact checking this process. 

    I also wonder if a patch / release / upgrade on the OS level may change / delete / modify the registry keys being modified.

  • Upgrading to the hardware, OS, and Solarwinds (Orion2019.2HF3) after quite a lot of slowness issues.

    These steps seem to have improved the performance greatly, probably in DB calls improvements.

    Regardless of which one had the most impact, this combination has a much improved the Solarwinds UI and poller response times.