Getting more ThroughPut out of your Solarwinds Pollers
Have you ever wondered why there is plenty of system resources (CPU/Memory/Network/Disk) on your Solarwinds Poller box but still the throughput is not where you need it to be?
From my years of being a C-Programmer (Sockets) and knowledge of the ISO 7 Layer stack, it was apparent to me that the issue was in the Transport layer. This post will explain how-to tune the Transport layer to maximize the TCP throughput.
First, to prove this was the case, I wrote a quick Powershell script to list the TCP Connection Counts for each of the Solarwinds Pollers in our monitoring environment. We were having "unpredictable results" when the Pollers were too busy. This occurred when the Total TCP connections per poller were in the 12,000 to 15,000 range. The symptoms of this issue were, like I stated, "Unpredictable Solarwinds results" (i.e. random failures) while there were plenty of resources (CPU/Memory) on the poller box. That is where my powershell script came in handy to periodically check the TCP Connection counts. As suspected, the bottleneck was in Layer 4 (the Transport Layer) where the number of TCP connections were not closing quick enough. By making a simple change (described below) to the TCP settings, it allows the TCP connections to close faster and thus you get a higher TCP throughput and the issue of the "unpredictable results" went away. Granted, there may be a future time when the TCP connections start to fail again; at which time we may look at purchasing another poller license.
After the TCP Kernel parameters are changed and you have restarted Solarwinds then run your powershell script for checking the TCP Connection counts again. You should see a drastic decrease in the number of connections as they are now closing quicker.
Bottom line, if you want to squeeze the most throughput out of a poller, look into tuning your TCP parameters. This worked for us. Disclaimer; as with any kernel change you will need to perform your own due deligence and testing for your own environment.
On Windows platforms, if the following tcp parameters are not explicitly defined in the regedit tables then the default values will be used. If the default timeout is 120 seconds and the maximum number of ports is approximately 4,000, resulting in a maximum rate of 33 connections per second. If your index has four partitions, each search requires four ports, which provides a maximum query rate of 8.3 queries per second.
(maximum ports/timeout period)/number of partitions = maximum query rate
If this rate is exceeded, you may see failures as the supply of TCP/IP ports is exhausted. Symptoms include drops in throughput and errors indicating failed network connections. You can diagnose this problem by observing the system while it is under load, using the netstat utility provided on most operating systems.
Changing the TCP parameters involves two (2) steps:
Step #1: Configure the TCP settings for the server
To set TcpTimedWaitDelay (TIME_WAIT):
NOTE: TcpTimedWaitDelay will not work unless the StrictTimeWaitSeqCheck is set to 1.
To set MaxUserPort (ephemeral port range):
Step #2 – Reboot the server
Reboot of server is required after these changes.
SolarWinds solutions are rooted in our deep connection to our user base in the THWACK® online community. More than 150,000 members are here to solve problems, share technology and best practices, and directly contribute to our product development process.