Network Configuration Manager Scalability – An Extreme Example

So, you Network Configuration Manager (NCM) folks are probably aware of this link explaining its scalability.

To summarize, a single poller can deal with up to 10K devices, and you can add two additional polling engines (APE) to the primary poller, which sums up to 30K devices.

That’s a lot, and I’m assuming not that many deployments exist in that region, because most customers I was talking to in the past years run between 500 – 1000 devices.

Don’t we all like big things? Well, here’s something bigger!
Have a look at this pic:


The deployment above started as a project in 2011 led by our partner I.Tresor in Germany, led by fellow THWACKster HerrDoktor, Holger Mundt.
Oh wait, Holger lives in Bavaria, but let’s ignore that fact for now.

The end user was running Kiwi CatToolsRegistered for network automation and configuration management for a few years, but required more features like AD-Auth, better reporting, and auditing.

Right from the early days, scalability was the most crucial factor in the planning process. The first version of the deployment managed 10K devices, but the scaling allowed the full 30K already.

Holger faced various challenges during the deployment.

The first one was to get NCM running without using the network discovery feature, which was declined by the end user’s security team.

Instead, they offered to provide a daily updated spreadsheet with a list of managed network nodes.

Fortunately, our partner is no stranger to the SDK and created an excellent script to automate the import process.

Each time a new CSV file was found on a network location, the script checked changes against the database and actioned these appropriately, up to a level where no manual intervention was required, and it detected removed, changed, and added nodes.

The first shot of the script was not optimized and ran for about 72 hours, making implementation impossible. A few nifty PowerShell functions and optimizations later, the runtime was down to 35 minutes.

Over the years, the deployment grew up to 35K devices, and while this is beyond what NCM officially supports, it was still running.

Thirty-five thousand devices put the database under a lot of stress, and more optimization than usual was required.

Long-running jobs increased wait locks to an average of 7 seconds each, but optimization reduced it to 1.5 seconds. That doesn’t sound like a lot, but made a significant difference.

A few performance hindrances could be identified and eliminated by the SolarWinds support teams: they set up SQL scripts that cleared out log files growing too large, and additional SDK scripts that double-checked nodes that were imported with wrong information.

Around that time, the customer wasn’t sure if growing the SolarWinds estate or switching to another vendor would be the way to go, but they remembered what happened to the dinosaurs and started planning for an entirely new NCM setup.

The planning for the current deployment started in 2017 and went live quite recently.
It is designed to fit 90K devices and runs on three separated instances with dedicated databases each, and uses one of the best tricks to max out the performance: Additional Web Servers.

In a nutshell, each instance is based on 5 VMs:

1 x Additional Web Server (AWS) – 4 vCPUs, 8 GB RAM
3 x Polling Engines incl. Core – 8 vCPUs, 16 GB RAM
1 x SQL Server – 24 vCPUs, 64 GB RAM

And, on a global level:
1 x Enterprise Operations Console (EOC) – 4vCPUs 8 GB RAM
The EOC database sits on the least-used instance.

As of right now, the instances are running at less than 70% utilization, so there’s room for growth.
The nodes are assigned to the instances based on their regional location. The location is coded in the node name and is picked up by the SDK Script.

Even in this scale, NCM runs fine, and the routers and switches, which are located all over Germany, report status and response, daily backups are stored, and loads of compliance rules are in place to keep everything as tidy as my desk.

Various jobs are running daily, weekly, and monthly.

For example, significant config changes are rolled out via config snippets as tasks, but essential settings are always managed with compliance rules so that changes are getting corrected no matter what.

Additionally, NCM sends weekly reports to all network teams, in case of devices that couldn’t be saved or took more than one attempt.

This is quite an impressive setup and an excellent real-world example of the abilities of the platform.
I am looking forward to hearing similar stories from all of you out there!

Top Replies

  • This is a fantastic case study, saschg , and I am impressed how SolarWinds rose to the challenge and provided a solution that supported such a massive implementation.  My company is nowhere near this level or complexity, but it is nice to know that not only do Orion and NCM have the horsepower to accommodate large-scale setups, the team (THWACK included in that) can put its collective brainpower together to tune and tweak things to "top-like" performance.


  • Nice we know it can!!

    I don't see why not 30,000  per NCM poller (if you run NCM only)..

    10K per poller  is to  a poller that run NPM &NCM...

  • Hi sja​ we were questioning this restriction as well. As it is in the „what Solarwinds supports“ restrictions, we were taking this as a given fact. As we had a few performances Cases I doubt that the handling would have been as smooth as it was.

    Let‘s be honest, every Supporter first checks if you are using the product as you should.

    we were running one poller with 20.000 nodes with no bigger problems....that’s the other side of the story emoticons_wink.png

  • Indeed, even in this scale, NCM runs fine, and the switches and switches, which are found all over Germany, report status and reaction, everyday reinforcements are put away, and heaps of consistence rules are set up Airbnb Property Management to keep all that as clean as my work area.

  • Great share about the capabilities! We are significantly smaller but nice to know there's loads of capability!

    @saschg can you or anyone else elaborate on this point: uses one of the best tricks to max out the performance: Additional Web Servers.

    When do you find this becomes necessary and what amount of impact do you associate with it - anything measurable or quantifiable or qualitative that you can add?

    The reason for my interest is we don't meet the loading to require a secondary web server but sometimes the load times feel slow and I've often wondered what it would be like. It might just be database on our side but we tried optimizing it and giving it more resources as well. Overall it loads pretty good I just always want it to load faster no matter what! After all we are responding to a mission critical crisis when we load Solarwinds sometimes...

    Thanks in advance for any thoughts!