So, you Network Configuration Manager (NCM) folks are probably aware of this link explaining its scalability.
To summarize, a single poller can deal with up to 10K devices, and you can add two additional polling engines (APE) to the primary poller, which sums up to 30K devices.
That’s a lot, and I’m assuming not that many deployments exist in that region, because most customers I was talking to in the past years run between 500 – 1000 devices.
Don’t we all like big things? Well, here’s something bigger!
Have a look at this pic:
The deployment above started as a project in 2011 led by our partner I.Tresor in Germany, led by fellow THWACKster HerrDoktor, Holger Mundt.
Oh wait, Holger lives in Bavaria, but let’s ignore that fact for now.
The end user was running Kiwi CatTools for network automation and configuration management for a few years, but required more features like AD-Auth, better reporting, and auditing.
Right from the early days, scalability was the most crucial factor in the planning process. The first version of the deployment managed 10K devices, but the scaling allowed the full 30K already.
Holger faced various challenges during the deployment.
The first one was to get NCM running without using the network discovery feature, which was declined by the end user’s security team.
Instead, they offered to provide a daily updated spreadsheet with a list of managed network nodes.
Fortunately, our partner is no stranger to the SDK and created an excellent script to automate the import process.
Each time a new CSV file was found on a network location, the script checked changes against the database and actioned these appropriately, up to a level where no manual intervention was required, and it detected removed, changed, and added nodes.
The first shot of the script was not optimized and ran for about 72 hours, making implementation impossible. A few nifty PowerShell functions and optimizations later, the runtime was down to 35 minutes.
Over the years, the deployment grew up to 35K devices, and while this is beyond what NCM officially supports, it was still running.
Thirty-five thousand devices put the database under a lot of stress, and more optimization than usual was required.
Long-running jobs increased wait locks to an average of 7 seconds each, but optimization reduced it to 1.5 seconds. That doesn’t sound like a lot, but made a significant difference.
A few performance hindrances could be identified and eliminated by the SolarWinds support teams: they set up SQL scripts that cleared out log files growing too large, and additional SDK scripts that double-checked nodes that were imported with wrong information.
Around that time, the customer wasn’t sure if growing the SolarWinds estate or switching to another vendor would be the way to go, but they remembered what happened to the dinosaurs and started planning for an entirely new NCM setup.
The planning for the current deployment started in 2017 and went live quite recently.
It is designed to fit 90K devices and runs on three separated instances with dedicated databases each, and uses one of the best tricks to max out the performance: Additional Web Servers.
In a nutshell, each instance is based on 5 VMs:
1 x Additional Web Server (AWS) – 4 vCPUs, 8 GB RAM
3 x Polling Engines incl. Core – 8 vCPUs, 16 GB RAM
1 x SQL Server – 24 vCPUs, 64 GB RAM
And, on a global level:
1 x Enterprise Operations Console (EOC) – 4vCPUs 8 GB RAM
The EOC database sits on the least-used instance.
As of right now, the instances are running at less than 70% utilization, so there’s room for growth.
The nodes are assigned to the instances based on their regional location. The location is coded in the node name and is picked up by the SDK Script.
Even in this scale, NCM runs fine, and the routers and switches, which are located all over Germany, report status and response, daily backups are stored, and loads of compliance rules are in place to keep everything as tidy as my desk.
Various jobs are running daily, weekly, and monthly.
For example, significant config changes are rolled out via config snippets as tasks, but essential settings are always managed with compliance rules so that changes are getting corrected no matter what.
Additionally, NCM sends weekly reports to all network teams, in case of devices that couldn’t be saved or took more than one attempt.
This is quite an impressive setup and an excellent real-world example of the abilities of the platform.
I am looking forward to hearing similar stories from all of you out there!