Showing results for 
Search instead for 
Did you mean: 

Strategies for Scaling the World’s Largest Networks

Level 12

By Joe Kim, SolarWinds Chief Technology Officer

It can be truly astounding to think about the scale of today’s largest government networks, which are growing larger and more complex every day.

As a public sector IT pro, it may seem like an impossible challenge to manage this growing behemoth. Ever-increasing numbers of network devices, servers, and applications give you less leeway for downtime, hiccups, or problems of any sort.

There is a range of strategies that government IT pros can employ to support network growth and scalability while helping to ensure that all architectural and infrastructural requirements are met, and system failover scenarios are accounted for.

As the IT environment expands, it becomes more important for monitoring and management systems to scale to keep up with growth. Most monitoring systems are built with the following elements, each with its own requirements and challenges to scale:

  • A server that hosts the monitoring product and polls for status and performance
  • A database where the polled information is stored for historical data access and reporting
  • A web console for software management, data visualization, and reporting

Within this environment, three primary variables will affect a system’s scalability:

  1. Infrastructure size: The number of monitored elements (where an element is defined as a single, identifiable node, interface, or volume), or the number of servers and applications that can be monitored.
  2. Polling frequency: The interval at which the monitoring system polls for information. For example, statistics collected every few seconds instead of every minute will make the system work harder, and requirements will increase.
  3. The number of simultaneous users accessing the monitoring system.

Those are the basics of understanding the feasibility of scalability. Now, let’s move on to ways to manage that environment.

A command center is particularly well suited to agencies with multiple regions or sites where the number of nodes to be monitored in each region would warrant both localized data collection and storage. It works well for regional teams that are responsible for their own environments and require autonomy over their monitoring platform. While the systems are segregated between regions, all data can still be accessed from the centrally located console.

Additional scalability tips

There are several additional strategies that will help manage an agency’s growing infrastructure:

Add polling engines: Distributing the polling load for the monitoring system among multiple servers will provide scalability for large networks.

Add web servers: Additional web servers can help support increasing numbers of concurrent monitoring sessions, helping to ensure that more users have uninterrupted web access to network monitoring software.

Add a failover server: To help ensure the monitoring system is always available, install a failover mechanism that will switch monitoring system operation to a secondary server if the primary server should fail.

Agency networks will certainly get large. It's the nature of an increasingly technically driven government. While it may seem overwhelming, implementing these few tactics will help IT managers embrace the growth and ultimately realize its value.

Find the full article on Government Computer News.

Level 19

Scalability and reliability and very important.  We added additional polling engines and also have systems on both coasts.

Level 13

yeah, I'm planning on adding a second engine here as well...

Level 13

Yes, we have gotten to the point of needing too add another engine to poll.
Level 16

Nice article

Level 21

Missing from this topic:  Training

Scaling is built on the right practices, and our house of cards is made stronger by great training of those who will design and deploy it.  You might have the best Network people in the world, but if you don't get them the necessary training, you definitely won't have the network necessary for scaling larger and larger.

When I think about every line in a Cisco switch, router, and firewall, and how they interact and depend on each other, it can be intimidating.  Setting up the right templates, knowing how to deploy things correctly the first time, and doing so consistently--these are the ingredients to a cake that cannot be remixed and remade affordably if something is missed in a template.  Rebuilding thousands of routers or switches or firewalls just because something wasn't added in, or wasn't deployed properly for a best-practice environment that scales to large sizes is a huge expense--one that need never be incurred if management simply invests in training their staff correctly, with the big goal in mind.

Some will say that they can't afford to pay for training--it's on the employee's back to get it on their own dollar and on their own time.  Or that training staff will make them more valuable, and they'll ask for raises or move elsewhere.

This is the cost of doing the job right.  Budget for training so you have competent staff.  Budget to pay them so they'll stay.  And build your business so it's a place people love to work.  Remember, you want an organization to which people are lined up trying to get in, not one where they're watching hoping to get out.

Level 18

You measure/test and plan based on usage and growth.

Estimate amount of growth and scale accordingly and the go back and measure/test to see how you did and to monitor capacity for future planning.

regarding monitoring tools you go through the same process..

Usually it is better to have a small toolset watching the production toolset so you can do proper capacity planning where your monitoring of the monitoring tool

doesn't influence the always need to be on the outside looking in.

Level 14

There is still one other point of the solution that needs attention, the database.

All of the additional web servers and pollers, although spreading the Orion load, increase the database's workload.

Level 14

The database server, its connectivity, and its backups are also very important.  A non-optimized and/or a minimal-speced DB server will really cause issues.  Don't forget about you network connectivity to it.

Level 15

The ability of the database server to sync with its peer from a geographical remote location and the pollers to adjust is what we are keeping our eye on.

Level 13

hmmm, can it fix itself, deploy and expand itself as needed as well? That'd be supercool.

Level 16

Nice article. A nice cookbook when planning for a monitoring solution for large-scale networks.

Level 10

Totally agree with adding a failover server.

Level 13

We have added polling engines, we are about to add web servers, and my database gets re-allocated resources about every other month. (always more). Fail over is on the horizon, but not seen as urgent. Yet.

Level 15

We have a secondary polling engine, but haven't found the need for HA or additional web servers yet. I'm still trying to get others on the team excited about the products so hopefully usage will grow.

Level 7

I'm confused. We bought 2 web servers having been told that they were able to be deployed in an HA pair. Turns out that is apparently not the case. We then tried DNS round robin load balancing. Every time the web page is refreshed, the connection reverts to the other web server, which requires the user to log in again. I'm also told that SolarWinds does not support F% load balancing. If the web servers are not session state aware with each other, how can this be a seamless redundancy? Am I missing something or do I have bad information?


About the Author
Joseph is a software executive with a track record of successfully running strategic and execution-focused organizations with multi-million dollar budgets and globally distributed teams. He has demonstrated the ability to bring together disparate organizations through his leadership, vision and technical expertise to deliver on common business objectives. As an expert in process and technology standards and various industry verticals, Joseph brings a unique 360-degree perspective to help the business create successful strategies and connect the “Big Picture” to execution. Currently, Joseph services as the EVP, Engineering and Global CTO for SolarWinds and is responsible for the technology strategy, direction and execution for SolarWinds products and systems. Working directly for the CEO and partnering across the executive staff in product strategy, marketing and sales, he and his team is tasked to provide overall technology strategy, product architecture, platform advancement and engineering execution for Core IT, Cloud and MSP business units. Joseph is also responsible for leading the internal business application and information technology activities to ensure that all SolarWinds functions, such as HR, Marketing, Finance, Sales, Product, Support, Renewals, etc. are aligned from a systems perspective; and that we use the company's products to continuously improve their functionality and performance, which ensures success and expansion for both SolarWinds and customers.