cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post

The pain of network variation - part 1/2

Level 9

Network variation is hurting us

Network devices like switches, routers, firewalls and load-balancers ship with many powerful features. These features can be configured by each engineer to fit the unique needs of every network. This flexibility is extremely useful and, in many ways, it's what makes networking cool. But there comes a point at which this flexibility starts to backfire and become a source of pain for network engineers.

Variation creeps up on you.  It can start with harmless requests for some non-standard connectivity, but I've seen those requests grow to the point where servers were plugging straight into the network core routers.  In time, these one-off solutions start to accumulate and you can lose sight of what the network ‘should’ look like.  Every part of the network becomes its own special snowflake.

I’m not judging here. I've managed quite a few networks and all of them end up with high-degrees of variation and technical debt. In fact, it takes considerable effort to fight the storm of snowflakes. But if you want a stable and useful network you need to drive out variation. Of course you still need to meet the demands of the business, but only up to a point. If you're too flexible you will end up hurting your business by creating a brittle network which cannot handle changes.

Your network becomes easier and faster to deploy, monitor, map, audit, understand and fix if you limit your network to a subset of standard components. Of course there are great monitoring tools to help you manage messy networks, but you’ll get greater value from your tools when you point them towards a simple structured network.

What’s so bad about variety?

Before we can start simplifying our networks we have to see the value in driving out that variability. Here are some thoughts on how highly variable (or heterogeneous) networks can make our lives harder as network engineers:

  • Change control - Making safe network change is extremely difficult without standard topologies or configurations. Making a change safely requires a deep understanding of the current traffic flows - and this will take a lot of time. Documentation makes this easier, but a simple standardized topology is best. The most frustrating thing is that when you do eventually cause an outage, the lessons learned from your failed change cannot be applied to other dissimilar parts of your network.
  • Discovery time can be high. How do you learn the topology of your network in advance of problems occurring? A topology mapping tool can be really helpful to reduce the pain here, but most people have just an outdated visio diagram to rely on.
  • Operations can be a nightmare in snowflake networks.  Every problem will be a new one, but probably one that could have been avoided - it's likely that you'll go slowly mad. Often you'll start troubleshooting a problem and then realize, ‘oh yeah, I caused this outage with the shortcut I took last week. Oops’.  By the way, it’s a really good sign when you start to see the same problems repeatedly. Operations should be boring, It means you can re-orient your Ops time towards 80/20 analysis of issues, rather that spending your days firefighting.
  • Stagnation -  You won't be able to improve your network until you simplify and standardize your network. Runbooks are fantastic tools for your Ops and Deployment teams, but the runbook will be useless if the steps are different for every switch in your network. Think about documenting a simple task...if network Y do step1, except if feature Z enabled then do something else, except if it’s raining or if it's a leap year.  You get the message.
  • No-Automation - If your process it too complicated to capture in a runbook you shouldn't automate it. Simplify your network, then your process, then automate.

Summary

Network variation can be a real source of pain for us engineers. In this post we looked at the pain it causes and why we need to simplify and standardize our networks. In Part 2 we'll look at the root causes for these complicated, heterogenous networks and how we can begin tackling the problem.

13 Comments

I've pushed at our organization for years to understand that having the fewest variables results in the highest up time.  Selecting a common router and switch, for example--or a minimum number of different router or switch hardware solutions.

Ideally I'd put the same model router and switch at every site and be done with worrying about what to keep on the shelf as a hot spare.  That would be so sweet and simple to maintain!  In practicality, what's appropriate for a site with a thousand users and gig WAN speeds is monetary overkill for a site with four users and a non-symmetrical DSL WAN service.

So my team evaluates a site's size and WAN speed needs and selects from a large, medium, or small standard set of hardware for routers & switches, and deploys them accordingly.  When all is said and done, we have limited the total number of platforms we must support.

Yes, it's occasionally a little more expensive than need be, for a site that falls between small, medium, or large.  They'll get a router with somewhat more capacity than they strictly require, but we believe that gives us a good upward growth path as they expand their demand.

Similarly, we do not put in custom-sized UPS's for each network room, preferring instead to have fewer models and some oversizing for power, rather than keeping a dozen different models on hand.

While variety is interesting, the larger the network, the better we're able to support it with less variety in hardware and IOS versions. 

Then to keep configurations uniform, our Network Architect took our individual scripted playbooks and built a Configurator in Excel that prompts us for all the needed information and automatically builds out the full running-config for every switch, router, and ASA firewall.  It really ensures newly deployed gear is set up the way we need it, instead of allowing individual one-offs that may be caused by our separate configuration skills or scripts.

Lastly, we use NCM to ensure one-offs don't pop up in configurations by creating and running Compliance and Remediation reports and scripts against our equipment.  NCM is a sweet tool for ensuring configuration conformity.

Thanks for bringing this topic up!  The less variation and more conformity that's present in network hardware and configurations, the more predictable and manageable the network becomes.

MVP
MVP

Good  points.  Keep it simple and standardized. 

Level 14

Change control is critical.  Keeping things simple and consistent should be the rule.  With one offs, you can lose control of your environment.  Once control is lost, regaining it can be a herculean event.

I see it here at my company as well/. They will invest in a vendor for a technology that is $5 cheaper than what our incumbent vendor will charge. I go hoarse trying to convince them that the upfront value is lost in the Support area as it means we have another vendor to manage, another Client Portal we have to deal with, another interface we have to learn, etc. The less vendors, the easier the integration capabilities, the "one throat to choke" analogy, etc. the better. The money isn't only saved, but then it gives my team an opportunity to increase the ROI.

Level 9

Hey rschroeder@,

     Thanks for your detailed common - you make some great points there.  Using templated configuration on deployment, and using a config management tool for in-life auditing & enforcement is the ideal approach. 

The small, medium & large architectures are a nice approach. We used to call it the 't-shirt size approach' in my last company.  It's really helpful for planning also, you can quickly forecast what a branch's network costs will be next year as they grow into the next architecture.

You're right that you'll occasionally deploy more horse-power than you need but it's still the right approach. Budget owners will argue for the minimum deployment, but none of those folks will be around when you ask for budget to consolidate and cleanup the resultant mess. In fact the same finance folks who pressure you to 'right-size' the network, will go nuts if you try to swapout after two years, whilst complaining that your Opex budget to maintain the mess is too high.

Thanks Again,

   John H

Thanks, John.  I like your "T-Shirt size approach" description.

;^)

Level 12

We try to maintain a cookie-cutter approach to hardware deployments. Our biggest problem is management keeps changing the cookie dough. We started a refresh of our wireless deployment in our 90+ retail locations about three years ago, made it almost half-way through the project and now they decided all new deployments will get something different for wireless, which means that beginning next month, I will have four different wireless systems to maintain. And yes we still have that battle of "well this switch is $20 less than that switch and a switch is a switch, right?". Our run book has nearly as many options as we have locations!

Level 14

Sounds like the time our lead engineer ordered new laptops for the network team.  Procurement folks found a model that was $20 dollars cheaper apiece and ordered them instead of the model we requested.  When these "better value" laptops came in, we found they had no serial ports.  Hello!  Network team here!  We promptly ordered USB to serial adapters at a cost of $20 apiece.

My experience with variation is it generally leads to more downtime, longer time to repair, and management is a nightmare. 

I was expecting to read you'd ordered serial adapters at $50 each--something more than the savings.  Silly Procurement, making assumptions again.

MVP
MVP

I've been lucky. I've worked at my current employer for 21 years and they've always maintained a standard to our network. First it was all DEC gear and then we became a Cisco shop. That's what we still use today. We use standard config templates and everyone maintains the policies (pretty much). So all good here

Level 10

Nice article, this is currently what my company is undergoing now, "Standardization"; When you have almost 150 sites across 70 countries all over the world which networks are too varied and does not follow a single and specific standards, it can really be a pain in the @#% to manage and maintain; I'm glad that  management has already started to take notice and network standardization projects are now in play and work in progress.

Level 20

It also makes it harder to get into NPM somehow!

About the Author
"John Harrington is a network engineer who loves network design, deployment and testing. He has designed and deployed enterprise, mobile telecoms and public cloud data center networks. He values efficient processes and business-driven networking. John enjoys sharing his mistakes, learnings, and insights on his blog The Network Sherpa and on Twitter."