Skip navigation

Geek Speak

4 Posts authored by: John Harrington

The Pareto Principle

 

The Pareto principle, also known as the 80-20 principle, says that 20% of the issues will cause you 80% of the headaches. This principle is also known as The Law of the Vital Few. In this post, I'll describe how the Pareto principle can guide your work to provide maximum benefit. I'll also describe a way to question the information at hand using a technique known as 5 Whys.

 

The 80-20 rule states that when you address the top 20% of your issues, you'll remove 80% of the pain. That is a bold statement. You need to judge its accuracy yourself, but I've found it to be uncannily accurate.

 

The implications of this principle can take a while to sink it. On the positive side, it means you can make a significant impact if you address the right problems. On the down side, if you randomly choose what issues to work on, it's quite likely you're working on a low-value problem.

 

Not quite enough time

 

When I first heard of the 80-20 rule I was bothered by another concern: What about the remaining problems? You should hold high standards and strive for a high-quality network, but maintaining the illusion of a perfect network is damaging. If you feel that you can address 100% of the issues, there's no real incentive to prioritize. I heard a great quote a few months back:

 

     "To achieve great things, two things are needed; a plan, and not quite enough time." - Leonard Bernstein

 

We all have too much to do, so why not focus our efforts on the issues that will produce the most value? This is where having Top-N reports from your management system is really helpful. Sometime you need to see the full list of issues, but only occasionally. More often, this restricted view of the top issues is a great way to get started on your Pareto analysis.

 

3G WAN and the 80-20 rule

 

A few years back, I was asked to design a solution for rapid deployment warehouses in remote locations. After an analysis of the options I ran a trial using a 3G-based WAN. We ran some controlled tests, cutting over traffic for 15 minutes, using some restrictive QoS policies. The first tests failed with a saturated downlink.

 

When I analyzed the top-talkers report for the site I saw something odd. It seemed that 80% of the traffic to the site was print traffic. It didn't make any sense to me, but the systems team verified that the shipping label printers use an 'inefficient' print driver.

 

At this point I could have ordered WAN optimizers to compress the files, but we did a 5 Whys analysis instead. Briefly, '5 Whys' is a problem solving technique that helps you identify the true root cause of issues.

 

  • Why is the bandwidth so high? - Printer traffic taking 80% of bandwidth
  • Why is printer traffic such a high percentage? - High volume of large transactions
  • Why is the file size so large? - Don't know - oh yeah we use PostScript (or something)
  • Why can't we use an alternative print format? - We can, let's do it, yay, it worked!
  • Why do we need to ask 5 whys? - We don't, you can stop when you solve the problem

 

The best form of WAN optimization is to suppress or redirect the demand. We don't all have the luxury of a software engineer to modify their code and reduce bandwidth, but in this case it was the most elegant solution. We were able to combine a trial, reporting, top-N and deep analysis with a flexible team. The result was a valuable trial and a great result.

 

Summary

 

Here's a quick summary of what I covered in this post:

 

  • The 80/20 principle can help you get real value from your efforts.
  • Top-N reports are a great starting point to help you find that top 20%.
  • The 5 Whys principle can help you dig deeper into your data and choose the most effective actions.

 

Of course a single example doesn't prove the rule.  Does this principle ring true for you, or perhaps you think it is nonsense? Let me know in the comments.

In part one of this series we looked at the pain that network variation causes. In this second and final post we’ll explore how the network begins to drift and how you can regain control. 

 

How does the network drift?

It’s very hard to provide a lasting solution to a problem without knowing how the problem occurred in the first instance. Before we look at our defenses we should examine the primary causes of highly variable networks.

 

  • Time The number one reason for shortcuts is that it takes too long to do it the ‘right way’.
  • Budget Sure it’s an unmanaged switch. That means low maintenance, right?
  • Capacity Sometimes you run out of switch ports at the correct layer, so new stuff is connected the the wrong layer. It happens.
  • No design or standards The time, budget and capacity problems are exacerbated by a lack of designs or standards.

 

Let’s walk through an example scenario. You have a de-facto standard of using layer-2 access switches, and an L3 aggregation pair of chassis switches. You’ve just found out there’s a new fifth-floor office expansion happening in two weeks, with 40 new GigE ports required.

 

You hadn’t noticed that your aggregation switch pair is out of ports so you can’t easily add a new access-switch. You try valiantly to defend your design standards, but you don’t yet have a design for an expanded aggregation-layer, you have no budget for new chassis and you’re out of time. 

 

So, you reluctantly daisy chain a single switch off an existing L2 access switch using a single 1Gbps uplink. You don’t need redundancy it’s only temporary. Skip forward a few months, you’ve moved onto the next crisis and you’re getting complaints of the dreaded ‘slow internet’ from the users on the fifth floor. Erm..

 

The defense against drift

Your first defense is knowing this situation will arise. It’s inevitable. Don’t waste your time trying to eliminate variation, your primary role is to manage the variation and limit the drift. Basic capacity planning can be really helpful in this regard.

 

Another solution is to use ‘generations’ of designs. The network is in constant flux but you can control it by trying to migrate from one standard design to the next. You can use naming schemes to distinguish between the different architectures, and use t-shirt sizes for different sized sites: S, M, L, XL. 

 

At any given time, you would ideally have two architectures in place, legacy and next-gen. Of course the ultimate challenge is to age-out old designs, but capacity and end-of-life drivers can help you build the business case to justify the next gen design.

 

But how do you regain control of that beast you created on the fifth floor? It’s useful to have documentation of negative user feedback, but if you can map and measure the performance this network showing that impact, then you’ve got a really solid business case.

 

A report from a network performance tool showing loss, latency and user pain, coupled with a solid network design makes for a solid argument and strong justification for an upgrade investment.

Network variation is hurting us

Network devices like switches, routers, firewalls and load-balancers ship with many powerful features. These features can be configured by each engineer to fit the unique needs of every network. This flexibility is extremely useful and, in many ways, it's what makes networking cool. But there comes a point at which this flexibility starts to backfire and become a source of pain for network engineers.

Variation creeps up on you.  It can start with harmless requests for some non-standard connectivity, but I've seen those requests grow to the point where servers were plugging straight into the network core routers.  In time, these one-off solutions start to accumulate and you can lose sight of what the network ‘should’ look like.  Every part of the network becomes its own special snowflake.

I’m not judging here. I've managed quite a few networks and all of them end up with high-degrees of variation and technical debt. In fact, it takes considerable effort to fight the storm of snowflakes. But if you want a stable and useful network you need to drive out variation. Of course you still need to meet the demands of the business, but only up to a point. If you're too flexible you will end up hurting your business by creating a brittle network which cannot handle changes.

Your network becomes easier and faster to deploy, monitor, map, audit, understand and fix if you limit your network to a subset of standard components. Of course there are great monitoring tools to help you manage messy networks, but you’ll get greater value from your tools when you point them towards a simple structured network.

What’s so bad about variety?

Before we can start simplifying our networks we have to see the value in driving out that variability. Here are some thoughts on how highly variable (or heterogeneous) networks can make our lives harder as network engineers:

  • Change control - Making safe network change is extremely difficult without standard topologies or configurations. Making a change safely requires a deep understanding of the current traffic flows - and this will take a lot of time. Documentation makes this easier, but a simple standardized topology is best. The most frustrating thing is that when you do eventually cause an outage, the lessons learned from your failed change cannot be applied to other dissimilar parts of your network.
  • Discovery time can be high. How do you learn the topology of your network in advance of problems occurring? A topology mapping tool can be really helpful to reduce the pain here, but most people have just an outdated visio diagram to rely on.
  • Operations can be a nightmare in snowflake networks.  Every problem will be a new one, but probably one that could have been avoided - it's likely that you'll go slowly mad. Often you'll start troubleshooting a problem and then realize, ‘oh yeah, I caused this outage with the shortcut I took last week. Oops’.  By the way, it’s a really good sign when you start to see the same problems repeatedly. Operations should be boring, It means you can re-orient your Ops time towards 80/20 analysis of issues, rather that spending your days firefighting.
  • Stagnation -  You won't be able to improve your network until you simplify and standardize your network. Runbooks are fantastic tools for your Ops and Deployment teams, but the runbook will be useless if the steps are different for every switch in your network. Think about documenting a simple task...if network Y do step1, except if feature Z enabled then do something else, except if it’s raining or if it's a leap year.  You get the message.
  • No-Automation - If your process it too complicated to capture in a runbook you shouldn't automate it. Simplify your network, then your process, then automate.

 

Summary

Network variation can be a real source of pain for us engineers. In this post we looked at the pain it causes and why we need to simplify and standardize our networks. In Part 2 we'll look at the root causes for these complicated, heterogenous networks and how we can begin tackling the problem.

TL'DR:  'Continuous Improvement' promotes a leaner cycle of data gathering and prioritized fixes which can deliver big benefits for Ops teams without large investments in organizational change.

People are thinking and talking about network automation more than ever before. There are a bewildering array of terms and acronyms bandied about. Whether people are speaking about DevOps or SDN, the central proposition is that you'll reach a nirvana of automation where the all the nasty grunt work is removed and our engineer time is spent, erm... engineering.

Yet many engineers and network managers are rejecting the notion of SDN and DevOps. These folk run real warts-and-all networks and are overwhelmed by the day to day firefighting, escalations, repetitive manual process, inter-departmental friction, etc. They can see the elegance and power of software defined networks and long for the smooth-running harmony of a DevOps environment. Most engineers can see the benefits of DevOps but see no path - they simply don't know how to get to the promised land.

Network equipment vendors purport to solve your management and stability problems by swapping your old equipment with newer SDN-capable equipment. Call me skeptical, but without better processes your shiny new equipment is just another system to automate and manage. I don't blame network equipment vendors for selling solutions, but it's unlikely their solution will solve your technical debt and stability issues. Until your operational and process problems are sufficiently well defined, you'll waste time and money hopelessly trying to find solutions.

DevOps is an IT philosophy that promises many benefits including, holistic IT, silo elimination, developers who are aware of operational realities, continuous integration, tighter processes, increased automation and so on. I'm a complete fan of the DevOps movement, but it requires nothing short of a cultural transformation, and that's kinda hard.

I propose that you start with 'Continuous Improvement', which is an extremely powerful component of the DevOps philosophy. You start by focusing your limited resources on high-leverage tasks that increase your view of the network. You gathering data and use it to identify your top pain point and you focus your efforts on eliminating your top pain point.  If you've chosen the top pain point, you should have enough extra hours to start the process again. In this virtuous circle scenarios you have something to show for every improvement cycle, and a network which become more stable as time passes.

Adopting 'Continuous Improvement' can deliver the fundamental benefits of DevOps without needing to bring in any other teams or engage in a transformation process.

Let's work through one 'Continuous Improvement' scenario:

  1. Harmonize SNMP and SSH access. The single biggest step you can take towards automation is to increase the visibility of your network devices. Inconsistent SNMP and SSH access to your network devices is one of the biggest barriers to visibility. Ensure you have correct and consistent SNMP configuration. Make sure you have TACACS or RADIUS authenticated SSH access from a common bastion. This can be tedious work, but it provides a massive return on investment. All of the other gains come from simplifying and harmonizing access to your network devices.
  2. Programmatically pull SNMP and Configurations. This step should be easy, just gather the running config and basic SNMP information for now. You can tune it all later.
  3. Analyze Analyze the configuration and SNMP data you gathered. Talk to your team and your customers about their pain points.
  4. Prioritize - Prioritize one high-leverage pain point that you can measure and improve. Don't pick the gnarliest problem. You should pick something that can be quickly resolved, but saves a lot of operational hours for your team. That is high-leverage.
  5. Eliminate the primary pain point Put one person on this task, make it a priority. You desperately need a quick win here.
  6. Celebrate Woot! This is what investment looks like. You spend some engineer hours but you got more engineer hours back.
  7. Tune up Identify the additional data would help make better decision for your next cycle and tune your management system to gather that extra data... then saddle up your horse and start again.

Summary

You don't need to overhaul your organization to see the benefits of DevOps. You can make a real difference to you teams operational load and happiness by focusing your limited resources on greater network visibility and intelligent prioritization. Once you see real results and buy yourself some breathing room, you'll be able to dive deeper into DevOps with a more stable network and an intimate knowledge of your requirements.

Filter Blog

By date: By tag: