1 2 3 Previous Next

Geek Speak

34 Posts authored by: jgherbert

Having convinced myself in previous posts that my automation projects will never be finished and that I will forever be supporting 101 different device connection paradigms, I thought that in this post perhaps I should try to find some sliver of dry land in the fetid Swamp of Automation.


In The Swamp of Automation


Failing that, let’s look at why consistency is so important when deploying infrastructure.


Your Infrastructure Sucks


It does. Don’t deny it. Despite best efforts, it’s still riddled with inconsistencies, plagued with one-off solutions, and besmirched with workarounds for badly-behaving applications.


Don’t feel bad; you are part of an unspoken global society held together by the silent bonds of engineering prowess. Our motto is “When every infrastructure is ‘special,’ every infrastructure is normal.


Deploy Consistently


Uniqueness, arguably, is the enemy of automation.


Effective automation requires consistency so the same encoded business logic and automation processes can be used across all devices. Exceptions and design variances mean that the generalized automation code has to be customized to deal with each unique situation, requiring further coding and creating additional complexity. This in turn increases both the cost and the time taken to developing automation tools.


This is one of the reasons why brownfield (swamp) automation deployments can be so difficult; years of inherited workarounds and one-off solutions can make even simple automation tasks seem impossibly complex because of the myriad “What if?” scenarios that have to be accounted for.


Greenfield solutions in contrast can provide fertile grounds for automation, but only if the infrastructure is designed with automation in mind from the start. A badly-designed greenfield infrastructure can be just as unmanageable as a brownfield.


If Proof Is Needed...


Amazon Web Services (AWS) for example—but in fact, almost any cloud provider—demonstrates these principles very effectively. When an application needs a feature or mechanism that AWS doesn’t offer, the choices are:


  • Rewrite the application; or
  • Don’t use AWS.


Because the AWS infrastructure is managed entirely by automation, there is no opportunity for one-off, non-standard solutions because the automation tools don’t support anything except the standard solutions. AWS is the concept of “one size fits all” taken to an absurd extreme. In fact, it’s almost risible how quickly programmers who previously would have had fits arguing that the infrastructure should do some incredible feats of engineering to support or otherwise mitigate lazy coding on their part, have adapted to the new paradigm where they have to code to meet the capabilities of the infrastructure. What is this crazy magic?


There is a lesson here, however. If vendors provide “nerd knobs” to allow us to engineer crazy solutions, and if we thrive on finding ways to make the infrastructure meet the needs of our users, then our users will expect us to keep using the nerd knobs and keep on coming up with insane solutions. And let’s be honest, as engineers, we’re kind of proud when we find a really clever way to accomplish what’s needed and we solve a problem. It goes against every problem-solving bone in our bodies to say “no.”


The difference is that in the cloud, the conversation goes like this:


 Programmer: “I need you to do this crazy thing so my app works”
      Cloud: “No.”
Programmer: “But don’t you have a nerd knob you can turn?”
      Cloud: “No.”
Programmer: “But without this, the project will fail.”
      Cloud: “Shame, but no.”
Programmer: “Seriously, you’re killing me here.”
      Cloud: “Don’t tempt me.”

The Sliver of Land


Imagine that the Swamp of Automation represents a brownfield automation deployment. It’s not that it can’t be done, but the inconsistencies can turn out to be lurking alligators with bad breath and worse tempers. Move slowly and tread carefully.


The sliver of buildable land in the middle of this swamp, if there is one, is a greenfield deployment where there’s an opportunity to build a deep foundation of automation before the first resident moves in. However, get the foundation wrong, and it may all yet sink into the ground.


Your Homework for the Holidays


Thinking about your current, sucky infrastructure, if you had the opportunity to build it from the ground up, what would you do differently? How would you make it easier to operate? Or do you already have amazing systems in place? If you’ve been involved in a greenfield deployment, was there any resistance to doing things in a way that would pay off down the road?


I’d love to hear your tales of competency, incompetency, triumph, and alligator bites.

As somebody not lucky enough to have a nice clicky interface with which to manage and automate all my equipment, I have to develop my own tools to do so. One aspect of developing those tools that drives me up the wall is the variety of mechanisms I have to support to communicate with the various devices deployed in the network. Why can’t we have one, consistent way to manage the devices?




I can already hear voices saying “But we have SSH. SSH is available on almost all devices. Is that not the consistency you desire?” No, it isn’t. In terms of configuring network devices, SSH is just a(n encrypted) transport mechanism and provides no help whatsoever with configuring the devices. Once I’ve connected using SSH, I have to develop the appropriate customized code to screen-scrape the particular operating system I connect to. Anybody who has done this will testify that this is not particularly straightforward and, worse, the reward for doing so is to be able to issue commands and receive, in return, wads of unstructured data (i.e., command output) which can change between code versions, making parsing a nightmare.


So here’s my first requirement: structured data.


Structured Data


Typical command line output blurts out the requested information in such a way that the surrounding label text and the position of a piece of text impart the information necessary to infer what the text itself represents. Decoding data in this format usually means developing regular expressions to identify and pull apart the text so that the constituent data can be processed. Identification of data is implicit, based on contextual clues. Unstructured data is nice (or a least tolerable) for a human to look at and make sense of, but is frequently very difficult to interpret in code.


Structured data, on the other hand, follows a specific set of rules to present the data points such that they are explicitly and unambiguously identified and labeled for consumption by code. Structured data usually presents data in a way that mimics programmatic hierarchical data structures, which lends itself to easy integration into code.


Junos has historically favored XML for structured data, but — subjectively — I would argue that XML is ugly and can get complex very quickly, especially where multiple namespaces are being used. Personally I have a soft spot for JSON, and I know other people like YAML, but ultimately I’m at the point where I’d say “I don’t care, just pick one and stick with it,” so that I can focus my time and effort on handling the one format.


So here’s my second requirement: consistent encoding.


What Gets Encoded?


Once we have a way to encode the data in structured format, we then need to be more consistent about what’s being encoded. Again, this may point at a project like OpenConfig, but failing that, perhaps it might be worth it if everything could be described using YANG. By default, YANG maps to XML, but RFC7951 (JSON Encoding of Data Modeled With YANG) helpfully shows how my pet preference JSON can be used instead.


The point here is that if YANG is used, the mapping to JSON or XML is almost a side issue, so long as the data is modeled in YANG to start with; both XML and JSON fans can translate from YANG to their favorite encoding and—and this is the key part—so can the devices, so clients can request the encoding of their choice.


One Transport To Rule Them All


So now that I’ve determined that we need YANG models with support for both JSON and XML encoding, optionally following a common OpenConfig data model, let’s address how we communicate with the devices.


I don’t want to have to figure out what kind of device I’m connecting to, then based on that information, decide what connection transport I should be using (e.g., HTTP, HTTPS, SSH possibly on a non-standard TCP port). What I want to be able to do is to connect to the device in a standard way on a standard port. I don’t mind reconnecting to a non-standard port, but I want to be told about that port after I connect to the standard port.


That’s my third requirement: one transport for all.


OpenConfig takes this approach, by having a “well-known” URL on the device to report back information about the device and connection details in a standard format. I’d like to take it a step further, though.


REST API or Bust


Let’s use a REST API for all this communication. REST APIs are ubiquitous now; every programming language has the ability to send and receive requests over HTTP(S) and to decode the XML/JSON responses. It makes things easy!


My last requirement: access via REST API.


Wait a moment, let me check again:

  • Structured data: YES
  • Consistent encoding: YES
  • One transport: YES


I believe I’ve accidentally just defined RESTCONF (RFC8040), whose introductory paragraph reads:


“[...] an HTTP-based protocol that provides a programmatic interface for accessing data defined in YANG, using the datastore concepts defined in the Network Configuration Protocol (NETCONF).”


I am not a huge fan of NETCONF/XML over SSH, but NETCONF/JSON over HTTP? Count me in!


My Thoughts


Automating the infrastructure is hard enough without battling against multiple protocols. How about we just all agreed that RESTCONF is a good compromise and start supporting it across all devices?


For what it’s worth, there is some level of RESTCONF support in more recent software releases, including:

  • Cisco IOS XE
  • Cisco IOS XR
  • Cisco NXOS
  • Juniper Junos OS
  • Arista EOS
  • Extreme XOS
  • DellEMC OS10
  • and more...


But here’s the problem: when did you last hear of anybody trying to automate with RESTCONF?


That’s what I’d like to see. What about you?

There Can Be Only One

There Can Only Be One


I’ve heard repeatedly from people in this industry that what we need is a single interface to our infrastructure. For status monitoring, perhaps that’s best represented by the ubiquitous “Single Pane of Glass” (Drink!); for general network management, perhaps it’s a Manager of Managers (MoM); and for infrastructure configuration it’s, well, that’s where it gets tricky.


Configurator of Configurators




I was once hopeful of a future where I could configure and monitor any network device I wanted using a standardized operational interface, courtesy of the efforts of OpenConfig. If you’ve not heard of this before, you can check out an introduction to OpenConfig I wrote in early 2016. However, after a really promising start with lots of activity, OpenConfig appears to have gone dark; there are no updates on the site, the copyright notice still says 2016, the latest “News” was a 2015 year-end round-up, and there’s really little to learn about any progress that might have been made. OpenConfig is a mysterious and powerful project, and its mystery is only exceeded by its power.




I mention OpenConfig because one of the biggest battles that project faced was to to reconcile the desire for consistency across multiple vendors’ equipment while still permitting each vendor to have their own proprietary features. Looking back, SNMP started off the right way too by having a standard MIB where everybody would store, say, network information, but it became clear quite quickly that this standard didn’t support the vendors’ needs sufficiently. Instead, they started putting all the useful information in their own data structures within an Enterprise MIB specific to the vendor’s implementation. Consequently, with SNMP, the idea of commonality has almost been reduced to being able to query the hostname and some very basic information. If OpenConfig goes the same way, then it will have solved very little of the original problem.


Puppet faces a similar problem in that its whole architecture is based on the idea that you can tell any device to do something, and not need to worry about how it’s actually accomplished by the agent. This works well with the basic, common commands that apply to all infrastructure devices (say, configuring an access VLAN on a switchport), but the moment things get vendor-specific, it gets more difficult.


The CoC


It should therefore be fairly obvious that to write a tool that can fully automate a typical homogeneous infrastructure (that is, containing a mix of device types from multiple vendors) is potentially an incredibly steep task. Presenting a heterogeneous-style front end to configure a homogeneous infrastructure is tricky at best, and to create a single, bespoke tool to accomplish this would require skills in every platform and configuration type in use. The fact that there isn’t even a common transport and protocol that can be used to configure all the devices is a huge pain, and the subject of another post coming soon. But what choice do we have?


APIs Calling APIs


One of solutions I proposed to the silo problem is for each team to provide documented APIs to their tools so that other teams can include those elements within their own workflows. Most likely, within a technical area, things may work best if a similar approach is used:

API Hierarchy


Arguably the Translation API could itself contain the device-specific code, but there’s no getting around the fact that each vendor’s equipment will require a different syntax, protocol, and transport. As such, that complexity should not be in the orchestration tools themselves but should be hidden behind a layer of abstraction (in this case, the translation API). In this example, the translation API changes the Cisco spanning-tree “portfast” into a more generic “edge” type:

API Translation At Work


There’s also no way to avoid the exact problem faced by OpenConfig, that some vendors, models, features, or licenses, will offer capabilities not offered by others. OpenConfig aimed to make this even simpler, by pushing that Translation API right down to the device itself, creating a lingua franca for all requests to the device. However, until that glorious day arrives, and all devices have been upgraded to code supporting such awesomeness, there’s a stark fact that should be considered:


Most automation in homogeneous environments will, by necessity, cater to the lowest common denominator.


Lowest of the Low


Let’s think about that for a moment. If we want automation to function across our varied inventory, then the fact that it ends up catering to the lowest common denominator means that it should be possible to deploy almost any equipment into the network, because the fancy proprietary features aren’t going to be used by automation. While that sounds dangerously like ad-copy for white box switching, the fact remains that if any port on the network can be configured the same way (using an API) then the reality of which hardware is deployed in the field is only a matter of whether or not a device-specific API can be created to act as middleware between the scripts and the device. That could almost open up a whole new way of thinking about our networks...


Abstract the Abstraction


Will we end up dumbing down our networks to allow this kind of heterogeneous operation of homogeneous networks? I don’t know, but it seems to me that as soon as there’s a feature disparity between two devices with a similar role in the network, we end up looking right back at the LCD.


I’m a fan of creating abstractions to abstractions, so that—as much as possible—the dirty details are hidden well out of sight. And while it would be lovely to think that all vendors will eventually deploy a single interface that our tools can talk to, until that point, we’re on the hook to provide those translations, and to build that common language for our configuration and monitoring needs.


Qu'est-ce qui pourrait mal se passer?

With the popularity of Agile methodologies and the ubiquity of people claiming they were embracing NetOps/DevOps, I could swear we were supposed to have adopted a new silo-busting software-defined paradigm shift which would deliver the critical foundational framework we needed to become the company of tomorrow, today.

Warning: Bull-dung Ahead


Brace For Cynicism!


I recently discussed Automation Paralysis and the difficulties of climbing the Cliffs of Despair from small, uni-functional automation tools at the bottom (the Trough of Small Successes) to larger, integrated tools at the top (the Plateau of Near-Completion). The way I see it, in most cases, the cliffs are being climbed individually by each of the infrastructure sub-specialties (network, compute, storage, and security), and even though each group goes through similar pains, there's no shared experience here. Each group works through its own problems on their own, in their own way, using their own tools.


For the sake of argument, let's assume that all infrastructure teams have successfully scaled the Cliffs of Despair with only a few casualties along the way, and are making their individual base camps on the Plateau of Near Completion. What happens now? What does the company actually have? What the company has is four complex automation products which are most likely totally incompatible with one another.


Introducing the Silo Family!


IT Silos: Network, Compute, Storage, Security


If I may, I'd like to introduce to you all to the Silo Family. While they may not show up in genetic test results, I'll wager we all have relatives in these groups:



Netbeard Picture








NetBeard is proud of having automated a large chunk of what many said could not be automated: the network. Given the right inputs, NetBeard's tools can save many hours by pushing configs out to devices in record time. NetBeard's team was the first one in the world to ever have to solve these problems, and it was made all the more difficult by the fact that there were no existing tools available that could do what was needed.


Compute Monkey

Computer Monkey Picture

Looking more confident than the rest of the cohort, Compute Monkey can't understand what the fuss is all about. Compute Monkey's servers have been Puppeted and Cheffed and Whatever-Elsed for years, and it's a public secret that deploying compute now requires little more than a couple of mouse clicks.



Storebot Picture

StoreBot is pleasant enough, but while everybody can hear the noises coming out of StoreBot's mouth, few have the ability to interpret what it all means. If you've ever heard the teacher talking in the Peanuts TV cartoon series it's a bit like that: Whaa waawaa scuzzy whaaaa LUN wawabyte whaaaaw.


Security Fox

SecurityFox Picture

Nobody knows anything about Security Fox. Security Fox likes to keep things secret.


Family Matters


The problem is, each group works in a silo. They don't collaborate with automation, they don't believe that the other groups would really understand what they do (come on, admit it), and they keep their competitive edge to themselves. I don't believe that any of the groups really means to be insular, but, well, each team has knowledge, and to work together on automation would mean having to share knowledge and be patient while the other groups try to understand what, how, and why the group operates the way it does. And once somebody else understands that role, why should they be the ones to automate it? Isn't that automating another group out of a job? Ultimately, I am cynical about the chances of success based on most of the companies I've seen over the years.


However, if success is desired, I do have a few thoughts, and I'm sure that the THWACK community will have some too.


Bye Bye, Silos


Getting rid of silos does not mean expecting everybody to do everything. Indeed, expertise in each technology in use is required just as it is when the organization was siloed. However, merging all these skills into a single team does mean that it's possible to introduce the idea of shared fate, where the team as a whole is responsible – and hopefully rewarded – for achieving tighter integrations between the different technologies so that there can be a single workflow.


Create APIs Between Groups


If it's not possible to unite the teams, and especially where there is a legacy of automation dragged up to the Plateau, make that automation available to fellow teams via APIs, and the other teams should do the same in return. That way each team gets to feel accomplished and maintains their expertise, management team, and so on, but now automation in each group can use, and be used by, automation from other groups. For example, when deploying a server based on a request to the Compute group, wouldn't it be nice if the Compute group's automation obtained IPs, VLANs, trunks, etc., via an API provided by the Network group. Storage could be instantiated the same way. Everybody gets to do their own thing, but by publishing APIs, everybody gets smarter.


Go Hyperconverged


Hyperconvergence is not only Buzzword Approved™, but for some it's the perfect workaround for having to create all this automation in a bespoke fashion. Of course, with convenience comes caveat, and there are quite a few to consider, perhaps including:

  • Vendor lock-in (typically only vendor-approved hardware can be used)
  • Solution lock-in (no choice but to run the vendor's software)
  • Delivers a one-size-fits-most solution, which is good if you're that size
  • May not be able to customize to particular needs if not supported by the software


I'm not against hyper converged infrastructure (HCI) by any means, but it seems to me that it's always a compromise in one way or another.


Use Another Solution


Why write all this coordinated automation when somebody else can do it for you? Well, because somebody else might not do it quite the way you had in mind. I mean, why not spin up some OpenStack in the corporate DC? OpenStack has a component for everything, I hear, including compute, storage, network, vegan recipes, key management, 18th century French poetry, and orchestration. OpenStack can be incredibly powerful, but last I heard it's really not fun to install and maintain for oneself; it's much nicer to let somebody else run it and just subscribe to the service; sounds a bit like cloud doesn't it? On which note:




Make It Somebody Else's Problem (MISEP). The big cloud providers have managed to de-silo their teams, or maybe they were never siloed in the first place. The point is, services like AWS are half way up the Asymptotic Dream of Full Automation; they pull together all those automation tools, make them work together, orchestrate them, then provide pointy-clicky access via a web browser. What's not to love? All the hard work is done, it's cheaper*, there will be no need to write scripts any more**, you can do anything you like***, and life will be wonderful****.


* Rarely true with any reasonable number of servers running

** Also very rarely true

*** I made this up

**** It won't


As ever, if you read between the lines, you might guess that as with HCI (another form of MISEP), such simplicity comes at a price, both literally and figuratively. With cloud services it's usually a many-sizes-fit-most model, but if what you want to do isn't supported, that's just tough luck and you need to find another way. While skills in the previous silos may be less necessary, a new silo appears instead: Cloud Cost Optimization. Make of that what you will.


Why The Long Face?


It may seem that this is an unreasonably negative view of automation – and some of it is a tiny bit tongue-in-cheek, – but I have tried to highlight some of the very real challenges standing in the way of a beautifully cost-efficient, highly agile, high-quality automated network. Wait, that's reminding me of something, and allows me to make one last dig at the dream:


Pick Two: Cheap, Fast, Good


We can get there. At least, we can get much of the way there, but we have to break out of our silos and start sharing what we know. We also need to go into this with eyes wide open, an understanding of what the alternatives might be, and a reasonable expectation of what we're going to get out of it in the end.

Despite all the talk of all our jobs being replaced by automation, my experience is that the majority of enterprises are still very much employing engineers to design, build, and operate the infrastructure. Why, then, are most of us stuck in a position where despite experimenting with infrastructure automation, we have only managed to build tools to take over small, mission-specific tasks, and we've not achieved the promised nirvana of Click Once To Deploy?


We Are Not Stupid


Before digging into some reasons we're in this situation, it's important to first address the elephant in the room, which is the idea that nobody is stupid enough to automate themselves out of a job. Uhh, sure we are. We're geeks, we're driven by an obsession with technology, and the vast majority of us suffer from a terrible case of Computer Disease. I believe that the technical satisfaction of managing to successfully automate an entire process is a far stronger short-term motivation than any fear of the potential long-term consequences of doing so. In the same way that hoarding information as a form of job security is a self-defeating action (as Greg Ferro correctly says, "if you can't be replaced, you can't be promoted"), avoiding automation because it takes a task away from the meatbags is an equally silly idea.


Time Is Money


Why do we want to automate? Well, automation is the path to money. Automation leads to time-saving; time-saving leads to agility; agility leads to money. Save time, you must! Every trivial task that can be accomplished by automation frees up time for more important things. Let's be honest, we all have a huge backlog of things we've been meaning to do, but don't have time to get to.


However, building automation takes time too. There can be many nuances to even simple tasks, and codifying those nuances and handling exceptions can be a significant effort, and large scale automation is exponentially more complex. Because of that, we start small, and try to automate small steps within the larger task because that's a manageable project. Soon enough, there will be a collection of small, automated tasks built up, each of which requires its own inputs and generates its own outputs, and--usually--none of which can talk to each other because each element was written independently. Even so, this is not a bad approach, because if the tasks chosen for automation occur frequently, the time saved by the automation can outweigh the time spent developing it.


This kind of automation still needs hand-holding and guidance from a human, so while the humans are now more productive, they haven't replaced themselves yet.


Resource Crunch


There's an oft-cited problem that infrastructure engineers don't understand programming and programmers don't understand infrastructure, and there's more than a grain of truth to this idea. Automating small tasks is something that many infrastructure engineers will be capable of, courtesy of some great module/package support in scripting languages like Python. Automating big tasks end-to-end is a different ball game, and typically requires a level of planning and structure in the code exceeding that which most infrastructure engineers have in their skills portfolio. That's to be expected: if coding was an engineer's primary skill, they'd more likely be a programmer, not an infrastructure engineer.


Ultimately, scaling an automation project will almost always require dedicated and skilled programmers, who are not usually found in the existing team, and that means spending money on those programming resources, potentially for an extended period of time. While the project is running, it's likely that there will be little to no return on the investment. This is a classic demonstration of the maxim that you have to speculate to accumulate, but many companies are not in a position--or are simply unwilling--to invest that money up front.


The Cliffs Of Despair


With this in mind, in my opinion, one of the reasons companies get stuck with lots of small automation is that it's relatively easy to automate multiple, small tasks, but taking the next step and automating a full end-to-end process is a step too far for many companies. It's simply too great a conceptual and/or financial leap from where things are today. Automating every task is somewhere so far off in the distance, nobody can even forecast it.


They say a picture is worth a thousand words, which probably means I should have just posted this chart and said "Discuss," but nonetheless, as a huge fan of analyst firms, I thought that I could really drive my point home by creating a top quality chart representing the ups and downs of infrastructure automation.


The Cliffs Of Despair


As is clearly illustrated here, after the Initial Learning Pains, we fall into the Trough Of Small Successes, where there's enough knowledge now to create many, small automation tools. However, the Cliffs Of Despair loom ahead as it becomes necessary to integrate these tools together and orchestrate larger flows. Finally–and after much effort–a mechanism emerges by which the automation flows can be integrated, and the automation project enters the Plateau of Near Completion where the new mechanism is applied to the many smaller tools and good progress is made towards the end goal of full automation. However, just as the project manager announces that there are only a few remaining tasks before the project can be considered a wrap, development enters the Asymptotic Dream Of Full Automation, whereby no matter how close the team gets to achieving full automation, there's always just one more feature to include, one more edge case that hadn't arisen before, or one more device OS update which breaks the existing automation, thereby ensuring that the programming team has a job for life and will never achieve the sweet satisfaction of knowing that the job is finished.


Single Threaded Operation


There's one more problem to consider. Within the overall infrastructure, each resource area (e.g., compute, storage, network, security) is likely working their own way towards the Asymptotic Dream Of Full Automation and at some point will discover that full, end-to-end automation means orchestrating tasks between teams. And that's a whole new discussion, perhaps for a later post.


Change My Mind


Change My Mind

Ask a good server engineer where their server configuration is defined and the answer will likely be something similar to In my Puppet manifests. Ask a network administrator the same thing about the network devices and they'll probably look at you in confusion. Likely responses may include:


  • Uh, the device configuration is on the device, of course.
  • We take configuration backups every day!


Why is it that the server team seems to have gotten their act together while the network team is still working the same way they were twenty years ago?


The Device As The Master Configuration Source


To clarify the issue described, for many companies, the instantiation of the company's network policy is the configuration currently active on the network devices. To understand a full security policy, it's necessary to look at the configuration on a firewall. To review load balancer VIP configurations, one would log into the load balancer and view the VIPs. There's nothing wrong with that, as such, except that by viewing the configuration on a running device, we see what the configuration is, not what it was intended it to be.

"We see what the configuration IS, not what it was intended to be"

Think about that for a moment: taking daily backups of a device configuration tells us absolutely nothing about what we had intended for the policy to be; rather, it's just a series of snapshots of the current implemented configuration. Unless an additional step is taken to compare each configuration snapshot against some record of the intended policy, errors (and malicious changes) will simply be perpetuated as the new latest configuration for a device.


Contrast this to a Linux server managed by, for example, Puppet. The server team can define a policy saying that the server should run Perl v5.10.1, and code that into a Puppet manifest. A user with appropriate permissions may decide that for some code they are writing, they need to have Perl v5.16.1, so they install the new version, overwriting the old one. In the network world, a daily backup of the server configuration would now include Perl 5.16.1 and from then on that would implicitly be the version of Perl running on that device, even though that wasn't the owning team's intent. Puppet, on the other hand, runs periodically and checks the policy (as represented by the manifest) against what's running on the the device itself. When the Perl version is checked, the discrepancy will be identified, and Puppet will automatically restore v5.10.1 because that's the version specified in the policy. If the server itself dies, all a replacement server really needs is to load the OS with a basic configuration and a Puppet agent, and all the policies defined in the manifest can be instantiated on the new server just as they were on the old server. The main takeaways are that the running configuration is just an instantiation of the policy, and the running configuration is checked regularly to ensure that it is still an accurate representation of that policy.

"The running configuration is just an instantiation of policy"

Let's Run The Network On Puppet!


Ok, nice idea, but let's not get too far ahead of ourselves here. Puppet requires an agent to run on the device. This is easy to do on a server operating system, but many network devices run a proprietary OS, or limit access to the system sufficiently that it wouldn't be possible to install an agent (there are some notable exceptions to this). Even if a device offers a Puppet agent, creating the configuration manifests may not be straightforward, and will certainly require network engineers learning a new skillset.


Picking on Junos OS as an example, the standard Puppet library supports the configuration of physical interfaces, VLANs, LAGs, and layer 2 switching, and, well, that's it. Of course, there's something deeper here worth considering: the same manifest configuration works on an EX and an MX, despite the fact that the implemented configurations will look different, and that's quite a benefit. For example, consider this snippet of a manifest:


Puppet manifest snippet


On a Juniper EX switch, this would result in configuration similar to this;


Juniper EX configuration sample


On a Juniper MX router, the configuration created by the manifest is quite different:


Juniper MX configuration sample


The trade-off for learning the syntax for the Puppet manifest is that the one syntax can be applied to any platform supporting VLANs, without needing to worry about whether the device uses VLANs or bridge-domains. Now if this could be supported on every Juniper device and OS version and the general manifest configuration could be made to apply to multiple vendors as well, that would be very helpful.




A manifest in this instance is a text file. Text files are easy for a script to create and edit, which makes automating the changes to these files relatively straightforward. Certainly compared to managing the process of logging into a device and issuing commands directly, creating a text file containing an updated manifest seems fairly trivial, and this may open the door to more automated configuration than might otherwise be possible.


Centralized Configuration Policy


Puppet has been used as an example above, but that does not imply that Puppet is the (only) solution to this problem; it's just one way to push out a policy manifest and ensure that the instantiated configuration matches what's defined by the policy. The main point is that as network engineers, we need to be looking at how we can migrate our configurations from a manual, vendor- (and even platform-) specific system to one which allows the key elements to be defined centrally, deployed (instantiated) to the target device, and for that configuration to be regularly validated against the master policy.


It's extremely difficult and, I suspect, risky, to jump in and attempt to deploy an entire configuration this way. Instead, maybe it's possible to pick something simple, like interface configurations or VLAN definitions, and seeing if those elements can be moved to a centralized location while the rest of the configuration is on-device. Over time, as confidence increases, additional parts of the configuration can be pulled into the policy manifest (or repository).


Roadblocks and Traffic Jams


There's one big issue with moving entire configurations into a centralized repo, which is that each vendor offers different ways to remotely configure the devices, some methods do not offer full coverage of the configuration syntax available via the CLI (I'm squinting at you, Cisco), and some operating systems are much more amenable to receiving and seamlessly (i.e., without disruption) applying configuration patches than others. Network device vendors are notoriously slow to make progress when it comes to network management, at least where it doesn't allow them to charge for their own solution to a problem, and developing a single configuration mechanism which could be applied to devices from all vendors is a non-trivial challenge (cf: OpenConfig). Nonetheless, we owe it to ourselves to keep nagging our vendors to make serious progress in this area and keep it high on the radar. When I look at trying to implement this kind of centralized configuration across my own company's range and age of hardware models and vendors, my head spins. We have to have a consistent way to configure our network devices, and given that most companies keep network devices for a least a few years, even if that was implemented today, it would still be 3-4 years before every device in a network supported that configuration mechanism.

"We owe it to ourselves to keep nagging our vendors"


On a more positive note, however, I will raise a glass to Juniper for being perhaps the most netdev friendly network device vendor for a number of years now, and  I will nod respectfully in the direction of Cumulus Networks who have kept their configurations as Unix standard as possible within the underlying Linux OS, thus opening them up to configuration via existing server configuration tools.


What Do You Do?


How do you manage the expectations that devices are implementing the policies they were intended to, and do not become an ever-changing source of truth for the intended policy? How do you push configurations to your devices, or does that idea scare you or seem impossible to do? If automation means swapping the CLI for a GUI, are on on board?

What do you do?

Please let me know; I hope to see a light at the end of the tunnel (and I hope it's not an oncoming train).

I am fascinated by the fact that in over twenty years, the networking industry still deploys firewalls in most of our networks exactly the way it did back in the day. And for many networks, that's it. The reality is that the attack vectors today are different from what they were twenty years ago, and we now need something more than just edge security.


The Ferro Doctrine

Listeners to the Packet Pushers podcast may have heard the inimitable Greg Ferro expound on the concept that firewalls are worthless at this point because the cost of purchasing, supporting, and maintaining them exceeds the cost to the business of any data breach that may occur as a result. To some extent, Greg has a point. After the breach has been cleared up and the costs of investigation, fines, and compensatory actions have been taken into account, the numbers in many cases do seem to be quite close. With that in mind, if you're willing to bet on not being breached for a period of time, it might actually be a money-saving strategy to just wait and hope. There's a little more to this than meets the eye, however.


Certificate of Participation

It's all very well to argue that a firewall would not have prevented a breach (or delayed it any longer than it already took for a company to be breached), but I'd hate to be the person trying to make that argument to my shareholders, or (in the U.S.) the Securities Exchange Commission or the Department of Health and Human Services, to pick a couple of random examples. At least if you have a firewall, you get to claim "Well, at least we tried." As a parallel, imagine that two friends have their bicycles stolen from the local railway station where they had left them. One friend used a chain and padlock to secure their bicycle, but the other just left their bicycle there because the thieves can cut through the chain easily anyway. Which friend would you feel more sympathy for? The chain and padlock at least raised the barrier of entry to only include thieves with bolt cutters.


The Nature Of Attacks

Greg's assertion that firewalls are not needed does have a subtle truth to it -- if it's coupled with the idea that some kind of port-based filtering at the edge is still necessary. But perhaps it doesn't need to be stateful and, typically, expensive. What if edge security was implemented on the existing routers using (by definition, stateless) access control lists instead? The obvious initial reaction might be to think, "Ah, but we must have session state!" Why? When's the last TCP sequence prediction attack you heard of? Maybe it's a long time ago, because we have stateful firewalls, but maybe it's also because the attack surface has changed.


Once upon a time, firewalls protected devices from attacks on open ports, but I would posit that the majority of attacks today are focused on applications accessed via a legitimate port (e.g. tcp/80 or tcp/443), and thus a firewall does little more than increment a few byte and sequence counters as an application-layer attack is taking place. A quick glance at the OWASP 2017 Top 10 List release candidate shows the wide range of ways in which applications are being assaulted. (I should note that this release candidate, RC1, was rejected, but it's a good example of what's at stake even if some specifics change when it's finally approved.)


If an attack takes place using a port which the firewall will permit, how is the firewall protecting the business assets? Some web application security might help here too, of course.


Edge Firewalls Only Protect The Edge

Another change which has become especially prevalent in the last five years is the idea of using distributed security (usually firewalls!) to move the enforcement point down toward the servers. Once upon a time, it was sometimes necessary to do this simply because centralized firewalls simply did not scale well enough to cope with the traffic they were expected to handle. The obvious solution is to have more firewalls and place them closer to the assets they are being asked to protect.


Host-based firewalls are perhaps the ultimate in distributed firewalls, and whether implemented within the host or at the host edge (e.g. within a vSwitch or equivalent within a hypervisor), flows within a data center environment can now be controlled, preventing the spread of attacks between hosts. VMWare's NSX is probably the most commonly seen implementation of a microsegmentation solution, but whether using NSX or another solution, the key to managing so many firewalls is to have a front end where policy is defined, then let the system figure out where to deploy which rules. It's all very well spinning up a Juniper cSRX (an SRX firewall implemented as a container) for example, on every virtualization host, but somebody has to configure the firewalls, and that's a task, if performed manually, that would rapidly spiral out of control.


Containers bring another level of security angst too since they can communicate with each other within a host. This has led to the creation of nanosegmentation security, which controls traffic within a host, at the container level.


Distributed firewalls are incredibly scalable because every new virtualization host can have a new firewall, which means that security capacity expands at the same rate as the compute capacity. Sure, licensing costs likely grow at the same rate as well, but it's the principal that's important.


Extending the distributed firewall idea to end-user devices isn't a bad idea either. Imagine how the spread of a worm like wannacry could have been limited if the user host firewalls could have been configured to block SMB while the worm was rampant within a network.


Trusted Platforms

In God we trust; all others must pay cash. For all the efforts we make to secure our networks and applications, we are usually also making the assumption that the hardware on which our network and computer runs is secure in the first place. After the many releases of NSA data, I think many have come to question whether this is actually the case. To that end, trusted platforms have become available, where components and software are monitored all the way from the original manufacturer through to assembly, and the hardware/firmware is designed to identify and warn about any kind of tampering that may have been attempted. There's a catch here, which is that the customer always has to decide to trust someone, but I get the feeling that many people would believe a third-party company's claims of non-interference over a government's. If this is important to you, there are trusted compute platforms available, and now even some trusted network platforms with a similar chain of custody-type procedures in place to help ensure legitimacy.


There's Always Another Tool

The good news is that security continues to be such a hot topic that there is no shortage of options when it comes to adding tools to your network (and there are many I have chosen not to mention here for the sake of brevity). There's no perfect security architecture, and whatever tools are currently running, there's usually another that could be added to fill a hole in the security stance. Many tools, at least the inline ones, add latency to the packet flows; it's unavoidable. In an environment where transaction speed is critical (e.g. high-speed trading), what's the trade off between security and latency?


Does this mean that we should give up on in-depth security and go back to ACLs? I don't think so. However, a security posture isn't something that can be created once then never updated. It has to be a dynamic strategy that is updated based on new technologies, new threats, and budgetary concerns. Maybe at some point, ACLs will become the right answer in a given situation. It's also not usually possible to protect against every known threat, so every decision is going to be a balance between cost, staffing, risk, and exposure. Security will always be a best effort given the known constraints.


We've come so far since the early firewall days, and it looks like things will continue changing, refreshing, and improving going forward as well. Today's security is not your mama's security architecture, indeed.

I'm not aware of an antivirus product for network operating systems, but in many ways, our routers and switches are just as vulnerable as a desktop computer. So, why don't we all protect them in the same way as our compute assets? In this post, I'll look at some basic tenets of securing the network infrastructure that underpins the entire business.


Authentication, authorization, and accounting (AAA)

Network devices intentionally leave themselves open to user access, so controlling who can get past the login prompt (authentication) is a key part of securing devices. Once logged in, it's important to control what a user can do (authorization). Ideally, what the user does should also be logged (accounting).


Local accounts are bad, mkay?

Local accounts (those created on the device itself) should be limited solely to backup credentials that allow access when the regular authentication service is unavailable. The password should be complex and changed regularly. In highly secure networks, access to the password should be restricted (kind of a "break glass for password" concept). Local accounts don't automatically disable themselves when an employee leaves, and far too often, I've seen accounts still active on devices for users who left the company years ago, with some of those accessible from the internet. Don't do it.


Use a centralized authentication service

If local accounts are bad, then the alternative is to use an authentication service like RADIUS or TACACS. Ideally, those services should, in turn, defer authentication to the company's existing authentication service, which in most cases, is Microsoft Active Directory (AD) or a similar LDAP service. This not only makes it easier to manage who has access in one place, but by using things like AD groups, it's possible to determine not just who is allowed to authenticate successfully, but what access rights they will have once logged in. The final, perhaps obvious, benefit is that it's only necessary to grant a user access in one place (AD), and they are implicitly granted access to all network devices.


The term process

A term (termination) process defines the list of steps to be taken when an employee leaves the company. While many of the steps relate to HR and payroll, the network team should also have a well-defined term process to help ensure that after a network employee leaves, things such as local fall back admin passwords are changed, or perhaps SNMP read/write strings are changed. The term process should also include disabling the employee's Active Directory account, which will also lock them out of all network devices because we're using an authentication service that authenticates against AD. It's magic! This is a particularly important process to have when an employee is terminated by the company, or may for any other reason be disgruntled.


Principal of least privilege

One of the basic security tenets is the principal of least privilege, which in basic terms, says Don't give people access to things unless they actually need it; default to giving no access at all. The same applies to network device logins, where users should be mapped to the privileged group that allows them to meet their (job) goals, while not granting permissions to do anything for which they are not authorized. For example, an NOC team might need read-only access to all devices to run show commands, but they likely should not be making configuration changes. If that's the case, one should ensure that the NOC AD group is mapped to have only read-only privileges.


Command authorization

Command authorization is a long-standing security feature of Cisco's TACACS+, and while sometimes painful to configure, it can allow granular control of issued commands. It's often possible to configure command filtering within the network OS configuration, often by defining privilege levels or user classes at which a command can be issued, and using RADIUS or TACACS to map the user to that group or user class at login. One company I worked for created a "staging" account on Juniper devices, which allowed the user to enter configuration mode and enter commands, and allowed the user to run commit check to validate the configuration's validity, but did not allow an actual commit to make the changes active on the device. This provided a safe environment in which to validate proposed changes without ever having the risk of the user forgetting to add check to their commit statement. Juniper users: tell me I'm not the only one who ever did that, right?


Command accounting

This one is simple: log everything that happens on a device. More than once in the past, we have found the root cause of an outage by checking the command logs on a device and confirming that, contrary to the claimed innocence of the engineer concerned, they actually did log in and make a change (without change control either, naturally). In the wild, I see command accounting configured on network devices far less often than I would have expected, but it's an important part of a secure network infrastructure.


Network time protocol (NTP)

It's great to have logs, but if the timestamps aren't accurate, it's very difficult to align events from different devices to analyze a problem. Every device should be using NTP to ensure that they have an accurate clock to use. Additionally, I advise choosing one time zone for all devices—servers included—and sticking to it. Configuring each device with its local time zone sounds like a good idea until, again, you're trying to put those logs together, and suddenly it's a huge pain. Typically, I lean towards UTC (Coordinated Universal Time, despite the letters being in the wrong order), mainly because it does not implement summer time (daylight savings time), so it's consistent all year round.


Encrypt all the things

Don't allow telnet to the device if you can use SSH instead. Don't run an HTTP server on the device if you can run HTTPS instead. Basically, if it's possible to avoid using an unencrypted protocol, that's the right choice. Don't just enable the encrypted protocol; go back and disable the unencrypted one. If you can run SSHv2 instead of SSHv1, you know what to do.


Password all the protocols

Not all protocols implement passwords perfectly, with some treating them more like SNMP strings. Nonetheless, consider using passwords (preferably using something like MD5) on any network protocols that support it, e.g., OSPF, BGP, EIGRP, NTP, VRRP, HSRP.


Change defaults

If I catch you with SNMP strings of public and private, I'm going to send you straight to the principal's office for a stern talking to. Seriously, this is so common and so stupid. It's worth scanning servers as well for this; quite often, if SNMP is running on a server, it's running the defaults.


Control access sources

Use the network operating system's features to control who can connect to them in the first place. This may take the form of a simple access list (e.g., a vty access-class in Cisco speak) or could fall within a wider Control Plane Policing (CoPP) policy, where the control for any protocol can be implemented. Access Control Lists (ACLs) aren't in themselves secure, but it's another step to overcome for any bad actor wishing to illicitly connect to the devices. If there are bastion management devices (aka jump boxes), perhaps make only those devices able to connect. Restrict from where SNMP commands can be issued. This all applies doubly for any internet-facing devices, where such protections are crucial. Don't allow management connections to a network device on an interface with a public IP. Basically, protect yourself at the IP layer as well by using passwords and AAA.


Ideally, all devices would be managed using their dedicated management ports, accessed through a separate management network. However, not everybody has the funding to build an out-of-band management network, and many are reliant on in-band access.


Define security standards and audit yer stuff

It's really worth creating a standard security policy (with reference configurations) for the network devices, and then periodically auditing the devices against it. If a device goes out of compliance is that a mistake or did somebody intentionally weaken the device security posture? Either way, just because a configuration was implemented once, it would be risky to assume it had remained in place from then on, so a regular check is worthwhile.


Remember why

Why are we doing all of this? The business runs over the network. If the network is impacted by a bad actor, the business can be impacted in turn. These steps are one part of a layered security plan; by protecting the underlying infrastructure, we help maintain the availability of the applications. Remember the security CIA triad —Confidentiality, Integrity, and Availability? The steps I have outlined above—and much more that I can think of—help maintain network availability and ensure that the network is not compromised. This means that we have a higher level of trust that the data we entrust to the network transport is not being siphoned off or altered in transit.


What steps do you take to keep your infrastructure secure?

Whatever business one might choose to examine, the network is the glue that holds everything together. Whether the network is the product (e.g. for a service provider) or simply an enabler for business operations, it is extremely important for the network to be both fast and reliable.


IP telephony and video conferencing have become commonplace, taking communications that previously required dedicated hardware and phone lines and moving them to the network. I have also seen many companies mothball their dedicated Storage Area Networks (SANs) and move them closer to Network Attached Storage, using iSCSI and NFS for data mounts. I also see applications utilizing cloud-based storage provided by services like Amazon's S3, which also depend on the network to move the data around. Put simply, the network is critical to modern companies.


Despite the importance of the network, many companies seem to have only a very basic understanding of their own network performance even though the ability to move data quickly around the network is key to success. It's important to set up monitoring to identify when performance is deviating from the norm, but in this post, I will share a few other thoughts to consider when looking at why network performance might not be what people expect it to be.



MTU (Maximum Transmission Unit) determines the largest frame of data that can be sent over an ethernet interface. It's important because every frame that's put on the wire contains overhead; that is, data that is not the actual payload. A typical ethernet interface might default to a physical MTU of around 1518 bytes, so let's look at how that might compare to a system that offers an MTU of 9000 bytes instead.


What's in a frame?

A typical TCP datagram has overhead like this:


  • Ethernet header (14 bytes)
  • IPv4 header (20 bytes)
  • TCP header (usually 20 bytes, up to 60 if TCP options are in play)
  • Ethernet Frame Check Sum (4 bytes)


That's a total of 58 bytes. The rest of the frame can be data itself, so that leaves 1460 bytes for data. The overhead for each frame represents just under 4% of the transmitted data.


The same frame with a 9000 byte MTU can carry 8942 bytes of data with just 0.65% overhead. Less overhead means that the data is sent more efficiently, and transfer speeds can be higher. Enabling jumbo frames (frames larger than 1500 bytes) and raising the MTU to 9000 if the hardware supports it can make a huge difference, especially for systems moving a lot of data around the network, such as the Network Attached Storage.


What's the catch?

Not all equipment supports a high MTU because it's hardware dependent, although most modern switches I've seen can handle 9000-byte frames reasonably well. Within a data center environment, large MTU transfers can often be achieved successfully, with positive benefits to applications as a result.


However, Wide Area Networks (WANs) and the internet are almost always limited to 1500 bytes, and that's a problem because those 9000-byte frames won't fit into 1500 bytes. In theory, a router can break large packets up into appropriately sized smaller chunks (fragments) and send them over links with reduced MTU, but many firewalls are configured to block fragments, and many routers refuse to fragment because of the need for the receiver to hold on to all the fragments until they arrive, reassemble the packet, then route it toward its destination. The solution to this is PMTUD (Path MTU Discovery). When a packet doesn't fit on a link without being fragmented, the router can send a message back to the sender saying, It doesn't fit, the MTU is... Great! Unfortunately, many firewalls have not been configured to allow the ICMP messages back in, for a variety of technical or security reasons, but with the ultimate result of breaking PMTUD. One way around this is to use one ethernet interface on a server for traffic internal to a data center (like storage) using a large MTU, and another interface with a smaller MTU for all other traffic. Messy, but it can help if PMTUD is broken.


Other encapsulations

The ethernet frame encapsulations don't end there. Don't forget there might be an additional 5 bytes required for VLAN tagging over trunk links, VXLAN encapsulation (50 bytes) and maybe even GRE or MPLS encapsulations (4 bytes each). I've found that despite the slight increase in the ratio of overhead to data, 1460 bytes is a reasonably safe MTU for most environments, but it's very dependent on exactly how the network is set up.



I had a complaint one time that while file transfers between servers within the New York data center were nice and fast, when the user transferred the same file to the Florida data center (basically going from near the top to the bottom of the Eastern coast of the United States) transfer rates were very disappointing, and they said the network must be broken. Of course, maybe it was, but the bigger problem without a doubt was the time it took for an IP packet to get from New York to Florida, versus the time it takes for an IP packet to move within a data center.


AT&T publishes a handy chart showing their current U.S. network latencies between pairs of cities. The New York to Orlando current shows that it has a 33ms latency, which is about what we were seeing on our internal network as well. Within a data center, I can move data in a millisecond or less, which is 33 times faster. What many people forget is that when using TCP, it doesn't matter how much bandwidth is available between two sites. A combination of end-to-end latency and congestion window (CWND) size will determine the maximum throughput for a single TCP session.


TCP session example

If it's necessary to transfer 100,000 files from NY to Orlando, which is faster:


  1. Transfer the files one by one?
  2. Transfer ten files in parallel?


It might seem that the outcome would be the same because a server with a 1G connection can only transfer 1Gbps, so whether you have one stream at 1Gbps or ten streams at 100Mbps, it's the same result. But actually, it isn't because the latency between the two sites will effectively limit the maximum bandwidth of each file transfer's TCP session. Therefore, to maximize throughput, it's necessary to utilize multiple parallel TCP streams (an approach taken very successfully for FTP/SCP transfers by the open source FileZilla tool). It's also the way that tools like those from Aspera can move data faster than a regular Windows file copy.


The same logic also applies to web browsers, which typically will open five or six parallel connections to a single site if there are sufficient resource requests to justify it. Of course, each TCP session requires a certain amount of overhead for connection setup. Usually a three-way handshake, and if the session is encrypted there may be a certificate or similar exchange to deal with as well. Another optimization that is available here is pipelining.



Pipelining uses a single TCP connection to issue multiple requests back to back. In HTTP protocol, this is accomplished by the HTTP header Connection: keep-alive, which is a default in HTTP/1.1. This request asks the destination server to keep the TCP connection open after completing the HTTP request in case the client has another request to make. Being able to do this allows the transfer of multiple resources with only a single TCP connection overhead (or, as many TCP connection overheads as there are parallel connections). Given that a typical web page may make many tens of calls to the same site (50+ is not unusual), this efficiency stacks up quite quickly. There's another benefit too, and that's the avoidance of TCP slow start.


TCP slow start

TCP is a reliable protocol. If a datagram (packet) is lost in transit, TCP can detect the loss and resend the data. To protect itself against unknown network conditions, however, TCP starts off each connection being fairly cautious about how much data it can send to the remote destination before getting confirmation back that each sent datagram was received successfully. With each successful loss-free confirmation, the sender exponentially increases the amount of data it is willing to send without a response, increasing the value of its congestion window (CWND). Packet loss causes CWND to shrink again, as does an idle connection during which TCP can't tell if network conditions changed, so to be safe it starts from a smaller number again. The problem is, as latency between endpoints increases, it takes progressively longer for TCP to get to its maximum CWND value, and thus longer to achieve maximum throughput. Pipelining can allow a connection to reach maximum CWND and keep it there while pushing multiple requests, which is another speed benefit.



I won't dwell on compression other than to say that it should be obvious that transferring compressed data is faster than transferring uncompressed data. For proof, ask any web browser or any streaming video provider.


Application vs network performance

Much of the TCP tuning and optimization that can take place is a server OS/application layer concern, but I mention it because even on the world's fastest network, an inefficiently designed application will still run inefficiently. If there is a load balancer front-ending an application, it may be able to do a lot to improve performance for a client by enabling compression or Connection: keep-alive, for example, even when an application does not.


Network monitoring

In the network itself, for the most part, things just work. And truthfully, there's not much one can do to make it work faster. However, the network devices should be monitored for packet loss (output drops, queue drops, and similar). One of the bigger causes of this is microbursting.



Modern servers are often connected using 10Gbps ethernet, which is wonderful except they are often over-eager to send out frames. Data is prepared and buffered by the server, then BLUURRRGGGGHH it is spewed at the maximum rate into the network. Even if this burst of traffic is relatively short, at 10Gbps it can fill a port's frame buffer and overflow it before you know what's happened, and suddenly the latter datagrams in the communication are being dropped because there's no more space to receive them. Anytime the switch can't move the frame from input to output port at least as fast as it's coming in on a given port, the input buffer comes into play and puts it at risk of getting overfilled. These are called microbursts because a lot of data is sent over a very short period. Short enough, in fact, for it to be highly unlikely that it will ever be identifiable in the interface throughput statistics that we all like to monitor. Remember, an interface running between 100% for half the time and 0% for the rest will likely show up as running at 50% capacity in a monitoring tool. What's the solution? MOAR BUFFERZ?! No.


Buffer bloat

I don't have space to go into detail here, so let me point you to a site that explains buffer bloat, and why it's a problem. The short story is that adding more buffers in the path can actually make things worse because it actively works against the algorithms within TCP that are designed to handle packet loss and congestion issues.


Monitor capacity

It sounds obvious, but a link that is fully utilized will lead to slower network speeds, whether through higher delays via queuing, or packet loss leading to connection slowdowns. We all monitor interface utilization, right? I thought so.


The perfect network

There is no perfect network, let's be honest. However, having an understanding not only of how the network itself (especially latency) can impact throughput, as well as an understanding of the way the network is used by the protocols running over it, might help with the next complaint that comes along. Optimizing and maintaining network performance is rarely a simple task, but given the network's key role in the business as a whole, the more we understand, the more we can deliver.


While not a comprehensive guide to all aspects of performance, I hope that this post might have raised something new, confirmed what you already know, or just provided something interesting to look into a bit more. I'd love to hear your own tales of bad network performance reports, application design stupidity, crazy user/application owner expectations (usually involving packets needing to exceed the speed of light) and hear how you investigated and hopefully fixed them!

It sounds obvious, perhaps, but without configurations, our network, compute, and storage environments won't do very much for us. Configurations develop over time as we add new equipment, change architectures, improve our standards, and deploy new technologies. The sum of knowledge within a given configuration is quite high. Despite that, many companies still don't have any kind of configuration management in place, so in this article, I will outline some reasons why configuration management is a must, and look at a some of the benefits that come with having it.


Recovery from total loss

As THWACK users, I think we're all pretty technically savvy, yet if I were to ask right now if you had an up-to-date backup of your computer and its critical data, what would the answer be? If your laptop's hard drive died right now, how much data would be lost after you replaced it?


Our infrastructure devices are no different. Every now and then a device will die without warning, and the replacement hardware will need to have the same configuration that the (now dead) old device had. Where's that configuration coming from?


Total loss is perhaps the most obvious reason to have a system of configuration backups in place. Configuration management is an insurance policy against the worst eventuality, and it's something we should all have in place. Potential ways to achieve this include:



At a minimum, having the current configuration safely stored on another system is of value. Some related thoughts on this:


  • Make sure you can get to the backup system when a device has failed.
  • Back up / mirror / help ensure redundancy of your backup system.
  • If "rolling your own scripts," make sure that, say, a failed login attempt doesn't overwrite a valid configuration file (he said, speaking from experience). In other words, some basic validation is required to make sure that the script output is actually a configuration file and not an error message.



Better than a copy of the current configurations, a configuration archive tracks all -- or some number of -- the previous configurations for a device.


An archive gives us the ability to see what changes occurred to the configuration and when. If a device doesn't support configuration rollback natively, it may be possible to create a kind of rollback script based on the difference between the two latest configurations. If the configuration management tool (or other systems) can react to SNMP traps to indicate a configuration change, the archive can be kept very current by triggering a grab of the configuration as soon as a change is noted.


Further, home-grown scripts or configuration management products can easily identify device changes and generate notifications and alerts when changes occur. This can provide an early warning of unauthorized configurations or changes made outside scheduled maintenance windows.


Compliance / Audit

Internal Memo

We need confirmation that all your devices are sending their syslogs to these seventeen IP addresses.


-- love from, Your Friendly Internal Security Group xxx

"Putting the 'no' in Innovation since 2003"


A request like this can be approached in a couple of different ways. Without configuration management, it's necessary to log in to each device and check the syslog server configuration. With a collection of stored configurations, however, checking this becomes a matter of processing configurations files. Even grepping them could extract the necessary information. I've written my own tools to do the same thing, using configuration templates to allow support for the varying configuration stanzas used by different flavors of vendor and OS to achieve the same thing.


Some tools — Solarwinds NCM is one of them — can also compare the latest configuration against a configuration snippet and report back on compliance. This kind of capability makes configuration audits extremely simple.


Even without a security group making requests, the ability to audit configurations against defined standards is an important capability to have. Having discussed the importance of configuration consistency, it seems like a no-brainer to want a tool of some sort to help ensure that the carefully crafted standards have been applied everywhere.


Pushing configuration to devices

I'm never quite sure whether the ability to issue configuration commands to devices falls under automation or configuration management, but I'll mention it briefly here since NCM includes this capability. I believe I've said in a previous Geek Speak post that it's abstractions that are most useful to most of us. I don't want to write the code to log into a device and deal with all the different prompts and error conditions. Instead, I'd much rather hand off to a tool that somebody else wrote and say, Send this. Lemme know how it goes. If you have the ability to do that and you aren't the one who has to support it, take that as a win. And while you're enjoying the golden trophy, give some consideration to my next point.


Where is your one true configuration source?

Why do we fall into the trap of using hardware devices as the definitive source of each configuration? Bearing in mind that most of us claim that we're working toward building a software-defined network of some sort, it does seem odd that the configuration sits on the device. Why does it not sit in a database or other managed repository that has been programmed based on the latest approved configuration in that repo?


Picture this for example:


  • Configurations are stored in a git repo
  • Network engineers fork the repo so they have a local copy
  • When a change is required, the engineer makes the necessary changes to their fork, then issues a pull request back to the main repo.
  • Pull requests can be reviewed as part of the Change Control process, and if approved, the pull-request is accepted and merged into the configuration.
  • The repo update triggers the changes to be propagated to the end device


Such a process would give us a configuration archive with a complete (and commented) audit trail for each change made. Additionally, if the device fails, the latest configuration is in the git repo, not on the device, so by definition, it's available for use when setting up the replacement device. If you're really on the ball, it may be possible to do some form of integration testing/syntax validation of the change prior to accepting the pull request.


There are some gotchas with this, not the least of which is that going from a configuration diff to something you can safely deploy on a device may not be as straightforward as it first appears. That said, thanks to commands like Junos' load replace and load override and IOS XR's commit replace, such things are made a little easier.


The point of this is not really to get into the implementation details, but more to raise the question of how we think about network device configurations in particular. Compute teams get it; using tools like Puppet and Chef to build and maintain the state of a server OS, it's possible to rebuild an identical server. The same applies to building images in Docker. The configuration should not be within the image becuase it's housed in the Dockerfile. So why not network devices, too? I'm sure you'll tell me, and I welcome it.


Get. Configuration. Management. Don't risk being the person everybody feels pity for after their hard drive crashes.

As a network engineer, I don't think I've ever had the pleasure of having every device configured consistently in a network. But what does that even mean? What is consistency when we're potentially talking about multiple vendors and models of equipment?


There Can Only Be One (Operating System)


Claim: For any given model of hardware there should be one approved version of code deployed on that hardware everywhere across an organization.


Response: And if that version has a bug, then all your devices have that bug. This is the same basic security paradigm that leads us to have multiple firewall tiers comprising different vendors for extra protection against bugs in one vendor's code. I get it, but it just isn't practical. The reality is that it's hard enough upgrading device software to keep up with critical security patches, let alone doing so while maintaining multiple versions of code.

Why do we care? Because different versions of code can behave differently. Default command options can change between versions; previously unavailable options and features are added in new versions. Basically, having a consistent revision of code running means that you have a consistent platform on which to make changes. In most cases, that is probably worth the relatively rare occasions on which a serious enough bug forces an emergency code upgrade.


Corollary: The approved code version should be changing over time, as necessitated by feature requirements, stability improvements, and critical bugs. To that end, developing a repeatable method by which to upgrade code is kind of important.


Consistency in Device Management


Claim: Every device type should have a baseline template that implements a consistent management and administration configuration, with specific localized changes as necessary. For example, a template might include:


  • NTP / time zone
  • Syslog
  • SNMP configuration
  • Management interface ACLs
  • Control plane policing
  • AAA (authentication, authorization, and accounting) configuration
  • Local account if AAA authentication server fails*


(*) There are those who would argue, quite successfully, that such a local account should have a password unique to each device. The password would be extracted from a secure location (a break glass type of repository) on demand when needed and changed immediately afterward to prevent reuse of the local account. The argument is that if the password is compromised, it will leave all devices susceptible to accessibility. I agree, and I tip my hat to anybody who successfully implements this.


Response: Local accounts are for emergency access only because we all use a centralized authentication service, right? If not, why not? Local accounts for users are a terrible idea, and have a habit of being left in place for years after a user has left the organization.


NTP is a must for all devices so that syslog/SNMP timestamps are synced up. Choose one timezone (I suggest UTC) and implement it on your devices worldwide. Using a local time zone is a guaranteed way to mess up log analysis the first time a problem spans time zones; whatever time zone makes the most sense, use it, and use it everywhere. The same time zone should be configured in all network management and alerting software.


Other elements of the template are there to make sure that the same access is available to every device. Why wouldn't you want to do that?


Corollary: Each device and software version could have its own limitations, so multiple templates will be needed, adapted to the capabilities of each device.


Naming Standards


Claim: Pick a device naming standard and stick with it. If it's necessary to change it, go back and change all the existing devices as well.


Response: I feel my hat tipping again, but in principle this is a really good idea. I did work for one company where all servers were given six-letter dictionary words as their names, a policy driven by the security group who worried that any kind of semantically meaningful naming policy would reveal too much to an attacker. Fair play, but having to remember that the syslog servers are called WINDOW, BELFRY, CUPPED, and ORANGE is not exactly friendly. Particularly in office space, it can really help to be able to identify which floor or closet a device is in. I personally lean toward naming devices by role (e.g. leaf, access, core, etc.) and never by device model. How many places have switches called Chicago-6500-01 or similar? And when you upgrade that switch, what happens? And is that 6500 a core, distribution, access, or maybe a service-module switch?


Corollary: Think the naming standard through carefully, including giving thought to future changes.


Why Do This?


There are more areas that could and should be consistent. Maybe consider things like:


  • an interface naming standard
  • standard login banners
  • routing protocol process numbers
  • vlan assignments
  • BFD parameters
  • MTU (oh my goodness, yes, MTU)


But why bother? Consistency brings a number of obvious operational benefits.


  • Configuring a new device using a standard template means a security baseline is built into the deployment process
  • Consistent administrative configuration reduces the number of devices which, at a critical moment in troubleshooting, turn out to be inaccessible
  • Logs and events are consistently and accurately timestamped
  • Things work, in general, the same way everywhere
  • Every device looks familiar when connecting
  • Devices are accessible, so configurations can be backed up into a configuration management tool, and changes can be pushed out, too
  • Configuration audit becomes easier


The only way to know if the configurations are consistent is to define a standard and then audit against it. If things are set up well, such an audit could even be automated. After a software upgrade, run the audit tool again to help ensure that nothing was lost or altered during the process.


What does your network look like? Is it consistent, or is it, shall we say, a product of organic growth? What are the upsides -- or downsides -- to consistency like this?

You may be wondering why, after creating four blog posts encouraging non-coders to give it a shot, select a language and break down a problem into manageable pieces, I would now say to stop. The answer is simple, really: not everything is worth automating (unless, perhaps, you are operating at a similar scale to somebody Amazon).


The 80-20 Rule


Here's my guideline: figure out what tasks take up the majority (i.e. 80%) of your time in a given time period (in a typical week perhaps). Those are the tasks where making the time investment to develop an automated solution is most likely to see a payback. The other 20% are usually much worse candidates for automation where the cost of automating it likely outweighs the time savings.


As a side note, the tasks that take up the time may not necessarily be related to a specific work request type. For example, I may spend 40% of my week processing firewall requests, and another 20% processing routing requests, and another 20% troubleshooting connectivity issues. In all of these activities, I spend time identifying what device, firewall zone, or VRF various IP addresses are in, so that I can write the correct firewall rule, or add routing in the right places, or track next-hops in a traceroute where DNS is missing. In this case, I would gain the most immediate benefits if I could automate IP address research.


I don't want to be misunderstood; there is value in creating process and automation around how a firewall request comes into the queue, for example, but the value overall is lower than for a tool that can tell me lots of information about an IP address.


That Seems Obvious


You'd think that it was intuitive that we would do the right thing, but sometimes things don't go according to plan:


Feeping Creatures!


Once you write a helpful tool or an automation, somebody will come back and say, Ah, what if I need to know X information too? I need that once a month when I do the Y report. As a helpful person, it's tempting to immediately try and adapt the code to cover every conceivable corner case and usage example, but having been down that path, I counsel against doing so. It typically makes the code unmanageably complex due to all the conditions being evaluated and worse, it goes firmly against the 80-20 rule above. Feeping Creatures is a Spoonerism referring to Creeping Features, i.e. an always expanded feature list for a product.


A Desire to Automate Everything


There's a great story in What Do You Care What Other People Think (Richard Feynman) that talks about Mr. Frankel, who had developed a system using a suite of IBM machines to run the calculations for the atomic bomb that was being developed at Los Alamos.


"Well, Mr. Frankel, who started this program, began to suffer from the computer disease that anybody who works with computers now knows about. [...] Frankel wasn't paying any attention; he wasn't supervising anybody. [...] (H)e was sitting in a room figuring out how to make one tabulator automatically print arctangent X, and then it would start and it would print columns and then bitsi, bitsi, bitsi, and calculate the arc-tangent automatically by integrating as it went along and make a whole table in one operation.


Absolutely useless. We had tables of arc-tangents. But if you've ever worked with computers, you understand the disease -- the delight in being able to see how much you can do. But he got the disease for the first time, the poor fellow who invented the thing."


It's exciting to automate things or to take a task that previously took minutes, and turn it into a task that takes seconds. It's amazing to watch the 80% shrink down and down and see productivity go up. It's addictive. And so, inevitably, once one task is automated, we begin looking for the next task we can feel good about, or we start thinking of ways we could make what we already did even better. Sometimes the coder is the source of creeping features.


It's very easy to lose touch with the larger picture and stay focused on tasks that will generate measurable gains. I've fallen foul of this myself in the past, and have been delighted, for example, with a script I spent four days writing, which pulled apart log entries from a firewall and ran all kinds of analyses on it, allowing you to slice the data any which way and generate statistics. Truly amazing! The problem is, I didn't have a use for most of the stats I was able to produce, and actually I could have fairly easily worked out the most useful ones in Excel in about 30 minutes. I got caught up in being able to do something, rather than actually needing to do it.


And So...


Solve A Real Problem


Despite my cautions above, I maintain that the best way to learn to code is to find a real problem that you want to solve and try to write code to do it. Okay, there are some cautions to add here, not the least of which is to run tests and confirm the output. More than once, I've written code that seemed great when I ran it on a couple of lines of test data, but then when I ran it on thousands of lines of actual data, I discovered oddities in the input data, or in the loop that processes all the data reusing variables carelessly or similar. Just like I tell my kids with their math homework, sanity check the output. If a script claims that a 10Gbps link was running at 30Gbps, maybe there's a problem with how that figure is being calculated.


Don't Be Afraid to Start Small


Writing a Hello World! script may feel like one of the most pointless activities you may ever undertake, but for a total beginner, it means something was achieved and, if nothing else, you learned how to output text to the screen. The phrase, "Don't try to boil the ocean," speaks to this concept quite nicely, too.


Be Safe!


If your ultimate aim is to automate production device configurations or orchestrate various APIs to dance to your will, that's great, but don't start off by testing your scripts in production. Use device VMs where possible to develop interactions with different pieces of software. I also recommend starting by working with read commands before jumping right in to the potentially destructive stuff. After all, after writing a change to a device, it's important to know how to verify that the change was successful. Developing those skills first will prove useful later on.


Learn how to test for, detect, and safely handle errors that arise along the way, particularly the responses from the devices you are trying to control. Sanitize your inputs! If your script expects an IPv4 address as an input, validate that what you were given is actually a valid IPv4 address. Add your own business rules to that validation if required (e.g. a script might only work with 10.x.x.x addresses, and all other IPs require human input). The phrase Garbage in, garbage out, is all too true when humans provide the garbage.


Scale Out Carefully


To paraphrase a common saying, automation allows you to make mistakes on hundreds of devices much faster that you could possibly do it by hand. Start small with a proof of concept, and demonstrate that the code is solid. Once there's confidence that the code is reliable, it's more likely to be accepted for use on a wider scale. That leads neatly into the last point:


Good Luck Convincing People


It seems to me that everybody loves scripting and automation right up to the point where it needs to be allowed to run autonomously. Think of it like the Google autonomous car: for sure, the engineering team was pretty confident that the code was fairly solid, but they wouldn't let that car out on the highway without human supervision. And so it is with automation; when the results of some kind of process automation can be reviewed by a human before deployment, that appears to be an acceptable risk from a management team's perspective. Now suggest that the human intervention is no longer required, and that the software can be trusted, and see what response you get.


A coder I respect quite a bit used to talk about blast radius, or what's the impact of a change beyond the box on which the change is taking place? Or what's the potential impact of this change as a whole? We do this all the time when evaluating change risk categories (is it low, medium, or high?) by considering what happens if a change goes wrong. Scripts are no different. A change that adds an SNMP string to every device in the network, for example, is probably fairly harmless. A change that creates a new SSH access-list, on the other hand, could end up locking everybody out of every device if it is implemented incorrectly. What impact would that have on device management and operations?




I really recommend giving programming a shot. It isn't necessary to be a hotshot coder to have success (trust me, I am not a hotshot coder), but having an understanding of coding will, I believe, will positively impact other areas of your work. Sometimes a programming mindset can reveal ways to approach problems that didn't show themselves before. And while you're learning to code, if you don't already know how to work in a UNIX (Linux, BSD, MacOS, etc.) shell, that would be a great stretch goal to add to your list!


I hope that this mini-series of posts has been useful. If you do decide to start coding, I would love to hear back from you on how you got on, what challenges you faced and, ultimately, if you were able to code something (no matter how small) that helped you with your job!

In this post, part of a miniseries on coding for non-coders, I thought it might be interesting to look at a real-world example of breaking a task down for automation. I won't be digging hard into the actual code but instead looking at how the task could be approached and turned into a sequence of events that will take a sad task and transform it into a happy one.


The Task - Deploying a New VLAN


Deploying a new VLAN is simple enough, but in my environment it means connecting to around 20 fabric switches to build the VLAN. I suppose one solution would be to use an Ethernet fabric that had its own unified control plane, but ripping out my Cisco FabricPath™ switches would take a while, so let's just put that aside for the moment.


When a new VLAN is deployed, it almost always also requires that a layer 3 (IP) gateway with HSRP is created on the routers and that VLAN needs to be trunked from the fabric edge to the routers. If I can automate this process, for every VLAN I deploy, I can avoid logging in to 22 devices by hand, and I can also hopefully complete the task significantly faster.


Putting this together, I now have a list of three main steps I need to accomplish:


  1. Create the VLAN on every FabricPath switch
  2. Trunk the VLAN from the edge switches to the router
  3. Create the L3 interface on the routers, and configure HSRP


Don't Reinvent the Wheel


Much in the same way that one uses modules when coding to avoid rewriting something that has been created already, I believe that the same logic applies to automation. For example, I run Cisco Data Center Network Manager (DCNM) to manage my Ethernet fabric. DCNM has the capability to deploy changes (it calls them Templates) to the fabric on demand. The implementation of this feature involves DCNM creating an SSH session to the device and configuring it just like a real user would. I could, of course, implement the same functionality for myself in my language of choice, but why would I? Cisco has spent time making the deployment process as bulletproof as possible; DCNM recognizes error messages and can deal with them. DCNM also has the logic built in to configure all the switches in parallel, and in the event of an error on one switch, to either roll back that switch alone or all switches in the change. I don't want to have to figure all that out for myself when DCNM already does it.


For the moment, therefore, I will use DCNM to deploy the VLAN configurations to my 20 switches. Ultimately it might be better if I had full control and no dependency on a third-party product, but in terms of achieving the goal rapidly, this works for me. To assist with trunking VLANs toward the routers, in my environment the edge switches facing the routers have a unique name structure, so I was also able to tweak the DCNM template so that if it detects that it is configuring one of those switches, it also adds the VLANs to the trunked list on the relevant router uplinks. Again, that's one less task I'll have to do in my code.


Similarly, to configure the routers (IOS XR-based), I could write a Python script based on the Paramiko SSH library, or use the Pexpect library to launch ssh and control the program's actions based on what it sees in the session. Alternatively, I could use NetMiko which already understands how to connect to an IOS XR router and interact with it. The latter choice seems like it's preferable, if for no other reason than to speed up development.


Creating the VLAN


DCNM has a REST API through which I can trigger a template deployment. All I need is a VLAN number and an optional description, and I can feed that information to DCNM and let it run. First, though, I need the list of devices on which to apply the configuration template. This information can be retrieved using another REST API call. I can then process the list, apply the VLAN/Description to each item and submit the configuration "job." After submitting the request, assuming success, DCNM will return the JobID that was created. That's handy because it will be necessary to keep checking the status of that JobID afterward to see if it succeeded. So here are the steps so far:


  • Get VLAN ID and VLAN Description from user
  • Retrieve list of devices to which the template should be applied
  • Request a configuration job
  • Request job status until it has some kind of resolution (Success, Failed, etc)


Sound good? Wait; the script needs to login as well. In the DCNM REST API that means authenticating to a particular URL, receiving a token (a string of characters), then using that token as a cookie in all future requests within that session. Also, as a good citizen, the script should logout after completing its requests too, so the list now reads:

  • Get VLAN ID and VLAN Description from user
  • Authenticate to DCNM and extract session token
  • Retrieve list of devices to which the template should be applied
  • Request a configuration job
  • Request job status until it has some kind of resolution (Success, Failed, etc)
  • Log out of DCNM


That should work for the VLAN creation but I'm also missing a crucial step which is to sanitize and validate the inputs provided to the script. I need to ensure, for example, that:


  • VLAN ID is in the range 1-4094, but for legacy Cisco purposes perhaps, does not include 1002-1005
  • VLAN Description must be 63 characters or less, and the rules I want to apply will only allow [a-z], [A-Z], [0-9], dash [-] and underscore [_]; no spaces and odd characters


Maybe the final list looks like this then:


  • Get VLAN ID and VLAN Description from user
  • Confirm that VLANID and VLAN Description are valid
  • Authenticate to DCNM and extract session token
  • Retrieve list of devices to which the template should be applied
  • Request a configuration job
  • Request job status until it has some kind of resolution (Success, Failed, etc)
  • Log out of DCNM


Configuring IOS XR


In this example, I'll use Python+NetMiko to do the hard work for me. My inputs are going to be:


  • IPv4 Subnet and prefix length
  • IPv6 Subnet and prefix length
  • L3 Interface Description


As before, I will sanity check the data provided to ensure that the IPs are valid. I have found that IOS XR's configuration for HSRP, while totally logical and elegantly hierarchical, is a bit of a mouthful to type out, so to speak, and as such it is great to have a script take the basic information like a subnet, and apply some standard rules to it (e.g. the 2nd IP is the HSRP gateway, e.g. .1 on a /24 subnet), the next address up (e.g. .2) would be on the A router, and .3 would be on the B router. For my HSRP group number, I use the VLAN ID.  The subinterface number where I'll be configuring layer 3 will match the VLAN ID also, and with that information I can also configure the HSRP BFD peer between the routers too. By applying some simple standardized templating of the configuration, I can take a bare minimum of information from the user and create configurations which would take much longer to create manually and quite often (based on my own experience) would have mistakes in it.


The process then might look like this:


  • Get IPv4 subnet, IPv6 subnet, VLAN ID and L3 interface description from user
  • Confirm that IPv4 subnet, IPv6 subnet, VLANID and interface description are valid
  • Generate templated configuration for the A and B routers
  • Create session to A router and authenticate
  • Take a snapshot of the configuration
  • Apply changes (check for errors)
  • Assuming success, logout
  • Rinse and repeat for B router


Breaking Up is Easy


Note that the sequences of actions above have been created without requiring any coding. Implementation can come next, in the preferred language, but if we don't have an idea of where we're going, especially as a new coder, it's likely that the project will go wrong very quickly.


For implementation, I now have a list of tasks which I can attack, to some degree, separately from one another; each one is a kind of milestone. Looking at the DCNM process again:


  • Get VLAN ID and VLAN Description from user


Perhaps this data comes from a web page but for the purposes of my script, I will assume that these values are provided as arguments to the script. For reference, an argument is anything that comes after the name of the script when you type it on the command line, e.g. in the command, sayhello.py John the program sayhello.py would see one argument, with a value of John.


  • Confirm that VLANID and VLAN Description are valid


This sounds like a perfect opportunity to write a function/subroutine which can take a VLAN ID as its own argument, and will return a boolean (true/false) value indicating whether or not the VLAN ID is valid. Similarly, a function could be written for the description, either to enforce the allowed characters by removing anything that doesn't match, or by simply validating whether what's provided meets the criteria or not. These may be useful in other scripts later too, so writing a simple function now may save time later on.


  • Authenticate to DCNM and extract session token
  • Retrieve list of devices to which the template should be applied
  • Request a configuration job
  • Request job status until it has some kind of resolution (Success, Failed, etc)
  • Log out of DCNM


These five actions are all really the same kind of thing. For each one, some data will be sent to a REST API, and something will be returned to the script by the REST API. The process of submitting to the REST API only requires a few pieces of information:


  • What kind of HTML request is it? GET / POST / etc?
  • What is the URL?
  • What data needs to be sent, if any, to the URL?
  • How to process the data returned. (What format is it in?)


It should be possible to write some functions to handle GET and POST requests so that it's not necessary to repeat the HTTP request code every time it's needed. The idea is not to repeat code multiple times if it can be more simply put in a single function and called from many places. This also means that fixing a bug in that code only requires it to be fixed in one place.


For the IOS XR configuration, each step can be processed in a similar fashion, creating what are hopefully more manageable chunks of code to create and test.


Achieving Coding Goals


I really do believe that sometimes coders want to jump right into the coding itself before taking the time to think through how the code might actually work, and what the needs will be. In the example above, I've run through taking a single large task (Create a VLAN on 20 devices and configure two attached routers with an L3 interface and HSRP) which might seem rather daunting at first, and breaking it down into smaller functional pieces so that a) it's clearer how the code will work, and in what order; and b) each small piece of code is now a more achievable task. I'd be interested to know if you as a reader feel that the task lists, while daunting in terms of length, perhaps, seemed more accomplishable from a coding perspective than just the project headline. To me, at least, they absolutely are.


I said I wouldn't dig into the actual code, and I'll keep that promise. Before I end, though, here's a thought to consider: when is it right to code a solution, and when is it not? I'll be taking a look at that in the next, and final, article in this miniseries.

You've decided it's time to learn how to code, so the next step is to find some resources and start programming your first masterpiece. Hopefully, you've decided that my advice on which language to choose was useful, and you're going to start with either Python, Go or PowerShell. There are a number of ways to learn, and a number of approaches to take. In this post, I'll share my thoughts on different ways to achieve success, and I'll link to some learning resources that I feel are pretty good.


How I Began Coding


When I was a young lad, my first foray into programming was using Sinclair BASIC on a Sinclair ZX81 (which in the United States was sold as the Timex Sinclair 1000). BASIC was the only language available on that particular powerhouse of computing excellence, so my options were limited. I continued by using BBC BASIC on the Acorn BBC Micro Model B, where I learned to use functions and procedures to avoid repetition of code. On the PC I got interested in what could be accomplished by scripting in MS-DOS. On Macintosh, I rediscovered a little bit of C (via MPW). When I was finally introduced to NetBSD, things got interesting.


I wanted to automate activities that manipulated text files, and UNIX is just an amazing platform for that. I learned to edit text in vi (aka vim, these days) because it was one tool that I could pretty much guarantee was installed on every installation I got my hands on. I began writing shell scripts which looped around calling various instantiations of text processing utilities like grep, sed, awk, sort, uniq, fmt and more, just to get the results I wanted. I found that often, awk was the only tool with the power to extract and process the data I needed, so I ended up writing more and more little awk scripts to fill in. To be honest, some of the pipelines I was creating for my poor old text files were tricky at best. Finally, somebody with more experience than me looked at it and said, Have you considered doing this in Perl instead?


Challenge accepted! At that point, my mission became to create the same functionality in Perl as I had created from my shell scripts. Once I did so, I never looked back. Those and other scripts that I wrote at the time are still running. Periodically, I may go back and refactor some code, or extract it into a module so I can use the same code in multiple related scripts, but I have fully converted to using a proper scripting language, leaving shell scripts to history.


How I Learned Perl


With my extensive experience with BASIC and my shallow knowledge of C, I was not prepared to take on Perl. I knew what strings and arrays were, but what was a hash? I'd heard of references but didn't really understand them. In the end—and try not to laugh because this was in the very early days of the internet—I bought a book (Learn Perl in 21 Days), and started reading. As I learned something, I'd try it in a script, I'd play with it, and I'd keep using it until I found a problem it didn't solve. Then back to the book, and I'd continue. I used the book as more as a reference than I did as a true training guide (I don't think I read much beyond about Day 10 in a single stretch; after that was on an as-needed basis).


The point is, I did not learn Perl by working through a series of 100 exercises on a website. Nor did I learn Perl by reading through the 21 Days book, and then the ubiquitous Camel book. I can't learn by reading theory and then applying it. And in any case, I didn't necessarily want to learn Perl as such; what I really wanted was to solve my text processing problems at that time. And then as new problems arose, I would use Perl to solve those, and if I found something I didn't now how to do, I'd go back to the books as a reference to find out what the language could do for me. As a result, I did not always do things the most efficient way, and I look back at my early code and think, Oh, yuck. If I did that now I'd take a completely different approach. But that's okay, because learning means getting better over time and —  this is the real kicker — my scripts worked. This might matter more if I were writing code to be used in a high-performance environment where every millisecond counts, but for my purposes, "It works" was more than enough for me to feel that I had met my goals.


In my research, I stumbled across a great video which put all of that more succinctly than I did:


Link: How to Learn to Code - YouTube


In the video, (spoiler alert!) CheersKevin states that you don't want to learn a language; you want to solve problems, and that's exactly it. My attitude is that I need to learn enough about a language to be dangerous, and over time I will hone that skill so that I'm dangerous in the right direction, but my focus has always been on producing an end product that satisfies me in some way. To that end, I simply cannot sit through 30 progressive exercises teaching me to program a poker game simulator bit by bit. I don't want to play poker; I don't have any motivation to engage with the problem.


A Few Basics


Having said that you don't want to learn a language, it is nonetheless important to understand the ways in which data can be stored and some basic code structure. Here are a few things I believe it's important to understand as you start programming, regardless of which language you choose to learn:


scalar variablea way to store a single value, e.g. a string (letters/numbers/symbols), a number, a pointer to a memory location, and so on.
array / list / collectiona way to store an (ordered) list of values, e.g. a list of colors ("red", "blue", "green") or (1,1,2,3,5,8).
hash / dictionary / lookup table / associative arraya way to store data by associating a unique key to a value, e.g. the key might be "red", and the value might be the html hex value for that color, "#ff0000". Many key/value pairs can be stored in the same object, e.g. colors=("red"=>"#ff0000", "blue"=>"#00ff00", "green"=>"#0000ff")
zero-based numberingthe number (or index) of the first element in a list (array) is zero;  the second element is 1, and so on. Each element in a list is typically accessed by putting the index (the position in the list) in square brackets after the name. In our previously defined array colors=("red", "blue", "green") the elements in the list are colors[0] = "red", colors[1]="blue", and colors[2]="green".
function / procedure / subroutinea way to group a set of commands together so that the whole block can be called with a single command. This avoids repetition within the code.
objects, properties and methodsan object can have properties (which are information about, or characteristics of, the object), and methods (which are actually properties which execute a function when called). The properties and methods are usually accessed using dot notation. For example, I might have an object mycircle which has a property called radius; this would be accessed as mycircle.radius. I could then have a method called area which will calculate the area of the circle (πr²) based on the current value of mycircle.radius; the result would access as mycircle.area() where parentheses are conventionally used to indicate that this is a method rather than a property.


All three languages here (and indeed most other modern languages) use data types and structures like the above to store and access information. It's, therefore, important to have just a basic understanding before diving in too far. This is in some ways the same logic as gaining an understanding of IP before trying to configure a router; each router may have a different configuration syntax for routing protocols and IP addresses, but they're all fundamentally configuring IP ... so it's important to understand IP!


Some Training Resources


This section is really the impossible part, because we all learn things in different ways, at different speeds, and have different tolerances. However, I will share some resource which either I have personally found useful, or that others have recommended as being among the best:





The last course is a great example of learning in order to accomplish a goal, although perhaps only useful to network engineers as the title suggests. Kirk is the author of the NetMiko Python Library and uses it in his course to allow new programmers to jump straight into connecting to network devices, extracting information and executing commands.




Go is not, as I think I indicated previously, a good language for a total beginner. However, if you have some experience of programming, these resources will get you going fairly quickly:



As a relatively new, and still changing, language, Go does not have a wealth of training resources available. However, there is a strong community supporting it, and the online documentation is a good resource even though it's more a statement of fact than a learning experience.





Parting Thoughts


Satisfaction with learning resources is so subjective, it's hard to be sure if I'm offering a helpful list or not, but I've tried to recommend courses which have a reputation for being good for complete beginners. Whether these resources appeal may depend on your learning style and your tolerance for repetition. Additionally, if you have previous programming experience you may find that they move too slowly or are too low level; that's okay because there are other resources out there aimed at people with more experience. There are many resources I haven't mentioned which you may think are amazing, and if so I would encourage you to share those in the comments because if it worked for you, it will almost certainly work for somebody else where other resources will fail.


Coincidentally a few days ago I was listening to Scott Lowe's Full Stack Journey podcast (now part of the Packet Pushers network), and as he interviewed Brent Salisbury in Episode 4, Brent talked about those lucky people who can simply read a book about a technology (or in this case a programming language) and understand it, but his own learning style requires a lot of hands-on, and the repetition is what drills home his learning. Those two categories of people are going to succeed in quite different ways.


Since it's fresh in my mind, I'd also like to recommend listening to Episode 8 with Ivan Pepelnjak. As I listened, I realized that Ivan had stolen many of the things I wanted to say, and said them to Scott late in 2016. In the spirit that everything old is new again, I'll leave you with some of the axioms from RFC1925 (The Twelve Networking Truths) (one of Ivan's favorites) seem oddly relevant to this post, and to the art of of programming too:


         (6a)  (corollary). It is always possible to add another
               level of indirection.    
     (8)  It is more complicated than you think.
     (9)  For all resources, whatever it is, you need more.
    (10)  One size never fits all.
    (11)  Every old idea will be proposed again with a different
          name and a different presentation, regardless of whether
          it works.
         (11a)  (corollary). See rule 6a.  

To paraphrase a lyric from Hamilton, Deciding to code is easy; choosing a language is harder. There are many programming languages that are good candidates for any would-be programmer, but selecting the one that will be most beneficial to each individual need is a very challenging decision. In this post, I will attempt to give some background on programming languages in general, as well as examine a few of the most popular options and attempt to identify where each one might be the most appropriate choice.


Programming Types and Terminology


Before digging into any specific languages, I'm going to explain some of the properties of programming languages in general, because these will contribute to your decision as well.


Interpreted vs Compiled



An interpreted language is one where the language reads the script and generates machine-level instructions on the fly. When an interpreted program is run, it's actually the language interpreter that is running with the script as an input. Its output is the hardware-specific bytecode (i.e. machine code). The advantages of interpreted languages are that they are typically quick to edit and debug, but they are also slower to run because the conversion to bytecode has to happen in real-time. Distributing a program written in an interpreted language effectively means distributing the source code.





A compiled language is one where the script is processed by the language compiler and turned into an executable file containing the machine-specific bytecode. It is this output file that is run when the script is executed. It isn't necessary to have the language installed on the target machine to execute bytecode, so this is the way most commercial software is created and distributed. Compiled code runs quickly because the hard work of determining the bytecode has already been done, and all the target machine needs to do is execute it.




Strongly Typed vs Weakly Typed

What is Type?


In programming languages, type is the concept that each piece of data is of a particular kind. For example, 17 is an integer. John is a string. 2017-05-07 10:11:17.112 UTC is a time. The reason languages like to keep track of type is to determine how to react when operations are performed on them.


As an example, I have created a simple program where I assign a value of some sort to a variable (a place to store a value), imaginatively called x. My program looks something like this:

x = 6
print x + x

I tested my script and changed the value of x to see how each of five languages would process the answer. It should be noted that putting a value in quotes (") implies that the value is a string, i.e. a sequence of characters.John is a string, but there's no reason678" can't be a string too. The values of x are listed at the top, and the table shows the result of adding x to x:




Weakly Typed Languages

Why does this happen? Perl and Bash are weakly (or loosely) typed; that is, while they understand what a string is and what an integer is, they're pretty flexible about how those are used. In this case, Perl and bash made a best effort guess at whether to treat the strings as numbers or strings; although the value 6 was defined in quotes (and quotes mean a string), the determination was that in the context of a plus sign, the program must be trying to add numbers together. Python and Ruby, on the other hand, respected 6 as a string and decided that the intent was to concatenate the strings, hence the answer of 66.


The flexibility of the weak typing offered by a language like Perl is both a blessing and a curse. It's great because the programmer doesn't have to think about what data type each variable represents, and can use them anywhere and let the language determine the right type to use based on context. It's awful because the programmer doesn't have to think about what data type each variable represents, and can use them anywhere. I speak from bitter experience when I say that the ability to (mis)use variables in this way will, eventually, lead to the collapse of civilization. Or worse, unexpected and hard-to-track-down behavior in the code.


That Bash error? Bash for a moment pretends to have strong typing and dislikes being asked to add variables whose value begins with a number but is not a proper integer. It's too little, too late if you ask me.


Strongly Typed Languages

In contrast, Python and Ruby are strongly-typed languages (as are C and Go). In these languages to add two numbers means adding two integers (or floating point numbers, aka floats). Concatenating strings requires two or more strings. Any attempt to mix and match the types will generate an error. For example in Python:

>>> a = 6
>>> b = "6"
>>> print a + b Traceback (most recent call last):   File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for +: 'int' and 'str'

Strongly typed languages have the advantage that accidentally adding the wrong variable to an equation, for example, will not be permitted if the type is incorrect. In theory, it reduces errors and encourages a more explicit programming style. It also ensures that the programmer is clear that the value of an int(eger) will never have decimal places. On the other hand, sometimes it's a real pain to have to convert variables from one format to another to use its value in a different context.

PowerShell appears to want to pretend to be strongly typed, but a short test reveals some scary behavior. I've included a brief demonstration at the end in the section titled Addendum: PowerShell Bonus Content.


Dynamic / Statically Typed


There's one more twist to the above definitions. While functionally the language may be strongly typed, for example, it's possible to allow a variable to change its type at any time. For example, it is just fine in Perl to initialize a variable with an integer, then give it a new value which is a string:

$a = 1;
$a = "hello";

Dynamic typing is typically a property of interpreted languages, presumably because they have more flexibility to change memory allocations at runtime. Compiled languages, on the other hand, tend to be statically typed; if a variable is defined as a string, it cannot change later on.


Modules / Includes / Packages / Imports / Libraries


Almost every language has some system whereby the functionality can be expanded by installing and referencing some code written by somebody else. For example, Perl does not have SSH support built in, but there is a Net::SSH module which can be installed and used. Modules are the easiest way to avoid reinventing the wheel and allow us to ride the back of somebody else's hard work. Python has packages, Ruby has modules which are commonly distributed in a format called a "gem," and Go has packages. These expansion systems are critical to writing good code; it's not a failure to use them, it's common sense.


Choosing a Language


With some understanding of type, modules and interpreted/compiled languages, now it's time to figure out how to choose the best language. First, here's a quick summary of the most common scripting languages:


C / ITypeS / DExpansion
PowerShellInterpretedIt's complicatedDynamicModules


I've chosen not to include Bash mainly because I consider it to be more of a wrapper than a fully fledged scripting language suitable for infrastructure tasks. Okay, okay. Put your sandals down. I know how amazing Bash is. You do, too. will





Ten years ago I would have said that Perl (version 5.x, definitely not v6) was the obvious option. Perl is flexible, powerful, has roughly eleventy-billion modules written for it, and there are many training guides available. Perl's regular expression handling is exemplary and it's amazingly simple and fast to use. Perl has been my go-to language since I first started using it around twenty-five years ago, and when I need to code in a hurry, it's the language I use because I'm so familiar with it. With that said, for scripting involving IP communications, I find that Perl can be finicky, inconsistent and slow. Additionally, vendor support for Perl (e.g. providing a module for interfacing with their equipment) has declined significantly in the last 5-10 years, which also makes Perl less desirable. Don't get me wrong; I doubt I will stop writing Perl scripts in the foreseeable future, but I'm not sure that I could, in all honesty, recommend it for somebody looking to control their infrastructure with code.




It probably won't be a surprise to learn that for network automation, Python is probably the best choice of language. I'm not entirely clear why people love Python so much, and why even the people who love Python seem stuck on v2.7 and are avoiding the move to v3.0. Still, Python has established itself as the de facto standard for networking automation. Many vendors provide Python packages, and there is a strong and active community developing and enhancing packages. Personally, I have had problems adjusting to the use of whitespace (indent) to indicate code block hierarchy, and it makes my eyes twitch that a block of code doesn't end with a closing brace of some kind, but I know I'm in the minority here. Python has a rich library of packages to choose from, but just like Perl, it's important to choose carefully and find a modern, actively supported package. If you think that semicolons at the end of lines and braces surrounding code make things look horribly complicated, then you will love Python. A new Python user really should learn version 3, but note that v3 code is not backward compatible with v2.x, and it may be important to check the availability of relevant vendor packages in a Python3-compatible form.





Oh Ruby, how beautiful you are. I look at Ruby as being like Python, but cleaner. Ruby is three or four years younger than Python, and borrows parts of its syntax from languages like Perl, C, Java, Python, and Smalltalk. At first, I think Ruby can seem a little confusing compared to Python, but there's no question that it's a terrifically powerful language. Coupled with Rails (Ruby on Rails) on a web server, Ruby can be used to quickly create database-driven web applications, for example. I think there's almost a kind of snobbery surrounding Ruby, where those who prefer Ruby look down on Python almost like it's something used by amateurs, whereas Ruby is for professionals. I suspect there are many who would disagree with that, but that's the perception I've detected. However, for network automation, Ruby has not got the same momentum as Python and is less well supported by vendors. Consequently, while I think Ruby is a great language, I would not recommend it at the moment as a network automation tool. For a wide range of other purposes though, Ruby would be a good language to learn.




PowerShell – that Microsoft thing – used to be just for Windows, but now it has been ported to Linux and MacOS as well. PowerShell has garnered strong support from many Windows system administrators since its release in 2009 because of the ease with which it can interact with Windows systems. PowerShell excels at automation and configuration management of Windows installations. As a Mac user, my exposure to PowerShell has been limited, and I have not heard about it being much use for network automation purposes. However, if compute is your thing, PowerShell might just be the perfect language to learn, not least because it's native in Windows Server 2008 onwards. Interestingly, Microsoft is trying to offer network switch configuration within PowerShell, and released its Open Management Infrastructure (OMI) specification in 2012, encouraging vendors to use this standard interface to which PowerShell could then interface. As a Windows administrator, I think PowerShell would be an obvious choice.





Go is definitely the baby of the group here, and with its first release in 2012, the only one of the languages here created in this decade! Go is an open source language developed by Google, and is still mutating fairly quickly with each release, as new functionality is being added. This is a good because things that are perceived as missing are frequently added in the next release. It's bad because not all code will be forward compatible (i.e. will run in the next version). As Go is so new, the number of packages available for use is much more limited than for Ruby, Perl, or Python. This is obviously a potential downside because it may mean doing more work for one's self.


Where Go wins, for me, is on speed and portability. Because Go is a compiled language, the machine running the program doesn't need to have Go installed; it just needs the compiled binary. This makes distributing software incredibly simple, and also makes Go pretty much immune to anything else the user might do on their platform with their interpreter (e.g. upgrade modules, upgrade the language version, etc). More to the point, it's trivial to get Go to cross-compile for other platforms; I happen to write my code on a Mac, but I can (and do) compile tools into binaries for Mac, Linux, and Windows and share them with my colleagues. For speed, a compiled language should always beat an interpreted language, and Go delivers that in spades. In particular, I have found that Go's HTTP(S) library is incredibly fast. I've written tools relying on REST API transactions in both Go and Perl, and Go versions blow Perl out of the water. If you can handle a strongly, statically typed language (it means some extra work at times) and need to distribute code, I would strongly recommend Go. The vendor support is almost non-existent, however, so be prepared to do some work on your own.




There is a lot to consider when choosing a language to learn, and I feel that this post may only scrape the surface of all the potential issues to take into account. Unfortunately, sometimes the issues may not be obvious until a program is mostly completed. Nonetheless, my personal recommendations can be summarized thus:


  • For Windows automation: PowerShell
  • For general automation: Python (easier), or Go (harder, but fast!)


If you're a coder/scripter, what would you recommend to others based on your own experience? In the next post in this series, I'll look at ways to learn a language, both in terms of approach and some specific resources.


Addendum: Powershell Bonus Content


In the earlier table where I showed the results from adding x + x, PowerShell behaves perfectly. However, when I started to add int and string variable types, it was not so good:

PS /> $a = 6
PS /> $b = "6"
PS /> $y = $a + $b
PS /> $y 12

In this example, PowerShell just interpreted the string 6 as an integer and added it to 6. What if I do it the other way around and try adding an integer to a string?

PS /> $a = "6"
PS /> $b = 6
PS /> $y = $a + $b
PS /> $y 66

This time, PowerShell treated both variables as strings; whatever type the first variable is, that's what gets applied to the other. In my opinion that is a disaster waiting to happen. I am inexperienced with PowerShell, so perhaps somebody here can explain to me why this might be desirable behavior because I'm just not getting it.

Filter Blog

By date: By tag:

SolarWinds uses cookies on its websites to make your online experience easier and better. By using our website, you consent to our use of cookies. For more information on cookies, see our cookie policy.