cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post

Network Automation: the Good, the Bad, and the Ugly

Level 10

Infrastructure automation is nothing new. We’ve been automating our server environments for years, for example. Automating network devices isn’t necessarily brand new either, but it’s never been nearly as popular as it has been in recent days.

Part of the reason network engineers are embracing this new paradigm is because of the potential time-savings that can be realized by scripting common tasks. For example, I recently worked with someone to figure out how to script a new AAA configuration on hundreds of access switches in order to centralize authentication. Imagine having to add those few lines of configuration one switch at a time – especially in a network in which there were several different platforms and several different local usernames and passwords. Now imagine how much time can be saved and typos avoided by automating the process rather than configuring the devices one at a time.

That’s the good.

However, planning the pseudocode alone became a rabbit hole in which we chased modules on GitHub, snippets from previous scripts, and random links in Google trying to figure out the best way to accommodate all the funny nuances of this customer’s network. In the long run, if this was a very common task, we would have benefited greatly from putting in all the time and effort needed to nail our script down simply because it would then be re-usable and shareable with the rest of the community. However, by the time I checked in again with some more ideas, my friend was already well underway configuring the switches manually simply because it was billable time and he needed to get the job done right away. There’s a balance between diminishing returns and long-term benefits to writing code for certain tasks.

That’s the bad.

We had some semblance of a script going, however, and after some quick peer review we wanted to use it on the remaining switches. Rather than modify the code to remove the switches my friend already configured, we left it alone because we assumed it wouldn’t hurt to run the script against everything.

So we ran the script, and several hundred switches became unreachable on the management network. Nothing went hard down, mind you, but we weren’t able to get into almost the entire access layer. Thankfully this was a single campus of several buildings using a lot of switch stacks, so with the help of the local IT staff, the management access configuration on all the switches was rolled back the hard way in one afternoon. This happened as a result of a couple guys with a bad script. We still don’t really know what happened, but we know that this was a human error issue – not a device issue.

That’s the ugly.

Network automation seeks to decrease human error, but the process requires skill, careful peer review, and maybe even a small test pool. Otherwise, the blast radius of a failure could be very large and impactful. There is also great automation software out there with easy-to-use interfaces that can enable you to save time without struggling to learn a new programming language.

But don’t let that dissuade you from jumping with both feet into learning Python and experimenting with scripting common tasks. In fact, there are even methods for preventing scripting misconfigurations as well. Just remember that along with the good, there can be some bad, and if ignored, that bad could get ugly.

9 Comments
MVP
MVP

Ahh..the joys of scripting.

I used to write wrappers to update configs for 5000+ 5GT vpn appliances and a number of network devices.

I had a basic wrapper script I used where I added specific commands for the devices provided by the LANWAN team.

We'd test on a few test devices to make sure it worked as expected before I turned it loose on many things.

It worked rather well...I had a set of wrappers for the 5GT's, Cisco devices, and one for the Extreme switches.

One thing I learned was log what the script did so that you could follow the log back and see where something went awry....

Level 21

Always build in an "oh $h!t" button to recover from the blast if/when it happens.

Level 20

Nothing like a bad route or ACL or f/w rule pushed out with NCM to ruin your day!

This is fantastic advice. Typically I script large scary things knowing that the first time will it run and only tell me what it would try to do. Then I run it with breaks so that I can audit what I did on the first few on the list, and finally I let it rip, but I always have a plan to undo what I did.

Level 10

Running anything in prod always scares the life out of me, so having that solid rollback plan is so so important

I like this geekpost a lot. I don't think automation is limited to even infrastructure; everyone should try to save time automating where appropriate/when they can. It's back to the basic question of: you can spend 20 minutes doing , or you can spend 5 minutes doing it so that you can spend 15 doing something else.

This applies to scripting:

pastedImage_0.png

Level 10

Ha - yeah I've seen that one before - so true!

Level 13

I've never seen it...but i like it.

About the Author
I've been in IT for almost 30 years beginning in the stockroom and working my way up through operations to help build and develop the Automated Operations Team at Radioshack before Enterprise Management was a cool thing. Working in several different shops over the years has exposed me to a number of different challenges regarding monitoring and alerting. I am a amateur radio operator, Skywarn spotter for the National Weather Service, and a volunteer firefighter in a rural county just West of Fort Worth.