2016 IT Resolution #1: Stop Blaming The Tools!

networkingnerd over 8 years ago 2 minute read time

I read an interesting thread the other day about a network engineer that tried to use an automated tool to schedule a rolling switch upgrade. He realized after it completed and the switches restarted that he had the wrong image for the device and they weren't coming back up. It was about fifty switches in total, which resulted in a major outage for his organization.

What struck me about the discussion thread was first that he wondered why the tool didn't stop him from doing that thing. The second was that the commenters responded that it wasn't the tool's job to sanity check his inputs. The end result was likely a severe discipline discussion on the part of the original engineer.

Tools are critical in network infrastructure today. Things have become far too complicated for us to manage on our own. SolarWinds makes a large number of tools to help us keep our sanity. But it is the fault of the tool when it is used incorrectly?

Tools are only as good as their users. If you smash your fingers over and over again with a hammer, does that mean the hammer is defective? Or is the fact that you're holding it wrong in play? Tools do their jobs whether you're using the correctly or incorrectly.

2016 is the year when we should all stop blaming the tools for our problems. We need to reassess our policies and procedures and find out how we can use tools effectively. We need to stop pushing all of the strange coincidences and problems off onto software that only does what it's told to do. Software tools should be treated like the tools we use to build a house. They are only useful if used properly.

Best practices should include sanity checking of things before letting the tool do the job. A test run of an upgrade on one device before proceeding to do twenty more. A last minute write up of the procedure before implementing it. Checking the inputs for a monitoring tool before swamping the system with data. Tapping the nail a couple of times with the hammer to make sure you can pull your fingers away before you start striking.

It's a new year full of new resolutions. Let's make sure that we stop blaming our tools and use them correctly. What do you think? Will you resolve to stop blaming the tools? Leave me a comment with some of your biggest tool mishaps!

Top Comments

bsciencefiction.tv over 8 years ago

I have gotten to where I hate the term false alert.
Even when the LOB is responsible,,, removing a process or server without notifying anyone, for example... they still want to call it a false alert. I say take personal responsibility.
- Cancel
- Vote Up 0 Vote Down
- More
- Cancel
mesverrum over 8 years ago

It is funny/sad for me because I work with lots of different companies all the time and I often see administrators who are running themselves to the ragged edge in terms of their workload but when I show them how to use these kind of features in NCM and they often give me something along the lines of "looks nice but I can't trust my guys with that kind of power." And in most cases it is absolutely true, they don't have the man hours available to debug and test the scripts they'd want to use, but because they don't have time to work on automating away the tedious little things in their environment they are never quite able to keep up with all the things the business is expecting from them. And if they put the tool out there without a good plan and structure around it they might get bit hard when something like that happens. Definitely a rock and a hard place.
Planning contingencies and developing a testing procedure and maintaining change control are all time intensive but there is only so far your organization can go before they have to move into that kind of more structured methodology.
- Cancel
- Vote Up +1 Vote Down
- More
- Cancel
geomars over 8 years ago

50/50. It depends to the quality of the tools.
- Cancel
- Vote Up 0 Vote Down
- More
- Cancel
cahunt over 8 years ago

Agreed Indeed!
- Cancel
- Vote Up 0 Vote Down
- More
- Cancel
rschroeder over 8 years ago

I've seen some fun scripts shared for NCM that allows a chief Network Analyst to have technicians perform switch or router IOS upgrades, and the script can determine if the right IOS image is being applied to the proper device.
I've made and seen enough errors that I don't trust scripts, or human actions. "Put not your faith in works of men" applies literally here.
If I leave a Cisco switch's original image and startup config alone, if I only upload and reference an inappropriate image to which it should boot, it can recover on its own if that image is incorrect. But only if the original "correct" IOS image is still present AND only if the boot variable references that old image as a second image to use in case the primary image fails.
It takes extra planning to do this correctly, and it's easy to make mistakes. Plus, older hardware may not have enough space to allow two IOS images to be present.
If you're deleting/overwriting old images & config files, you're running a risk of having an extended and unexpected outage, and eliminating some automatic recovery capabilities Cisco has built into some models.
Cisco's ISSU solution has worked perfectly for me with about thirty 4510's I purchased--but it requires dual supervisors in each chassis. If I do something incorrect (wrong IOS image, wrong boot variable, etc.) the 4510 In-Service-Software-Upgrade process tries to do as I told it to. If it fails, it automatically recovers and brings everything back the way it was. It downgrades back to the original working version of IOS code, and then tells me what was wrong. And it does it hitlessly.
Like the person in the original topic of this thread, I'd like all my systems to do what I mean instead of the system doing what I erroneously told it to do. But I know GIGO. Everyone else who works with computerized systems MUST understand GIGO, too.
Don't waste time blaming the tool when you used it incorrectly. Blame yourself. Next time anticipate problems, do peer review of your steps, get Change Management approval, and guard against Murphy's Law in every way you can.
- Cancel
- Vote Up 0 Vote Down
- More
- Cancel

Thwack - Symbolize TM, R, and C