Showing results for 
Search instead for 
Did you mean: 

2016 IT Resolution #1: Stop Blaming The Tools!

Level 11

I read an interesting thread the other day about a network engineer that tried to use an automated tool to schedule a rolling switch upgrade. He realized after it completed and the switches restarted that he had the wrong image for the device and they weren't coming back up. It was about fifty switches in total, which resulted in a major outage for his organization.

What struck me about the discussion thread was first that he wondered why the tool didn't stop him from doing that thing. The second was that the commenters responded that it wasn't the tool's job to sanity check his inputs. The end result was likely a severe discipline discussion on the part of the original engineer.

Tools are critical in network infrastructure today. Things have become far too complicated for us to manage on our own. SolarWinds makes a large number of tools to help us keep our sanity. But it is the fault of the tool when it is used incorrectly?

Tools are only as good as their users. If you smash your fingers over and over again with a hammer, does that mean the hammer is defective? Or is the fact that you're holding it wrong in play? Tools do their jobs whether you're using the correctly or incorrectly.

2016 is the year when we should all stop blaming the tools for our problems. We need to reassess our policies and procedures and find out how we can use tools effectively. We need to stop pushing all of the strange coincidences and problems off onto software that only does what it's told to do. Software tools should be treated like the tools we use to build a house. They are only useful if used properly.

Best practices should include sanity checking of things before letting the tool do the job. A test run of an upgrade on one device before proceeding to do twenty more. A last minute write up of the procedure before implementing it. Checking the inputs for a monitoring tool before swamping the system with data. Tapping the nail a couple of times with the hammer to make sure you can pull your fingers away before you start striking.

It's a new year full of new resolutions. Let's make sure that we stop blaming our tools and use them correctly. What do you think? Will you resolve to stop blaming the tools? Leave me a comment with some of your biggest tool mishaps!

Level 18

I agree that the tool should not have been responsible for the engineers debacle.  He did not do his part to ensure what he scheduled was correct.  Thus policies and procedures were not adequate or were ignored.  This is a situation where peer review might have saved his bacon.

Level 12

I'm right there with y'all. The program did exactly what it was told to do which makes it a useful tool if used correctly. This should definitely have been tested in some sort of test environment first. I like Jfrazier‌s point about the peer review. We meet twice a week to run new scripts and monitors by each other before we put them in to production.

Level 17

Totally agree! I don't blame the tools when something goes wrong. I'll then modify something within the tool to make sure it works the next time. To not test doing automated IOS upgrades to 50 devices is a mayor oversight. In fact most automated scripts should be tested first to make sure they'll work fine against all the production devices.

Level 10

Very true that tools are as good as users. if the user does not handle the tools, the tools won't operate themselves except with human intervention.

Level 12

yes its true.....

Level 15

It is not only the tools issue for not alerting or providing a sanity check, but the engineer should have followed some sort of Change Management Process.  Each organization has their own wickets to jump through and processes to follow.  I have worked for some organizations where I was the "Change Management Guy" and others that were more formal with Division Chairs and Tech leads to sit on a formal board with verbal approvals.  At the end of the day we all need to slow down a bit and realize the scope of our actions.  Even if it means, eating a bit of crow from time to time, or just asking for a second set of eyes to verify the process.

We as network engineers are getting asked to do more with less and frequently get blamed for computer processing shortfalls.  This is where Solarwinds comes in and my fellow Thwackers to help us along.  The shortcuts, how-to, labs, and training sessions (there are more resources) are there for us to use and get more out of the companies investment into Solarwinds.

I have been through many organizations in my day and consistently see a pleura of tools used and they are only using 10-20% of the tools capability.  We need to use less tools and use the tools we have better.  The more you develop your tools the better data you will get out of them.  I have supplied a couple lists of things users can do to improve their Solarwinds Solution in hopes to free up "Fire Fighting Time" and turn that into Development Time.

......Pausing for discussion and thought................

Level 16

if you use a windows patch deployment product most/all of the good ones prevent you from (say) deploying a windows95 patch to a windows 2012 server. Why doesn't network automation tools provide the same level of functionality?

A bad workman blames his tools; a master craftsman hates bad tools.

As an example in NCM it would be really nice to have a repository of approved software for distribution to network equipment, and then a set of rules that says which types of equipment could get a particular image, and finally a tool to schedule the transfer, reconfiguration, and reboot of an image. The tools would support the peer review of changes.

Level 14

A network engineer needs to know his/her network, period.  One engineer I used to work for always said, if you can't draw it, you don't know it.  No tool can replace an engineer's in depth knowledge of their network.

Level 21

I've seen some fun scripts shared for NCM that allows a chief Network Analyst to have technicians perform switch or router IOS upgrades, and the script can determine if the right IOS image is being applied to the proper device.

I've made and seen enough errors that I don't trust scripts, or human actions.  "Put not your faith in works of men" applies literally here.

If I leave a Cisco switch's original image and startup config alone, if I only upload and reference an inappropriate image to which it should boot, it can recover on its own if that image is incorrect.  But only if the original "correct" IOS image is still present AND only if the boot variable references that old image as a second image to use in case the primary image fails.

It takes extra planning to do this correctly, and it's easy to make mistakes.  Plus, older hardware may not have enough space to allow two IOS images to be present.

If you're deleting/overwriting old images & config files, you're running a risk of having an extended and unexpected outage, and eliminating some automatic recovery capabilities Cisco has built into some models.

Cisco's ISSU solution has worked perfectly for me with about thirty 4510's I purchased--but it requires dual supervisors in each chassis.  If I do something incorrect (wrong IOS image, wrong boot variable, etc.) the 4510 In-Service-Software-Upgrade process tries to do as I told it to.  If it fails, it automatically recovers and brings everything back the way it was.  It downgrades back to the original working version of IOS code, and then tells me what was wrong.  And it does it hitlessly.

Like the person in the original topic of this thread, I'd like all my systems to do what I mean instead of the system doing what I erroneously told it to do.  But I know GIGO.  Everyone else who works with computerized systems MUST understand GIGO, too.

Don't waste time blaming the tool when you used it incorrectly.  Blame yourself.  Next time anticipate problems, do peer review of your steps, get Change Management approval, and guard against Murphy's Law in every way you can.

Level 17

Agreed Indeed!

Level 8

50/50. It depends to the quality of the tools.

Level 20

It is funny/sad for me because I work with lots of different companies all the time and I often see administrators who are running themselves to the ragged edge in terms of their workload but when I show them how to use these kind of features in NCM and they often give me something along the lines of "looks nice but I can't trust my guys with that kind of power."  And in most cases it is absolutely true, they don't have the man hours available to debug and test the scripts they'd want to use, but because they don't have time to work on automating away the tedious little things in their environment they are never quite able to keep up with all the things the business is expecting from them.  And if they put the tool out there without a good plan and structure around it they might get bit hard when something like that happens.  Definitely a rock and a hard place. 

Planning contingencies and developing a testing procedure and maintaining change control are all time intensive but there is only so far your organization can go before they have to move into that kind of more structured methodology.

I have gotten to where I hate the term false alert.

Even when the LOB is responsible,,, removing a process or server without notifying anyone, for example... they still want to call it a false alert.  I say take personal responsibility.

About the Author
A nerd that happens to live and breathe networking of all kinds. Also known to dip into voice, security, wireless, and servers from time to time. Warning - snark abounds.