Stop And Smell The Documentation

Stop And Smell The Documentation

A recent issue with a network reminded me of the importance of documentation. I was helping a friend find out why destinations in the core of the network were unable to ping other locations. We took the time to solve some routing neighbor issues but couldn't figure out why none of the core could get out to the Internet. We were both confused and working through any issues in the network. After a bit more troubleshooting with his team, it turned out to be a firewall issue. In the process of helping the network team, someone had added a rule to the firewall that blocked the core from getting out. A lot of brainpower was wasted because this engineer was trying to help.

We reinforce the idea that documentation is imperative. As-built documentation is delivered when a solution is put together. Operational docs are delivered when the solution is ready to be turned up. We have backup plans, disaster plans, upgrade procedures, migration guidelines, and even a plan to take equipment out of service when it reaches the end of life. But all of these plans, while important, are the result of an entire process. What isn't captured is the process itself. This becomes very important when you are troubleshooting a problem.

When I worked on a national helpdesk doing basic system support, we used the CAR method of documentation:

  • C - Cause: What do we think caused the problem? Often, this was one of the last things filled in because we didn't want to taint out troubleshooting method with wild guesses. Often I would put in "doesn't work" or "broken".
  • A - Actions: This was the bulk of the documentation. What did you do to try and fix the problem? I'll expand on this in a minute, but Actions was the most critical part of the documentation that almost never captured what was needed.
  • R - Resolution: Did you fix the problem? If not, why? How was the customer left? If you were part of a multiple call process, Resolution needed to reflect where you were in the process so the next support technician could pick up where you left off.

Cause and Resolution are easy and usually just one or two line entries. What broke and how did you fix it? Actions, on the other hand, was usually a spartan region of half sentences and jumbled thoughts. And this is where most of the troubleshooting problems occurred.

When you're trying to fix a problem, it's tempting to only write down the things that worked. Why record things that failed to fix the problem? If you try something and it doesn't affect the issue, just move on to the next attempt. However, in the real world, not recording the attempts to fix the problem are just as detrimental as the issue in the first place.

Writing In The Real World

Let's take the above example. We were concentrating on the network and all the issues there. No one thought to look at the firewall until we found out the issue was with outbound traffic. No one mentioned there was a new rule in the firewall that directly affected traffic flow. No one wrote down that they put the rule in the firewall just for this issue, which made is less apparent how long the rule had been there.

Best practice says to document your network. Common practice says to write down as much as you can think of. But common sense practice is that you should write down everything you've done to during troubleshooting. If you swap a cable or change a port description it should be documented. If you tweak a little setting in the corner of the network or you delete a file it should be noted somewhere. Write down everything and decide what to keep later.

The reason for writing it all down is because troubleshooting is rarely a clean process. Fixing one problem will often uncover other issues. Without knowing the path we took to get from problem to solution, we can't know which of these issues were introduced by our own meddling and which issues were already there. Without an audit trail we could end up chasing our own tails for hours not knowing that the little setting we tweaked five minutes in caused the big issue three hours later.

It doesn't matter if you write down your steps on a tablet or jot them on the back of a receipt for lunch. What is crucial is that all that information makes it to a central location by the end of the process. You need great tools, like the ones from Solarwinds, to help you make sense of all these changes and correlate them into solutions to problems. And for those of us, like me, that often forget to write down every little step, Solarwinds makes log capture programs that can let the device report every command that was entered. It's another way to make sure someone is writing it all down.

How do you document? What are you favorite methods? Have you ever caused a problem in the process of fixing another one? Let me know in the comments!

Parents Comment Children
No Data
Thwack - Symbolize TM, R, and C