If you work in engineering or support, you've probably spent a lot of time troubleshooting things. You've probably spent just as much time trying to figure out why things were broken in the first place. As much as we might like to think about things being simple when it comes to IT troubleshooting, the fact of the matter is that most of the time the problems are so complex as to be almost impossible to solve at first glance.
The real thing we're looking for here is root cause analysis. It's a fancy term for "find out what caused things to break." What root cause analysis is focused on is a proven, repeatable methodology for determining the root cause of the problem. And the process is deceptively simple: if you remove a symptom and the problem doesn't happen, it's not part of the root cause. How can we do root cause analysis on problems that are organization-wide or that have so many component factors as to make it difficult to isolate? That's where the structure comes into place.
Step One: Do You Have A Problem?
It may sound silly, but to do root cause analysis on a problem, first you have to figure out if you have a problem. I originally talked about problem determination when I first started writing, as it was one of the biggest issues I saw in the field. People can't do problem determination. They can't figure out if something isn't behaving properly unless there is an error message.
Problems have causes. Most of the time they are periodic or triggered. Rarely, they may appear to be random but are, in fact, just really, really oddly periodic. To determine root cause, you first must figure out that the thing you are looking at is a problem. Is it something that is happening by design of the protocol or the implementation? Is it happening because of environmental factors or other external sources? You're going to be mighty upset if you spend cycles troubleshooting what you think is a failing power supply only to find out someone keeps shutting off the power to the room and causing the outage.
Problems also need to be repeatable. If something can't be triggered or observed on a schedule, you need to dig further until you can make it happen. Random chance isn't a problem. Cosmic rays causing data loss isn't something you can replicate easily. Real problems that can be solved with root cause analysis can be repeated until they are resolved.
Step Two: Box Your Problem
The next step in the troubleshooting process is the part we're the most familiar with: the actual troubleshooting. I wrote about my troubleshooting process years ago. I just start determining symptoms and isolating them until I get to the real problem. Sometimes that means erasing those boxes and redrawing them. You can't assume that any one solution will be the right one until you can determine for a fact that it solves the root cause.
This is where a lot of people tend to get caught up in the euphoria of troubleshooting. They like solving problems. They like seeing something get fixed. What they don't like is finding out their elegant solution didn't work. So, they'll often stop when they've drawn a pretty box around the issue and isolated it. Deal with things until you don't have to deal with them any longer. But with root cause analysis, you have to keep digging. You have to know that your process fixes the issue and is repeatable.
When I worked for Gateway 2000, every call we took had to follow the ARC method of documentation: steps to ADDRESS the issue, RESOLUTION of the issue, reason for the CALL. I always thought it should have been CAR - CALL, ADDRESS, RESOLUTION, but I kept getting overruled. We loved filling in C and R: why did they call and what eventually fixed it. What we didn't do so well was the middle part. Things don't get fixed by magic. You need to write down every step along the way and make sure that the process you followed fixes the problem. If you leave out the steps, you'll never know what fixed things.
Step Three: Make Sure You Really Fixed It
This is the part of root cause analysis that most people really hate. Not only to you have to prove you fixed the thing, but you also have to prove that the steps you took fixed it. Like determining the root cause above, if one of your steps didn't fix the problem, you have to eliminate it from the root cause analysis as being irrelevant.
Think about it like this. If you successfully solve a problem by kicking a server and fixing DNS, what actually fixed the issue? Root cause analysis says you have to try both solutions next time you're presented with the same issue. It's very likely that DNS fixes were the real solution and the root cause was DNS misconfiguration. But you can't discount the kick until you can prove it didn't fix the issue. Maybe you jostled a fan loose and made the CPU run cooler?
We have a real problem with isolating issues. Sometimes that means that when we change a setting and it doesn't fix the problem, we need to change it back. That sounds counter-intuitive until you realize that making fourteen changes until you find the right setting to fix the issue means you're not really sure which one solved the problem. That means you have to isolate everything to make sure that Solution Nine wasn't really the right one and it just took 30 minutes to kick in while you tried Solutions Ten through Fourteen.
Once you know that you fixed the issue and that this particular solution or solution path fixed the issue, you've successfully completed the majority of your root cause analysis. But you're not quite done yet.
Step Four: Blameless Reporting
This is a hard one. You need to do a report about the root cause. But what if the cause is something someone changed or did that made the issue come up? How do you do a report without throwing someone under the bus?
Fix the problem, not the blame. You can't have proper root cause analysis if the root cause is "David." People aren't the cause of issues. David's existence didn't cause the server to reboot. David's actions caused it. Focus on the actions that caused the problem and the resolution. Maybe it's as simple as revoking reboot rights from the group that David and other junior admins belong to. Maybe the root cause really is that David was mad at management and just wanted to reboot a production server to make them mad. But you have to focus on the actions and not the people. Blaming people doesn't solve problems. Correcting actions does.
Root cause analysis isn't easy. It's designed to help people get to the bottom of their problems with a repeatable process every time. If you follow these steps and make sure you're honest with yourself along the way, you'll quickly find that your problems are getting resolved faster and more accurately. You'll also find that people are more willing to quickly admit mistakes and get them rectified if they know the whole thing isn't going to come down on their head. And a better overall environment for work means less problems for everyone else to have to solve.