Showing results for 
Search instead for 
Did you mean: 
Create Post

What Is Root Cause Analysis?

Level 11

If you work in engineering or support, you've probably spent a lot of time troubleshooting things. You've probably spent just as much time trying to figure out why things were broken in the first place. As much as we might like to think about things being simple when it comes to IT troubleshooting, the fact of the matter is that most of the time the problems are so complex as to be almost impossible to solve at first glance.

The real thing we're looking for here is root cause analysis. It's a fancy term for "find out what caused things to break." What root cause analysis is focused on is a proven, repeatable methodology for determining the root cause of the problem. And the process is deceptively simple: if you remove a symptom and the problem doesn't happen, it's not part of the root cause. How can we do root cause analysis on problems that are organization-wide or that have so many component factors as to make it difficult to isolate? That's where the structure comes into place.

Step One: Do You Have A Problem?

It may sound silly, but to do root cause analysis on a problem, first you have to figure out if you have a problem. I originally talked about problem determination when I first started writing, as it was one of the biggest issues I saw in the field. People can't do problem determination. They can't figure out if something isn't behaving properly unless there is an error message.

Problems have causes. Most of the time they are periodic or triggered. Rarely, they may appear to be random but are, in fact, just really, really oddly periodic. To determine root cause, you first must figure out that the thing you are looking at is a problem. Is it something that is happening by design of the protocol or the implementation? Is it happening because of environmental factors or other external sources? You're going to be mighty upset if you spend cycles troubleshooting what you think is a failing power supply only to find out someone keeps shutting off the power to the room and causing the outage.

Problems also need to be repeatable. If something can't be triggered or observed on a schedule, you need to dig further until you can make it happen. Random chance isn't a problem. Cosmic rays causing data loss isn't something you can replicate easily. Real problems that can be solved with root cause analysis can be repeated until they are resolved.

Step Two: Box Your Problem

The next step in the troubleshooting process is the part we're the most familiar with: the actual troubleshooting. I wrote about my troubleshooting process years ago. I just start determining symptoms and isolating them until I get to the real problem. Sometimes that means erasing those boxes and redrawing them. You can't assume that any one solution will be the right one until you can determine for a fact that it solves the root cause.

This is where a lot of people tend to get caught up in the euphoria of troubleshooting. They like solving problems. They like seeing something get fixed. What they don't like is finding out their elegant solution didn't work. So, they'll often stop when they've drawn a pretty box around the issue and isolated it. Deal with things until you don't have to deal with them any longer. But with root cause analysis, you have to keep digging. You have to know that your process fixes the issue and is repeatable.

When I worked for Gateway 2000, every call we took had to follow the ARC method of documentation: steps to ADDRESS the issue, RESOLUTION of the issue, reason for the CALL. I always thought it should have been CAR - CALL, ADDRESS, RESOLUTION, but I kept getting overruled. We loved filling in C and R: why did they call and what eventually fixed it. What we didn't do so well was the middle part. Things don't get fixed by magic. You need to write down every step along the way and make sure that the process you followed fixes the problem. If you leave out the steps, you'll never know what fixed things.

Step Three: Make Sure You Really Fixed It

This is the part of root cause analysis that most people really hate. Not only to you have to prove you fixed the thing, but you also have to prove that the steps you took fixed it. Like determining the root cause above, if one of your steps didn't fix the problem, you have to eliminate it from the root cause analysis as being irrelevant.

Think about it like this. If you successfully solve a problem by kicking a server and fixing DNS, what actually fixed the issue? Root cause analysis says you have to try both solutions next time you're presented with the same issue. It's very likely that DNS fixes were the real solution and the root cause was DNS misconfiguration. But you can't discount the kick until you can prove it didn't fix the issue. Maybe you jostled a fan loose and made the CPU run cooler?

We have a real problem with isolating issues. Sometimes that means that when we change a setting and it doesn't fix the problem, we need to change it back. That sounds counter-intuitive until you realize that making fourteen changes until you find the right setting to fix the issue means you're not really sure which one solved the problem. That means you have to isolate everything to make sure that Solution Nine wasn't really the right one and it just took 30 minutes to kick in while you tried Solutions Ten through Fourteen.

Once you know that you fixed the issue and that this particular solution or solution path fixed the issue, you've successfully completed the majority of your root cause analysis. But you're not quite done yet.

Step Four: Blameless Reporting

This is a hard one. You need to do a report about the root cause. But what if the cause is something someone changed or did that made the issue come up? How do you do a report without throwing someone under the bus?

Fix the problem, not the blame. You can't have proper root cause analysis if the root cause is "David." People aren't the cause of issues. David's existence didn't cause the server to reboot. David's actions caused it. Focus on the actions that caused the problem and the resolution. Maybe it's as simple as revoking reboot rights from the group that David and other junior admins belong to. Maybe the root cause really is that David was mad at management and just wanted to reboot a production server to make them mad. But you have to focus on the actions and not the people. Blaming people doesn't solve problems. Correcting actions does.

Root cause analysis isn't easy. It's designed to help people get to the bottom of their problems with a repeatable process every time. If you follow these steps and make sure you're honest with yourself along the way, you'll quickly find that your problems are getting resolved faster and more accurately. You'll also find that people are more willing to quickly admit mistakes and get them rectified if they know the whole thing isn't going to come down on their head. And a better overall environment for work means less problems for everyone else to have to solve.


Nice write up

Blameless Reporting is the mark of a good environment.  It's part of being professional, and it shows your team and organization have good people-skills.

When finger-pointing occurs, individuals and teams may tend to obfuscate or cover-up problems and their true causes.  This results in the inability to find, correct, and prevent the causes of outages.

We have PIR's (Post-Incident Reviews) and RCA's (Root Cause Analyses) every time there is an event that negatively impacts the network use of customers or employees.  And our sessions involve non-accusatory analyses that are focused on understanding the problem, its cause, its impact, and imparting an understanding of the pain the outage caused to customers or the organization.  And we learn to not make that mistake a second time.

Then we all move on to a better tomorrow, not worried about being "written up" or having our jobs be put at risk when another mistake is made in the future. We're human, trying to be perfect, as best we can.

Level 15

Nice article.  These things are so ingrained in me that I don't have to think about them, they just occur.  However, I thought this was important to share with some of my junior co-workers.  Hopefully, they take them to heart and gain the skills through practice to utilize these steps.


Level 13

Good Article

Level 15

How did you get to that level of maturity.  I have long time experience but many of my team are first time IT people or not too far out of school.  They all have the fear factor about making mistakes or pointing blame.  One of the biggest blockages to getting there is that people can't admit there wrong or that they don't know.  I think it would be grate to be able have proper after-incident reviews and RCA's but when the people present can't be honest.....   Suggestions?

Level 10

Really good article, thanks. It conjures up memories of fishbone analysis. I really liked our discussion of Blameless Reporting. That is a real challenge in most environments. Looks like some, rschroeder​, have it figured out.

Level 20

Getting to the root of the problem isn't always and easy thing to do.  Often things can be fixed without knowing why something actually happened.

I was involved with at least one of these a week at my old job, a managed web hosting provider. "Blameless Reporting" can sometimes rely heavily on spin when preparing an AAR for an angry customer. An old boss taught me a lesson that I still use today:

     When determining blame you first look at the process. If the process is solid then you look at the person.


Good article - that point of blameless is so critical. I've seen environments where management encourages people to hide things. If a person feels that they are going to be brought into the "high carpet" for an incident they are more likely to try and hide it. There is a big difference between getting in trouble and being held accountable. My previous boss never handled anything in anger or with blame. He would, after the incident was resolved, ask everyone involved "What did you learn?" and "How can we prevent this in the future." Emphasis on the individual learning and on the We of the team working together.

Level 14

Thanks for the article. 

About the Author
A nerd that happens to live and breathe networking of all kinds. Also known to dip into voice, security, wireless, and servers from time to time. Warning - snark abounds.