cancel
Showing results for 
Search instead for 
Did you mean: 

Root Cause Paralysis

MVP

So far this month, we've talked about the difficulty of monitoring complex, interconnected systems; the merging of traditional IT skills; and tool sprawl. You've shared some great insights to these common problems. I'd like to finish up my diplomatic tenure with yet another dreaded reality of life in IT: Root Cause Analysis.

Except... I've seen more occurrences of Root Cause Paralysis lately. I'll explain. I've seen a complex system suffer a major outage because of a simple misconfiguration on an unmonitored storage array. And that simple misconfiguration in turn revealed several bad design decisions that were predicated on the misconfiguration. Once the incident has been resolved, management demanded a root cause analysis to determine the exact cause of the outage, and to implement a permanent corrective action. All normal, reasonable stuff.

The Paralysis began when representatives from multiple engineering groups arrived to the RCA meeting. It was the usual suspects: network, application, storage, and virtualization. We began with a discussion on the network, and the network engineers presented a ton of performance and log data during the morning of the outage to indicate that all was well in Cisco-land. (To their credit, the network guys even suggested a few highly unlikely scenarios in which their equipment could have caused the problem.) We moved to the application team, who presented some SCOM reports that showed high latency just before and during the outage. But when we got to the virtualization and storage components, all we had was a hearty, "everything looked good." That was it. No data, no reports, no graphs to quantify "good."

So my final questions for you:

  1. Has this situation played out in your office before?
  2. What types of information do you bring with you to defend your part of the infrastructure?
  3. Do you prep for these meetings, or do you just show up and hope for the best?

Go!

30 Comments
Level 19

I think that if the root cause can be identified in a reasonable time it should be examined. Recently I have live through the analysis paralysis part of root cause and it is just a pain. The information I brought to the table in this case is the exact device that caused a bottleneck and graphs showing how memory and CPU spikes caused the device to choke. Then the question became what traffic caused the spikes. I had anticipated this and so I refered to the NetFlow data which showed no changes in the traffic pattern, just more of everything. To me the issue was the device and the trend of increasing CPU and memory utilization. The packets that broke the camel's back were just that, they arrived at the device at the wrong time and were the last to overload the device. I usually don't enter these meetings feeling that I have to defend my network devices, just that I want to find the culprit, a reasonable explaination of the cause, and a plan of action.

Level 12

yes it happens all the time when we start going down the RCA road - it turns into its infrastructure /Applications donny brook . I tend to walk into those meetings if I am invited ( dont ask the monitoring guy ) to join. I will bring the data i am seeing off the server , CPU , Memory , IOPS , Network stats and Netflow to back up the infrastructure team but the applications/Dev Apps team if i say we had a little high cpu or memory the next response is its not resourced right and to add CPU and Memory to the box its not the application... and the battle starts ..

Level 19

Well of course! You weren't infering that the app actually uses CPU and mem were you??

Level 12

its even better when I show them 30-90 days of usage and still it has to be cpu

MVP
MVP

If you show up to these meetings with some NMS data, you're putting the focus on the data, not you. Otherwise you end up playing defense, and that's when tensions can mount. And +1 for being more interested in the resolution than the blame.

MVP
MVP

uh, oh. better add another 4 vCPUs to that single-threaded application VM.

Level 13

You have to come ready to defend your technology domain.  Just saying its good would never work in any of the environments I have worked. 

Level 21

I love reading posts such as these where I feel like "holy crap, I have totally been in that exact same situation just last week".

When we do an RCA, we are expected to provide all of our data to our manager in advance of any meeting that we may have.  I am on the Windows OS team and am responsible for providing the OS and application level information so I am normally covered with the performance data out of SAM in addition to any logs that may be available.  I do have an advantage in being the SCP as I know how to use the SolarWinds products to produce data better than any of the others at the company. 

In any RCA meeting I think it's important to focus on data and not people, this will keep the team focused on understanding the problem and coming up with a solution.  If yo focus on the people you end up with folks going on the defensive and then you won't get anything productive out of the meeting/process; unfortunately I have been in that meeting more than once.

Level 12

+1 byrona... i concur

MVP
MVP

Totally agree. Bringing data redirects attention away from people. RCAs shouldn't be a time to attack your organizational rivals.

Level 13

Oh yes, been there.  It's always the network's fault, and especially the firewall's fault.  So being the firewall admin I show up prepared to do battle.  The battle isn't really to point blame at someone else but rather to make sure the network/firewall isn't blamed without reason and that we actually get down to finding out what happened.

Preparation includes having a Frappacino right before the meeting - it ensures I'm awake and makes me a nicer person to play with.

Data includes the usual graphs of bandwidth usage, a list of the hosts involved as we understand, and a diagram to help depict where each of those graphs represents and the traffic path between the hosts.  When a firewall is in the path I will also bring a list of all the rules relevant to the hosts involved along with the hit counts on those rules to help show what the firewall did or didn't do to the traffic.

Level 11

It's always the firewall's fault.  See this Dilbert comic for reference: http://dilbert.com/strips/comic/2013-04-07


If the firewall can't be blamed, then it's the network's fault.  After all, the network arbitrarily breaks applications at random.  The RCA meeting is for everyone to defend their area of responsibility, or point blame at someone else.  My job title is technically Network Engineer 2, but I spend more and more time defending the network instead of configuring it.  Thankfully for me, packets don't lie, and I've become quite quick to break out packet analysis whenever a finger gets pointed at the network.  In my efforts to exonerate the network, I usually uncover the actual issue. 

Level 10

We typically don't have a formal meeting about stuff like this but there is always research done in to trying to find out what caused it. There is an incident manager that rotates every week that is in charge of driving the issue to completion as well as writing up a report on what happened and what was done to correct it as well as prevent it in the future. The tone always has been on not who is to blame because it's equally shared among all the groups. The user's as well as upper management pretty much think of us as "the helpdesk" and beyond that don't care what the difference between app support or the storage team is. All they care about is making sure it doesn't happen again. Our director of IT has done a good job explaining that the amount of down time that we actually have is much less than what it is on average with companies of our size but such is life there are going to be problems that we wont be able to foresee and will cause interruptions in service. Solarwinds helps us to reduce these significantly.

Level 13
After all, the network arbitrarily breaks applications at random

This is especially funny when you work with developers claiming you changed something in the network and now their program (that they're actively developing (changing)) doesn't work right anymore

Level 10

I've always wanted a network that no-one is allowed to use...  A place of calm serenity, far away from disturbing hands of users or devs.  The perfect network has no users, and, alas, is therefore not the perfect network.

Level 13

I often joke that the network would run much better if it weren't for the damn users.

MVP
MVP

Dude, you need a home lab.

Level 13

Computerized users programmed to split the network time evenly so that all of the computer users get the best speeds possible while the others wait?

Virtual user-token ring?

1. We don't have RCA meetings per se, but we have had major outages with high visibility that required an organized followup for RCA.

2. Since I'm involved with infrastructure, virtualization, and storage in addition to owning all of our Solarwinds modules, I tend to have all the info that I (or our applications team, or anyone else) would need at hand and produce appropriate stuff based on the problem. I don't know about other shops, but our app devs and app managers use their Solarwinds views sparingly since we're fortunate enough not to have a ton of problems - and since I was intimately involved with the construction and config of all the SAM stuff, I can decipher the SAM info just as quickly as they can. There's not a lot of fingerpointing going on and I'm grateful for that. In the past, I worked at shops where infrastructure and applications groups had outright adversarial relationships, and it showed.

3. See above - definitely prep and maybe save/export a few relevant things for easy viewing. Nowadays, between NPM/SAM/NTA and LEM, I have tons of real info on tap.

MVP
MVP

Sounds like management style has a lot of influence over these RCA meetings. You must have a good relationship with your peers if you're all willing to accept responsibility for outages and work together to prevent them. Nice!

Level 10

I am not usually involved in these meetings but I am asked to supply data. One thing I always find entertaining is that during 35-40% of these outages the Operations Group (NOC) will receive a Group Jabber from a member of the Engineering Team with One or Two words "Ooops",  "My Bad" or "D@mn It". My personal favorite "Sorry I Fat Gingered that One", now the NOC refers to Outages as Fat Gingers, and the legend continues.

Level 10

Just recently we had an issue crop up at the wrong time, which of course brought a lot of attention and "help" from our customer. I am a DoD contractor, and any of you who have experince in my world know what I'm talking about. I am the network engineer for a center which supports several different contracts, all in theory supporting the same mission. Typically when we convene an RCA, each contract come in immediately with the position "it wasn't us" and not much happens fast. However, in the most recent RCA, we were able to quickly move past the defensiveness as we focused on the data presented. We had sniffer data and firewall logs showing what is happening on the wire and were able to follow it and identify the offending device. No fingers were pointed at individuals or contracts, how refreshing. Hopefully in the future, all RCAs can work like this one did and keep the paralysis due to predetermined defenses to a minimum.

Level 10

Exactly rharland2012. I concur with your answers.

Level 10

It's not everyone that usually hold RCA meeting but it does not mean we don't have such outages. I remember experiencing one and every concerned party began to panic. We had meeting afterwards to explain what happened, the cause, the resolution and prevention of such re-occurring. But All I could say right now is that is wasn't child play.

Level 12

As I mentioned in a previous ambassador post, I used to work for a company where the different teams were actually different companies in an outsourcing arrangement.  That led to a lot of finger pointing when it came time to do an RCA.  To add to the frustration, there was no central NMS on which to rely for data.

Level 9

I'd be willing to bet we have all learned (or will one day) that going to any meeting without data to backup the "all is good here" statement, ends in the fingers getting pointed your direction simply because you don't have the data to backup the comment. Now, before I go into a meeting, I'll get all the ducks I can find, put them in a row (neat little graph, chart, screenshots, etc.) and be prepared for anything that might sling fault my direction. Even if the fact is that my component went down, failed, or was mis-configured, I have the data to clearly show what the problem was. Leaving the meeting with no real conclusion is sometimes worse because you end up looking like you don't even know what your managing. If you don't know something is wrong, why do they have you managing it ? If it's wrong - take the blame, figure out the solution and move forward. Sometime we get stuck on not wanting to admit something is our (department, component, etc.) fault.

Level 9

It seems that now days Network Engineers have to be men and women of different hats, I have found myself in many situations troubleshooting a issue where it has nothing to do with the network, but because we we're blamed first we had to run with the issue as far as we could which a lot of times get the issue/problem isolated and eventually fixed.  Its a never ending struggle....

Level 15

This was a good discussion.

MVP
MVP

wow...I missed this one.

I have seen this several times in several shops.  most everyone brings logs, graphs, and charts...usually to defend their domain.

In some cases it was a real, "what happened" event and we are all digging into scenerios...  RCA's should be a group effort to improve things while preventing incidents in the future.

I just reached the point where I've been stepping back to re-evaluate my team's tool sprawl.  It seems we're going in the right direction, but still increasing our time spent with tools, learning new ones, not retiring old ones, and configuring existing ones to get more bang for the buck.

I'm at the point where I need more bodies to efficiently utilize the information that's available to us, and I don't have them.

At the same time I'm advocating for more Orion tools to be made available to our IT teams that aren't using anything that can give them great insight into the conditions of their systems and their users' experiences.

About the Author
Long-time SolarWinds implementer and user. Spend my days now with vSphere, HDS, Nexus, and pretty much everything else.