cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post

How to Use Monitoring Tools to Sniff Out the Root Cause

Level 11

When it comes to IT, things go wrong from time to time. Servers crash, memory goes bad, power supplies die, files get corrupted, backups get corrupted...there are so many things that can go wrong. When things do go wrong, you work to troubleshoot the issue and end up bringing it all back online as quickly as humanly possible. It feels good, you might even high five or fist bump your co-worker, for the admin, this is a win. However, for the higher-ups, this is where the finger pointing begins.  Have you ever had a manager ask you “So what was the root cause?” or say “Let’s drill down and find the root cause.”

I have nightmares of having to write after action reports (AARs) on what happened and what the root cause was. In my imagination, the root cause is a nasty monster that wreaks havoc in your data center, the kind of monster that lived under your bed when you were 8 years old, only now it lives in your data center. This monster barely leaves a trace of evidence as to what he did to bring your systems down or corrupt them. This is where a good systems monitoring tool steps in to save the day and help sniff out the root cause. 

Three Things to Look for in a Good Root Cause Analysis Tool

A good root cause analysis (RCA) tool can accomplish three things for you, which can provide you with the best track on what the root cause most likely is and how to prevent it in the future. 

  1. A good RCA tool will…be both reactive and predictive. You don’t want a tool that simply points to logs or directories where there might be issues. You want a tool that can describe what happened in detail and point to the location of the issue. You can't begin to track down the issue if you don’t understand what happened and have a clear timeline of events.  Second, the tool can learn patterns of activity within the data center that allow it to become predictive in the future if it sees things going downhill. 
  2. A good RCA tool will…build a baseline and continue to update that baseline as time goes by.  The idea here is for the RCA tool to really understand what looks “normal” to you, what is a normal set of activities and events that take place within your systems. When a consistent and accurate baseline is learned, the RCA tool can get much more accurate as to what a root cause might be when things happen outside of what’s normal. 
  3. A good RCA tool will…sort out what matters, and what doesn’t matter. The last thing you want is a false positive when it comes to root cause analysis. The best tools can accurately measure false positives against real events that can do serious damage to your systems. 

Use More Than One Method if Necessary

Letting your RCA tool become a crutch to your team can be problematic. There will be times that an issue is so severe and confusing that it’s sometimes necessary to reach out for help. The best monitoring tools do a good job of bundling log files for export should you need to bring in a vendor support technician. Use the info gathered from logs, plus the RCA tool output and vendor support for those times when critical systems are down hard, and your business is losing money every minute that it’s down.

14 Comments
Level 14

Thanks!  Nice write up.

Level 13

Good Article. Thanks

Level 11

No problem, thanks for reading.

Level 11

Thank you, hope it was helpful.

Level 16

Thanks for the write up! We do a Failure Impact Analysis (FIA) and it includes an AAR and Incident Action Plan (IAP) which is what are you going to do to keep it from happening again and how you are going to monitor it.

This is usually when monitoring gets engaged because that step was skipped when they originally set the technology up. 

Level 14

Good stuff.  One of the problems we have is that some of the "technical" people aren't good enough to do root cause analysis (and we have no proper tools).  They just jump at the first thing they see and blame that.  I then have to take hours (sometimes days) to convince people that my correct analysis is actually correct.  Fortunately I have Solarwinds to help me.

Level 14

I remember years ago when I was a mainframe engineer I was on site where I looked after one mainframe and someone from another company looked after the other one.  Overnight an error was reported on one of the tape to tape reel devices (you know the big ones you always see in old movies) during a backup.  The client wanted to know what had happened and the other engineer (as it was on his system) told them that an operator had pressed the unload button.  Apparently that wasn't good enough an answer and they wanted to know what the engineer was going to do to stop this in future.  His answer was brilliant  -  'cut the operator's fingers off'.  Nothing the client could say as the answer was a viable one and it was their own staff that made the mistake.

MVP
MVP

Tools are only as good as the quality of the data they are pointed to.

Case in point, if they are not looking at the information (different log file) they will and may never see the triggering root cause event or symptom.  Sometimes it is a correlated event that may not be obvious.

X happens but Y usually happens within 30 minutes...but if Y doesn't happen within 30 minutes then you are on the path to doom.

Those sorts of event correlations are learned about over time...not all tools can understand the relationship so you must have some sort or correlation engine involved to see and watch for such events.

In some cases it is a little known log file for an app that could have contained the needed data, but logging level is set to such a high level that only the most catastrophic event gets through....The root cause never got logged or the indicator for it never got logged.

In the end, hopefully the RCA tool will help you to build the proper correlations.  Granted it's learning analysis should help to do that.  The challenge is there can be so much data that the noise floor is very deep.

This!  And so much more!  More is required to filter out the chaff.  More is required to build Solarwinds products to automatically correct or prevent problems.

Notification of changes in conditions is no longer enough.  It's no longer sufficient to receive data and have to interpret it and then decide how to respond and act to correct the issues. 

Our teams are falling behind the reporting influx, and we need automation that can take the burden of interpretation and troubleshooting out of our hands.

Our customers can't afford to leave network monitoring--and reactions to changes--in the hands of people anymore.  When a one-second outage to services impacts our customers, only machines can react fast enough to detect the problem, determine the correct cause, and implement the right correction to restore the issue.  And if the machines can't do that for us, at a minimum they must reach out to the right people and provide accurate and timely data with recommendations about how to resolve the problem.

Level 8

Very good article and fantastic observations in the comments. Thanks all.

MVP
MVP

Nice write up

MVP
MVP

Nice article. Finding the root cause is critical to understanding, diagnosing and preventing future issues.

Level 20

The hard part to me is the correlation of all of the events against each other to gain real knowledge.  It's not a trivial problem.

Level 13

Good article and excellent points.  To me this is one of the hardest things to get people to grok.  They either don't bother to learn the big picture or get so in the weeds they lose track of what they are trying to accomplish and end up wasting a lot of time doing things that aren't likely to address the issue (or they have one or two solutions to every problem and when they don't work they walk away).   Thanks for posting.