When it comes to IT, things go wrong from time to time. Servers crash, memory goes bad, power supplies die, files get corrupted, backups get corrupted...there are so many things that can go wrong. When things do go wrong, you work to troubleshoot the issue and end up bringing it all back online as quickly as humanly possible. It feels good, you might even high five or fist bump your co-worker, for the admin, this is a win. However, for the higher-ups, this is where the finger pointing begins. Have you ever had a manager ask you “So what was the root cause?” or say “Let’s drill down and find the root cause.”
I have nightmares of having to write after action reports (AARs) on what happened and what the root cause was. In my imagination, the root cause is a nasty monster that wreaks havoc in your data center, the kind of monster that lived under your bed when you were 8 years old, only now it lives in your data center. This monster barely leaves a trace of evidence as to what he did to bring your systems down or corrupt them. This is where a good systems monitoring tool steps in to save the day and help sniff out the root cause.
Three Things to Look for in a Good Root Cause Analysis Tool
A good root cause analysis (RCA) tool can accomplish three things for you, which can provide you with the best track on what the root cause most likely is and how to prevent it in the future.
- A good RCA tool will…be both reactive and predictive. You don’t want a tool that simply points to logs or directories where there might be issues. You want a tool that can describe what happened in detail and point to the location of the issue. You can't begin to track down the issue if you don’t understand what happened and have a clear timeline of events. Second, the tool can learn patterns of activity within the data center that allow it to become predictive in the future if it sees things going downhill.
- A good RCA tool will…build a baseline and continue to update that baseline as time goes by. The idea here is for the RCA tool to really understand what looks “normal” to you, what is a normal set of activities and events that take place within your systems. When a consistent and accurate baseline is learned, the RCA tool can get much more accurate as to what a root cause might be when things happen outside of what’s normal.
- A good RCA tool will…sort out what matters, and what doesn’t matter. The last thing you want is a false positive when it comes to root cause analysis. The best tools can accurately measure false positives against real events that can do serious damage to your systems.
Use More Than One Method if Necessary
Letting your RCA tool become a crutch to your team can be problematic. There will be times that an issue is so severe and confusing that it’s sometimes necessary to reach out for help. The best monitoring tools do a good job of bundling log files for export should you need to bring in a vendor support technician. Use the info gathered from logs, plus the RCA tool output and vendor support for those times when critical systems are down hard, and your business is losing money every minute that it’s down.