Root Cause Analysis for a System Admin

Version 1

    System Admins face challenges everyday trying to figure out why a server or application is not performing well or why a system crashed during the night. The task of finding the root cause becomes even more difficult when it involves multiple systems that work together. Below are some tips and best practices that I have learned over the years. Please comment and pass along your experiences as well!

     

    It’s usually the simple things! As IT pros, we tend to look for a complicated mix of events as the cause of a problem on a server or application. Sometimes, issues are that complex. However, the majority of issues I helped resolve were usually something simple, such as a Windows® update, a hung process/service, or someone shutting a system down improperly. To that end, I learned that stepping back and thinking of the simplest reason a problem can occur often revealed the root of the issue in a lot less time. While I still find myself occasionally running down the rabbit hole, I do try to live by what I call the 15-minute rule. If I haven’t figured out the root cause or made any progress within 15 minutes, it’s time to STOP, step back, and think of the simplest thing that can go wrong.

     

    Centralize operating system and application logs. Determining root cause can be extremely frustrating when you have to remote in to each server or application separately then attempt to correlate or compare data across multiple sources. This may be a shameless plug, but the truth is using a product that lets you store all your log data in one location saves a lot of time, effort, and guesswork. Many log management tools come with built-in analytics and processes to help speed-up log centralization. Even something as simple as a syslog server where you can store logs can save you time.

     

    Visualize your data. Data visualization is an effective way to determine root cause. Something as simple as a histogram view of logs over a period of time can reveal anomalies. For example, if you look at data for a particular block of time, you can see either a large spike or a sudden drop in log events. At this point, you have already narrowed your analysis time frame and significantly reduced the amount of data you have to analyze. Additionally, you get a system-wide view of the log data because both application and server logs are presented at the same time.

     

    Understand log levels and types. Every server and application has a different method of logging.  Windows® logging is enabled through the local or domain security policy, while Linux® systems use an audit daemon or syslog. If your primary purpose for collecting logs is for root cause analysis, then it is critical to understand how devices log to ensure you are collecting the right data. Most administrator guides provide details about the type of logs that are produced. However, sometimes you have to do some digging or ask the vendor directly for information on logging—especially to get the exceptions or errors that provide real information beyond just up/down status.

     

    Optimize root cause with saved searches and procedures. A quick and easy way to optimize RCA is to save any successful searches into your data. Every time I figure out the root cause of an event, I try to save every search either within the tool I am using or in a document. Creating a repository of these searches or processes can save time and speed up RCA in the future.

     

    Hopefully, you find some of these suggestions helpful. Sometimes in the heat of the moment it is best to stop and go back to the basics. This isn’t a comprehensive list so please pass along any tools, tips, tricks, and experiences that have helped you become more efficient in responding to events.


    This post is part of our Best Practices for Root Cause Analysis Series. For more best practices, check out the index post here: Best Practices for Root Cause Analysis