Root Cause Analysis for a Network Admin
Root cause analysis (RCA) for Network Admins typically centers on network performance issues. Because the network is the backbone of any business, you may also be tasked with providing data to other teams in an effort to troubleshoot issues with applications or a security threat. Root cause analysis can be a daunting task, so I would like to pass along some best practices that got me through those stressful root cause analysis sessions.
Define the problem first. Does this remind you of a lesson from your Critical Thinking class in college??? I find that this approach actually works well. Working with your team to define the problem helps you determine where to start your investigation.
Centralize your network device logs. Once you define the problem, the digging begins. This can be tough when you have to investigate data from each device separately. Centralizing your data can make determining root cause less difficult. Even a simple syslog server can save a ton of time by letting you see information across multiple systems in one dashboard. Budget permitting, there are more sophisticated tools that help you correlate information and determine root cause more efficiently.
Visualize your data. Data visualization is an effective way to determine root cause. Something as simple as a histogram view of logs over a period of time can reveal anomalies. For example, if you look at data for a particular block of time, you can see either a large spike or a sudden drop in log events. At this point, you have already narrowed your analysis time frame and significantly reduced the amount of data you need to analyze.
Understand log levels and types. Every device has a different method of logging. If your primary reason for collecting logs is root cause, then it is critical to understand how they log. Most network devices assign a severity level to their logs (0-7) with each severity representing a different type of log. Understanding what information is provided at each level helps you select the right logs to collect and limits the amount of data you have work through.
Optimize RCA with saved searches and procedures. A quick and easy way to speed up RCA is to save searches. Depending on the tool you deploy to manage logs, you should have the ability to save an analysis you have performed. Anytime you use a specific method to identify the cause of an issue, save that search or process. Creating a repository of these searches or processes can speed up RCA in the future.
Hopefully, you find some of these suggestions helpful. Sometimes in the heat of the moment it is best to stop and go back to the basics. This isn’t a comprehensive list so
please pass along any tools, tips, tricks, and experiences that have helped you become more efficient in responding to events.
This post is part of our Best Practices for Root Cause Analysis Series. For more best practices, check out the index post here: Best Practices for Root Cause Analysis