Best Practices for Root Cause Analysis

Version 1

    Best Practices for Root Cause Analysis

    Root cause analysis (RCA) is probably the most critical function in security outside of maintaining a secure and compliant network. Anytime a security event has been detected or even perceived, you are tasked with discovering the origin or cause. 

     

    I spent a couple of years managing a Security Operations Center (SOC) so I’m well acquainted with the frustrations of root cause analysis. Especially when you are under the gun to provide answers ASAP.

     

    Based on that experience, I would like to pass along some best practices that got me through those stressful root cause analysis sessions. At the same time, I’m interested in learning about the tools, tips, and tricks that helped you become more efficient in responding to security events. Feel free to respond with your experiences and ideas.

     

    Sometimes less is more. When you look into your log management or SIEM technology, sometimes it’s better to start with a single key word, IP address, or username for a specific timeframe rather than a specific detailed search. This kind of simple search may provide insight into activity from other devices which can help you back track and find out where the event originated. Pay attention to large amounts of certain events like errors, access failures, file activity, and change events that contain the same username, IP Address, or both. Also, look for changes by the same user or source IP across multiple systems.

     

    Visualize your data. Visualizing data is a way to identify trends in the information flow. A large spike of a single event or a sustained amount of events over a period of time usually signifies an anomaly. Visualization can also help you quickly identify a time frame to start your investigation. Finally, if you are using a SIEM or log management product, you can typically build dashboards based on various criteria like network traffic, authentication, file, and change events.

     

    Correlate data from different devices to identify security events. Correlation of log data across different devices, systems, and applications adds another layer of security monitoring and may reveal security issues that fell under the radar. For example, correlating a spike in outbound email logs not sourcing from your internal email server is a good indication of malware. Advanced Persistent Attacks (APTs) can be hard to detect. However, if you’re investigating logs you can look for random software installs correlated with outbound FTP traffic logs from your firewall within the same time frame as an attack.

     

    Establish templates to support incident response. RCA and incident response are different functions, but they rely on each other. Determine what a response team, legal, or your leadership will require from the data (i.e. IP address, port, username etc.), then create Standard Operational Procedures (SOPs) and templates for general and specific situations. Regardless of who is present when an event occurs, everyone needs to be clear about what information is critical and who to contact.

     

    Hopefully, you find these suggestions helpful. Sometimes in the heat of the moment it is best to stop and go back to the basics. This isn’t a comprehensive list so please pass along any tools, tips, tricks, and experiences that have helped you become more efficient in responding to events.

     

    Want more specific information? Check out these additional links:

    Root Cause Analysis for a System Admin

    Root Cause Analysis for a Network Admin

    Root Cause Analysis for a Security Pro