Most of the time, IT pros gain troubleshooting experience via operational pains. In other words, something bad happens and we, as IT professionals, have to clean it up. Therefore, it is important for you to have a troubleshooting protocol in place that is specific to dependent services, applications, and a given environment. Within those parameters, the basic troubleshooting flow should look like this:
- Define the problem.
- Gather and analyze relevant information.
- Construct a hypothesis on the probable cause for the failure or incident.
- Devise a plan to resolve the problem based on that hypothesis.
- Implement the plan.
- Observe the results of the implementation.
- Repeat steps 2-6.
- Document the solution.
Steps 1 and 2 usually lead to a world of pain. First of all, you have to define the troubleshooting radius, the surface area of systems in the stack that you have to analyze to find the cause of the issue. Then, you must narrow that scope as quickly as possible to remediate the issue. Unfortunately, remediating in haste may not actually lead to uncovering the actual root cause of the issue. And if it doesn’t, you are going to wind up back at square one.
You want to get to the single point of truth with respect to the root cause as quickly as possible. To do so, it is helpful to combine a troubleshooting workflow with insights gleaned from tools that allow you to focus on a granular level. For example, start with the construct that touches everything, the network, since it connects all the subsystems. In other words, blame the network. Next, factor in the application stack metrics to further shrink the troubleshooting area. This includes infrastructure services, storage, virtualization, cloud service providers, web, etc. Finally, leverage a collaboration of time-series data and subject matter expertise to reduce the troubleshooting radius to zero and root cause the issue.
If you think of the troubleshooting area as a circle, as the troubleshooting radius approaches zero, one gets closer to the root cause of the issue. If the radius is exactly zero, you’ll be left with a single point. And that point should be the single point of truth about the root cause of the incident.
Share examples of your troubleshooting experiences across stacks in the comments below.