There are many books, websites, and probably self-help videos devoted to teaching or explaining the art of troubleshooting. Most are specific to an industry, and further to a problem domain within that industry. Within each problem domain, within each industry, within each methodology, there are tools of the trade designed to help you solve whatever problem is vexing you at that moment. The specificity of all of this, however, can be abstracted out of these insular, domain-specific modalities to affect a greater understanding of the role of troubleshooting in general.
It goes without saying that you cannot find something that you do not know you are looking for, and yet this is what a lot of neophyte engineers instinctively try. “The phones are down” may seem like the problem you need to fix, but counterintuitively that is only a symptom of the real problem. The real problem, the one causing the phones to be down, lies elsewhere. While you run around trying to figure out what’s up with the phones, what you should be thinking is, “For what reason(s) is/are the phones ‘down’?” and move from there. For example, are all the phones down? Some? Are there other symptoms? And what has changed recently, if anything? Once you’ve worked through some of this, which may only take seconds or minutes for a seasoned engineer, you’re more prepared to move onto the next steps.
Analyzing the problem(s), or problem statements, will help you to form some hypothesis as to where the problem is likely to lie. Now, how can you begin testing your ideas to see if you are on the right track? Well, in the IT world that we all live in (I know, I said abstracted…), you’re going to need information. Information gathering can be a manual process, and in many cases must be, but having good tools at your disposal can certainly help the process along the way, especially when you are shooting in the dark, so to say. Again, if you don’t know what you don’t know, an automated and impartial tool can help.
Tool impartiality is often overlooked as a step in the discovery phase of troubleshooting any problem. Plumbers have scopes to look inside of pipes that they cannot see; electricians have multi-meters to help them test connectivity, resistance, etc.; and you as an IT professional have tools like PerfStack. A tool like this happily gathers information from all of your systems, jumping to no conclusions, and can call out abnormalities in the steady state of a system. Where many engineers skip straight to the “trying to fix anything they suspect is the problem” phase, PerfStack simply presents what it sees in an impartial and authoritative manner. From its dashboards, an engineer can begin his/her search from a position of knowledge. Combine that with the wisdom that comes from experience, and you have a very strong team.
Mean time to innocence (MTTI) is a somewhat tongue-in-cheek metric in IT shops these days, referring to the amount of time it takes an engineer to prove that the domain for which they have responsibility is not, in fact, the cause of whatever problem is being investigated. In order to quantify an assessment of innocence you need information, documentation that the problem is not yours, even if you cannot say with any certainty who does own the problem. To do this, you need a tool that can generate impersonal, authoritative proof you can stand on, and which other engineers will respect. This is certainly helped if a system-wide tool, trusted by all parties, is a major contributor to this documentation.
A tool like PerfStack will certainly help in getting buy-off from the pointy-haired bosses as to what needs to happen to fix whatever needs fixing. Most organizations have a change control process--though likely an amended one during any kind of outage—and documentation is always a part of that. And all of this stuff, this paper trail from beginning to end, flows together nicely right into the final package that many organizations require for a post-mortem. Engineers and management can get through an after-the-fact incident meeting much quicker, and with likely consensus, with a clean and robust set of documents.
At the end of the day, troubleshooting is an art no matter what you do, where you do it, or in what industry you live. The methodologies are largely the same at a macro level, as are the need for quality tools. Can a great engineer find the root cause of a problem without a comprehensive tool like PerfStack? Sure. A cobbled together band of point tools has always been a part of the engineer’s toolkit and likely always will be, at least until our new sentient robotic overlords obviate the need for that. But a full-scale, system-wide solution like PerfStack should also be a part of any well-stocked engineering team’s process. After all, it can help find those things you do not yet know you are looking for.