I've worked in IT for a long time (I stopped counting at twenty years. Quite a while ago.) This experience means that I generally do well in troubleshooting in data--related areas. Other areas like networking and I'm pretty much done at "do I have an IP address" and "is it plugged in?"
What Can Go Wrong?
One of the things I've noticed is that while people can be experts in deploying solutions, this doesn't mean they are great at diagnosing issues. You've worked with that guy. He's great at getting things installed and working. But when things go wrong, he just starts pulling out cables and grumbling about other people's incompetence. He keeps making changes and does several at the same time. He's a nightmare. And when you try to step in to help him get back on a path, he starts laying blame before he starts diagnosing the issue. You don't have to be that guy, though, to have challenges in troubleshooting.
Some of the effects that can contribute to troubleshooting challenges:
If you have recently solved a series of NIC issues, the next time someone reports slow response times, you're naturally going to first consider a NIC issue. And many times, this will work out just fine. But if it constrains your thinking, you may be slow to get to the actual cause. The best way to fight this cognitive issue is to gather data first, then assess the situation based on your entire troubleshooting experience.
Confirmation Bias goes hand in hand with availability heuristic. Once you have narrowed the causes you think are causing this response time metric, your brain will want you to go look for evidence that the problem is indeed the network cards. The best way to fight this is to recognize when you are looking for proof instead of looking for data. Another way to overcome confirmation bias is to collaborate with others on what they are seeing. While groupthink can be a issue, it's less likely for a group to share the same confirmation bias equally.
So to get here, you have limited your guesses to recent issues, you have searched out data to prove the correctness of your diagnosis and now you are anchored there. You want to believe. You may start rejecting and ignoring data that contradicts your assumptions. In a team environment, this can be one of the most frustrating group troubleshooting challenges. You definitely don't want to be that gal. The one who won't look at all the data. Trust me on this.
I use intuition a lot when I diagnose issues. It's a good thing, in general. Intuition helps professionals take a huge amount of data and narrow it down to a manageable set of causes. It's usually based on having dealt with similar issues hundreds or thousands of times over the course of your career. But intuition without follow up data analysis can be a huge issue. This often happens due to ego or lack of experience. Dunning Kruger syndrome (not knowing what you don't know) can also be a factor here.
Improving Troubleshooting Skills
- Be Aware.
The first thing you can do to improve the speed and accuracy of your troubleshooting is to recognize these behaviours when you are doing them. Being self-aware, especially when you are under pressure to bring systems back online or have a boss pacing behind your desk asking "when will this be fixed?" will help you focus on the right things. In a truly collaborative, high trust environment, team members can help others check whether they are having challenges in diagnosing based on the biases above.
- Get feedback.
We are generally luck in IT that we, unlike other professions, can almost always immediately see the impact of our fixes to see if they actually fixed the problem. We have tools that report metrics and users who will let us know if we were wrong. But even post-event analyses, documenting what we got right, what we got wrong can help us improve our methods
Yes, every day we troubleshoot issues. That counts as practice. But we don't always test ourselves like other professions do. Disaster Recovery exercises are a great way to do this, but I've always thought we needed troubleshooting code camps/hackathons to help us hone our skills.
- Bring Data.
Data is imperative to punching through the cognitive challenges listed above. Imagine diagnosing a data-center wide outage and having to start by polling each resource to see how it's doing. We must have data for both intuitive and analytical responses.
I love my data. But it's only and input into a diagnostic process. Metrics, considered in a holistic, cross-platform, cross team view is the next step. A shared analysis platform makes combining and overlaying data to get to the real answers makes all this smoother and faster.
- Log What Happened.
This sounds like a lot of overhead when you are under pressure (is your boss still there?), but keeping a quick list of what was done, what your thought process was, what others did can be an important part of professional practice. Teams can even share the load of writing stuff down. This sort of knowledgebase is also important for when your run into the rare things that that have a simple solution but you can't remember exactly what to do (or even not to do).
A person with experience can be a experienced non-expert. But with data, analysis and awareness of our biases and challenges in troubleshooting, we can get problems solved faster and with better accuracy. The future of IT troubleshooting will be based more and more on analytical approaches.
Do you have other tips for improving your troubleshooting and diagnostic skills? Do you think we should get formal training in troubleshooting?