cancel
Showing results for 
Search instead for 
Did you mean: 

Improving your Diagnostic and Troubleshooting Skills

Level 12

TroubleShootingShooing.jpg

I've worked in IT for a long time (I stopped counting at twenty years.  Quite a while ago.)  This experience means that I generally do well in troubleshooting in data--related areas.  Other areas like networking and I'm pretty much done at "do I have an IP address" and "is it plugged in?"

This is why team collaboration on IT issues, as I posted before, is so important.

What Can Go Wrong?

One of the things I've noticed is that while people can be experts in deploying solutions, this doesn't mean they are great at diagnosing issues. You've worked with that guy.  He's great at getting things installed and working.  But when things go wrong, he just starts pulling out cables and grumbling about other people's incompetence.  He keeps making changes and does several at the same time.  He's a nightmare.  And when you try to step in to help him get back on a path, he starts laying blame before he starts diagnosing the issue. You don't have to be that guy, though, to have challenges in troubleshooting.

Some of the effects that can contribute to troubleshooting challenges:

Availability Heuristic

If you have recently solved a series of NIC issues, the next time someone reports slow response times, you're naturally going to first consider a NIC issue.  And many times, this will work out just fine.  But if it constrains your thinking, you may be slow to get to the actual cause.  The best way to fight this cognitive issue is to gather data first, then assess the situation based on your entire troubleshooting experience.

Confirmation Bias

Confirmation Bias goes hand in hand with availability heuristic. Once you have narrowed the causes you think are causing this response time metric, your brain will want you to go look for evidence that the problem is indeed the network cards.   The best way to fight this is to recognize when you are looking for proof instead of looking for data.  Another way to overcome confirmation bias is to collaborate with others on what they are seeing.  While groupthink can be a issue, it's less likely for a group to share the same confirmation bias equally.

Anchoring Heuristic

So to get here, you have limited your guesses to recent issues, you have searched out data to prove the correctness of your diagnosis and now you are anchored there.  You want to believe.  You may start rejecting and ignoring data that contradicts your assumptions. In a team environment, this can be one of the most frustrating group troubleshooting challenges. You definitely don't want to be that gal.  The one who won't look at all the data. Trust me on this.

Intuition

I use intuition a lot when I diagnose issues.  It's a good thing, in general.  Intuition helps professionals take a huge amount of data and narrow it down to a manageable set of causes. It's usually based on having dealt with similar issues hundreds or thousands of times over the course of your career.  But intuition without follow up data analysis can be a huge issue.  This often happens due to ego or lack of experience.  Dunning Kruger syndrome (not knowing what you don't know) can also be a factor here.

There are other challenges in diagnosing causes and effects of IT issues. I highly recommend reading up of them so you can spot these behaviours in others and yourself.

Improving Troubleshooting Skills

  1. Be Aware.
    The first thing you can do to improve the speed and accuracy of your troubleshooting is to recognize these behaviours when you are doing them.  Being self-aware, especially when you are under pressure to bring systems back online or have a boss pacing behind your desk asking "when will this be fixed?" will help you focus on the right things.  In a truly collaborative, high trust environment, team members can help others check whether they are having challenges in diagnosing based on the biases above.
  2. Get feedback.
    We are generally luck in IT that we, unlike other professions,  can almost always immediately see the impact of our fixes to see if they actually fixed the problem.  We have tools that report metrics and users who will let us know if we were wrong.  But even post-event analyses, documenting what we got right, what we got wrong can help us improve our methods
  3. Practice.
    Yes, every day we troubleshoot issues.  That counts as practice.  But we don't always test ourselves like other professions do.  Disaster Recovery exercises are a great way to do this, but I've always thought we needed troubleshooting code camps/hackathons to help us hone our skills. 
  4. Bring Data.
    Data is imperative to punching through the cognitive challenges listed above.  Imagine diagnosing a data-center wide outage and having to start by polling each resource to see how it's doing.  We must have data for both intuitive and analytical responses.
  5. Analyze.
    I love my data.  But it's only and input into a diagnostic process.  Metrics, considered in a holistic, cross-platform, cross team view is the next step.  A shared analysis platform makes combining and overlaying data to get to the real answers makes all this smoother and faster.
  6. Log What Happened. 
    This sounds like a lot of overhead when you are under pressure (is your boss still there?), but keeping a quick list of what was done, what your thought process was, what others did can be an important part of professional practice.  Teams can even share the load of writing stuff down.  This sort of knowledgebase is also important for when your run into the rare things that that have a simple solution but you can't remember exactly what to do (or even not to do).

A person with experience can be a experienced non-expert. But with data, analysis and awareness of our biases and challenges in troubleshooting, we can get problems solved faster and with better accuracy. The future of IT troubleshooting will be based more and more on analytical approaches.

Do you have other tips for improving your troubleshooting and diagnostic skills?  Do you think we should get formal training in troubleshooting?

40 Comments
vinay.by
Level 16

Wonderful article, i just love thwack for this very own reason (superminds in a single forum). Thanks datachick

sqlrockstar
Level 17

Wonderful post! Thanks for sharing!

jm_sysadmin
Level 13

Bring Data.

I love data, but I am always amazed at how many people get stuck on what happened last time, or what they think is happening, but never think to check if the data you have matches that guess.

datachick
Level 12

Thanks.  I'm glad it resonated with you.

datachick
Level 12

Happy you liked it.

datachick
Level 12

IMG_5486.JPG

designerfx
Level 16

I like this article a lot. It's a solid set of guidelines that everyone can benefit from. To point out the last question: "Do you think we should get formal training in troubleshooting?": This is a nice idea but this is a bit of something that tends to be hands on. I'm not sure there's really an alternative, half the time when people do a DR test it isn't even the real thing. Plus, whatever happens to be taught isn't going to be applicable all the time. It certainly couldn't hurt, though.

shuckyshark
Level 13

Love it!

Practice Practice Practice...when I was going to school many moons ago (think Netware 3.12), I had the best instructor. After a few weeks, I complained about how easy everything seemed to be and asked if there was anything he could do. I came into school the next day, and my lab setup wasn't working...I went to him and asked if he did anything. He responded "yes, you said everything was easy, so now, every morning, something won't work." For the next 72 weeks, my mornings were spent troubleshooting my lab before I could start my day...best education experience ever.

gfsutherland
Level 14

Great article!!! Practice makes perfect.... an old adage but applies here!!!

In any major event, someone on my team plays "scribe".... It helps in after action reports and as a teaching tool.

ScottRich
Level 12

Good stuff datachick​! I have spent much of my 30+ year career as a one-person IT department so having a team is a  new concept for me. I have to work hard to remember that they are here to help, not judge. And just because someone is 'systems' and not 'network' does not mean they cannot have valuable insight to my network problem. Thanks for the great article!

ScottRich
Level 12

I had a similar situation with a Cobol instructor in the early 80's. He had a very unique way of messing with our code and then making us find the problem.

rschroeder
Level 21

Thanks for saying what we MUST understand.  Reading it in differing blogs, reading it presented in different ways, helps remind us that it's easy to go down the wrong path when troubleshooting, and that wastes everyone's time.

I'm guilty of this myself.  If I'm troubleshooting OTV or BGP issues for a few hours and an experienced technician reports a specific device has begun experiencing an unusual communications problem, chances are I might still be wearing my Advanced Network Troubleshooter hat.  And in that realm, I've got some big hammers in my hand, and that oddball communication issue is looking an awfully lot like a nail.

Instead, I remind myself that troubleshooting MUST start at Layer 1 and then move up the stack after each layer is eliminated.

Part of the problem can be the expectation that an experienced Tech already performed the basics:

  • What changes have been made to the device and its environment since the time it last worked properly?
  • Is there power to the device?
  • Is there link light on its NIC to the network?
  • Did you reboot the device?
  • Did you POPO the device?
  • Does it have an IP address?
  • Can it ping the gateway?
  • Can it ping outside the gateway?
  • Are other devices experiencing the same problems?
  • Is it wireless or wired?
  • What conditions are required for successful communication (including firewall rules, ACLs, VPN configuration, certificates, etc.)?

If these kinds of questions haven't been considered, investigated, and answered, my coming at the problem thinking of BGP or OTV is likely a waste of everyone's time.

Once I know L1, L2, and L3 are good, I can forward the issue to the right departments for additional analysis.  Or if it's security-related (VPN or Firewall), I can advance the troubleshooting with log audits and tcpdumps.

Back to basics, always!

tinmann0715
Level 16

My Network/Systems team follows this methodology more and more. But many times I find that the royal "We" are in the dark because we are not properly monitoring a server, SAN, VMware host, network device, application, etc. fully.  There is still that one critical piece of data that we need to go looking for to help us determine root cause. But we are getting better.

tallyrich
Level 15

Nice article, thanks for posting. Your materials are always very thorough.

joepoutre
Level 12

I like the term troubleshooing. Sounds rather less violent in concept.

And then there is troubleshoeing, solving a problem by giving it a swift kick.

Experience is always the best background. The more you've done, the more you've faced, the more likely it is you've seen it before, or something similar.

Of course, it is when it looks like something you've seen before but turns out to be something completely different that causes hair loss and gnashing of teeth.

Jfrazier
Level 18

Well put datachick​ !  The cross-platform/cross-team mentions are very important in many of todays complex environments.

datachick
Level 12

Wow.  Thanks.  I struggle with finding the right balance.  I hope you dig in with more reading. studying people and teams is one of my favourite things.

datachick
Level 12

I think that's why eventually we all start looking alike.

datachick
Level 12

And data will set us free.  Or something like that.

datachick
Level 12

Great list of questions there.

datachick
Level 12

It's difficult to settle on that right balance of data collection and data hoarding. That's why I like the ability to configure and tweak the metrics in SolarWinds tools.

datachick
Level 12

I look serious enough there, right?

sparda963
Level 12

I agree with most people that troubleshooting is something that has to be learned, but I am not sure a normal sit down formal class is the way to go about it. I do not think that troubleshooting is a skill, I think it is more of a mindset. I have seen people who claim to be excellent troubleshooters, only to find out they just look at the surface of the problem and immediately jump to the most recent conclusion that fixed a similar problem last time for them. Sometimes you get lucky and this works, but more often then not this does not work out well.

I find it is an extremely fine line between an excellent troubleshooter, and a bad troubleshooter. A bad one will just jump to conclusion based on their most recent similar issue. An excellent troubleshooter will used past data and experiences to take educated shortcuts based on data. A good example of this is someone calls to complain they can't get to a web page. A bad troubleshooter may say its a web filter blocking it or the browser cache and settings need to be reset. A good troubleshooter will ask if the person has ever been able to get to the website in the past, and if they could when was the last time it worked.

This is something that cannot really be taught, as you just have to have a mindset for it. You have to be able to see the evidence in front of you, understand what questions to ask to help get the information you need, and how to process that information and move forward on it.

datachick
Level 12

That's great! I had a database prof who would do things that made even the men cry.  He'd assign a 2-hour deep dive presentation on database Internet's, but when you walked in he'd say "sorry, today you are doing a 10 minute executive overview within a 15 minute q&A and you can't use your notes or slides. And then the opposite.  Or he'd remove the projector from the room, saying you forgot your slides on the plane GO! 

It was hell.  But I learned a lot.

datachick
Level 12

thats a great way of doing it.  And so often overlooked.

datachick
Level 12

I have to work alone a lot, too.  But I prefer teams. I know others don't.  I often have to step back and see whats going on when there is team dysfunction.  Then try to motivate others to see it and take steps to correct.

datachick
Level 12

I do believe some people have a great knack for troubleshooting. And some are just terrible at it.  But I do believe that we can all improve our skills by doing the things I listed above.

ecklerwr1
Level 19

Some skills in troubleshooting only come from years of working on different kinds of problems with multiple technologies.  This is why everyone wants years of real experience but few pay for it.

sparda963
Level 12

ecklerwr1 "This is why everyone wants years of real experience but few pay for it."

This right here!!!! This basically says it all.

Random job posting:

Looking for a senior level programmer with lots of experience. Must have 10 years of Jode.js programming experience.

Node.js came out in 2009. Math is hard.

Radioteacher
Level 14

My 23 years of experience has taught me how to quickly determine one of two paths. 

  1. Can we do this?  Is this something the team can tackle?
  2. Wow, this is bigger than the team, who do we need to call?  Where can we get help and how soon?

Never be afraid to say the following.

  • I don't know
  • I am willing to be wrong
  • Please help me find the problem

RT

Radioteacher
Level 14

You only get 1 year of experience every 365.25 days of being alert and engaged at your work.

Solarwinds alerts that HBA's, Network and System of a redundant system dropped off the network.  The failover went to the alternate site as planned.  Thirty seconds of outage for services to come back up.

Four team members go to the datacenter.

No keyboard scan (Caps Lock) and the LED on the mouse did not light.

  • Mention that the 5 Volt bus in the server died

We had the same model of hardware in a stack of systems recently pulled from the racks.  We pulled the dead server and laid it down next to the other server not in use.

  • Labeled and pulled hard drives from the dead system and moved them to other server.
  • Pulled card cage from the dead one and moved it to the other server.
  • The HBA missed on the first pass (my bad....but I am willing to be wrong) was pulled and installed in the other server. (not in card cage)
  • Power up server and re-license OS.

Time to repair 32 minutes.

System was back up and replicating the data again, ready for another failure.

RT

rschroeder
Level 21

I've observed I troubleshoot better when it's someone else's problem.  Ironic, that.  And yes, others seem to troubleshoot better when it's my problem.

The main thing is working with a team, and not letting the pressure get to you.  And not minding someone watching over your shoulder and offering advice--or working on the same problem simultaneously, just over the cube wall, helping out with a real-time set of reports of what's being done, what's been looked at, what to look at next.

And, of course, working with folks who want my help when an issue arises in their realm, not minding if I backseat drive over their shoulder, guiding them through leveraging Solarwinds' modules & features to help discover the source of a problem.

rschroeder
Level 21

I tend to grin at that and say to myself "They're looking for either a time-traveler or a wizard, or else they're testing to see which applicants are detail-oriented."

datachick
Level 12

Great points about being able to say "I don't know" and to realize when help is...helpful.

datachick
Level 12

Love real-life stories.

datachick
Level 12

Glad you liked it.

I think people can get better at troubleshooting.  First, knowing more about the components they support is always helpful, especially deep dives about the underlying designs and internals.  Second, being trained on where to find the data that helps them diagnose issues is always helpful. Finally, people can learn how to rule out causes and narrow down to the relevant information.  I've been part of these sorts of trainings and they all helped.

byrona
Level 21

I have found that troubleshooting takes a specific mindset that not everybody has.  The first and possibly most important step is to not approach the problem assuming you know the answer because this obscures your ability to see other possibilities.

superfly99
Level 17

My 28 year experience helps a lot with trouble shooting even too the extent of pointing someone in the right direction without knowing their product or device. Common sense goes a long way. I try and diagnose things by myself but will involve others if I'm faced with a stumbling block. More minds think better than one.

thebbert
Level 8

I just wanted to thank you for the great article about what can break-down in troubleshooting and some areas we can all improve in

datachick
Level 12

I'm happy you found it helpful

About the Author
Data Evangelist Sr. Project Manager and Architect at InfoAdvisors. I'm a consultant, frequent speaker, trainer, blogger. I love all things data. I'm an Microsoft MVP. I work with all kinds of databases in the relational and post-relational world. I'm a NASA 2016 Datanaut! I want you to love your data, too.