This is the second post in a post that started in Part the First.

Randomness is a part of life.  This has widespread areas of affect: from human interactions to chaos theory.  To make Dungeons & Dragons more lifelike, it needs to inherit this randomness from real life.  This is where the dice come in.  The one that’s used most often is the 20-sided die (referred to as a d20).

On any roll of a d20, you have a 5% chance to get any number.  The best and worst rolls being a 20 or a 1, respectively.  Anyone has a 5% chance of hitting either result (thanks math!).  In D&D these get special connotations: they are called Critical Rolls.  So, a 20 is a Critical Hit and a 1 is a Critical Fail.

Critical Hits in IT

Sometimes you or your team just hit something out of the park.  Yes, I’m mixing metaphors here, but you get the point.  You didn’t plan for this task to go as well as it did.  You’ve moved 1,000 mailboxes during the night with no downtime.  Your team executed a zero-downtime upgrade of a SQL cluster.  Your coworker applied configurations to 200 WAN routers with no blips.  This a Critical Hit!

On the chance that you’ve encountered one of these rare events, be sure to celebrate.  Do a little dance, take everyone out for a lunch, let your management team know about it, whatever you do to mark accomplishments.

Then again, there’s the other side of the coin die.

Critical Fails in IT

I hate to say it, but in Information Technology, it always feels like there’s more probability of getting a 1 on a d20.  I stated earlier that in the game, you have a 5% chance to get a critical failure, but real-life IT probability feels skewed towards failure instead of success.

In game when you have a critical failure whatever you are trying to do fails… epically.  You go to push a troll off a bridge and instead you lightly caress his shoulder.  You both feel awkward.  This is an epic failure and the scope of it is bound to DM discretion.

In IT, epic failures take different forms.  They can be something as simple as turning off the wrong port on a switch or as great as crashing a mail server.  An IT department doesn’t have a DM to choose what happens when things go wrong.  Instead we rely on our own knowledge and the experience of others to help guide us on how to proceed.  Quick thinking and decisive action are key parts to following up after a failure, but the best thing you can do is communicate.

Communicate with the department and the affected parties.  Clean, clear communication of the issue and your plans for recovery are the first, best thing you can do after getting an epic failure.  Here’s where every member of IT gets to be their own DM.  It’s up to you to decide the next move.  Make it a good one with transparency.