cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post

After It Broke: Executing Good Postmortems

Level 11

No matter how much automation, redundancy, and protection you build into your systems, thing are always going to break. It might be a change breaking an API to another system. It might be a change in a metric. Perhaps you just experienced massive hardware failure. Many IT organizations have traditionally had a postmortem, or root cause analysis, process to try to improve the overall quality of their processes. The major problem with most postmortem processes is that they devolve into circular pointing matches. The database team blames the storage team, who in turn blames the network team, and everyone walks out of the meeting angry.

As I’m writing this article, I’m working a system where someone restarted a database server in the middle of a large operation, causing database corruption. This is a classic example of an event that might trigger a postmortem. In this scenario, we moved to new hardware and no one tested the restore times of the largest databases. This is currently problematic, as the database restore is still happening a few hours after I started this article. Other scenarios would be any situations where you have unexpected data loss, on-call pages, or a monitoring failure that didn’t capture a major system fault.

How can we do a better postmortem? The first thing to do is execute blameless postmortems. This process assumes that everyone involved in an accident had good intentions and executed with the right intentions based on available information. This technique originates in medicine and aviation, where human lives are at stake. Instead of assigning blame to any one person or team, the situation is analyzed with an eye toward figuring out what happened. Writing a blameless postmortem can be hard, but the outcome is more openness in your organization. You don’t want engineers trying to hide outages to avoid an ugly, blame-filled process.

Some common talking points for your postmortems include:

  • Was enough data collected to gather the root cause of the incident?
  • Would more monitoring data help with the process analysis?
  • Is the impact of the incident clearly defined?
  • Was outcome shared with stakeholders?

In the past, many organizations did not share a postmortem outside of the core engineering team. This is a process that has changed in recent years. Many organizations like Microsoft and Amazon, because of the nature of their hosting businesses, have made postmortems public. By sharing with the widest possible audience, especially in your IT organization, you can garner more comments and deeper insights into a given problem.

One scenario referenced in Site Reliability Engineering by Google is the notion of integrating postmortems into disaster recovery activities. By incorporating these real-world failures, you make your disaster recovery testing as real as possible.

If your organization isn’t currently conducting postmortems, or only conducts them for major outages, you might start to think about trying to introduce them more frequently for smaller problems. As mentioned above, starting with paged incidents is a good start. It gets you to start thinking about how to automate responses to common problems and helps ensure that the process can be followed correctly so that when a major issue occurs, you're not focused on how to conduct the postmortem, but instead on how to find the real root cause of the problem.

12 Comments
Level 13

Good Article - thought provoking. I used to work in an organization that used to claim to have a blameless culture but that appears to have gone by the wayside recently.

MVP
MVP

I think there is more to the postmortem than is listed:

- clearly define the problem that occurred and what/whom was affected and to what extent.

- provide the root cause

- how did we determine there was a problem ?

- how was the problem resolved ?

- can this problem occur again ?

- how can we do to prevent this from occurring again ?

- what signs and/or symptoms were there that indicated there was a problem ?

- when you start to see the signs, is there a way to resolve the issue before there is an outage ?

- did we encounter any problems in trying to determine the problem ?

- did we encounter any problems in trying to resolve/fix the problem ?

There is more but this is a good start...

Being professional and non-accusatory, staying positive and keeping the attitude of "what did we learn about this outage that we can use to reduce downtime or eliminate future outages?" is one of the most important things that can be done.

Obviously, sharing the learned information AND documenting it for future similar issues, will bring the most value to the process.

Some outages are out of our control, and we seek to remove their causes.

Some causes of outages are human errors and mistakes in judgment, and we strive to reduce them and improve our actions so we always (believe we) know what will happen when we hit that ENTER key.

I work in a 7x24 critical care hospital organization, and we've seen outages decrease more and more over the fifteen years I've been here.  A great portion of that improvement comes from having friendly, efficient PIR's (Post Incident Reviews) and RCA's (Root Cause Analyses) that identify the details, identify what went wrong (without pointing fingers and without threatening someone with retribution, censure, loss of job, etc.).

I used to think of admitting mistakes as "falling on my own sword", and guiltily feeling I was "taking one for the team."  I don't feel that way any longer.  Instead of holding on to that adolescent / immature attitude, my team has matured into adults who readily admit their mistakes and culpability. Our Manager does not act emotionally or irrationally, and moves forward with the understanding that every learns from the process.

Once everyone understands they're not going to lose their job for isolating a data center or a hospital, we can move forward with a work environment that's professional AND friendly.  We all do our jobs better, we don't have to deal with that additional stressor.

I don't think of my environment as "blameless"; I think that term isn't realistic.  If I made the mistake, I own the problem, the blame is on my shoulders.  But there's no retribution for my incorrect action if I caused an outage.  Instead, the change comes in the training, in the notifications to the customers and support teams, and in our understanding of the additional steps we may need to take before making changes.  This can include, but are not limited to:

  • Thoroughly understanding what's going to happen as the result of any command entered--BEFORE the ENTER key is pushed
  • Purchasing (AND USING) a test lab / sandbox environment to test changes before they go into production
  • Identifying every customer and support team that a change may impact, and notifying them and scheduling the change to meet their needs
  • Using Change Management EVERY TIME.  This results in the rest of the organization trusting my Network Analysts aren't trying to slide one under the rug, aren't doing "Cowboy Networking" (like Captain James T. Kirk used to use "Cowboy Diplomacy" and violate the Prime Directive), and can be trusted to know what we're doing.
  • Following up with the customers immediately after the change, to ensure they're working as expected
  • Keeping the Help Desk notified at major steps:  immediately prior to the change being made, and after the change is complete
  • Someone from the right team (mine, or other I.T. Support teams) being available 7x24 during the changes

We appreciate that our network is the lifeline in a critical health care environment.  And because we follow the rules, dot the "i's" and cross the "t's", we are given the budget we request every year.  We're trusted to build that Info Highway with the right growth capability and alternate routes to take that we can make major changes, shut down major routes on that highway during business hours, and our customers know their traffic will still flow as needed.

Level 15

Another step is to add monitoring - if it was missing so that you get early warning that trouble may be on the way next time. How long do those database backups take?

At my previous gig we simply added a notification step into the scheduler system that sent a syslog message at the beginning and end of each job. Then made some SAM templates to look into the Orion syslog tables for the syslog messages to show up.

Level 14

Good points by all here. Great topic jdanton​!

For me the session I have participated in are the ones where it is said up front that this is not a blame session and that all egos are left outside the room. Those sessions usually are the most honest and generally provide a foundation for lasting teamwork moving forward. As a pre-school/kindergarten teacher friend of mine is fond of saying " we call them mistakes... not purposes". We learn from our mistakes, it is human nature. The true test is "what did we miss and how do we not miss that or something worse next time?" Lastly, publish the results across the organization, it builds credibilty, understanding and good will.

MVP
MVP

Nice write up

Level 14

Unfortunately management here just want to point the finger and blame someone as long as it isn't them.  They fail to see that they are actually the problem.  Not enough staff, too much work and unreasonable deadlines all lead to fatigue and mistakes caused by too many interruptions to critical tasks.  Root Cause documents just get filed in the round filing cabinet and no lessons are learned.  They they come up with the great idea of getting in some contractors to sort things out.  Guess who they expect to train the contractors.  Yep, no work gets done while we train the contractors then they leave because it is an insane place to work and we have to make up for the lost time as well as doing the normal workload.  The spiral downwards continues.

MVP
MVP

The postmortem is so valuable - if the information is then used. I've seen times where the team would look and say what can we do better and then build a plan to execute. But, I've also seen times where the what do we need is reflected upon and then forgotten.

Level 13

Don't you just hate it when someone reboots a server without telling people.  Lean from mistakes.

Level 14

sadly... guilty on more than one occurance... 🙂

Level 20

Sometimes it's had to follow up since you're already moving on to the next problem.

Seasoned and mature IT professionals salivate over the post mortem. Never let a good crises go to waste. 😉

It's your time to bend the ear of the executives to get what you've been clamoring for the past x months to prevent the catastrophe that final just occurred.

About the Author
Joseph D'Antoni is an Senior Architect and SQL Server MVP with over a decade of experience working in both Fortune 500 and smaller firms. He is currently Solutions Architect for SQL Server and Big Data for Anexinet in Blue Bell, PA. He is frequent speaker at major tech events, and blogger about all topics technology. He believes that no single platform is the answer to all technology problems. He holds a BS in Computer Information Systems from Louisiana Tech University and an MBA from North Carolina State University. .