No matter how much automation, redundancy, and protection you build into your systems, thing are always going to break. It might be a change breaking an API to another system. It might be a change in a metric. Perhaps you just experienced massive hardware failure. Many IT organizations have traditionally had a postmortem, or root cause analysis, process to try to improve the overall quality of their processes. The major problem with most postmortem processes is that they devolve into circular pointing matches. The database team blames the storage team, who in turn blames the network team, and everyone walks out of the meeting angry.
As I’m writing this article, I’m working a system where someone restarted a database server in the middle of a large operation, causing database corruption. This is a classic example of an event that might trigger a postmortem. In this scenario, we moved to new hardware and no one tested the restore times of the largest databases. This is currently problematic, as the database restore is still happening a few hours after I started this article. Other scenarios would be any situations where you have unexpected data loss, on-call pages, or a monitoring failure that didn’t capture a major system fault.
How can we do a better postmortem? The first thing to do is execute blameless postmortems. This process assumes that everyone involved in an accident had good intentions and executed with the right intentions based on available information. This technique originates in medicine and aviation, where human lives are at stake. Instead of assigning blame to any one person or team, the situation is analyzed with an eye toward figuring out what happened. Writing a blameless postmortem can be hard, but the outcome is more openness in your organization. You don’t want engineers trying to hide outages to avoid an ugly, blame-filled process.
Some common talking points for your postmortems include:
- Was enough data collected to gather the root cause of the incident?
- Would more monitoring data help with the process analysis?
- Is the impact of the incident clearly defined?
- Was outcome shared with stakeholders?
In the past, many organizations did not share a postmortem outside of the core engineering team. This is a process that has changed in recent years. Many organizations like Microsoft and Amazon, because of the nature of their hosting businesses, have made postmortems public. By sharing with the widest possible audience, especially in your IT organization, you can garner more comments and deeper insights into a given problem.
One scenario referenced in Site Reliability Engineering by Google is the notion of integrating postmortems into disaster recovery activities. By incorporating these real-world failures, you make your disaster recovery testing as real as possible.
If your organization isn’t currently conducting postmortems, or only conducts them for major outages, you might start to think about trying to introduce them more frequently for smaller problems. As mentioned above, starting with paged incidents is a good start. It gets you to start thinking about how to automate responses to common problems and helps ensure that the process can be followed correctly so that when a major issue occurs, you're not focused on how to conduct the postmortem, but instead on how to find the real root cause of the problem.