After an incident is resolved and the postmortem document written, it is time to conduct a review meeting. This is the last step in an Incident Management process and the part that is most easily neglected despite being the most important in the effort to squash future incidents.
Why do a review meeting?
Teams feel the pain of an incident long after the issue itself is mitigated. The incident gets resolved via efficient incident command and then the team puts in a bunch of hard work to write a detailed postmortem document … so why add even more overhead by requiring an in-person (or video call) review meeting? Review meetings can be time-expensive endeavors with sometimes as many as several dozen attendees for Critical SEV-1 incidents that impact many stakeholders and leadership. A reasonable question would be to ask if the document can just be published and reviewed offline, thus saving many hours of people’s time.
Quite simply, if a review is not in-person then you are wasting a bunch of the potential for improvement due to the bystander effect and thinking that someone else will tend to the issues at hand. In the guise of “never waste an incident” each incident is an opportunity to make the platform more antifragile — getting stronger as a result of failures. This means that incidents must inject work in the form of postmortem action items into the work stream of potentially several teams. This work is disruptive and appropriate investment depends on the nature of the incident and owning team, a feature team may be disrupted for a sprint or two as they work through the action items resulting from an incident however a lower level infrastructure team may uncover key vulnerabilities that can take several quarters to build appropriate safeguards around.
Incidents are expensive events that directly impact customers and engineers. Spending the time to meet with stakeholders and review the impact is necessary each and every time.
Common pitfall: Scheduling too short (<1 hr)
First, it should be recognized that granular technical details require time to digest and discuss so always book the meeting for a full 1 hour time slot on the schedule. Particularly well crafted postmortem documents can conclude early, though these are the exception of writers and incidents that are particularly clean with clear root cause and action items. Even if you think this is the case, still schedule a full 1 hour slot — much easier to end early than to overrun a half-hour allocation.
Even “simple” incidents should general an hour’s worth of material to discuss and review. Running short on occasion is fine, but if reviews are regularly short that is an indication that the reviews are not going sufficiently deep.
Why managers should run reviews
So now that we have a 1 hour meeting scheduled the first item to address is “who runs the meeting?”, many times people will suggest that a tech lead on the team should run these reviews or perhaps the writer of the document (which many times is unfortunately the person most directly involved with triggering the incident). While both of these options are reasonable it is better for the manager of the team to run the review.
There are three key reasons that the manager should conduct the review. First is to help the engineers feel supported. If all parts of responding to an incident falls on the shoulders of a single engineer, that can make them feel punished — so sharing the burden across others better reflects the team’s collective responsibility for high quality. Secondly, the manager can better incorporate the resulting action items into the team’s roadmap. Balancing external pressures from partner teams, product managers, etc.. it is the manager’s duty to balance these with post incident actions. Lastly, because these reviews are a reflection of the aggregate team which the manager is responsible for.
Incidents impact customers and then also impact the team’s roadmap, ultimately it is the manager of the team that is responsible for balancing work that reduces incidents and pays down tech debt contrasted with building net new features. Therefore the team’s incidents (in aggregate) are ultimately their responsibility. While no one single incident is directly attributable to the manager’s leadership over their team, the trend over time is definitely the responsibility of the manager.
Agenda for a review meeting
Here is a rough agenda outline for these reviews:
- ~15 min reading and commenting on the postmortem document, do not assume people have read the doc before the meeting
- ~15–30 min walk through of the overall doc by the meeting runner with interjected questions / discussion making notes of the comments
- ~15–30 min discussion specifically on the Whys and Action Items
While the specific balance of time will vary for a review, this overall structure is fairly constant. Give people some time to review the document’s contents, walk through the contents of the document top-to-bottom, and then collect ideas and feedback on what actions can be taken to either prevent future incidents all-together or respond better in the event of future incidents.
Control the conversation
The meeting runner needs to balance a few conflicting priorities when reviewing incidents:
- Control the conversation: (just like during Incident Command) — You need to spur conversation to collect ideas and feedback from a wide audience, but careful to not get too much discussion on any given single topic. So when things are too quiet you need to solicit interaction, and when things are too verbose you need to rein in the conversation and earmark offline followups.
- Blame process, not people: (just like in the Postmortem doc itself) — You need to follow blameless principles and avoid singling out any specific person by name as having “caused” the problem. Whenever possible, refer to people as a “team member” without further personal identification. If including a direct reference, only use initials both verbally and in the document (typically found in the timeline section).
- Take responsibility and learn: The principle of blameless applies to individual people, there is still an imperative to treat incidents seriously and take responsibility for their impact. Blameless is not a pass to neglect the seriousness of the negative effect to the business. So without blame you need to make people own up to the situation and take collective responsibility.
Collect written feedback
An online collaborative document is very useful for collecting distributed feedback and comments. So use Google Docs, Office 365, or any other equivalent system that allows for interactive collection of in-doc comments from the group. Other tools like email, JIRA, Slack, etc.. are not suitable for this particular task. In order to get good feedback it needs to be pin-pointed to specific parts of the document and the feedback must be written down. Verbal commentary lacks longevity and specificity.
Focus on actions
Throughout all steps of the post-incident process it’s useful to reinforce that there are only two goals:
- Reducing the chance of future incidents.
- Reducing the magnitude of future incidents.
These goals should be pursued with effort proportional with the risk of future impact (aka: the chance weighted by the magnitude). The bigger or more frequent the incident the more effort should be expended on reducing the future risk, however if this was a particularly rare event with minimal impact it may be prudent to spend less effort on postmortem actions to preserve bandwidth furthering other business goals.
Leverage SRE teams (if you have them)
While thinking about proportional actions consider not only the depth but also the breadth of response. When an incident showcases a persistent risk inherent to your technology stack it is worth asking a question of if a larger program should be put on the roadmap. For example the mock SEV1 incident in How to write an Incident Postmortem showcases an incident where it is identified that additional JVM parameters should get set on the service that failed. A key insight is that all Java services in that environment are also susceptible from the same issue. Given that this was a multi-million dollar impact incident it is therefore prudent to attempt to spread this learned lesson to many more applications than just the single app that caused the incident.
Organizations that have SRE teams should have a mechanism to evaluate how best to spread lessons learned widely across a platform and adopt ownership of that program. Even if you don’t have an SRE team as a function to pick this up, teams should still think about “how widely can we spread these lessons?”. The team should consider fixing up all of the applications within its own purview and publicly posting a message about the lessons in slack or email.
Remember that it is possible to repeat a review
Not every post-incident review meeting will go smoothly. Sometimes there are too many open questions, or maybe the root cause hasn’t been identified, or folks aren’t sure which action items would help the situation. All too often teams treat these meetings as things to “just get done with” and never want to look back on them. These review meetings are a tool for improving quality and they can only be as good as the content under review. It is perfectly okay to go back to do some more research and come back to repeat a review. This will lead to a much stronger review and should be seen as a successful result.
A complete guide to Incident management
This article is part of a series on Incident management, which includes:
- Incident severity levels for online platforms
- A guide to running Incident Command
- How to write an Incident Postmortem (with example and template)
- How to conduct a postmortem review meeting (this article)
- The importance of SEV-1 call leaders