The importance of SEV-1 call leaders

Jonathan Word
5 min readFeb 1, 2024

Incidents come in different shapes and sizes. The most severe incidents require special handling that is unlike their less-critical variants. These SEV-1 (aka CRITICAL) incidents can have material financial impact for a company and create a challenging environment for any incident commander creating a need for specially designated SEV-1 call leaders.

High stakes

In many ways a SEV-1 call leader is not that different from a regular incident commander — it's just that the situation demands little-to-no room for mistakes during the incident resolution process. These sorts of incidents can have significant financial impact,.. the stakes don’t get any higher than the entire platform going down.

SEV-1 incidents carry with them a bright spotlight on the importance of the situation. It is not unusual for senior leadership to be engaged and involved during the active call and many times widespread customer communication is warranted. Under this pressure “regular” incident commanders many times will freeze up under the stress, stalling resolution efforts. A typical oncall engineer isn’t equipped to simultaneously operate with the burden of updating senior leadership, navigating customer comms, and getting the entire system operational post-haste.

The solution to overcome this challenge is to designate a “special forces” rotation of select incident commanders that have experience and specific training to operate in these intense situations. These are the SEV-1 call leaders — the people with the most skill and experience to able to get the system fixed under the pressure.

Some fires are bigger than others — photo by Jay Heike on Unsplash

Senior leadership needs to be on the rotation

Analogous to why team managers should run incident postmortem reviews, senior leadership should be on the SEV-1 call rotation. Which layer “senior leadership” translates to will vary depending on the specific size and business of the company.

The key is that the rotation needs to include people that have:

  • Many years worth of experience in managing incidents, so they are reasonably likely to have seen a similar incident from the past.
  • Strong authority on the call to direct the resolution effort irrespective of the levels of anyone else on the call.

Getting senior leaders to accept the responsibility of holding a pager and responding to incidents creates a signal to engineers they have direct accountability for the reliability of the systems that they oversee. Every engineer understands the burden of an oncall rotation and the impact on personal time. Putting a pager in the hands of Directors and VPs for critical incidents showcases that they are also personally responsible for the health of the systems their teams build.

Splitting of tech calls and comms calls

SEV-1 incidents involve meaningfully more conversation that needs to happen in parallel than regular incidents, mostly because of the typical need to arrange customer-facing comms. In order to make sure that discussion about how and what to write to customers does not distract from the efforts to solve the problem one of the first splits is to cut the call in two:

  • A tech call to focus on incident resolution
  • A comms call to focus on customer communication

Note that not every SEV-1 needs to get split: some are quiet affairs dependent on a 3rd party system so there essentially isn’t a tech call to be had and others are just naturally less conversation-heavy for one reason or another. Particularly challenging incidents may even get more than one tech call (for example separate calls to investigate different theories to fix the system) — but the vast majority of even SEV-1s will only need to get split into these two call types.

Hierarchy of duties

If the calls get split then the duties of the Incident Commander instantly needs to shift since, as a single person it is only possible to attend one of the calls at a time.

SEV-1 call leader will therefore need to switch between the two calls to collect updates from each side and hop to the other call to provide status updates for each while also ensuring senior leadership is kept up-to-date. Generally the tech call is more important to ensure that resolution efforts are kept on track, but both need regular check-ins.

Before leaving either of the two primary calls, the SEV-1 call leader should appoint a sub-commander to manage that call in their absence. And upon re-joining the call get a recap from the sub-commander to get back up to speed.

Dry run the postmortem review meeting

The importance of the postmortem review meeting depends on the severity of the incident, SEV-1 incidents should draw a large attendance and a high seniority audience. To ensure that the team responsible for the postmortem is fully prepared for the review meeting a dry-run review should be done with the SEV-1 call leader.

The dry run isn’t the same meeting as the actual review — the purpose of the dry run is to ensure that the preparations are completed, not to solve all of the open questions or settle debates about the Action items.

The objectives of the dry run are to:

  • Make sure that the postmortem document is complete and well written
  • Ensure that the team has attempted to answer any open questions
  • Make sure the whole team presenting the document is familiar with its content

Widespread impact of action items

SEV-1 incidents usually have deep implications for action items as they reveal underlying risks of a platform. The instructions on my recommended postmortem document template includes a section to link to related SRE programs. This is a mechanism to track efforts that take lessons learned from one incident and apply them as widely as possible across an ecosystem.

Each lesson learned from a SEV-1 that doesn’t get widely applied is a missed opportunity. It will only be a matter of time until other applications built on the same pattern suffer from the same issues, potentially causing more incidents in the future.

A complete guide to Incident management

This article is part of a series on Incident management, which includes:

--

--

Jonathan Word

Technology leader and builder of large-scale distributed platforms.