Incident severity levels for online platforms

Jonathan Word
6 min readOct 18, 2023

--

Defining clear Incident Severity levels is a key component to an efficient Incident Management process that helps Engineering teams quickly respond to outages and mitigate customer impact.

Operating an online platform with multiple nines of availability is a complex task, modern websites depend on thousands to hundreds of thousands of individual moving parts all of which need to work correctly in order to provide a satisfactory user experience. When any one of those parts fails then an Incident Management process is needed to coordinate mitigation efforts and resolve the impact as quickly as possible getting the system back into working order.

Incident Management starts by answering a seemingly simple question: How bad is the problem? Knowing how big the issue is that we are dealing with is crucial to understanding the appropriate response to that particular problem. When the problem is small we probably don’t need to notify our C-suite level leadership or send a notice to all customers … but if it is really big then we absolutely need a communication strategy that notifies relevant stakeholders.

Incidents happen, so be prepared to respond — photo by Obi - @pixel8propix on Unsplash

So: How bad is the problem?

The fastest way to answer this question is with a simple number in the range from 1 to 4. Each severity level represents distinct response and communication strategies due to the magnitude of the incident impact. Saying that this is a SEV-2 incident is a quick reliable way to communicate internally about the importance of a given issue and provides an instant understanding of what should be done about it.

Incident severity levels are similar to medial Triage tags, for example defined by the START triage protocol. In medical triage color codes are used to represent a basic sorting of patients into categories which can be used to prioritize response. Have a life-threatening injury? that is RED and requires immediate attention. Have a patient that is unconscious but breathing? that is YELLOW and can get a delayed response. Have a non-life-threatening injury? that is GREEN and is classified as minor. Colors work well for physical tags to label patients needing medical response where a quick glance a few feet away can indicate the urgency needed for that patient. Numbers work best for software incidents that require engineering intervention to support quick categorization and sorting by monitoring systems.

4-level Severity matrix

Incident severity level matrix

In this 4-level system, each severity level has a unique purpose, is communicated to customers differently, and triggers unambiguous levels of response urgency.

SEV-1 (Critical impact)

SEV-1 Incidents represent the most critical level of impact. Typically this would be cases where the entire platform is down or user experience is significantly degraded. Essentially this level is used for any incident that places significant amounts of revenue at risk or loss.

SEV-1 incidents typically have their own specific SEV-1 process where incident command is split into a resolution team and a communication team — with incident calls having several dozens to sometimes hundreds of people on them.

Response times must be as immediate as possible 24/7, will be reported to top leadership (ex: CTO or Head of engineering), and often result in public communications to all customers.

Postmortems are required and leadership should be invited to optionally attend.

SEV-2 (High impact)

SEV-2 Incidents have high levels of impact but do not rise to the level where wide-spread involvement is necessary. This level may have some revenue impact and needs quick resolution yet is localized to a specific relevant engineering team that is related to the affected failing component.

Response times need to be quick 24/7 (within a 15-minute SLA), will be reported to the leader of the relevant engineering team (ex: VP or Director), and will sometimes result in directed communications to specific affected customers.

Postmortems are required with the relevant engineering team.

SEV-3 (Medium impact)

SEV-3 Incidents have medium impact, causing some friction with customers yet has little or zero revenue impact. This level needs prioritized resolution ahead of any other day-to-day activities however does not need 24/7 continuous response, these issues can be solved during regular work-hours and should be worked on until resolved.

The SEV-3 level is essentially a “Daytime SEV-2” since during working hours it should receive the same level of attention and priority as a SEV-2, with the only difference being that for a SEV-3 incident effort should conclude at the end of the work day to avoid burning out oncall engineers.

Accordingly, response times should be similar to SEV-2 during working hours (within 15 minutes) however during off-hours responses are optional at the discretion of the owning engineering team.

Postmortems are optional, many SEV-3 incidents do not warrant the level of investment of a full postmortem however an engineering team may elect to write up a postmortem doc in order to capture action items. This usually arises from particularly tricky SEV-3 incidents to solve (sometimes taking multiple days to mitigate) or from repeated SEV-3 incidents that have a common cause that needs fixing.

SEV-4 (Low or no impact)

SEV-4 Incidents have low or no impact, with typical users unable to notice any degradation of the user experience. This level is not prioritized ahead of other work and may receive delayed attention depending on the team’s bandwidth.

While some component has been flagged via an alarm as being in a degraded state, this may not be visible to users or may be small enough to be there just not particularly noticeable.

Responses are required, however are only needed within a 1 working-day SLA and mitigation activity may be delayed or scheduled at the team’s convenience. No postmortems are needed.

Common pitfall: Too few levels

A common question is why not have 3 levels? SEV-4 at first glance appears to offer little value, so why include it in the leveling system?

The usefulness of SEV-4 lies in its ability to be an early-warning system. Alarms are typically configured with varying thresholds of importance.

For example Database storage is often consumed at predictable rates and has a maximum configured size, this means that alarms can be constructed similar to:

  • Cut a SEV-1 if a critical database completely runs out of storage
  • Cut a SEV-2 if a critical database is likely to run out of storage in the next 24 hours
  • Cut a SEV-3 if a critical database is likely to run out of storage in the next week
  • Cut a SEV-4 if a critical database is likely to run out of storage in the next 2 months

The usefulness is that with this alarm spread an engineering team gets an early warning well in advance of the system entering an emergency state. When the SEV-4 alarm triggers there is still enough time that the team can schedule their activity and respond to the issue without dropping everything else. It is like a smoke alarm beeping before there are flames everywhere — a very important early warning that enables a calm and orderly response to a situation that if left unattended would grow and magnify.

Common pitfall: Too many levels

There is also a common miss-understanding of incident severity where companies design a 5-level severity system. The SEV-5 level is then assigned to bugs or support issues. This however is a misunderstanding of the difference between Bugs and Incidents. Just because Bugs can be created and tracked in the same system that handles Incidents does not mean that they should. Indeed Bugs should have a separate Bug management process that tracks the Priority of each bug rather than overloading Incident Severity to cover both use cases.

Bugs may on the surface appear to be similar to an Incident because they are both injected unplanned work — however the similarity ends there. Bugs are typically the result of missed QA testing and are a mistake in the development process. Many bugs are small annoyances that do not have any revenue impact sometimes not getting fixed for weeks, months, or even years — and other bugs represent critical breakdowns in user workflow and need prioritized fixing.

Incidents on the other hand are emergency events that require attention due to the Severity of their impact on users. If these two processes are blended together then both suffer negatively.

A complete guide to Incident management

This article is part of a series on Incident management, which includes:

--

--

Jonathan Word
Jonathan Word

Written by Jonathan Word

Technology leader and builder of large-scale distributed platforms.

No responses yet