A crucial piece of quickly resolving Incidents is having a good Incident Commander that is controlling the flow of the resolution activities. Here is a guide for how to be an effective Incident Commander and get things fixed as quickly as possible as part of an efficient Incident Management process.
First spin up an Incident call
Incidents get created in various ways — sometimes via an Alarming system (like Atlassian OpsGenie, PagerDuty, maybe a custom built solution, or any other incident management system), via internal chat where someone points out that something is wrong, or through customers reaching out to customer support. In each case the very first action that should be taken is to spin up an incident call and get people on the line. There may be a temptation to try to look at a dashboard first, maybe read a quick log line, or check a deployment timestamp — however those are all things that can be done after you and others are on a video call.
Getting people on a call is the very first step during any incident response activity. Without the relevant people on the line then you’ll get nowhere in resolving the issue. This is just like how if your house was on fire, it wouldn’t matter how many electricians or plumbers were standing just outside talking about the fire they aren’t going to be able to take any concrete steps towards making progress and putting the fire out. What is needed are fire fighters — a team of people specifically suited to fixing that specific problem. And similarly if your lights won’t turn on then it is the electricians that you need and not the fire fighters.
This means that you need to be comfortable paging people in. Being on-call is part of software engineering, it comes with the job and that means sometimes working at odd hours to resolve issues. The right people need to be on the call for it to be effective, so the very first priority is ensuring that the right quorum is online.
Identify the Incident Commander
Once people are on a call the need for an Incident Commander quickly arises. This is a designated person that manages the incident call, pages needed people, works with a customer communications team, distributes status updates to stakeholders / leadership, etc.. very importantly the Incident Commander MUST NOT be personally performing actions attempting to fix the problem. The role of the commander is to efficiently coordinate the activities of others, if they were to also start going in and writing code to patch a critical defect or walking through a rollback run book then all of the coordination of the incident breaks down.
So as soon as there is quorum on the call (and sometimes while still getting people online), an Incident Commander is needed to be chosen to organize the activity. Typically this will be the most seasoned person on the call with experience running incidents. For SEV-1 incidents this may even be a specific SEV-1 call leader that is needed for particularly high profile and impactful incidents.
The first job for the commander is to assess customer impact and ensure that the incident is categorized with an appropriate Severity level. Do people on the call have an understanding of the customer impact?
Sometimes it isn’t clear precisely how user experience is degraded — can people reproduce the error? does it happen every time or only sometimes randomly? do user retries eventually succeed? is this affecting all users or just a specific subset?
Depending on the severity level the incident was created as it may need to be upgraded to a higher severity or if the impact is less than originally thought it can be prudent to downgrade to a lower severity.
Types of activity
Once a severity level has been agreed upon it is time to start organizing actions, these may be in a few categories:
- Paging people: Are all the people on the call that are needed? Has the group identified that it would be useful for someone else to get on the call? Maybe additional people need to be paged.
- Controlling the flow of conversation: Are people talking over each other? Is someone saying things are are not directly related to mitigating the issue? The commander should not be shy about interrupting and keeping the group on target.
- Information gathering: Is there additional information that would be useful? Could logs or metrics help diagnose the problem? Maybe reading some lines of code or checking deployment timestamps?
- Writing internal notes: Do internal stakeholders need to be kept updated on the status of the incident?
- Performing a mitigating action: Doing the “actual work” that fixes things for example rolling back a deployment, scaling some infrastructure, making a configuration change, code change, etc..
- Drafting customer communications: Do external stakeholders need to be notified that there is a problem? Is there a public status page or other customer notification needed?
With so many different things that need to get done the commander should delegate as much as possible and perform almost no action personally. The only action that is generally accepted for the commander to perform is writing stakeholder updates (as time allows), otherwise every needed action should be delegated to someone else. The job of the commander lies in the coordination, not in the action.
One mitigating action at a time
Most actions can be pursued in parallel by delegating to different people and this is usually a good practice to keep things flowing as quickly as possible. However the one thing that should be serialized whenever reasonable (some SEV-1 cases aside) is that only one action attempting to mitigate the issue should be conducted at a time.
There may be cases where the group has multiple ideas to try to get things working again, and if those ideas are enacted at nearly the same time then it can become difficult if not impossible to determine which action was the one that actually made a difference and fixed the problem. In the sort-term this seems like a good thing, the incident got resolved faster by not waiting and just doing — however this significantly impedes the postmortem process whose purpose is to determine the root cause and identify action items to provide long-term stability for the platform. If the mitigation action that fixed the problem is unclear that makes it harder to determine the root cause, and therefore harder to implement protective actions to try to prevent future incidents.
Which means that actions taken to mitigate the incident impact should be spread by enough time to distinguish the distinct results of each of those actions. This does not need to be long, just enough seconds or minutes that the outcome of each action can be separately evaluated.
Always be doing something
There should always be an action item that assigned people are working on. An Incident commander needs to avoid a deadlock in the conversation where people are all waiting on someone else to take an action.
To counter this, call people out individually by name and seek positive confirmation that they understand their delegated action — until they acknowledge their assigned action then it is not yet fully delegated to them.
Regularly check for updates
Longer calls may have periods of downtime where no actions are being taken because time is needed for something to complete, for example if a code patch is necessary it may take time for CI tests to pass and a build to complete, it make take time for folks to gather information, sometimes no one has an idea for any mitigating action to take, etc..
During these lulls in the flow it is important to not forget to regularly check for updates. A regular cycle of: reiterating impact, in-progress actions, and expected times of completion is important to keep folks on track and moving.
Rotate commanders for long incidents
Incidents are 24/7/365 kinds of events and require continuous attention until resolved with some incidents running long into the night. In these cases the commander should asses if the user impact truly is SEV-2 worthy or can be downgraded to SEV-3 and let the team stop for the night, get some rest, and return in the morning to keep working on fixing the issue. Sometimes an incident will be many hours long and still demand continuous attention that cannot be set aside till morning.
In these cases a rotation should be arranged to enable tired team members to drop from the call and get some sleep with alternates picking up the effort in the meantime. If people push for too long and are too tired they risk making mistakes that can make the situation worse instead of better, so always keep in mind that people get tired and may need to be relieved for some rest.
Don’t forget to monitor post-resolution
Incidents don’t come to an instant close as soon as a mitigating action is successful. While most folks can be released from the call quickly, care should always be taken to monitor for a period of time sufficient to have confidence that the incident is actually really fixed and will not repeat itself right away.
This is usually about 15 minutes or so, but can sometimes be longer depending on the nature of the incident. This is a good time to write down notes that would be useful during the postmortem, ensure that all actions that were taken are memorialized, screenshots of dashboards captured, log lines copied, etc.. These are quick and easy things to collect right after the incident is fixed instead of the hours / days / week later when the postmortem document needs to be written — by which time some of these records may be hard to find or could be lost due to data retention limits.