Stages of Incident Management: and How To Improve Them
Stages of Incident Management: and How To Improve Them
Stages of Incident Management: and How To Improve Them
of Incident
Management
And how to improve them
5 Stages of Incident Management 2
1. Preparation
2. Detection & Alerting
3. Containment
4. Remediation
5. Analysis
1. Preparation
Even the most experienced IT professionals will say that Preparation is an essential,
yet often overlooked, part of incident management. It’s the stage where teams
explore “what if” scenarios and then define processes to address them.
Incident Detection is not only focused on knowing that something is wrong, but also on
how teams are notified about it. While these two may seem like separate processes,
they are in fact very connected. The challenge is that while the proliferation of available
IT monitoring tools has greatly improved the ability for teams to detect abnormalities
and incidents - monitoring tools can also create “alert storms” or false positives that
complicate the response process.
Top IT teams add a layer onto the monitoring process to ensure alerts are managed
properly. This layer acts to centralize the alerting process, while also building in
additional intelligence to the way alerts are delivered.
Detection should lead to the appropriate response... This primarily call for the
need to clearly identify and communicate the roles, responsibilities as well as
the initia approach for incident handling. It should include determination of
who shall identify the incident and determine its severity as a means to
handle the incidenteffectively within the organisational context.”
MITA
5 Stages of Incident Management 7
• Knowledge = power.
A basic alert conveys something is wrong, but it doesn’t always express what.
This causes unnecessary delays as teams must investigate and determine what
caused it. By coupling alerts with the technical details of why it was triggered, the
remediation process can begin faster.
3. Containment:
The triage process for an IT incident is similar to processes deployed in medical fields.
The first step is to identify the extent of the incident. Next, the incident needs to be
contained in order to prevent the situation from getting worse. All actions taken in this
phase should be focused on limiting and preventing any further damage from occurring.
• Don’t go it alone.
Hero culture in IT teams is a dying philosophy. No longer is it fashionable to be the
lone engineer who works endless evening and weekend hours because they are
the only person who can bring systems back online. Instead, teams are working
as just that, teams. Collaborating on issues because they understand that incidents
can be resolved faster through shared knowledge. Conference lines, chat tools,
and live video feeds therefore become essential elements of the incident
management toolbox. These can quickly bring teams together so they can
collaborate in real time. It’s also common for teams to integrate chat tools
with incident management tools so incidents can be triggered, acknowledged,
and resolved from a single platform.
• Be transparent.
The digital age makes seemingly endless amounts of information available at any
time. In the midst of an IT meltdown, this can be an advantage - or disadvantage.
If users are met with a service disruption, it’s common for the incident to be
made public in short order. To stay ahead of this, teams should have an incident
communication plan in place. The goal is to build trust with customers by publicly
acknowledging that a disruption is taking place, and to ensure them that steps are
being taken to resolve it. Tools like Twitter, StatusPage, and user forums are great
places to share this information. Importantly, this process should be designed to
continue through the remediation and analysis phases to further grow trust with
users that may otherwise abandon a system.
5 Stages of Incident Management 10
4. Remediation:
Prior to full system recovery, remediation efforts should be performed to fix the
source of the problem. The final stage of recovery is to not just restore the
system to where it was, but rather to make it better and more secure.
The system should have the same operational capabilities, but it also
should protect against what caused the incident in the first place.”
• Cynefin.
A decision making framework, Cynefin (pronounced “KUN-iv-en”) provides a structured
way to approach problems that helps incident responders determine the best course of
action based on the nature of the problem itself. Depending on the type of incident
(simple, complex, complicated, chaotic), an approach to solving it can be defined.
Does the incident have a known cause and solution?
Do I need to involve additional people to help address an incident?
Is there time to probe the problem to identify the best response, or does the
situation require immediate action?
• Automate much?
Chat tools have become a defacto tool for organizations to improve
communication and collaboration. Yet chat tools have also evolved far past
simply enabling teams to send messages. The software development team at
GitHub pioneered the evolution of chat tools when they released the open
source tool, Hubot. Hubot allows users to trigger actions and scripts directly
from a chat environment. This allows teams to simplify operations by creating
bots that automate processes (initiating a server restart, deploying a snippet of
code, etc).
5 Stages of Incident Management 12
5. Analysis:
Incident management workflows don’t end once the dust has settled and systems
have been restored. Now begins one of the most important phases of the incident
management lifecycle: Analysis. The intent of a “postmortem” analysis is to clearly
understand both the systemic causes of an incident along with the steps taken to
respond to it.
SANS Institute
5 Stages of Incident Management 13
• Be blameless.
The goal of every postmortem should be to understand what went wrong and
what can be done to avoid similar incidents in the future. Importantly, this process
should not be used to assign blame. That’s because teams that focus on the “who”
and not the “what,” let emotions pull the analysis away from truly understanding
what happened.
5 Stages of Incident Management 14
In summary:
In modern IT environments, change is the only constant. This means
systems will continually be stressed in new and different ways. Teams that
understand this, also understand that it’s not a matter of if - but when -
systems will fail. Taking steps to prepare for these failures should be
recognized as a critical element of ongoing success, and integrated into
the DNA of engineering teams.
About OpsGenie:
OpsGenie provides advanced alerting and on-call management solutions
for dev & ops teams. Offering a rich feature set for designing actionable
alerts, managing on-call schedules, and defining escalation policies,
OpsGenie minimizes the impact of IT incidents by ensuring that the right
people are notified - at the right time. Founded in 2012, OpsGenie sup-
ports thousands of IT teams across the globe maintain high
availability and uptime.