Incident management_ ITIL 4 Practice Guide
Incident management_ ITIL 4 Practice Guide
This document provides practical guidance for the incident management practice.
Table of Contents
1. About this guide
2. General information
9. Acknowledgements
the practice’s processes and activities and their roles in the service value chain
2. General information
2.1 Purpose and description
Key message
The purpose of the incident management practice is to minimize the negative impact
of incidents by restoring normal service operation as quickly as possible.
Tips
A simple flow to decide if there is an incident:
If users perceive the situation as abnormal, it is recommended to register an incident and work
on making users happy as quickly as possible, regardless of whether there is a breach of SLA. If
users have not reported anything, but a service level agreement is breached, register an incident
and work to restore the agreed level of service before it affects users. If a service or configuration
item are not working as defined in a technical specification, register an incident and work to
restore normal performance before it affects the SLA and users. If there is no formal
specifications of service or component normal operation, or if the service works within the
specifications, but a specialist thinks that it is not operating normally, register an incident and
restore normal operation as quickly as reasonably possible.
Definition: Incident
The quick detection and resolution of incidents is made possible with effective and
efficient processes, automation, and supplier relationships alongside skilled and
motivated specialist teams. Resources from the four dimensions of service management
are combined to form the incident management practice.
A significant business impact is not the only characteristic of a major incident. Major
incidents are often associated with a higher level of complexity. Many systems and
services are designed for high availability, and single failures are unlikely to cause a
significant business impact. Failures in these systems are quickly, and often
automatically, detected and fixed. However, if multiple seemingly trivial events coincide,
they may lead to a major disruption of multiple services and have a high impact on
service consumers. Complex incidents such as this require a special approach to
management and resolution.
clear criteria to distinguish major incidents from disasters and other incidents
other dedicated resources (including budget); for example, for urgent consultations
with third- party
experts or procurement of components
stakeholders
2.2.3 Workarounds
Definition: Workaround
A solution that reduces or eliminates the impact of an incident or problem for which a
full resolution is not yet available. Some workaround reduce the likelihood of incidents.
There are a number of activities and areas of responsibility that are not included in the
incident management practice, although they are closely related to it. These activities are
listed in Table 2.1, along with references to the practice guides in which they can be
found. Management practices should be combined to form service value streams, as
described in section 3.2.
A complex functional component of a practice that is required for the practice to fulfil
its purpose.
A practice success factor (PSF) is more than a task or activity; it includes components
from all four dimensions of service management. The nature of the activities and
resources of PSFs within a practice may differ, but together they ensure that the practice
is effective.
The incident management practice includes the following PSFs:
The higher quality of the initially collected data supports the correct response to and
resolution of incidents, including automated resolution, also known as self-healing.
Some incidents remain invisible to users, improving user satisfaction and customer
satisfaction.
Some incidents may be resolved before they affect the service quality agreed with
customers, improving the perceived service and the reported service quality.
In complicated situations, where the exact nature of the incident is unknown but the
systems and components are familiar to the support teams and the organization has
access to expert knowledge, incidents are usually routed to a specialist group or
groups for diagnosis and resolution. Sometimes this can assist in identifying patterns
and lead to a model and/or a solution which can be applied to similar incidents in the
future.
Definition: Swarming
A technique for solving various complex tasks. In swarming, multiple people with
different areas of expertise work together on a task until it becomes clear which
competencies are the most relevant and needed.
Usually, swarming assists in decreasing the level of complexity and makes it possible to
switch to the techniques used in a complicated or clear situations. One example where
swarming is particularly relevant are major incidents of an unknown nature. In these
situations, pulling together numerous specialized resources is cost-effective compared to
the losses resulting from the incident remaining unsolved.
Physical meetings are not required when swarming. When a plan is established, experts
may work alone to run experiments, perform analysis , and use other tools to discover
what is happening. To engage with the incident, swarming utilizes the correct people
rather than a great amount of people. It is usual to involve people from different teams in
swarming; this requires organizational solutions which allow involving team members on
a very short notice.
Other techniques can be used in complex situations. For example, expert analysis may be
replaced or combined with a series of safe-to-fail experiments which aim to improve the
understanding of the nature of the incident. Adopting and utilizing a complexity-based
framework for decision-making1 is useful for dealing with incidents in situations of high
and changing complexity.
As mentioned in section 2.2.1, some incidents recur and can be handled in a well-known,
repeatable way. Ideally, such recurrences should be analysed and further repetition
prevented (this usually involves the problem management practice). However, problem
management may take significant time, and some incident, even if well-understood,
cannot be effectively prevented. Their occurrence and nature are clear, and their
handling often can follow a well-defined incident model. To optimize the time and
resources for resolution of such incidents, the shift left approach can be used.
An approach to managing work that focuses on moving activities closer to the source
of the work, in order to avoid potentially expensive delays or escalations. In a software
development context, a shift-left approach might be characterized by moving testing
activities closer to (or integrated with) development activities. In a support context, a
shift-left approach might be characterized by providing self-help tools to end-users.
In incident management, shift-left can be used to delegate more activities to users: not
only reporting an incident, but also self-help using chat bots, FAQ pages, and other
resources. Another form of shift-left is training of the service desk agents to diagnose and
solve more different types of incidents. Any opportunity to solve incidents without
transferring them to other teams should be used, especially as the transfer is likely to
take extra time and cost extra money. This should not, however, create unacceptable
delays; the speed of incident resolution remains the most important requirement. The
shift-left approach works best in clear, well-known situations, where less experienced
people can successfully follow well-tested and safe instructions.
Regardless of the complexity, it is important to review and confirm the high quality of the
incident data from the first steps of incident handling. This has a strong influence on the:
Definitions
Prioritization
Task priority
The importance of a task relative to other tasks. Tasks with a higher priority should
be worked on first. Priority is defined in the context of all the tasks in a backlog.
There are a number of simple guidelines for prioritization which apply to all types of
tasks, including incidents:
Prioritization is needed only when there is a resource conflict. Where there are
sufficient resources to process every task within the time constraints, prioritization is
unnecessary.
In each team, all types of tasks (including incidents) should await prioritization and
assignment in a single backlog, together with other tasks (planned and unplanned).
Visualization tools, such as Kanban, and Lean principles, such as the limiting of work
in progress, are useful for effective prioritization.
These rules apply to all types of work, whether planned or unplanned, performed by the
service provider’s specialist teams. It is important that they are agreed and followed by
everyone involved in the organization’s service management activities, across all
practices. Specific to incident management, the following additional recommendations
should be considered:
Resource availability and estimated processing time are defined by each team. For
well-known repeating operations, the processing time may be standardized. The
target resolution time may be defined by SLAs and/or the internal service
specifications of the service provider. The impact assessment and completion
(resolution) time may change as support teams discover new information.
Periodic reviews provide an opportunity to analyse the stakeholders’ satisfaction with the
incident management practice. Periodic incident review is also key for the continual
improvement of the practice and the organization’s products and services.
Key message
Effective reviews will always need data; therefore, it is important to agree the
requirements for documenting it. Data should be:
- Concurrent: It is useful to know exactly what was done when, to assist in continual
improvement. This requires stakeholders to update incident records during, not after,
the event. Also, an accurate timeline may be useful for investigating the problem.
Definition: Process
Incident handling and resolution This process is focused on the handling and
resolution of individual incidents, from detection to closure.
Periodic incident review This process ensures that the lessons from incident
handling and resolution are learned and that approaches to incident management
are continually improved.
Problem records
Knowledge base
Throughout the process, ownership over each incident should be ensured. The
ownership may be transferred via the handling and resolution process, but each incident
should have a person responsible for it at any time. Also, stakeholder communications
should be updated whenever there are changes in the status of the incident.
The process may vary significantly, depending on the incident model. Table 3.2 provides
descriptions of the activities in two incident models (manual and automatic), which are
just two of many options. They are meant to illustrate the difference between incident
models.
Incident The service desk agent performs Based on pre-defined rules, the
classification initial classification of the incident; following is automatically
this helps to qualify incident discovered:
impact, identify the team
responsible for the failed CIs and/or - the incident's impact on
services, and to link the incident to services and users
other past and ongoing events, - the solutions available
incidents, and/or problems. In - the technical team(s)
some cases, classification helps to responsible for the incident
reveal a previously defined solution resolution if automated solutions
for this type of content. are ineffective or unavailable.
This process includes the activities listed in Table 3.3 and transforms the inputs into
outputs.
Incident records
Incident reports
Capacity and
performance
information
Continuity policies and
plans
Activity Description
Incident review and The incident manager, together with service owners and other
incident records relevant stakeholders, performs a review of selected incidents
analysis such as major incidents, those not resolved in time, or all
incidents over a certain period. They identify opportunities for
incident model and incident handling procedures optimization,
including the automation of incident processing and resolution.
In practice, however, many organizations come to use of the value stream concept after
having worked for a while (sometimes for years) without the value streams being
managed, mapped, or understood. This means that when the importance of the concept
becomes clear, the first step is to understand and map the ‘As Is’ situation, the de-facto
flows of work, and to analyse them in order to identify and eliminate the non-value-
adding activities and other forms of waste.
Combined, organizations’ value streams form an operating model which can be used to
understand and improve how the organization creates value for the stakeholders.
Many organizations have been following best practice recommendations for various
service management practices, such as incident management, change enablement,
software development, and many others. Incident management is one of the most
adopted and mature practices; organizations often start their ITSM journey with incident
management.
However, the practices have often been adopted and organized in a siloed, isolated
manner, just as they were presented in the service management bodies of knowledge. In
reality, a flow of work required to create or restore value, for a customer or another
stakeholder, is almost never limited to one practice.
Activity Practice
Incident detection Service desk (for user-reported incidents)
or
Monitoring and event management
Incident management
The incident management practice is core for this value stream, but it is not enough to
complete the value stream and restore value co-creation.
ITIL 4 recommends organizations to examine how they perform work and map all the
value streams they can identify. This will enable them to analyse their current state and
identify any barriers to workflow and non-value-adding activities (waste). Wasteful
activities should be eliminated to increase productivity.
Opportunities to increase value-adding activities can be found across the service value
chain. These may be new activities or modifications to existing ones, which can make the
organization more productive. Value stream optimization may include process
automation or adoption of emerging technologies and ways of working to gain
efficiencies or enhance user experience.
Value streams should be defined by organizations for all their products and services.
Depending on the organization’s strategy, value streams can be redefined to react to
changing demand and other circumstances, or remain stable for a significant amount of
time. In any case, they should be continually reviewed and improved to ensure that the
organization achieves its objectives in an optimal way.
Separating the restoration value streams for incidents detected by users and incidents
detected by monitoring. The former value stream would be initiated by users contacting
service desk and focused on restoring the services to an agreed level and to the users’
expectations. The latter value stream would be triggered by events captured by the
monitoring systems and focused on restoring the components and services to an agreed
technical specification, preventing any negative impact on the live services and their
users.
There is no single operating model fitting all organizations. Different solutions work for
different organizations, involving different value streams which in turn involve different
management practices.
Identify the scope of the value stream analysis It can be mapped to a particular
product or service or applied to most or all of them. Similarly, service value streams
may differ for different consumers; for example, incidents can be solved and
communicated differently for internal and external customers, or for B2B and B2C
products, or for services based on products developed inhouse or sourced externally.
Define the purpose of the value stream from the business standpoint Make sure
the stakeholder’s concerns are clearly understood, since they are the ones defining
value. In case of incident management, it is usually user who needs to return to
normal work as soon as possible; however, there are usually other interested parties.
For example, internal users may be unable to provide normal service to a business
customer because of the incident, and the value of the value stream should be
considered from the business perspective, not solely from the user perspective.
Do the service value stream walk Walk through or directly experience the steps
and information flow as they go in practice (consider the Lean technique of Gemba
walk):
c. Evaluate the workflow steps Typically, the criteria for evaluation are:
value for the stakeholder (does the step add value for the business stakeholder?)
d. Map the activities and the information flows In an ideal situation, the flow goes
smoothly without delays and pauses, there are no disconnections between the steps,
and the world is level with minimal (and agreed) variation.
e. Create and review the timeline and resource level Map out process times and lead
times for resources and workload through the workflow steps.
Reflect on the value stream map (VSM) Identify factors that might not have been
entirely apparent at first. The information collected is used at this step to find the
waste.
Create a ‘to be’ VSM This informs and drives improvement. The value stream should
be considered holistically to ensure end-to-end efficiency and value creation, not just
local improvements.
Using the ‘to be’ VSM, plan improvements Refer to the continual improvement
practice guide for a practical improvement model.
At the scoping step (1), identify the IT and business services related to the value
stream and the involved business stakeholders. For example, when an IT service
provider delivers IT services consumed by business users who in turn provide
services to the business clients, should the incident-related service value stream
involve restoration of normal business services for the clients, or should it be limited
to the restoration of normal IT services for the business users?
Make sure the value stream is understood (step 2) from the standpoint of the
business, not only of the service provider.
During the service value stream walk (3a), identify other practices involved in dealing
with incidents at every step. Which practices provide required information
(configuration data, asset data, previously identified solutions, agreed timeline for
the service restoration…)? What if the incident resolution requires changes? What if
incident diagnosis and/or resolution involves third parties?
During the workflow steps evaluation (3c), evaluate the step’s impact on the value
restoration. Special attention should be paid to steps with low business value, low
performance, and availability or capacity issues. It is not unusual to find steps which
serve some internal control or bureaucratic purposes but delay the incident
resolution.
At the reflection and planning steps (4-5), ensure that the incident management
flow is optimized for business value throughout the stream, not only at the incident
management practice activities.
Include creation or update of incident models (see sections 2.2.1 and 3.1.2) in the
value stream improvement plans (step 6).
Remember, roles are not job titles. One person can take on multiple roles and one role
can be assigned to multiple people.
Roles are described in the context of processes and activities. Each role is characterized
with a competency profile based on the model shown in Table 4.1.
coordinating manual work with incidents, especially those involving multiple teams
monitoring and reviewing the work of teams that handle and resolve incidents
ensuring sufficient awareness of the incidents and their status across the
organization
In some cases, organizations may introduce an additional role of the major incident
manager (MIM). This role has similar responsibilities to the incident manager but focuses
exclusively on major incidents. This role becomes the main point of contact and
coordination during major incidents. The MIM usually has wider authority and may have
dedicated resources for major incident management.
The competency profile for these roles is CMAT, though the importance of each of these
competencies varies from activity to activity.
Incident handling
and resolution
process
Incident Technical TC Understanding of the service
detection specialist design, resource configuration,
User and business impact of events
and symptoms
technical domain
product/service
territory
consumer types.
The method of organization will vary, depending on the organization’s needs and
resources. The incident management practice should take a flexible approach to its
organization, involving resources from various internal and external teams as necessary.
Either way, it is crucial to ensure effective cooperation between members of different
teams involved in handling and resolution of incidents.
team members experience a lack of autonomy and report being blocked by others
a culture prevails where lone ‘heroes’ are rewarded when incidents are solved.
a decrease in morale
a lack of motivation
Furthermore, trust between team members breaks down. Approaches such as DevOps
and techniques such as swarming show some of the characteristics needed to
encourage a positive culture, although it is not necessary to follow these approaches to
achieve the correct team dynamic. The following three main areas need to be addressed.
partners and suppliers, including contract and SLA information on the services they
provide
This information may take various forms, depending on the incident models in use. The
key inputs and outputs of the practice are listed in chapter 3.
Details of incidents are the most important pieces of information. These usually include:
sources of information
the last known time of correct operation before the symptoms began
similar systems which might be affected by the poor performance and are currently
operating normally
Additional information that will be exchanged and recorded during the incident
management practice should include details of:
the investigation
Detailed descriptions of how these tools support the practice’s activities are outlined in
Table 5.2.
In some cases, all activities after a particular activity in the incident handling and
resolution process can be fully automated using pre-defined scripts and scenarios for
specific types of incidents.
Note that automation tools used in the incident management practice could include not
only organization-wide tools, which are valid for all incidents, but also some local custom
tools and scripts created as a result of a periodic incident review process for specific
incident models. Both should be used to drive automation efforts.
Incident
handling and
resolution
process
Periodic incident
review process
Automate the value stream Although incident management is often one of the first
practices to be developed by a service provider, the implementation of ITSM
automation systems also often starts with the incident management processes.
Even if other practices may not be mature at this stage, it is important to define
requirements and design workflows that will support the full value stream, from
detection, to resolution of incidents. For incident resolution that requires changes,
the automation tool should allow for a simple change tracking workflow; for
recurring incidents, it should be possible to capture and reuse of proven solutions.
Think and work holistically.
Allow different workflows for user- and event- initiated incidents Detection,
classification, communications, and conditions for closing a record are all handled
differently for user-initiated and event-initiated incidents, even if the latter are
handled manually. Attempts to fit both types of incidents in one workflow with the
same forms and business logic are unlikely to be successful. The handling of event-
generated incidents can and should be automated.
Do not overcomplicate the workflows and business rules Forms filled in manually
should be user-friendly and should not take much time to fill in. When designing
user journeys and interfaces, treat IT support teams as you would treat external users
whose expectations are based on their experience with mobile apps and modern
web sites.
Allow for swarming and other forms of cross-team collaboration Some incident
management tools are designed for a linear flow and transfer of incident records
between the teams. When a joint action is required, it is often unsupported;
specialists meet and work together, but the incident records do not reflect it. Design
the tool for collaborative and non-linear workflows.
Very few services are delivered using only an organization’s own resources. Most, if not all,
depend on other services, often provided by third parties outside the organization (see
section 2.4 of ITIL Foundation: ITIL 4 Edition for a model of a service relationship).
Relationships and dependencies introduced by supporting services are described in the
practice guides for service design, architecture management, and supplier
management.
Partners and suppliers may support the development, management, and execution of
the incident management practice. The forms of support include the following:
Level 1 The practice is not well organized; it’s performed as initial or intuitive. It may
occasionally or partially achieve its purpose through an incomplete set of activities.
Level 2 The practice systematically achieves its purpose through a basic set of activities
supported by specialized resources.
Level 3 The practice is well defined and achieves its purpose in an organized way, using
dedicated resources and relying on inputs from other practices that are integrated into a
service management system.
Level 4 The practice achieves its purpose in a highly organized way, and its performance
is continually measured and assessed in the context of the service management system.
For each practice, the ITIL maturity model defines criteria for every capability level from
level two to level five. These criteria can be used to assess the practice’s ability to fulfil its
purpose and to contribute to the organization’s service value system.
Each criterion is mapped to one of the four dimensions of service management and to
the supported capability level. The higher the capability level, the more comprehensive
realization of the practice is expected. For example, criteria related to the practice
automation are typically defined at levels 3 or higher because effective automation is
only possible if the practice is well defined and organized.
This approach results in every practice having up to 30 capability criteria based on the
practice PSFs and mapped to the four dimensions of service management. The number
of criteria at each level differs; the four dimensions are comprehensively covered starting
from level 3, so this level typically has more criteria than others.
Table 7.1 outlines the capability criteria that are defined in the ITIL maturity model for the
incident management practice.
To perform a quick self-assessment using the capability criteria, the following rules
should be followed.
Figure 7.2 and table 7.2 show the capability development model, which can be applied to
every management practice. The structure of this publication is aligned with the
development steps.
Scope 2.3
Tools and 5
procedures
focus on value
More information on the guiding principles and their application can be found in section
4.3 of ITIL Foundation: ITIL 4 Edition.
Table 8.1 outlines recommendations for the success of the incident management
practice, linked to the relevant guiding principles.
Look at the incidents from For user-reported incidents, do not hide Focus on
the service consumer behind SLAs, aim to restore level of value
perspective service which satisfies the users. Collaborate
For monitoring-based incidents, assess and promote
business impact even if there are no visibility
directly affected users yet.
Prioritize incidents according to their
business impact.
Gather and reuse data Many incidents recur. Significant time Collaborate
and resources can be saved by and promote
developing incident models and reusing visibility
known resolutions. Do not rely on Optimize
individuals' experience, motivate team and
members to document and share their automate
knowledge.
Leverage automation tools to manage
knowledge and automate solutions,
where possible.
Understand, manage, and Incident lifecycle spans beyond one Think and
improve the incident practice. Ensure effective integration work
resolution value stream, not with service desk, change enablement, holistically
only the incident problem management, and other Focus on
management practice relevant practices. value
Develop the practice Start with the most critical products and Start where
continually but don't services and with basic workflow from you are
overcomplicate it detection to resolution. Gradually Progress
increase both the scope and the iteratively
capability level based on the business with
requirement and stakeholder feedback. feedback
Use the capability criteria and continual Keep it
improvement model as a guidance. simple and
practical
Adjust for complexity Shift left and automate handling and Optimize
resolution of repeating clear incidents. and
Use swarming to optimize resolution of automate
unusual, complex, and major incidents. Collaborate
and promote
visibility
9. Acknowledgements
Authors
Barry Corless, Roman Jouravlev, Andrew Vermes
Reviewers
Akshay Anand, Sofi Fahlberg, Michael G. Hall, Steve Harrop, Piia Karvonen, Anton Lykov,
Paula Määttänen, Christian F. Nissen, Mark O’Loughlin, Tatiana Orlova, Elina Pirjanti,
Stuart Rance
2023 Revision
David Cannon, Antonina Douannes, Peter Farenden, Adam Griffith, Roman Jouravlev,
Kaimar Karu, Barclay Rae, Stuart Rance, Nicola Reeves