Empowering Practical RCA using LLM
Empowering Practical RCA using LLM
t
necessitates efficient root cause analysis (RCA) for cloud dent is assigned to the appropriate engineering team after
incidents. Traditional RCA methods, which rely on man- an initial assessment. (3) Diagnosis [28]: Assigned on-call
ual investigations of data sources such as logs and traces, engineers (OCEs) inspect different aspects of the incident
are often laborious, error-prone, and challenging for on-call and have several rounds of back-and-forth communication to
engineers. In this paper, we introduce RCACopilot, an in- identify the root cause. (4) Mitigation [1, 17]: Several actions
novative on-call system empowered by the Large Language are taken by OCEs to mitigate the incident and to restore
Model for automating RCA of cloud incidents. RCACopilot service health.
af
matches incoming incidents to corresponding handlers based
on their alert types, aggregates the critical runtime diagnos-
tic information, predicts the incident’s root cause category,
and provides an explanatory narrative. We evaluate RCA-
Copilot using a real-world dataset consisting of a year’s
worth of incidents from Transport service in Microsoft. Our
evaluation demonstrates that RCACopilot achieves RCA ac-
curacy up to 0.766. Furthermore, the diagnostic information
collection component of RCACopilot has been successfully
in use at Microsoft for over four years.
to keep pace with the ever-evolving nature of cloud systems, summary information might miss useful signals to reach
thus often falling short when new incident types emerge. to conclusive diagnosis details; (2) Finetuning is costly and
Even when a relevant TSG is located, it may not cover all requires a huge volume of training samples, whereas we
the intricacies of the specific incident. This could be due to only have access to a few hundred high-quality manually
variations in system configurations, the presence of multiple labeled category information; (3) It is challenging to contin-
interacting root causes, or previously unknown issues. uously update a finetuned GPT model with evolving nature
At the heart of RCA lies the fundamental challenge of and scope of incidents; therefore such models are prone to
efficiently collecting and interpreting comprehensive, incident- generate more hallucinated results over time.
specific data within a limited time frame. OCEs must quickly In this paper, we introduce RCACopilot, a novel approach
discern the relevance of various data types to the incident at to cloud incident root cause analysis that shifts away from
hand and interpret them correctly. However, the complex- the traditional reliance on TSGs. RCACopilot operates as
ity and sheer volume of data generated by cloud systems an on-call system, empowering OCEs to construct ‘handlers’
t
often impede rapid decision-making. Furthermore, the ex- - automated workflows tailored to each alert type defined by
pertise required to analyze various data types, along with monitors, made up of reusable actions defined by their exper-
the diverse range of possible incident causes, exacerbates tise. These predefined handlers automatically streamline the
the difficulty of the task. As a result, OCEs may spend an collection of incident-specific diagnostic information from
inordinate amount of time analyzing data and formulating multiple sources, thus ensuring a more focused and relevant
hypotheses, detracting from time that could be better spent data accumulation process to avoid issues on either end of
resolving the incident and restoring system functionality. the information spectrum. Subsequently, the large language
t
... proaches such as TSGs, though useful, may fail to exploit the
full wealth of multi-source data due to inherent limitations.
2.2.1 Opportunities of Multi-Source Data. Different
Figure 1. A TSG for a poisoned message incident. data sources provide different perspectives on the system
state. For instance, logs can offer detailed event sequences,
metrics can reflect system performance over time, and traces
af
examples of troubleshooting guides and illustrate their inher-
ent limitations. Lastly, we discuss the potential advantages
of integrating a large language model into the RCA process,
which motivates our work.
second step in Figure 1, was not immediately incorporated 3 Insights from Incidents
into the TSG upon its release, causing potential inefficien- We conducted a comprehensive study of the one-year in-
cies in incident resolution. cidents from an email service from Microsoft, employing
• Insufficient details and coverage: High-level instructions of- rigorous qualitative analysis methods. Specifically, each inci-
ten appear in TSGs, lacking in detail and specific guidance, dent was carefully reviewed and categorized based on the
which forces OCEs into additional research and prolongs characteristics of the problem, the source of the issue, and
incident resolution. In the TSG example from Figure 1, the impact on the system by our experienced OCEs. We paid
the third step instructs to check the Poison Message Logs, particular attention to the root causes of the incidents, the
leaving out crucial details and causing confusion for OCEs effectiveness of the response, and the recurrence of similar
unfamiliar with this incident type. Additionally, TSGs may issues. While our insights were indeed intuitively derived,
overlook common checks, such as disk space checks, lead- they were firmly grounded in empirical data and analysis.
ing to partial or inadequate incident resolutions. Our study not only yielded valuable insights into incident
t
patterns and challenges but also informed the development
and refinement of our approach.
2.3 The Promise of Large Language Models for
Incident Management Insight 1: determining the root cause based on a single
The rapid advancements in natural language processing and data source can be challenging. As an illustration, con-
machine learning have led to the development of powerful sider Incident 2 in Table 1, where a single server failed to
LLMs, which are reported to be effective at various down- perform DNS resolution for incoming packets due to the
af
stream tasks with zero-shot and few-shot training [5, 11].
These models have shown exceptional performance in trans-
lation, summarization, and question-answering. Leveraging
their potential for incident management in cloud comput-
ing systems could revolutionize the way OCEs identify and
resolve incidents. By automating the interpretation aspect
of incident management, LLMs can help alleviate the stress
and cognitive load associated with complex on-call tasks for
OCEs, which enables OCEs to focus more on higher-level
jobs and decision-making.
t
ponent’s availability dropped.
5 2 Forest CertForBogusTenants 11 The number of concurrent Spammers abused the system
server connections exceeded by creating a lot of bogus ten-
a limit. ants with connectors using a
certificate domain.
6 1 Forest MaliciousAttack 2 Forest-wide processes crashed Active exploit was launched
over threshold. in remote PowerShell by seri-
af 7
10
2
3
Forest
Forest
Forest
Forest
UseRouteResolution
FullDisk
InvalidJournaling
DispatcherTaskCancelled
9
11
22
healthy.
alizing malicious binary blob.
Poisoned messages sent to the A configuration service was
forest made the system un- unable to update the settings
leading to the crash.
Many processes crashed and A specific disk was full.
threw IO exceptions.
Messages stuck in submission The customer set an invalid
queue for a long time. value for the Transport con-
fig and caused TenantSet-
tingsNotFoundException.
Normal priority messages Network problem caused the
across a forest had been authentication service to be
queued in submission queues unreachable.
for a long time.
Table 1. Examples of cloud incidents in different root cause categories.
Dr
led to an accumulation of unprocessed messages in the queue,
thereby significantly undermining its availability. Intrigu-
ingly, incidents of this category recurred 11 times in a span
of merely 15 days. Likewise, the DispatcherTaskCancelled
&