0% found this document useful (0 votes)
6 views

Empowering Practical RCA using LLM

The document introduces RCACopilot, an automated system that utilizes Large Language Models (LLMs) to enhance root cause analysis (RCA) for cloud incidents, addressing the inefficiencies of traditional manual methods. RCACopilot streamlines incident handling by matching alerts to handlers, collecting diagnostic data, and predicting root cause categories, achieving an RCA accuracy of up to 0.766 in real-world applications at Microsoft. The paper highlights the challenges of multi-source data integration and the limitations of existing troubleshooting guides, emphasizing the potential of LLMs to improve incident resolution efficiency.

Uploaded by

Mithilesh Kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Empowering Practical RCA using LLM

The document introduces RCACopilot, an automated system that utilizes Large Language Models (LLMs) to enhance root cause analysis (RCA) for cloud incidents, addressing the inefficiencies of traditional manual methods. RCACopilot streamlines incident handling by matching alerts to handlers, collecting diagnostic data, and predicting root cause categories, achieving an RCA accuracy of up to 0.766 in real-world applications at Microsoft. The paper highlights the challenges of multi-source data integration and the limitations of existing troubleshooting guides, emphasizing the potential of LLMs to improve incident resolution efficiency.

Uploaded by

Mithilesh Kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Empowering Practical Root Cause Analysis by

Large Language Models for Cloud Incidents


Yinfang Chen⋄§ , Huaibing Xie⋄¶ , Minghua Ma∗ , Yu Kang∗ , Xin Gao∗ , Liu Shi∗ , Yunjie Cao∗
Xuedong Gao∗ , Hao Fan∗ , Ming Wen† , Jun Zeng‡ , Supriyo Ghosh∗ , Xuchao Zhang∗
Chaoyun Zhang∗ , Qingwei Lin∗ , Saravan Rajmohan∗ , Dongmei Zhang∗ , Tianyin Xu§
Microsoft∗ , UIUC§ , PKU ¶ , HUST† , NUS‡
Abstract or users of the service (internal engineers or external cus-
Ensuring the reliability and availability of cloud services tomers). (2) Triaging [4, 8, 9]: After the detection, the inci-

t
necessitates efficient root cause analysis (RCA) for cloud dent is assigned to the appropriate engineering team after
incidents. Traditional RCA methods, which rely on man- an initial assessment. (3) Diagnosis [28]: Assigned on-call
ual investigations of data sources such as logs and traces, engineers (OCEs) inspect different aspects of the incident
are often laborious, error-prone, and challenging for on-call and have several rounds of back-and-forth communication to
engineers. In this paper, we introduce RCACopilot, an in- identify the root cause. (4) Mitigation [1, 17]: Several actions
novative on-call system empowered by the Large Language are taken by OCEs to mitigate the incident and to restore
Model for automating RCA of cloud incidents. RCACopilot service health.

af
matches incoming incidents to corresponding handlers based
on their alert types, aggregates the critical runtime diagnos-
tic information, predicts the incident’s root cause category,
and provides an explanatory narrative. We evaluate RCA-
Copilot using a real-world dataset consisting of a year’s
worth of incidents from Transport service in Microsoft. Our
evaluation demonstrates that RCACopilot achieves RCA ac-
curacy up to 0.766. Furthermore, the diagnostic information
collection component of RCACopilot has been successfully
in use at Microsoft for over four years.

CCS Concepts: • Computer systems organization →


Cloud computing; • Software and its engineering →
Maintaining software.

Keywords: Root Cause Analysis, Large Language Models,


Root cause analysis (RCA) is pivotal in promptly and effec-
tively addressing these incidents. By accurately diagnosing
the underlying problem and preventing its recurrence, RCA
not only restores service availability swiftly but also fortifies
the overall reliability of cloud services. However, identifying
the root causes of these incidents often represents a daunting
and time-consuming task that requires significant human
expertise and intervention [30].
Traditional approaches to cloud incident RCA typically
involve the manual collection and analysis of various types
of data, such as logs [16, 22, 25, 46, 47], metrics [32], traces
[45], and incident tickets [17, 36]. This manual process is not
only laborious and error-prone, but can also be challenging
due to varying levels of available information - what we term
as the ‘information spectrum’. The ‘information spectrum’
describes a continuum of information availability, ranging
Dr
Cloud Systems from situations with too little information to those inundated
with an excess. At either end of this spectrum, root cause
analysis can become particularly challenging. The relevant
information for RCA might be buried within the voluminous
data, leading to an information overload for OCEs. OCEs
1 Introduction may find it challenging to quickly pinpoint the relevant in-
Cloud computing serves as an indispensable infrastructure formation amidst the sea of data, hindering efficient incident
for numerous applications and services upon which peo- resolution. Conversely, OCEs could also encounter situations
ple rely daily. As the adoption of cloud services continues where they lack the necessary information to understand
to grow, ensuring their reliability, availability, and security and address the root causes of incidents accurately. Beyond
becomes increasingly vital [12, 26, 30]. However, the com- these challenges, the collected data itself is often noisy, in-
plexity of cloud systems makes them vulnerable to a variety complete and inconsistent, further complicating the RCA
of incidents that could pose significant challenges to these process.
crucial properties [43]. A typical incident life-cycle consists Specifically, the engineering team documents the frequent
of four stages: (1) Detection [31, 41, 42]: When an anomalous troubleshooting steps in the form of troubleshooting guides
system behavior is observed, an alert is raised by monitors (TSGs) to facilitate the handling of future incidents. However,
the volume of TSGs is overwhelming for OCEs, making the
⋄ This research was primarily conducted during an internship at Microsoft
search for the most relevant guide a time-consuming task
Research Asia. that might cause system downtime. Moreover, TSGs struggle
1
2023, ACM, arXiv Yinfang et.al.

to keep pace with the ever-evolving nature of cloud systems, summary information might miss useful signals to reach
thus often falling short when new incident types emerge. to conclusive diagnosis details; (2) Finetuning is costly and
Even when a relevant TSG is located, it may not cover all requires a huge volume of training samples, whereas we
the intricacies of the specific incident. This could be due to only have access to a few hundred high-quality manually
variations in system configurations, the presence of multiple labeled category information; (3) It is challenging to contin-
interacting root causes, or previously unknown issues. uously update a finetuned GPT model with evolving nature
At the heart of RCA lies the fundamental challenge of and scope of incidents; therefore such models are prone to
efficiently collecting and interpreting comprehensive, incident- generate more hallucinated results over time.
specific data within a limited time frame. OCEs must quickly In this paper, we introduce RCACopilot, a novel approach
discern the relevance of various data types to the incident at to cloud incident root cause analysis that shifts away from
hand and interpret them correctly. However, the complex- the traditional reliance on TSGs. RCACopilot operates as
ity and sheer volume of data generated by cloud systems an on-call system, empowering OCEs to construct ‘handlers’

t
often impede rapid decision-making. Furthermore, the ex- - automated workflows tailored to each alert type defined by
pertise required to analyze various data types, along with monitors, made up of reusable actions defined by their exper-
the diverse range of possible incident causes, exacerbates tise. These predefined handlers automatically streamline the
the difficulty of the task. As a result, OCEs may spend an collection of incident-specific diagnostic information from
inordinate amount of time analyzing data and formulating multiple sources, thus ensuring a more focused and relevant
hypotheses, detracting from time that could be better spent data accumulation process to avoid issues on either end of
resolving the incident and restoring system functionality. the information spectrum. Subsequently, the large language

afData-driven and Artificial Intelligence (AI) techniques


have been leveraged for automating the incident manage-
ment [9, 10]. While there are existing techniques that recom-
mends relevant TSGs [17] and automates the workflows [36]
of TSGs, their utility is limited by the inherent challenges
associated with TSGs. Despite these automated processes,
OCEs still find themselves investing significant manual effort
in sifting through the vast amounts of information, interpret-
ing the data, and identifying the root causes of incidents.
The recent advent and success of Generative Pretrained
Transformer (GPT) models in performing complex tasks [5,
38], suggests a promising avenue for enhancing RCA. Specifi-
cally, GPT models can be used to parse through high-volume
data, discern relevant information, and produce succinct,
insightful outputs. This significantly alleviates the burden
model (LLM) component of RCACopilot processes this di-
agnostic data, autonomously identifying the categories and
providing explanations of incident root causes. The combi-
nation of bespoke handlers and the analytical capabilities
of the LLM allows RCACopilot to significantly enhance
adaptability and scalability in incident response. As a result,
RCACopilot can effectively handle a diverse array of in-
cident types while reducing the need for extensive human
intervention.
The diagnostic information collection component of RCA-
Copilot has been in use at Microsoft for over four years.
Recently, the root cause prediction component has been
prototyped and tested by some incident teams at Microsoft
before its final rolling in production.
Summary. This paper makes the following contributions:
Dr
on OCEs to manually sift through vast amounts of data,
helping them focus on resolving the incident more quickly • We propose RCACopilot, an automated tool for cloud
and effectively. Additionally, GPT models can adapt to new incident RCA that enables on-call engineers to construct
and evolving types of incidents, learning from previous data incident-specific automatic workflows for efficient data
to improve future predictions. While GPT models can pro- collection from multiple sources.
cess and generate text efficiently, they lack intrinsic domain- • We introduce the integration of a large language model
specific knowledge, especially in specialized areas such as within RCACopilot that autonomously analyzes the col-
cloud incident management. This lack of understanding of lected diagnostic data to predict incident root cause cate-
specific contexts, such as cloud incidents, can limit their gories and generate explanations, demonstrating the po-
accuracy in predicting incident root causes and generating tential of the large language model in enhancing RCA.
appropriate explanations. • We showcase the real-world applicability of RCACopilot
Recently, Ahmed et. al. [1] proposed to finetune a pre- by presenting its successful adoption within Microsoft.
trained GPT model with domain-specific dataset for gen- This illustrates its practical effectiveness in enhancing
erating root causes of an incident just by leveraging the RCA efficiency, demonstrating the feasibility and benefits
title and summary information available at the time of in- of our approach in real-world cloud computing scenarios.
cident creation. While they have demonstrated promises
of GPT models in incident root causing, finetuning posses
several limitations: (1) As accurate root cause analysis re- 2 Background and Motivation
quires various sources of complex unstructured data (e.g., In this section, we first introduce the concept and importance
logs, telemetry, traces), just using generic title and initial of incident root cause analysis. We then present real-world
2
Empowering Practical Root Cause Analysis by
Large Language Models for Cloud Incidents 2023, ACM, arXiv

2.2 The Opportunities and Challenges of


Troubleshooting Guide for Poisoned Messages Multi-Source Data in Incident Management
1. Go to the Poisoned Message Dashboard. This page Managing incidents in the complex ecosystem of cloud ser-
gives a real-time, high-level view of the Poison Mes- vices necessitates a comprehensive understanding of system
sage feature. The charts should indicate whether the states. This comprehension often stems from the consolida-
problem has resolved itself or is ongoing, as well as tion of multi-source data, which includes traces, logs, and
some sense of where it is occurring . . . metrics. Traces represent tree-structured data detailing the
2. The Dashboard newly implements an Exception Ta- flow of user requests, logs are semi-structured text recording
ble that has poisoned messages within a time frame. hardware and software events, while metrics monitor service
In most cases, whatever exception is causing an alert status or user-perceived metrics, forming time series data.
will rise to the top of the table . . . While these individual data sources yield valuable insights,
3. You may also check the Poison Message Logs . . . capitalizing on their potential has challenges. Traditional ap-

t
... proaches such as TSGs, though useful, may fail to exploit the
full wealth of multi-source data due to inherent limitations.
2.2.1 Opportunities of Multi-Source Data. Different
Figure 1. A TSG for a poisoned message incident. data sources provide different perspectives on the system
state. For instance, logs can offer detailed event sequences,
metrics can reflect system performance over time, and traces

af
examples of troubleshooting guides and illustrate their inher-
ent limitations. Lastly, we discuss the potential advantages
of integrating a large language model into the RCA process,
which motivates our work.

2.1 Incident Root Cause Analysis


In the realm of cloud services, an incident refers to any event
that disrupts normal service operations or causes degrada-
tion in the quality of services. When such incidents occur,
root cause analysis is performed to identify the underlying
issue causing the disruption.
RCA in cloud services is a multi-faceted process:

• Data Collection: Gathering relevant data from various sources


can reveal the propagation of requests across services. Inte-
grating these data sources can provide a more comprehensive
view of the system, enabling more accurate and efficient in-
cident diagnosis and resolution. Furthermore, multi-source
data can facilitate correlation and causality analysis, which is
crucial for root cause analysis. By analyzing the relationships
between different data sources, we can identify patterns and
anomalies that may indicate the root cause of an incident.
2.2.2 Challenges of Multi-Source Data. Despite its po-
tential, effectively leveraging multi-source data in incident
management is challenging. The sheer volume and com-
plexity of data from various sources can be overwhelming,
making it difficult to extract meaningful insights. Worse still,
different data sources may provide inconsistent or conflict-
ing information. Moreover, real-world data is often noisy,
Dr
such as logs, metrics, traces, or alerts is the first step in which can complicate analysis and lead to false conclusions.
RCA.
2.2.3 Limitations of TSGs. Traditional TSGs represent
• Data Analysis: The collected data is then analyzed to iden-
an early attempt to leverage multi-source data for incident
tify patterns, anomalies, or correlations that can possibly
management. They guide OCEs to gather and analyze data
provide clues about the root cause of the incident.
from various sources to diagnose and resolve incidents. How-
• Hypothesis Verification: Based on the data analysis, hy-
ever, TSGs face several inherent limitations:
potheses about the possible root cause are formulated and
then verified by OCEs. • Manual data integration: TSGs typically require OCEs to
gather data from different sources manually. This process
Given the complexity and dynamism nature of cloud sys- can be time-consuming and error-prone. Notwithstanding
tems, along with the immense volume of data involved, con- the existence of diverse troubleshooting guides and TSG
ducting RCA is a challenging task, which requires substan- recommendation techniques [17], dependence on TSGs
tial expertise and time. Take the scale of our corporation’s still remains a significant stress and burnout for OCEs due
email service as an example, which delivers over 150 billion to the inherent limitations of the manual process.
messages daily. Ensuring the smooth operation of such a • Outdated information: TSGs, as static documents, often
large-scale service demands an efficient and effective RCA struggle to stay up-to-date with the evolving system changes
approach. This is pivotal in maintaining a reliable and high- and new insights about incident root causes. This lag can
performing communication infrastructure, particularly for lead OCEs to follow outdated or suboptimal troubleshoot-
organizations that rely heavily on Microsoft’s email server ing steps. For example, a new feature (“Exception Table”)
for their email communication. to check Poison Message exceptions, mentioned as the
3
2023, ACM, arXiv Yinfang et.al.

second step in Figure 1, was not immediately incorporated 3 Insights from Incidents
into the TSG upon its release, causing potential inefficien- We conducted a comprehensive study of the one-year in-
cies in incident resolution. cidents from an email service from Microsoft, employing
• Insufficient details and coverage: High-level instructions of- rigorous qualitative analysis methods. Specifically, each inci-
ten appear in TSGs, lacking in detail and specific guidance, dent was carefully reviewed and categorized based on the
which forces OCEs into additional research and prolongs characteristics of the problem, the source of the issue, and
incident resolution. In the TSG example from Figure 1, the impact on the system by our experienced OCEs. We paid
the third step instructs to check the Poison Message Logs, particular attention to the root causes of the incidents, the
leaving out crucial details and causing confusion for OCEs effectiveness of the response, and the recurrence of similar
unfamiliar with this incident type. Additionally, TSGs may issues. While our insights were indeed intuitively derived,
overlook common checks, such as disk space checks, lead- they were firmly grounded in empirical data and analysis.
ing to partial or inadequate incident resolutions. Our study not only yielded valuable insights into incident

t
patterns and challenges but also informed the development
and refinement of our approach.
2.3 The Promise of Large Language Models for
Incident Management Insight 1: determining the root cause based on a single
The rapid advancements in natural language processing and data source can be challenging. As an illustration, con-
machine learning have led to the development of powerful sider Incident 2 in Table 1, where a single server failed to
LLMs, which are reported to be effective at various down- perform DNS resolution for incoming packets due to the

af
stream tasks with zero-shot and few-shot training [5, 11].
These models have shown exceptional performance in trans-
lation, summarization, and question-answering. Leveraging
their potential for incident management in cloud comput-
ing systems could revolutionize the way OCEs identify and
resolve incidents. By automating the interpretation aspect
of incident management, LLMs can help alleviate the stress
and cognitive load associated with complex on-call tasks for
OCEs, which enables OCEs to focus more on higher-level
jobs and decision-making.

2.4 Our Motivation


The motivation for our work is rooted in the challenges faced
when using manual TSGs to diagnose incidents and identify
exhaustion of UDP hub ports on a front door machine. This
example highlights the difficulties in relying solely on a sin-
gle source (monitor alert) to diagnose complex issues.
When a mailbox server sends mail to external email recip-
ients, it uses specific front-door servers (proxies). However,
each front-door server has a limited number of available
SMTP outbound proxy connections. If a mailbox server’s
proxy connection request fails, it will be unable to send mes-
sages to external recipients. In this incident, the monitor first
raises an alert indicating detected failures when connecting
to the front door server. However, this alert only signifies a
connection issue between the mail server and the front door
server, without even suggesting a DNS resolution problem.
Consequently, the root cause remains unclear.
Dr
the underlying root causes. Recognizing the limitations of
manual TSGs, our goal is to develop an automated diagnostic 
process that harnesses the capabilities of LLMs to address
3UREDELOLW\

various cloud incidents more effectively. 


Different from previous work [36], which employs AI tech-
niques to generate automated workflow from existing TSGs,
our goal is to enable experienced OCEs to construct an auto- 
mated pipeline for incident diagnosis. This approach allows
OCEs to be directly assisted in identifying the root cause        
without the need to investigate intermediate diagnostic in- 7LPH,QWHUYDO GD\V
formation, though they still have the option to do so.
We envision a future in which root cause analysis is pre-
Figure 2. Recurring incidents proportion vs. time interval.
dominantly automated, requiring minimal manual verifica-
tion only when necessary. Our approach seeks to provide
OCEs with timely, relevant, and accurate information for Insight 2: incidents stemming from similar or iden-
specific incidents, leading to more efficient RCA. tical root causes often recur within a short period. We
By leveraging LLMs, our research aims to alleviate the found that most recurring incidents (93.80%) tend to reappear
stress and cognitive load associated with incident manage- within a brief span of 20 days, as shown in Figure 2. For in-
ment, ultimately enhancing the efficiency and effectiveness stance, consider the category of Incident 9 from Table 1. This
of OCEs in addressing incidents. type of incident, triggered by invalid customer configuration,
4
Empowering Practical Root Cause Analysis by
Large Language Models for Cloud Incidents 2023, ACM, arXiv

No. Sev. Scope Category Occur. Symptom Cause


1 1 Forest AuthCertIssue 3 Tokens for requesting ser- A previous invalid certificate
vices were not able to be cre- overrided the existing one due
ated. Several services reported to misconfiguration.
users experiencing outages.
2 2 Machine HubPortExhaustion 27 A single server failed to do The UDP hub ports on the ma-
DNS resolution for the incom- chine had been run out.
ing packages.
3 2 Forest DeliveryHang 6 Mailbox delivery service hang Number of messages queued
for a long time. for mailbox delivery exceeded
the limit.
4 2 Forest CodeRegression 15 An SMTP authentication com- Bug in the code.

t
ponent’s availability dropped.
5 2 Forest CertForBogusTenants 11 The number of concurrent Spammers abused the system
server connections exceeded by creating a lot of bogus ten-
a limit. ants with connectors using a
certificate domain.
6 1 Forest MaliciousAttack 2 Forest-wide processes crashed Active exploit was launched
over threshold. in remote PowerShell by seri-

af 7

10
2

3
Forest

Forest

Forest

Forest
UseRouteResolution

FullDisk

InvalidJournaling

DispatcherTaskCancelled
9

11

22
healthy.
alizing malicious binary blob.
Poisoned messages sent to the A configuration service was
forest made the system un- unable to update the settings
leading to the crash.
Many processes crashed and A specific disk was full.
threw IO exceptions.
Messages stuck in submission The customer set an invalid
queue for a long time. value for the Transport con-
fig and caused TenantSet-
tingsNotFoundException.
Normal priority messages Network problem caused the
across a forest had been authentication service to be
queued in submission queues unreachable.
for a long time.
Table 1. Examples of cloud incidents in different root cause categories.
Dr
led to an accumulation of unprocessed messages in the queue,
thereby significantly undermining its availability. Intrigu- 
ingly, incidents of this category recurred 11 times in a span 
of merely 15 days. Likewise, the DispatcherTaskCancelled
&RXQW

incidents (No. 10 in Table 1) and the DeliveryHang incidents 


(No. 3) reappeared 22 times and 6 times within a week and a

single month, respectively. These can be attributed to several
factors. Unresolved root causes from the initial response may 
lead to the same issue re-emerging, especially if the prob-
          ! 
lem is complex or not fully understood. Secondly, systemic
vulnerabilities, if not addressed, can be repeatedly exploited, &DWHJRU\2FFXUUHQFH
causing similar incidents. Thirdly, external dependencies,
such as reliance on a service that frequently experiences Figure 3. Distribution of incident category frequency.
outages, can also lead to recurring incidents. These patterns
suggest that by leveraging insights from previous incidents,
we could swiftly identify the root cause of new occurrences Insight 3: incidents with new root causes occur fre-
with the same root cause. quently and pose a greater challenge to analyze. TSGs
can help OCEs diagnose issues by providing clear investi-
gation guidance. However, when incidents arise from new,
5
2023, ACM, arXiv Yinfang et.al.

previously unencountered root causes, OCEs face a set of from a target data source. OCEs can build and modify these
challenges. For such incidents, no TSG exists, and OCEs may handlers based on their expertise. The handler includes three
struggle to identify the underlying issues. For instance, In- distinct actions: scope switching action, query action, and
cident 1 is a high-severity (severity 1) incident caused by mitigation action, which will be explained in Section 4.1.2.
misconfiguration, which blocked the authentication token Each action generates an output, guiding the control flow
generation to lead to severe outages. Similarly, Incident 6 is a of the incident handler. We use a RCACopilot handler that
malicious attack caused by an attacker launching an exploit diagnoses Incident 7 in Table 1 as an example to illustrate
with a malicious blob. This type of attack had never been the handler usage.
encountered before, leaving OCEs without an existing TSG
4.1.1 Incident handler. The decision-making process that
to reference. Lower severity level (severity 2) incidents, such
OCEs employ when handling an incident resembles a deci-
as Incident 5, are also susceptible to this challenge when the
sion tree’s control flow. The root node in the incident handler
spammer first abuses the system. As Figure 3 shows, inci-

t
is the incident alert type, which is gathered from the sys-
dents with a new root cause category account for 24.96%
tem monitor. We distilled OCE operations into three actions
(163 among 653) of all incidents. If OCEs spend their time
when constructing the incident handler. As OCE operations
searching for nonexistent TSGs, the incident’s impact could
can be similar to different incident types (e.g., conducting
escalate further. Recognizing this challenge, it is necessary
a common disk check or query to a database), we designed
to propose a new approach that can effectively infer, catego-
RCACopilot handler actions to be reusable across all han-
rize and explain the root causes for such unseen incidents,
dlers. We also maintain the versions of the handlers in the
thereby reducing the time OCEs take to identify and address

af
these unique incidents.

4 RCACopilot
RCACopilot has two stages: the diagnostic information
collection stage and the root cause prediction stage as shown
in Figure 4.
Diagnostic information collection stage: This is the
initial stage, where the incident is parsed and matched to
the pre-defined incident handler. Each handler is tailored
to a specific alert type. Upon matching the incident with
the appropriate handler, RCACopilot proceeds to collect
relevant diagnostic data from a variety of sources.
Root cause prediction stage: Once the diagnostic infor-
mation is collected, RCACopilot transitions into the root
cause prediction stage. In this phase, RCACopilot applies
database, which can be used to track their historical changes.
RCACopilot’s incident handlers can be updated and mod-
ified dynamically by OCEs, allowing them to stay abreast
with the most recent system changes and newly discovered
root causes. For instance, when a new metric is introduced
into the system, OCEs only need to construct a new ac-
tion to collect the relevant data and incorporate it into the
corresponding incident handler, which can ensure timely
adaptation.
4.1.2 Handler action. RCACopilot leverages the syn-
ergy of multi-source data. The system uses predefined ac-
tions in the incident handler to automatically collect relevant
diagnostic information from diverse sources. The automated
integration of data not only saves time but also reduces the
likelihood of human error. It also provides a more compre-
Dr
hensive view of the system state, facilitating efficient and
its predictive module to determine the likely root cause cat-
accurate incident resolution. This significantly lightens the
egory of the incident. This prediction is not a mere cate-
workload of OCEs, reducing stress and burnout, and enhanc-
gorization, but it is also supplemented with an explanation ing the effectiveness of the incident resolution process. The
detailing how RCACopilot arrived at the given prediction. action in the handler could be one of the following:
Subsequently, the predicted category label is presented to Scope switching action: This action facilitates precision
experienced OCEs for review. in RCA by allowing adjustments to the data collection scope
based on the specific needs of each incident. For instance, as
4.1 Diagnostic Information Collection Stage depicted in Figure 5, if an alert originates at the ‘forest’ level,
Driven by Insight-1 in Section 3, RCACopilot aims to collect signifying an issue within a specific forest, and the problem
multi-source data for RCA. Specifically, for each alert type, type is identified as ‘Busy Hub’, the scope switching action
an incident handler is constructed, comprising a series of can adjust the scope to the ‘machine’ level. This modification
actions to collect diagnostic information. Alert types are allows for a more fine-grained investigation, specifically
used to categorize alerts based on specific monitors and assessing if a singular hub server is overly taxed.
thresholds. Incidents sharing the same alert type exhibit The implementation of this action ensures that we effi-
similar symptoms, though they may stem from different root ciently navigate the information spectrum. When the in-
causes. vestigation requires a more targeted approach, this action
The RCACopilot incident handler is a workflow that can narrow the data collection scope. Conversely, if a more
consists of a series of actions. Each action is a function that holistic view is necessary, it can widen the scope, say from a
can be executed to collect specific diagnostic information single machine to an entire forest. This flexibility contributes
6
Empowering Practical Root Cause Analysis by
Large Language Models for Cloud Incidents 2023, ACM, arXiv

Collection Stage Prediction Stage


Incident Neighbor Root cause
Incoming Incident Summarization Search prediction OCEs

ID Title Embedding Incident 1

OwingTenant Incident Handler Info. Diagnostic information Incident 2

OwningTeam Parsing Matching Collection Find K Nearest


Incident K

LLM
Store LLM
Load DB Embedding
diagnostic
handlers Summarized vector DB Root cause category
info. diagnostic info. and explanation

Figure 4. RCACopilot architecture.

t
to a more balanced and effective diagnostic data collection handler to acquire the diagnostic information from a target
process. source. For instance, as illustrated in Figure 6, RCACopilot
Query action: Query action can query data from differ- can assimilate diverse data such as error logs, exception stack
ent sources and output the query result as a key-value pair traces, and socket metrics related to a specific incident. The
table. This type of action can also be hooked to executing a error log and exception stack trace alone does not provide

af
specific script with pre-defined parameters. Usually, scripts
are internal automatic investigation tools for a service, and
only the service team has access to the tools.
For instance, in Figure 5, the “Known issue?” action node
queries the database to see whether the current incident is
a known one or not based on its alert messages. If it is a
known issue, execution flow will enter the “True” branch to
give mitigation actions directly. Otherwise, a query script
that can aggregate threads with the same stack traces will
be executed. It will obtain an instantaneous list of the stacks
on all the managed threads in the target process and then
group common stacks together in order to identify potential
deadlocks/blocking code paths in the process.
The query action can also output an enum value to decide
the next action node to execute, e.g., after getting the top
error message on the exception stack traces, i.e., "Get top
sufficient insight to identify the root cause of the incident.
However, when supplemented with the socket metrics, a
more comprehensive picture emerges. In this example, it is
clear that the UDP socket is exhausted, which is the root
cause.
In the case of new incidents, RCACopilot can perform
a range of common checks, such as evaluating the provi-
sioning status or analyzing thread stacks. This assists OCEs
in gaining a holistic understanding of the situation. Note
that the information collected is pre-defined in the actions
of the RCACopilot handler, ensuring that only relevant
data is gathered, thus avoiding overwhelming information
that is unnecessary. By providing this comprehensive diag-
nostic information, RCACopilot empowers OCE teams to
troubleshoot issues efficiently. They can use the gathered in-
formation as guidance to address incidents more effectively.
Dr
error msg" node, the next action node to be run depends on
the exception type. Based on the error messages, a specific
4.2 LLMs for Incident Explanation
team will be reported and engaged, as shown in Figure 5.
Mitigation action: This action refers to the strategic Upon thorough investigation, each incident within our ser-
steps suggested to alleviate an incident, such as “restart ser- vice is manually assigned a root cause category by our sea-
vice” or “engage other teams”, as depicted in Figure 5. It’s soned OCEs. OCEs will use the categories to classify the
important to note that handlers do not always provide ex- historical incidents and guide the new incoming incidents’
act mitigation strategies for every incident, due to handlers’ RCA. However, reasoning the incidents and inferring their
pre-defined nature, which may not cover all possible situa- categories are time-consuming and potentially overwhelm-
tions. For instance, Incident 4 in Table 1, categorized under ing for OCEs, who have a tight time budget. Given this, we
code regression, presents a case where identification and have identified the categorization of incident root causes as
rectification of such code issues can be challenging. In cases our primary downstream task.
where the incident handler is uncertain, it will only offer Recently, LLMs have demonstrated remarkable capabili-
intermediate diagnostic information to the OCEs without ties in understanding the context of downstream tasks and
mitigation. generating relevant information from demonstrations, mak-
ing them a possible choice for incident RCA. However, rea-
4.1.3 Multi-source diagnostic information. RCACopi- soning the incident root cause is not a simple task, and LLMs
lot’s diagnostic information collection stage serves as a may not be able to achieve the optimal results on long-tail or
valuable tool for OCEs by aggregating data from a myriad domain-specific tasks without any guidance [6, 18]. Chain-of-
of sources. OCEs only need to customize the action in the Thoughts (CoT) prompting is a gradient-free technique that
7
2023, ACM, arXiv Yinfang et.al.

Default Analyze single


Switch scope to
busy server
single server
Busy Report to a
*Deliver.Exception:MailboxOffli specific team
Hub
neException.*
Get top error *recipient mailbox
Determine msg Engage other
Others location information is
issue type teams
not available*

Get- Check delivery Delivery is restarted


Busy Delivery/Recipient ThreadStackGroupi health recently?
Default ng.ps1
False True
Known issue?
True Mitigation Collect diagnose
actions Restart service logs
Default

t
Figure 5. A RCACopilot handler for too many messages stuck in the delivery queue alert.

to extracting such contextual semantics involves the use of


DatacenterHubOutboundProxyProbe probe log result from embedding models. The objective is to map the diagnostic
[MachineID]. information into an embedding space (i.e., numeric vector
Total Probes: 2, Failed Probes: 2
Id Level Created Description
space), where the distances between vectors represent the

af –
2
2
—–

Count: 2
...
Exceptions:
——- ———–
Error 11/21/2022 2:04:20 AM Probe result
Error 11/21/2022 1:49:20 AM Probe result
Failed probe error:
Name: No such host is known.
A WinSock error: 11001 encountered when connecting to
host: [HOST NAME]

InformativeSocketException: No such host is known.


A WinSock error: 11001 encountered when connecting to
host: [HOST NAME]
at TcpClientFactory.Create(...)
at SimpleSmtpClient.Connect(...)
...
Total UDP socket count: 15276
Total UDP socket count by process and processId (top
5 only):
semantic similarity of incidents. Choosing a computationally
efficient embedding model allows us to preserve accuracy
while handling a large number of incidents.
We employ FastText as our embedding model, which is
efficient, insensitive to text input length, and generates dense
matrices, making it easy to calculate the Euclidean distance
between similar vectors. Furthermore, since our downstream
task is domain-specific to the incident root cause reasoning,
and the incident-related information is internal to our com-
pany, we opt to train a FastText model on our historical inci-
dents rather than using a pre-trained large language model
as our embedding model, which is costly and inefficient. Ad-
ditionally, we provide users with the flexibility to customize
their embedding model if desired.

4.2.2 Diagnostic information summary. LLMs have


Dr
14923: Transport.exe, 203736 shown potential for automatic summarization [34]. Nonethe-
15: w3wp.exe, 102296 less, the length of the diagnostic information collected from
8: svchost.exe, 4748 RCACopilot handlers is often too extensive. As shown in
7: Microsoft.Transport.Store.Worker.exe, 74060
7: Microsoft.Transport.Store.Worker.exe, 87724
Figure 6, the diagnostic information of an incident can have
more than 2000 tokens with low readability of the log mes-
sages. The considerable number of tokens in the incident
description can pose challenges for the LLM to effectively
Figure 6. Diagnostic information for hub port exhaustion.
process and may introduce noise. Therefore, feeding the di-
agnostic information of an incident directly into the LLM
elicits LLMs to generate intermediate reasoning steps that to make a prediction could not be an ideal choice, let alone
lead to the final answer. In few-shots CoT prompting, a few using the information from multiple sources. In this regard,
manual demonstrations that are composed of a question and we add another layer to leverage the LLM’s ability to summa-
a reasoning chain that leads to an answer for each of them. rization to summarize the diagnostic information first before
Inspired by the above ideas, diagnostic information provided making the diagnosis reasoning. We construct the prompt in
by RCACopilot handlers can be used as ingredients for the the way of Figure 7. We ask LLM to summarize the diagnos-
reasoning process of the incidents. tic information into 120-140 words without outputting any
unrelated information. This summarization process makes
4.2.1 Embedding model. Our observation is that the se- the diagnostic information more concise and informative,
mantics of incidents can be revealed from the context in which which forms the basis for the later CoT prompting. Figure 8
the diagnostic information is described. A common approach illustrates a more readable and concise text generated by
8
Empowering Practical Root Cause Analysis by
Large Language Models for Cloud Incidents 2023, ACM, arXiv

RCACopilot, which is a summary (113 tokens) of the previ- 4.2.4 Prediction prompt construction. CoT prompting
ous diagnostic information example in Figure 6, highlighting is a gradient-free technique that guides LLMs to produce
the key details such as the number of UDP ports used and intermediate reasoning steps leading to the final answer. In
the process utilizing the most. Specifically, we employ the few-shot CoT prompting, several demonstrations include a
tiktoken [35] tokenizer to count text tokens. question and a reasoning chain that directs the answer. With-
out hinging on the hand-crafting of demonstrations, Auto-
“Please summarize the above input. Please note that the CoT [50] has shown the power of automatically constructing
above input is incident diagnostic information. The sum- the prompt to form the reasoning chains. By drawing inspira-
mary results should be about 120 words, no more than tion from this concept, we can view the summarized diagnos-
140 words, and should cover important information as tic information and the labeled root cause categories as ques-
much as possible. Just return the summary without any tions and reasoning, so finding the nearest incident neighbor
additional output.” is the automatic reasoning chain construction, aligning with

t
the CoT prompting context well. We construct the prompt
like Figure 9 to ask the LLM to choose the most likely inci-
Figure 7. Prompt to summarize diagnostic information.
dent that has the same root cause as the current incident,
and also we explicitly push the LLM to reason by using “give
your explanation” indications in the prompt.
“The DatacenterHubOutboundProxyProbe has failed
Context: The following description shows the error

aftwice on the backend machine, with both failures due


to a WinSock error 11001 indicating that the host is un-
known. This error was encountered while attempting
to connect to the host. The error is associated with the
EOP service and has not been notified yet. The failure
context suggests the same issue. The total UDP socket
count is 15276, with the majority being used by the
Transport.exe process. The issue seems to be related to
the SMTP connection and requires further investigation.”

Figure 8. The summarized diagnostic information.

4.2.3 Nearest neighbor search. Incidents are heteroge-


neous, making it impractical to combine all past incidents’
information for sampling due to the prompt length limita-
log information of an incident. Please select the
incident information that is most likely to have the
same root cause and give your explanation (just
give one answer). If not, please select the first item
“Unseen incident”.
Input: The DatacenterHubOutboundProxyProbe
probe result from [BackEndMachine] is a failure ...
Options:
A: Unseen incident.
B: The DatacenterHubOutboundProxyProbe has
failed twice ... category: HubPortExhaustion.
C: There are 62 managed threads in process
TransportDelivery ... category: AuthCertIssue.

Figure 9. The prompt to predict incident category.


Dr
tions, even after summarization. To selectively choose past
cases as samples in the prompt, we design a new similarity 4.3 Implementation
formula: We have developed and deployed RCACopilot using a com-
𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 (𝑎, 𝑏) = ||𝑎 − 𝑏 || 2 bined total of 58,286 lines of code, consisting of 56,129 lines
1 of C# and 2,157 lines of Python.
𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 (𝑎, 𝑏) = ∗ 𝑒 −𝛼 |𝑇 (𝑎) −𝑇 (𝑏 ) | To facilitate the building of the RCACopilot incident
1 + 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 (𝑎, 𝑏)
to calculate the similarity between two incidents. It first handler, we have implemented RCACopilot’s handler con-
computes the Euclidean distance for every pair of incident struction as a web application. To support a new type of alert
vectors. Importantly, it also takes into account the tempo- in RCACopilot, OCEs only need to add a new handler in
ral distance between incidents, reflecting our Insight-2 in the handler construction GUI according to her expertise (see
Section 3. Here, 𝑇 (𝑥) stands for the date of incident 𝑥. This Appendix A). After the new handler has been constructed,
consideration of temporal distance is crucial as it influences it will be stored in the database, and OCEs can modify it by
the relevance of past incidents to the current ones. After cal- creating new action nodes or deleting old nodes.
culating similarities, we select the top 𝐾 incidents as demon-
strations for the LLM. This approach ensures a diverse and
5 Evaluation
representative set of incidents for effective LLM reasoning. We aim to answer the following questions in our evaluation:
The values of 𝛼 and 𝐾 have been determined as 0.3 and 5, re- (1) How effective and efficient is RCACopilot as an on-call
spectively, through empirical evaluation, as will be presented system when predicting root cause categories and as-
in Section 5.4. sisting OCEs? RCACopilot achieves 0.766 and 0.533 for
9
2023, ACM, arXiv Yinfang et.al.

Micro-F1 and Macro-F1 separately when predicting the which are the latest models from OpenAI. We choose GPT-4
root cause category of cloud incidents, outperforming as the default model in RCACopilot because it has the best
all our baselines with a low running overhead (4.205 sec- performance.
onds). RCACopilot is also able to generate new root
cause category labels for unseen incidents with explana- 5.2 Compared Approaches
tions. We have selected XGBoost, FastText, and fine-tuned LLMs as
(2) How do different components of RCACopilot facilitate our baselines to compare with RCACopilot. We have also
its diagnosis and prediction? RCACopilot has proven made another two variants, i.e., GPT-4 Prompt and Embed.
that the diagnostic information collection component, to evaluate the design of RCACopilot.
GPT summarization, and chain-of-thoughts prompting • XGBoost provides a parallel tree boosting that has been
all contribute to RCACopilot’s prediction effectiveness. commonly used in the networking system diagnosis.
(3) Is RCACopilot suitable for deployment in real produc- • FastText is a popular lightweight textual embedding ap-

t
tion services, and are RCACopilot’s results trustworthy? proach, which has been adopted in testbed studies with
RCACopilot’s diagnostic information collection mod- fault injections for root cause diagnosis tasks.
ule has been deployed across 30 teams within Microsoft • Fine-tune GPT is to fine-tune a pre-trained GPT-3.5 model
for over four years. To evaluate the trustworthiness of with our training dataset and evaluate its performance on
RCACopilot, each experiment was conducted over three our testing dataset with the temperature parameter set to 0.
rounds, and RCACopilot can consistently achieve a high Note that GPT-4 is currently not available for fine-tuning.
Micro-F1 score of over 0.70 and a Macro-F1 score exceed-

af
5.1
ing 0.50.
All experiments are performed on the server with Intel(R)
Core(TM) i7-9700 CPU @ 3.00GHz, 32.0 GB physical mem-
ory, and Intel UHD Graphics 630. The OS of the server is
Windows 11 Enterprise.

Target System and Dataset


We evaluate RCACopilot in a global email service system
named Transport within the Microsoft. The Transport team
focuses on developing and maintaining the components re-
sponsible for mail flow, routing, and delivery. This system
interacts with various other services to ensure seamless inte-
gration with a multitude of products and services, including
serviceA, serviceB, and serviceC. Hence, it is representative
• GPT-4 Prompt is a variant of RCACopilot that directly
predict category with RCACopilot’s diagnosis informa-
tion summaries.
• GPT-4 Embed. is a variant of RCACopilot that changes
the embedding model from FastText to GPT embedding.

Method

FastText [45]
XGBoost [3]
Fine-tune GPT [1]
GPT-4 Prompt
GPT-4 Embed.
RCACopilot (GPT-3.5)
F1-score
Micro
0.076
0.022
0.103
0.026
0.257
0.761
Macro
0.004
0.009
0.144
0.004
0.122
0.505
Avg. Time (s)
Train.
10.592
11.581
3192

1925
10.562
Infer.
0.524
1.211
4.262
3.251
3.522
4.221
Dr
of complex, real-world systems that interact with multiple RCACopilot (GPT-4) 0.766 0.533 10.562 4.205
components. With around 150 billion messages being deliv-
Table 2. Effectiveness of different methods.
ered daily, Transport operates at a colossal scale and caters
to customers worldwide, adding another layer of diversity
and complexity. The system ensures the secure and effec-
tive transmission of emails between users, utilizing various
protocols such as SMTP, IMAP, and POP3. Given its crucial 5.3 Effectiveness and Efficiency
role in communications infrastructure, it is essential to have We evaluate RCACopilot’s effectiveness by predicting the
effective and efficient incident management capabilities. root cause category of an incident based on the summa-
We collect a one-year dataset of 653 incidents from Mi- rized diagnostic information using micro and macro F1-score
crosoft’s Transport service to investigate RCACopilot’s ef- metrics. These metrics calculate the harmonic mean of the
ficacy in practice. It is important to note that each of these precision and recall. The micro F1-score aggregates the per-
incidents represents complex issues in a large-scale, globally formance of all classes, taking into account the contribution
distributed system, and thus each provides valuable insights. of each sample, while the macro F1-score focuses on the
The dataset is manually labeled with root cause categories performance of each individual class. RCACopilot achieves
by experienced OCEs, which serves as our ground truth. We a micro F1-score of 0.766 and a macro F1-score of 0.533 on
divide the incident cases into training (75%) and testing sets our testing dataset.
(25%). As shown in Table 2, RCACopilot outperforms other ap-
We conduct experiments on two large language models proaches, and it tends to incur an acceptable higher runtime
in RCACopilot, i.e., GPT-3.5-turbo, and GPT-4 (8K tokens), overhead. The performance of baseline approaches is poor,
10
Empowering Practical Root Cause Analysis by
Large Language Models for Cloud Incidents 2023, ACM, arXiv

since multiple root cause categories exhibit a long tail (im- Data Source F1-score
balanced) distribution, as shown in Figure 3, and traditional
AlertInfo DiagnosticInfo ActionOutput Micro Macro
machine learning models (FastText and XGBoost) and fine-
tuning GPT model need a large amount of training data ✓ 0.689 0.510
to produce accurate predictions. Directly employing GPT-4 ✓sum. 0.766 0.533
prompt or GPT-4 embedding approach without our design ✓ 0.379 0.245
✓ ✓ 0.525 0.511
lacks domain-specific knowledge for GPT-4 to make deci-
✓ ✓ 0.431 0.247
sions. On the contrary, RCACopilot leverages the powerful ✓ ✓ 0.501 0.449
LLM to learn the domain-specific knowledge from minimal ✓ ✓ ✓ 0.440 0.349
cases, so that it can achieve the best performance. Results
Table 3. Effectiveness of different prompt context for
indicate that RCACopilot not only provides higher accuracy
RCACopilot. ✓sum. stands for the summarized diagnostic
but also maintains a reasonable level of efficiency, making it

t
information.
a suitable choice for incident root cause analysis.
When facing incidents that RCACopilot has never seen
before, RCACopilot is capable of generating a new category
keyword to depict the new incident case. For example, Inci-
dent 8 in Table 1 is a new incident case that RCACopilot
has never encountered. RCACopilot’s prediction compo- 5.4 Comparison Analysis
nent is able to predict it as a new category “I/O Bottleneck”. To understand how different components of RCACopilot

af
Although OCEs subsequently categorize it as “DiskFull” in
post-investigation, the fundamental aspects of the problem
identified by RCACopilot align closely with the human-
derived label. The corresponding RCACopilot’s explanation,
illustrating how it arrived at the "I/O Bottleneck" categoriza-
tion, is provided in Figure 10.

The prediction of “I/O Bottleneck” was made based on


the occurrence of System.IO.IOExceptions within cru-
cial functions handling input/output operations, suggest-
ing an issue with data processing. The nested exception
within the DiagnosticsLog module reinforces this notion.
These errors, combined with crashes on different backend
machines, point to a system struggle with handling data
facilitate root cause analysis, we conduct an ablation study
on the different RCACopilot’s components.
Evaluation on diagnostic information. First, we eval-
uate the impact of diagnostic information on effectiveness.
In particular, we compare diagnostic information collected
from the collection stage with other different incident-related
information, namely, incident alert information and RCA-
Copilot handler action output. AlertInfo includes the alert
type and alert scope. Alert type is a pre-defined anomaly
description from a monitor, which only reflects a symptom
of the incident instead of the root cause, e.g., an exception
type from external monitors. The alert scope is the scope
of the incident, e.g., a single machine. ActionOutput is the
output of a series of executed RCACopilot actions, which
are hashed as key-value pairs. As shown in Table 3, using
Dr
flow. diagnostic information alone can outperform others in both
Micro-F1 (0.689) and Macro-F1 scores (0.510). The interesting
observation here is that mixing the diagnostic information
Figure 10. RCACopilot’s explanation of an incident.
with others will not enhance RCACopilot’s predictive capa-
bilities. This demonstrates that an excess of information can
negatively impact the LLM’s prediction performance.
Evaluation on GPT summarization. We evaluate the
role of GPT summarization in enhancing RCACopilot’s ef-
fectiveness. As depicted in Table 3, utilizing summarized
  diagnostic information leads to the highest Micro-F1 and
  Macro-F1 scores, marking improvements of 0.077 and 0.023,
)BPDFUR
)BPLFUR

respectively, over the non-summarized diagnostic informa-


 . 
 . 
.  . 
tion. The results demonstrate that the summarization step
 . 
. 
 . 
. 
effectively condenses the information, allowing for more
.  .  efficient and accurate processing of incident data.
           
DOSKD DOSKD Evaluation on few-shots CoT reasoning. We assess
how few-shots CoT reasoning contributes to improving effec-
(a) F1 micro. (b) F1 macro. tiveness. GPT-4 Prompt approach in Table 2, which directly
predicts the category without any sample, only achieves
Figure 11. Effectiveness of using different K and alpha. 0.026 and 0.004 for Micro-F1 and Macro-F1 respectively. As
11
2023, ACM, arXiv Yinfang et.al.

shown in Figure 11a and Figure 11b, we compare the perfor- practice of incident labeling. Note that the effectiveness of
mance of RCACopilot with different numbers of samples in RCACopilot is also influenced by the quality of the root
the Chain-of-thoughts reasoning. Our analysis reveals that cause categories. Currently, all root cause categories are
the best combination of the number of samples and alpha manually labeled by our experienced OCEs. RCACopilot’s
values are 5 and 0.3, which achieves the highest F1 scores. diagnosis information collection has been deployed in over
Note that more samples in the CoT reasoning do not always 30 teams. Consequently, a valuable future work would be
incur an improvement for RCACopilot, and the value of to evaluate RCACopilot across different services to gain a
the alpha plays an important role in deciding the effective- more comprehensive understanding of its generalizability
ness. When the alpha is appropriate, it allows RCACopilot and adaptability.
to better capture the time relationships between different RCACopilot’s handler is designed to respond based on
incidents, leading to more accurate predictions. alerts generated by monitors. This implies that for incidents
that the monitor does not detect, RCACopilot will not be

t
5.5 Deployment Status and Scale able to match a handler, thereby limiting its applicability.
We have successfully deployed RCACopilot’s diagnostic in- We conducted three rounds of experiments to evaluate
formation collection module across over 30 teams within RCACopilot’s effectiveness. However, the occasional in-
Microsoft, where it has been in active use for over four stability of LLMs can influence their effectiveness, causing
years. The system is tailored to each team’s specific require- variations across different rounds. Another potential threat
ments, with custom handlers built for each unique setting. to internal validity lies in the implementation of our ap-
proach and those we compared against. To mitigate this risk,

af
Not all handlers are currently enabled in the production en-
vironment, as some are still under development and rigorous
testing. We observe that the average running time for each
incident ranges from 15 seconds to 841 seconds (see Appen-
dix A). The highest running time is attributable to the team’s
large-scale and complex system infrastructure. As part of our
commitment to continuous improvement and quality user
experience, we have incorporated a feedback mechanism in
emails to garner user perspectives from OCEs. According
to our collected feedback, most OCEs expressed satisfaction
with the diagnostic information provided by RCACopilot.

5.6 Tustworthiness
While GPT has shown great potential and impressive results
in various tasks, it is known to exhibit some instability in
certain complex tasks such as question answering, as noted
two authors have carefully checked the code. In particular,
we implemented the baselines based on the matured frame-
works.

7 Related Work
Root cause analysis. Root cause analysis in large cloud ser-
vices has become a popular topic of research in the system
and software engineering communities [2, 7, 14, 15, 19, 24,
27, 30, 40, 49]. It aims to identify the root causes of failures
and performance issues based on various data sources, such
as metrics, logs, and traces. Previous studies have proposed
different approaches for root cause analysis using one of
these data sources. For example, some methods rely on met-
rics to extract failure patterns [30, 48] or to construct service
dependency graphs [20, 29]. Others use logs to analyze a sub-
Dr
by Tan et al. [37]. These instabilities could potentially lead to set of log messages [1, 47] or to examine the details within
variable results. In order to ensure the trustworthiness and each log message [22, 46]. Moreover, some techniques utilize
stability of the GPT’s predictive capabilities in RCACopilot, trace to locate the faulty service [21, 23, 39, 43]. Different
each experiment has been conducted three rounds. In each from prior work, we build a system that can automatically
round, RCACopilot was able to maintain a high level of integrate metrics, logs, and traces for root cause analysis
performance, with the Micro-F1 consistently above 0.70 and with state-of-the-art large language models.
the Macro-F1 remaining above 0.50.
Large Language Models. In recent years, the rise of LLM
has brought new opportunities to the field of software sys-
6 Discussion tems by enabling various tasks such as code generation,
RCACopilot’s effectiveness depends on the ability of the summarization, repair, testing, and root cause analysis [1, 13,
LLM. Currently, RCACopilot is only integrated with Ope- 33, 34, 44]. For example, Mastropaolo et al. [34] studied the
nAI’s GPT models, and we have not yet explored the po- ability of fine-tuned T5 in the following tasks: automatic bug
tential effectiveness of other available LLMs. As such, the fixing, generation of assert statements, code summarization,
model’s performance may vary depending on the strengths and injection of code mutants. LANCE [33] uses fine-tuned
and weaknesses of the specific LLM employed. T5 to automatically generate logging statements for Java
We conducted our evaluation of RCACopilot’s predic- methods. VulRepair [13] also fine-tune T5 on vulnerability
tion module using the incident dataset from Transport. The repairs datasets to automatically propose vulnerability fixes.
dataset was prepared with the assistance of experts in Trans- Zhang et al. [44] proposes to use prompting for LLM to im-
port team, given their extensive experience and established prove code version control. Ahmed et al. [1] fine-tune GPT-x
12
Empowering Practical Root Cause Analysis by
Large Language Models for Cloud Incidents 2023, ACM, arXiv

models to recommend root causes and mitigation steps to 2019 34th IEEE/ACM International Conference on Automated Software
facilitate cloud incident management. In contrast to previous Engineering (ASE). IEEE, 364–375.
studies, RCACopilot employs advanced LLMs to summarize [10] Junjie Chen, Shu Zhang, Xiaoting He, Qingwei Lin, Hongyu Zhang,
Dan Hao, Yu Kang, Feng Gao, Zhangwei Xu, Yingnong Dang, et al.
diagnosis data and leverage the chain-of-thoughts ability to 2020. How incidental are the incidents? characterizing and prioritiz-
predict and explain root causes. ing incidents for large-scale online service systems. In Proceedings of
the 35th IEEE/ACM International Conference on Automated Software
8 Conclusion Engineering. 373–384.
[11] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde
RCACopilot represents a pioneering tool in the realm of de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas
cloud incident management, facilitating efficient root cause Joseph, Greg Brockman, et al. 2021. Evaluating large language models
analysis for OCEs. It introduces a unique approach to multi- trained on code. arXiv preprint arXiv:2107.03374 (2021).
source data collection through its diagnostic information col- [12] Yinfang Chen, Xudong Sun, Suman Nath, Ze Yang, and Tianyin Xu.
lection stage, utilizing predefined incident handlers. These 2023. Push-Button Reliability Testing for Cloud-Backed Applications

t
with Rainmaker. In Proceedings of the 20th USENIX Symposium on
handlers, constructed by OCEs, systematically gather multi- Networked Systems Design and Implementation (NSDI’23).
source diagnostic information, which sets the foundation [13] Michael Fu, Chakkrit Tantithamthavorn, Trung Le, Van Nguyen, and
for the subsequent analysis. Furthermore, RCACopilot in- Dinh Phung. 2022. VulRepair: a T5-based automated software vulner-
tegrates a large language model in its root cause prediction ability repair. In Proceedings of the 30th ACM Joint European Software
stage. This model autonomously processes the collected di- Engineering Conference and Symposium on the Foundations of Software
Engineering. 935–947.
agnostic data, predicting and explaining the root cause cate- [14] Yu Gao, Wensheng Dou, Feng Qin, Chushu Gao, Dong Wang, Jun

af
gory. This integration of AI techniques into cloud incident
management demonstrates the potential of RCACopilot in
enhancing the efficiency and accuracy of root cause analysis.

References
[1] Toufique Ahmed, Supriyo Ghosh, Chetan Bansal, Thomas Zimmer-
mann, Xuchao Zhang, and Saravan Rajmohan. 2023. Recommending
Root-Cause and Mitigation Steps for Cloud Incidents using Large
Language Models. arXiv preprint arXiv:2301.03797 (2023).
[2] Ahmed Alquraan, Hatem Takruri, Mohammed Alfatafta, and Samer
Al-Kiswany. 2018. An analysis of network-partitioning failures in
cloud systems. In 13th {USENIX} Symposium on Operating Systems
Design and Implementation ({OSDI} 18). 51–68.
[3] Behnaz Arzani, Selim Ciraci, Boon Thau Loo, Assaf Schuster, and Geoff
Outhred. 2016. Taking the blame game out of data centers operations
with netpoirot. In Proceedings of the 2016 ACM SIGCOMM Conference.
440–453.
[4] Chetan Bansal, Sundararajan Renganathan, Ashima Asudani, Olivier
[15]

[16]

[17]
Wei, Ruirui Huang, Li Zhou, and Yongming Wu. 2018. An empirical
study on crash recovery bugs in large-scale distributed systems. In
Proceedings of the 2018 26th ACM joint meeting on european software
engineering conference and symposium on the foundations of software
engineering. 539–550.
Supriyo Ghosh, Manish Shetty, Chetan Bansal, and Suman Nath. 2022.
How to fight production incidents? an empirical study on a large-
scale cloud service. In Proceedings of the 13th Symposium on Cloud
Computing. 126–141.
Muhammad Adil Inam, Yinfang Chen, Akul Goyal, Jason Liu, Jaron
Mink, Noor Michael, Sneha Gaur, Adam Bates, and Wajih Ul Hassan.
2022. SoK: History is a Vast Early Warning System: Auditing the
Provenance of System Intrusions. In 2023 IEEE Symposium on Security
and Privacy (SP). IEEE Computer Society, 307–325.
Jiajun Jiang, Weihai Lu, Junjie Chen, Qingwei Lin, Pu Zhao, Yu Kang,
Hongyu Zhang, Yingfei Xiong, Feng Gao, Zhangwei Xu, et al. 2020.
How to mitigate the incident? an effective troubleshooting guide rec-
ommendation technique for online service systems. In Proceedings of
the 28th ACM Joint Meeting on European Software Engineering Con-
Dr
Midy, and Mathru Janakiraman. 2020. Decaf: Diagnosing and triaging ference and Symposium on the Foundations of Software Engineering.
performance issues in large-scale cloud services. In Proceedings of 1410–1420.
the ACM/IEEE 42nd International Conference on Software Engineering: [18] Jungo Kasai, Yuhei Kasai, Keisuke Sakaguchi, Yutaro Yamada, and
Software Engineering in Practice. 201–210. Dragomir Radev. 2023. Evaluating GPT-4 and ChatGPT on Japanese
[5] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Medical Licensing Examinations. arXiv:2303.18027 [cs.CL]
Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish [19] Tanakorn Leesatapornwongsa, Cesar A Stuardo, Riza O Suminto, Huan
Sastry, Amanda Askell, et al. 2020. Language models are few-shot Ke, Jeffrey F Lukman, and Haryadi S Gunawi. 2017. Scalability bugs:
learners. Advances in neural information processing systems 33 (2020), When 100-node testing is not enough. In Proceedings of the 16th Work-
1877–1901. shop on Hot Topics in Operating Systems. 24–29.
[6] Ilias Chalkidis. 2023. ChatGPT may Pass the Bar Exam soon, but has a [20] Mingjie Li, Minghua Ma, Xiaohui Nie, Kanglin Yin, Li Cao, Xidao Wen,
Long Way to Go for the LexGLUE benchmark. arXiv:2304.12202 [cs.CL] Zhiyun Yuan, Duogang Wu, Guoying Li, Wei Liu, et al. 2022. Min-
[7] Haicheng Chen, Wensheng Dou, Yanyan Jiang, and Feng Qin. 2019. ing Fluctuation Propagation Graph Among Time Series with Active
Understanding exception-related bugs in large-scale cloud systems. In Learning. In Database and Expert Systems Applications: 33rd Interna-
2019 34th IEEE/ACM International Conference on Automated Software tional Conference, DEXA 2022, Vienna, Austria, August 22–24, 2022,
Engineering (ASE). IEEE, 339–351. Proceedings, Part I. Springer, 220–233.
[8] Junjie Chen, Xiaoting He, Qingwei Lin, Yong Xu, Hongyu Zhang, Dan [21] Zeyan Li, Junjie Chen, Rui Jiao, Nengwen Zhao, Zhijun Wang, Shuwei
Hao, Feng Gao, Zhangwei Xu, Yingnong Dang, and Dongmei Zhang. Zhang, Yanjun Wu, Long Jiang, Leiqin Yan, Zikai Wang, et al. 2021.
2019. An empirical investigation of incident triage for online service Practical root cause localization for microservice systems via trace
systems. In 2019 IEEE/ACM 41st International Conference on Software analysis. In 2021 IEEE/ACM 29th International Symposium on Quality
Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 111– of Service (IWQOS). IEEE, 1–10.
120. [22] Zhenhao Li, Chuan Luo, Tse-Hsun Chen, Weiyi Shang, Shilin He,
[9] Junjie Chen, Xiaoting He, Qingwei Lin, Hongyu Zhang, Dan Hao, Qingwei Lin, and Dongmei Zhang. 2023. Did We Miss Something
Feng Gao, Zhangwei Xu, Yingnong Dang, and Dongmei Zhang. 2019. Important? Studying and Exploring Variable-Aware Log Abstraction.
Continuous incident triage for large-scale online service systems. In
13
2023, ACM, arXiv Yinfang et.al.

arXiv preprint arXiv:2304.11391 (2023). European Software Engineering Conference and Symposium on the Foun-
[23] Dewei Liu, Chuan He, Xin Peng, Fan Lin, Chenxi Zhang, Shengfang dations of Software Engineering. 1477–1488.
Gong, Ziang Li, Jiayu Ou, and Zheshun Wu. 2021. Microhecl: High- [37] Yiming Tan, Dehai Min, Yu Li, Wenbo Li, Nan Hu, Yongrui Chen, and
efficient root cause localization in large-scale microservice systems. In Guilin Qi. 2023. Evaluation of ChatGPT as a Question Answering Sys-
2021 IEEE/ACM 43rd International Conference on Software Engineering: tem for Answering Complex Questions. arXiv preprint arXiv:2303.07992
Software Engineering in Practice (ICSE-SEIP). IEEE, 338–347. (2023).
[24] Haopeng Liu, Shan Lu, Madan Musuvathi, and Suman Nath. 2019. [38] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi,
What bugs cause production cloud incidents?. In Proceedings of the Quoc Le, and Denny Zhou. 2022. Chain of thought prompting elicits
Workshop on Hot Topics in Operating Systems. 155–162. reasoning in large language models. arXiv preprint arXiv:2201.11903
[25] Yudong Liu, Xu Zhang, Shilin He, Hongyu Zhang, Liqun Li, Yu Kang, (2022).
Yong Xu, Minghua Ma, Qingwei Lin, Yingnong Dang, et al. 2022. Uni- [39] Zhe Xie, Haowen Xu, Wenxiao Chen, Wanxue Li, Huai Jiang, Liangfei
Parser: A Unified Log Parser for Heterogeneous Log Data. In Proceed- Su, Hanzhang Wang, and Dan Pei. 2023. Unsupervised Anomaly
ings of the ACM Web Conference 2022. 1893–1901. Detection on Microservice Traces through Graph VAE. In Proceedings
[26] Chang Lou, Cong Chen, Peng Huang, Yingnong Dang, Si Qin, Xinsheng of the ACM Web Conference 2023. 2874–2884.

t
Yang, Xukun Li, Qingwei Lin, and Murali Chintalapati. 2022. RESIN: [40] Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao,
A Holistic Service for Dealing with Memory Leaks in Production Yongle Zhang, Pranay Jain, and Michael Stumm. 2014. Simple Testing
Cloud Infrastructure. In Proceedings of the 16th USENIX Symposium Can Prevent Most Critical Failures: An Analysis of Production Failures
on Operating Systems Design and Implementation (OSDI ’22). USENIX in Distributed Data-Intensive Systems.. In OSDI, Vol. 10. 2685048–
Association, Carlsbad, CA, USA, 109–125. https://www.usenix.org/ 2685068.
conference/osdi22/presentation/lou-resin [41] Jun Zeng, Zheng Leong Chua, Yinfang Chen, Kaihang Ji, Zhenkai
[27] Chang Lou, Peng Huang, and Scott Smith. 2020. Understanding, De- Liang, and Jian Mao. 2021. WATSON: Abstracting Behaviors from
tecting and Localizing Partial Failures in Large System Software.. In Audit Logs via Aggregation of Contextual Semantics.. In NDSS.

af NSDI, Vol. 20. 559–574.


[28] Chen Luo, Jian-Guang Lou, Qingwei Lin, Qiang Fu, Rui Ding, Dongmei
Zhang, and Zhe Wang. 2014. Correlating events with time series for in-
cident diagnosis. In Proceedings of the 20th ACM SIGKDD international
conference on Knowledge discovery and data mining. 1583–1592.
[29] Meng Ma, Jingmin Xu, Yuan Wang, Pengfei Chen, Zonghua Zhang,
and Ping Wang. 2020. Automap: Diagnose your microservice-based
web applications automatically. In Proceedings of The Web Conference
2020. 246–258.
[30] Minghua Ma, Zheng Yin, Shenglin Zhang, Sheng Wang, Christopher
Zheng, Xinhao Jiang, Hanwen Hu, Cheng Luo, Yilin Li, Nengjun Qiu,
et al. 2020. Diagnosing root causes of intermittent slow queries in
cloud databases. Proceedings of the VLDB Endowment 13, 8 (2020),
1176–1189.
[31] Minghua Ma, Shenglin Zhang, Junjie Chen, Jim Xu, Haozhe Li,
Yongliang Lin, Xiaohui Nie, Bo Zhou, Yong Wang, and Dan Pei. 2021.
Jump-Starting Multivariate Time Series Anomaly Detection for Online
Service Systems. In 2021 USENIX Annual Technical Conference (USENIX
ATC 21). 413–426.
[42]

[43]

[44]

[45]
Jun Zeng, Xiang Wang, Jiahao Liu, Yinfang Chen, Zhenkai Liang,
Tat-Seng Chua, and Zheng Leong Chua. 2022. Shadewatcher:
Recommendation-guided cyber threat analysis using system audit
records. In 2022 IEEE Symposium on Security and Privacy (SP). IEEE,
489–506.
Zhengran Zeng, Yuqun Zhang, Yong Xu, Minghua Ma, Bo Qiao, Wen-
tao Zou, Qingjun Chen, Meng Zhang, Xu Zhang, Hongyu Zhang, Xue-
dong Gao, Hao Fan, Saravan Rajmohan, Qingwei Lin, and Dongmei
Zhang. 2023. TraceArk: Towards Actionable Performance Anomaly
Alerting for Online Service Systems. In To appear in Proc. of ICSE.
Jialu Zhang, Todd Mytkowicz, Mike Kaufman, Ruzica Piskac, and
Shuvendu K Lahiri. 2022. Using pre-trained language models to resolve
textual and semantic merge conflicts (experience paper). In Proceedings
of the 31st ACM SIGSOFT International Symposium on Software Testing
and Analysis. 77–88.
Shenglin Zhang, Pengxiang Jin, Zihan Lin, Yongqian Sun, Bicheng
Zhang, Sibo Xia, Zhengdan Li, Zhenyu Zhong, Minghua Ma, Wa Jin,
et al. 2023. Robust Failure Diagnosis of Microservice System through
Multimodal Data. arXiv preprint arXiv:2302.10512 (2023).
Dr
[32] Minghua Ma, Shenglin Zhang, Dan Pei, Xin Huang, and Hongwei [46] Tianzhu Zhang, Han Qiu, Gabriele Castellano, Myriana Rifai,
Dai. 2018. Robust and rapid adaption for concept drift in software Chung Shue Chen, and Fabio Pianese. 2023. System Log Parsing: A
system anomaly detection. In 2018 IEEE 29th International Symposium Survey. IEEE Transactions on Knowledge and Data Engineering (2023).
on Software Reliability Engineering (ISSRE). IEEE, 13–24. [47] Xu Zhang, Yong Xu, Si Qin, Shilin He, Bo Qiao, Ze Li, Hongyu Zhang,
[33] Antonio Mastropaolo, Luca Pascarella, and Gabriele Bavota. 2022. Us- Xukun Li, Yingnong Dang, Qingwei Lin, et al. 2021. Onion: identifying
ing Deep Learning to Generate Complete Log Statements. In Proceed- incident-indicating logs for cloud systems. In Proceedings of the 29th
ings of the 44th International Conference on Software Engineering (ICSE ACM Joint Meeting on European Software Engineering Conference and
’22). 2279–2290. Symposium on the Foundations of Software Engineering. 1253–1263.
[34] Antonio Mastropaolo, Simone Scalabrino, Nathan Cooper, David Nader [48] Yingying Zhang, Zhengxiong Guan, Huajie Qian, Leili Xu, Hengbo
Palacio, Denys Poshyvanyk, Rocco Oliveto, and Gabriele Bavota. 2021. Liu, Qingsong Wen, Liang Sun, Junwei Jiang, Lunting Fan, and Min Ke.
Studying the usage of text-to-text transfer transformer to support 2021. CloudRCA: a root cause analysis framework for cloud computing
code-related tasks. In 2021 IEEE/ACM 43rd International Conference on platforms. In Proceedings of the 30th ACM International Conference on
Software Engineering (ICSE). IEEE, 336–347. Information & Knowledge Management. 4373–4382.
[35] OpenAI. 2023. Tiktoken: A Python library for tokenizing text. https: [49] Yongle Zhang, Junwen Yang, Zhuqi Jin, Utsav Sethi, Kirk Rodrigues,
//github.com/openai/tiktoken. Available from: https://github.com/ Shan Lu, and Ding Yuan. 2021. Understanding and detecting software
openai/tiktoken. upgrade failures in distributed systems. In Proceedings of the ACM
[36] Manish Shetty, Chetan Bansal, Sai Pramod Upadhyayula, Arjun Rad- SIGOPS 28th Symposium on Operating Systems Principles. 116–131.
hakrishna, and Anurag Gupta. 2022. AutoTSG: learning and synthesis [50] Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. 2023. Auto-
for incident troubleshooting. In Proceedings of the 30th ACM Joint matic Chain of Thought Prompting in Large Language Models. In The
Eleventh International Conference on Learning Representations (ICLR
2023).

14
Empowering Practical Root Cause Analysis by
Large Language Models for Cloud Incidents 2023, ACM, arXiv

A Appendix We have successfully deployed RCACopilot’s diagnos-


tic information collection module across 30 teams within
CompanyX. From these, we’ve selected the top 10 teams
that utilize the most RCACopilot incident handlers. Table 4
provides details about the average execution time and the
count of active handlers for these selected teams.

Avg. exec. Enabled


Team
time(s) handler
Team1 841 213
Team2 378 204
Team3 106 88

t
Team4 449 42
Team5 136 41
Team6 91 34
Team7 449 32
Team8 255 32
Team9 323 31

af Figure 12. Web-based UI of RCACopilot for handler


construction.

To facilitate the building of the RCACopilot incident


handler, we have implemented RCACopilot’s handler con-
struction as a web application, as shown in Figure 12.
Team10 22

diagnostic information.
18
Table 4. Teams using RCACopilot to automatically collect
Dr
15

You might also like