0% found this document useful (0 votes)
7 views

Automated Root Causing of Cloud Incidents using In-Context Learning with GPT-4

Automated Root Causing of Cloud Incidents using In-Context Learning with GPT-4

Uploaded by

dunhadasilva
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Automated Root Causing of Cloud Incidents using In-Context Learning with GPT-4

Automated Root Causing of Cloud Incidents using In-Context Learning with GPT-4

Uploaded by

dunhadasilva
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Automated Root Causing of Cloud Incidents using

In-Context Learning with GPT-4


Xuchao Zhang Supriyo Ghosh Chetan Bansal
Microsoft Microsoft Microsoft
Redmond, United States Bangalore, India Redmond, United States
[email protected] [email protected] [email protected]

Rujia Wang Minghua Ma Yu Kang


Microsoft Microsoft Microsoft
Redmond, United States Redmond, United States Shanghai, China
[email protected] [email protected] [email protected]

Saravan Rajmohan
Microsoft
Redmond, United States
[email protected]
ABSTRACT CCS CONCEPTS
Root Cause Analysis (RCA) plays a pivotal role in the incident di- • Computing methodologies → Natural language generation; •
agnosis process for cloud services, requiring on-call engineers to Software and its engineering → Software maintenance tools.
identify the primary issues and implement corrective actions to
prevent future recurrences. Improving the incident RCA process is KEYWORDS
vital for minimizing service downtime, customer impact and man- Incident Diagnosis, Root Cause Analysis, Large Language Model,
ual toil. Recent advances in artificial intelligence have introduced In-context Learning
state-of-the-art Large Language Models (LLMs) like GPT-4, which
have proven effective in tackling various AIOps problems, ranging ACM Reference Format:
from code authoring to incident management. Nonetheless, the Xuchao Zhang, Supriyo Ghosh, Chetan Bansal, Rujia Wang, Minghua Ma,
Yu Kang, and Saravan Rajmohan. 2024. Automated Root Causing of Cloud
GPT-4 model’s immense size presents challenges when trying to
Incidents using In-Context Learning with GPT-4. In Companion Proceedings
fine-tune it on user data because of the significant GPU resource
of the 32nd ACM International Conference on the Foundations of Software
demand and the necessity for continuous model fine-tuning with Engineering (FSE Companion ’24), July 15–19, 2024, Porto de Galinhas, Brazil.
the emergence of new data. To address the high cost of fine-tuning ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/3663529.3663846
LLM, we propose an in-context learning approach for automated
root causing, which eliminates the need for fine-tuning. We conduct
extensive study over 100,000 production incidents from CompanyX,
1 INTRODUCTION
comparing several large language models using multiple metrics. Over the last decade, the IT industry has transitioned away from the
The results reveal that our in-context learning approach outper- traditional practice of distributing software in shrink-wrapped pack-
forms the previous fine-tuned large language models such as GPT-3 ages. Instead, these companies are increasingly embracing cloud
by an average of 24.8% across all metrics, with an impressive 49.7% platforms as their favored approach for deploying applications and
improvement over the zero-shot model. Moreover, human evalua- services. Within these extensive cloud services, incidents such as
tion involving actual incident owners demonstrates its superiority unplanned interruptions or performance degradation can have a
over the fine-tuned model, achieving a 43.5% improvement in cor- significant negative impact on customer satisfaction, resulting in
rectness and an 8.7% enhancement in readability. The impressive revenue loss and a decline in customer trust. At present, the process
results demonstrate the viability of utilizing a vanilla GPT model of diagnosing such incidents still heavily relies on manual investiga-
for the RCA task, thereby avoiding the high computational and tion or the use of specialized service tools. Nevertheless, due to the
maintenance costs associated with a fine-tuned model. escalating scale and complexity of modern cloud systems, relying
solely on human capabilities is inadequate for effectively and effi-
ciently handling incidents, leading to extended Time-to-Mitigate
(TTM).
Root cause analysis, as a pivotal task during incident manage-
This work is licensed under a Creative Commons Attribution 4.0 Interna- ment lifecycle [10], plays a vital role in identifying the underlying
tional License. cause behind the occurrence of the incident. By conducting a root
FSE Companion ’24, July 15–19, 2024, Porto de Galinhas, Brazil cause analysis, the on-call engineers can uncover the primary issues
© 2024 Copyright held by the owner/author(s).
ACM ISBN 979-8-4007-0658-5/24/07 that caused the incident and take appropriate corrective actions to
https://doi.org/10.1145/3663529.3663846 prevent similar incidents from recurring in the future. This task is

266
FSE Companion ’24, July 15–19, 2024, Porto de Galinhas, Brazil X Zhang, S Ghosh, C Bansal, R Wang, M Ma, Y Kang, S Rajmohan

crucial for effective incident resolution, enhancing system reliabil- Title: Completion Mismatches Between Service-A and Service-B
ity, and improving overall incident response processes. Summary: Recent availability issues in Learner Service have resulted in
Although LLMs have demonstrated impressive performance in lost data for Service-A completions for many users. This ticket will track
incident diagnosis tasks [1] through model fine-tuning on the in- those incidents.
cident data, they still face several challenges when applied to the Reference root cause: Service-A sync job was not able to handle depen-
root cause analysis task. Firstly, the current fine-tuned model oper- dent service unavailability.
ates under the assumption that it can learn all the intricate details
of past incidents. However, it is widely recognized that the large Figure 1: A sample production incident.
language model is susceptible to hallucination (producing distorted
or exaggerated facts), as it cannot accurately recall the details from
across all metrics. (Section 4.2) (iii) Our human study with the ac-
the training data. Secondly, fine-tuning large language model is
tual incident owners of production incidents serves as compelling
associated with high costs and may even be infeasible for certain
evidence for the effectiveness of the proposed approach, show-
cutting-edge models with an extremely large number of parameters
casing notable improvements of 43.5% in correctness and 8.7% in
such as GPT-4. Lastly, the fine-tuned model struggles to address
readability. (Section 4.3) The key takeaways of our work is (i) Our
the issue of staleness, where emerging knowledge makes previ-
proposed in-context learning RCA approach not only circumvents
ous information obsolete. It is challenging to update the LLM with
the high cost of fine-tuning incident data but also achieves even bet-
recent knowledge unless it is continuously fine-tuned with latest
ter performance compared to the existing fine-tuned LLMs. (ii) In
data. Consequently, this limitation hinders the model’s capacity to
comparison to the traditional retrieval augmentation approach, in-
seamlessly ingest new knowledge.
-context examples can serve not only as task exemplars but also
To address the aforementioned challenges, we propose an in-
facilitate the integration of domain knowledge into vanilla LLMs,
context learning based method for the root cause analysis task.
resulting in a substantial performance improvement.
Rather than relying on fine-tuning with incident management data
To reproduce our proposed approach, we will make the source
to acquire domain-specific knowledge, we directly include relevant
code publicy available here1 . However, due to privacy concerns
historical incidents as in-context examples to equip the LLM with
with customer data in our dataset, we cannot release the full dataset.
knowledge from the incident management domain. When applying
Instead, we include a detailed guide in our code repo on how to
in-context learning to our task, several considerations and decisions
create a similar dataset step-by-step, along with some sample data
need to be taken. Specifically, given the high cost associated with
for reference.
fine-tuning a LLM, is it possible to achieve comparable performance
in the RCA task using a vanilla LLM model without fine-tuning?
(RQ1) Is it feasible to employ a traditional retrieval augmented 2 BACKGROUND
approach to enhance the performance of a vanilla GPT model in the In this section, we begin with an introduction to the root cause anal-
RCA task? (RQ2) How does the in-context learning method help ysis task and the recent developments in LLM models. Following
the vanilla LLM in root cause analysis task? (RQ3) Does having that, we delve into a thorough discussion of the research questions
more in-context examples result in better performance? (RQ4) How and the human evaluation conducted in this study.
does the performance vary with the relevance of the in-context
examples (RQ5)? How does the ordering of in-context examples 2.1 Incident Root Cause Analysis
affect the performance (RQ6)? In large-scale cloud services, it is inevitable to encounter production
To answer these questions thoroughly, we conducted an exten- incidents that can significantly impact the customer experience
sive evaluation involving a large-scale dataset of 101,308 incidents and incur substantial engineering resources for troubleshooting.
across over a thousand services from CompanyX, one of the largest The incident life-cycle typically consists of four stages: detection,
cloud service providers. In addition to the commonly reported lexi- triaging, diagnosis and mitigation. In the incident diagnosis stage,
cal and semantic evaluation metrics for such experiments, we also root cause analysis plays a critical role in identifying the underlying
present the results from a human validation, where we sought the cause of the reported incident. This process is complex and demands
input of incident owners to assess the correctness and readability a significant amount of manual effort, as well as domain knowledge
of suggested root causes. Since the original incident owners possess about the services involved. Incidents can stem from various issues,
the highest level of expertise regarding their incidents, their evalua- such as code bugs, dependency failures and infrastructure problems.
tion provides valuable insights into the performance of the models. The abundance of possibilities makes it challenging for On-Call
Our contribution can be summarized as: (i) This work represents Engineers (OCEs) to pinpoint the exact cause of the incidents. Errors
a pioneering effort in showcasing the practicality of cutting-edge made during the root cause analysis not only result in additional
Large Language Models (LLMs) like GPT-4 for accomplishing root effort and labor but also have a direct impact on customers and
cause analysis tasks without the need for fine-tuning, achieved revenue. It is essential to avoid such human errors to minimize
through an innovative in-context learning approach. (Section 3) disruptions and provide a better experience to customers. Figure 1
(ii) We present a rigorous and large-scale study in CompanyX on illustrates a real incident from a service, displaying the customer-
over 100,000 incidents from 1000+ cloud services with multiple provided title and summary, along with the actual root cause.
evaluation metrics. The proposed in-context learning approach
outperforms the fine-tuned GPT-3 model by an average of 24.7%
1 https://aka.ms/ICL-RCA

267
Automated Root Causing of Cloud Incidents FSE Companion ’24, July 15–19, 2024, Porto de Galinhas, Brazil

2.2 Research Questions RQ4 Does having more in-context examples result in better perfor-
We investigated several OpenAI GPT-3.x and GPT-4 models to mance?
generate root causes for the incidents without model fine-tuning. When considering the significance of in-context examples for LLM,
This leads to several RQs. especially without fine-tuning on domain-specific data, the question
arises of whether more in-context examples lead to better results.
RQ1 Given the high cost associated with fine-tuning a LLM, is it Ideally, we expect the LLM to be capable of analyzing the integrated
possible to achieve comparable performance in the RCA task using a in-context examples and distinguishing the useful ones. Our aim is
vanilla LLM model without fine-tuning? to find a balance between the quantity and quality of these exam-
Vanilla GPT models, which lack training with incident management ples. To achieve this, we conduct experiments using various sizes
data and domain knowledge, are not expected to perform well in of in-context examples and observe how it affects the root cause
zero-shot settings. On the other hand, even though the fine-tuned analysis task. Moreover, LLMs like the GPT-4 model have been
model can acquire incident domain knowledge from the training developed to accommodate an exceptionally large number of input
data, it still faces the burden of high training and maintenance tokens. For example, the GPT-4-32K model can handle up to 32
costs. To tackle these issues, we explore the in-context learning thousand tokens in its prompt. This substantial increase in capacity
approach, which integrates LLMs with in-context examples as task significantly enhances the LLMs’ ability to incorporate more back-
exemplars and augmented domain knowledge, eliminating the need ground information, thereby further improving their contextual
for time-consuming fine-tuning. To demonstrate the effectiveness understanding for specific tasks. In our case, we utilized the GPT-4
of the in-context learning approach, we compare its performance model in our experiment, testing both the 8K and 32K prompt limits.
with the fine-tuned model in the root cause analysis task. This allowed us to integrate a greater amount of historical incidents
as reference data, serving as background knowledge for the LLM.
RQ2 Is it feasible to employ a traditional retrieval augmented approach RQ5 How does the performance differ between highly relevant in-
to enhance the performance of a vanilla GPT model in the RCA task? context examples and irrelevant (random) examples?
Retrieval-augmented approaches [16, 25, 31] have emerged as a In-context examples typically serve two main functions. Firstly,
powerful technique to enhance the performance of LLMs by incor- they can be utilized as task demonstrations. By providing input
porating external documents. This integration allows language mod- and output through in-context examples, LLMs can learn how to
els to leverage external domain knowledge, leading to improved con- perform the task based on these examples. Secondly, the content of
textual understanding. However, some approaches [6, 41], require in-context examples can also provide LLMs with domain-specific
fine-tuning a specific decoder to leverage the retrieval-augmented knowledge to tackle new incidents. In this paper, we aimed to
knowledge, which contradicts our motivation to avoid fine-tuning investigate whether the relevance between in-context examples and
the model. On the contrary, there are other methods that directly new incidents plays a crucial role in determining the performance.
integrate the documents into the model input, wherein the aug- We compare the in-context examples most relevant to the new
mented document can only furnish domain knowledge but lacks incident with randomly selected examples.
the functionality of a task exemplar. Furthermore, using chunked
RQ6 How does the ordering of in-context examples affect the perfor-
retrieval documents may reduce the effectiveness of the retrieval
compared to the format of in-context examples. It also remains a mance?
question whether chunked retrieval documents can surpass the Previous research has revealed that LLMs are influenced by the
performance of the format of in-context examples. To assess the arrangement of in-context examples. The main question we aim
significance of in-context examples, we conducted a comparison to address is whether we should place our in-context examples
between our in-context learning approach and the traditional re- closest to the task description (at the beginning of prompt) or the
trieval augmentation method. In the latter, we divided incident new incident (bottom of prompt). To examine the impact of the
details, comprising incident title, summary and root cause, into order of in-context examples, we compare the performance of three
chunks, disregarding the original format of incident fields. The different settings. Firstly, we present the examples in descending
resulting document was presented as a sequence of chunks rather order of relevance, with the most relevant example coming first
than as in-context examples. This comparison allowed us to illus- and the least relevant example last, relative to the new incident. In
trate the importance of using in-context example format while we the second setting, we arrange the examples in ascending order,
maintain content consistency for the two approaches. meaning that the most relevant example is positioned closest to
the new incident. Lastly, we select the top k most relevant exam-
RQ3 How does the in-context learning method help the vanilla LLM ples and then randomize their positions to study the effect of this
in root cause analysis task? arrangement.
In-context learning methods have proven their capability to bridge
the domain knowledge gap for LLMs by utilizing demonstrated 3 METHODOLOGY
examples in many domains. In this research, we aim to explore how We present our in-context RCA approach that uses the in-context
this method can aid the vanilla LLM model (without fine-tuning) examples to enhance the performance of LLM. First, we provide an
in enhancing the root cause analysis task. We utilize four different overview of our approach in Section 3.1. Then we delve into the
GPT models with varying capacities and observe their performance details of the data preparation and in-context example extraction
with in-context examples to compare with their zero-shot version in Section 3.2 and Section 3.3. Last, the root cause generation step
for the root cause analysis task. is described in Section 3.4.

268
FSE Companion ’24, July 15–19, 2024, Porto de Galinhas, Brazil X Zhang, S Ghosh, C Bansal, R Wang, M Ma, Y Kang, S Rajmohan

insertion, modification, to deletion, starting from the moment an


New Incident
incident is created until it is successfully resolved.
Retrieval Corpus We gathered 101,308 incident data3 created between January 1,
query
Incident 2021, and September 30, 2022 that had both a non-empty summary
Retriever and root cause field. These incidents were part of the "Resolved"
return results

……
or "Mitigated" incidents, with severity levels ranging from 0 to 4,
Incident-1 Incident-2 Incident-K
Incident
where level 0 represented the most severe incidents. Furthermore,
In-context Examples Prompting
Summarization we applied filters to include only incidents whose titles contained
specific keywords like "ignore," "test," or "dummy." Once the filtering
I want to act as … for root cause of incidents (task description) Data Collection & process was completed, we sorted the remaining data based on the
Title: Unable to restart app …
Summary: Customer is unable to restart app and is receiving a
Cleaning creation date. We then selected the first 98,308 incidents for the
500 error … (summarized incident summary)
Root cause: config server was down ….
retrieval set, 2,000 for the validation set, and 1,000 for the testing
(more in-context examples) set. To enable a comparison with the fine-tuned model, we further
Title: Production Logic App not Scaling …
Summary: S500 RCA Process … the RCA be given extra effort refined the retrieval set by choosing the most recent 20,000 incidents
Root cause: Historical Incidents
(new incident)
to be used as the training set for fine-tuned baseline models.
In-context Learning

GPT-based Generated 3.2.2 Data Cleaning. Because of the urgency in resolving incidents,
RCA Model Root Cause
the OCEs typically don’t adhere to a strict template when providing
Figure 2: Overview of our In-context Learning RCA Frame- incident summaries and root causes. This leads to the content of
work these incidents being challenging to parse and understand using
rule-based models. Moreover, both the summary and root cause
often contain information presented in various formats, including
3.1 Overall Architecture screenshots, tables, stack traces and code snippets. However, since
Figure 2 illustrates the overall architecture of our proposed ap- GPT-3.x or GPT-44 only support textual data, we had to exclude
proach. Initially, we gather incident data created between January the images from the incident summaries and root causes. There-
1, 2021, and September 30, 2022, from our incident database. The fore, we conduct the following steps for incident data cleaning:
data is then cleaned by removing lengthy stack traces and embed- (i) eliminate lengthy stack traces. Based on our observations, we
ded images. To avoid overwhelming amounts of incident details, we have encountered stack traces that exceed 10,000 tokens. These
utilize GPT-35-turbo2 to summarize the incident summary and root lengthy stack traces obviously cannot be accommodated within
cause for constructing the retrieval corpus and in-context examples. the LLM’s prompt and tend to distract the LLM with unnecessary
After summarization, we employ a sentence transformer model details. Our proposed solution involves using a regular expression
[39] to generate embedding vectors for each incident’s summarized pattern to search for function calls within the incident summary.
summary. Subsequently, we construct a retrieval index using the This pattern allows us to identify lines that are part of the stack
Faiss library [20], enabling efficient similarity search based on these trace, specifically by recognizing all lines in the incident summary
embeddings. When a new incident arises, we use its description that contain at least one function call. (ii) remove Base64 image.
as a query to find relevant incidents based on the retrieval index. The incident contains images encoded in Base64 format, which is
The extracted incidents are then integrated into the prompt of the an encoding algorithm used to convert images into readable strings.
Large Language Model (LLM) in the form of in-context examples. This allows the images to be saved or transmitted over a network
Finally, we utilize the LLM, such as GPT-4, to generate the root without any data loss. However, as LLM cannot comprehend Base64
cause based on the new incident description and all the provided images, we address this issue by utilizing the BeautifulSoup library
in-context examples. to parse the <img> tag and subsequently remove these images from
the incident.
3.2 Data Preparation
Our data preparation process involves two steps. Initially, we extract 3.3 In-context Example Extraction
the data from the incident database using specific criteria. Next, we In this stage, we construct the retrieval corpus to find relevant
proceed to clean the incident samples. incident examples that share similarities with the new incident.
3.2.1 Data Collection. Numerous incidents from different services The process involves two key steps. First, we use incident summa-
and severities are detected and created at CompanyX through both rization to condense the length of the original incident, making it
automated systems and human monitoring. A team of dedicated more manageable. Next, we create a retrieval index that facilitates
on-call engineers (OCEs) is always ready to address these incidents efficient semantic search based on the incident descriptions. This al-
promptly, ensuring seamless service for our valued customers. To lows us to effectively find and retrieve incidents that closely match
effectively manage this high volume of incidents, CompanyX has the semantics of the new incident.
developed a specialized platform tailored for reporting and handling
such occurrences. This platform includes a comprehensive database 3 Unfortunately,we are unable to share the incident dataset as it contains confidential
that tracks all activities related to incident reporting, from data information about the cloud services and infrastructure.
4 The GPT-4 model does have support for image input, but it is currently not available
2 https://platform.openai.com/docs/models/gpt-3-5 for public usage.

269
Automated Root Causing of Cloud Incidents FSE Companion ’24, July 15–19, 2024, Porto de Galinhas, Brazil

Original Incident Summary (864 tokens): 3.3.2 Retrieval Index Building. Once the incident has been summa-
Customer XXX/Severity A ICM Ticket Template: Project ID:XXX rized in the previous step, we proceed with encoding the incident
- PREMIER SUPPORT CASE Project Name:XXX Environment summary using the Sentence Transformer (ST) model [22], specif-
Name:XXX Application/Platform Versions:XXX (10.0.XXX.XXX) : PU ically all-mpnet-bse-v27 . This model has been fine-tuned on 32
36 (X.X.XXXX.XXXXX) Case Info:XXXXXX - Open Unify Environment
datasets comprising 1 billion sentence pairs with a contrastive learn-
Urgency Reason: Job is critical to the customer, issue affects prod Poten-
tial Downtime window: NA Issue Description: <core description>
ing objective [11]. The ST model utilizes a sentence transformer
... ... ... encoder to convert sentences or paragraphs into a 768-dimensional
- Export job failed due to error(s) in ABCStagingWriter.execute(): ... dense vector space, which can help to encode the incident descrip-
Summarized Version (89 tokens): tion into vector space. In our approach, we utilize this model to
This customer is experiencing an issue with a ABC job that is stuck encode the concatenation of the title and summarized incident sum-
during execution and not creating new records. The customer has at- mary for historical incidents in the retrieval corpus. Likewise, for
tempted to delete the job and recreate it, but is still having the same issue. new incidents, we also use the same setting to generate the query
The urgent reason is that the job is critical to the customer and affects embedding.
production. The activity ID is ExcecutionID: XXX and the engineer has After obtaining the embedding vectors from the sentence trans-
requested assistance in getting the job to complete. Kusto URLs have former model, we utilize the FAISS [20] library for efficient similar-
been captured for this incident. ity search and clustering of the dense vectors derived from historical
incidents. Since the incidents are represented as vectors, we can
Figure 3: Example of Original and Summarized Incident compare them using L2 (Euclidean) distances. The vectors with the
lowest L2 distance from the query vector are considered similar to
3.3.1 Incident Summarization. Before we delve into the approach the query vector. The FAISS library employs a compressed repre-
of summarizing the incident summary and root cause, let’s first sentation of the vectors, eliminating the need to store the original
explain why such summarization is necessary. Incidents often con- vectors. Although this may lead to a slightly less precise search, the
tain an overwhelming amount of details that may not be helpful benefit lies in its ability to scale to billions of vectors in main mem-
in reflecting the core information of the incident. For instance, Fig- ory on a single server. This scalability allows for efficient searches
ure 3 illustrates how incidents can include detailed logs, which, if on our thousands of historical incidents, making it a feasible and
included as is, would occupy most of the space in the in-context practical approach for our requirements. The index generated by
examples and limit the number of examples that can be used. In FAISS can be stored in the disk and loaded into memory for efficient
extreme cases, if the length of the summary exceeds 8K tokens, search.
not even a single example can fit into the prompt for GPT-4 model.
3.3.3 In-Context Examples Retrieval. In this stage, our goal is to
Furthermore, incorporating excessive unnecessary information can
search for relevant in-context examples using the retrieval index
diminish the retrieval module’s effectiveness. Existing retrieval
that was generated in the previous step. Given an incident descrip-
models are typically pre-trained on smaller queries, which poses
tion 𝑑, the main objective of the retriever is to select the top-k
significant challenges when searching through lengthy documents
most relevant incidents D𝑟 = {𝑑 1, . . . , 𝑑𝑘 } from a large retrieval
with extended queries.
corpus D, where D𝑟 ⊆ D. To achieve this, we adopt the approach
Based on our observations, we employ the prompt shown in
used in prior research [25]. We utilize the sentence transformer
Figure 4 to summarize the incident summary and root cause using
model to encode the concatenation of the new incident title and
the GPT-3.5-turbo model5 . The objective of this prompt is to make
summary, which produces an embedding representation known as
the GPT model function as a software engineer and incorporate
the incident query vector. Afterward, we make use of the FAISS
specific information into the summarized version, including: in-
library to retrieve the top-k most relevant incidents based on the
cident symptoms, references to external services, distinguishing
retrieval index created in the previous step. Once we have retrieved
features such as error codes, and contextual details like the service
the relevant incidents, we combine their title, summary, and root
name. Additionally, we ensure that the length of the summarized
cause to serve them as in-context examples.
version does not exceed 5-6 sentences and is written in the third
person’s tone. To apply the prompt to new incidents, we substi-
tute the placeholders "description" with the incident summary and
3.4 Root Cause Generation
root cause, respectively. An illustrated example of a summarized To generate the root cause, our initial step involves constructing
incident can be seen in Figure 3. In this example, the summarized the prompt for the LLM using the retrieved in-context examples. As
incident includes the key symptom "ABC6 job is stuck during execu- illustrated in Figure 5, the prompt consists of the task definition, in-
tion and not creating new records," along with the associated error context examples and the description of the new incident. Initially,
code and initial troubleshooting steps provided by customers. The we define the root cause analysis task and prompt the LLM to act
summarization process helps reduce the original summary from 864 as a software engineer. Next, we present the retrieved in-context
tokens to just 89 tokens, which can be highly beneficial in saving examples, organized with their titles, summaries, and root causes,
space for the LLM prompt. with double new lines used to separate multiple incidents. It is
worth noting that we utilize the summarized incident summary
5 https://platform.openai.com/docs/models/gpt-3-5
and root cause in the example to prevent excessive prompt space
6 To
ensure the confidentiality of sensitive information, we employ the placeholders
"ABC" instead of the actual service name and "XXX" for error codes and IDs. 7 https://huggingface.co/sentence-transformers/all-mpnet-base-v2

270
FSE Companion ’24, July 15–19, 2024, Porto de Galinhas, Brazil X Zhang, S Ghosh, C Bansal, R Wang, M Ma, Y Kang, S Rajmohan

Incident Summary Prompt: Regarding the GPT3 fine-tuning model, we designate the last 20,000
I want you to act as an expert software engineer. Consider the following incidents from retrieval set as the training set for fine-tuning the
incident report was submitted on the IcM portal. model. As for the labels, we use the extracted root cause from each
Incident Description: {description} incident data sample.
Your task is to summarize this incident report. Focus on the following
aspects of the incident: 4.1.2 Evaluation Metrics. We choose two types of quantitative
· The symptoms of the incident that lead to this incident report metrics for evaluating our model: lexcial and semantic metrics.
· References to external services or tools that contain relevant informa- For lexical metrics, we opt for four classic metrics. Firstly, we em-
tion.
ploy the Rouge metric (Recall Oriented Understudy for Gisting
· Distinguishing features of the incident such as precise error codes,
Evaluation) [28] to compare a candidate document against a set of
specifics from logs etc.
· Context of the incident such as the name of the service, region, etc. reference texts. Specifically, we choose ROUGE-L [28], which con-
Your summary should be at most 5-6 sentences and should be in third siders sentence-level structural similarity and identifies the longest
person. You must end your summary with <|endoftext|>. co-occurring n-grams using Longest Common Subsequence (LCS)
Concise Summary: statistics [17]. Also, we utilize ROUGE-1 [28] to consider the 1-gram
Incident Root Cause Prompt Summary: matching . Moreover, we include METEOR (Metric for Evaluation
I want you to act as an expert software engineer. Your task is to summa- of Translation with Explicit Ordering) [4], which is based on the
rize the following root cause of an incident report. Your summary must harmonic mean of unigram precision and recall, and it also incor-
clearly state what the root cause of the incident was. porates additional features like stemming and synonymy matching
Incident Root Cause: {description} to enhance its accuracy. The last lexical metric we have chosen is
Concise Summary: GLEU [42], a deriative of BLEU (Bilingual Evaluation Understudy)
[29]. The metric was proposed to overcome some undesirable prop-
Figure 4: Summarization Prompt for Incident Summary and erties when the BLEU metric is used for single sentences.
Root Cause To assess our results based on the semantic meanings of words,
I want you to act as a software engineer to figure out the root cause of we choose to use two semantic metrics instead of lexical metrics, as
incidents. I will provide some examples to start with. the latter only consider exact word matches without taking word
meaning into consideration. The first semantic metric we utilize
Title: {incident title} is BERTScore [45]. It leverages pre-trained contextual embeddings
Summary: {summarization of incident summary} from the BERT model [12] to compare candidate and reference sen-
Root Cause: {summarization of incident root cause} tence words using cosine similarity. This method enables a more
... ... nuanced evaluation of semantic similarity. Next, we incorporate the
Title: {incident title}
NUBIA (NeUral Based Interchangeability Assessor) [21], a recently
Summary: {incident summary}
developed neural-based measure. NUBIA integrates various aspects
Root Cause:
of semantic evaluation, including semantic similarity, logical infer-
ence, and sentence legibility. It achieves this by exposing layers
Figure 5: In-context Examples Prompting
of pre-trained language models like RoBERTa STS [32], RoBERTa
occupation while retaining the core information of the incident MNLI, and GPT-2 [38]. This comprehensive approach allows us
as reference knowledge for addressing new incidents. Following to obtain a more accurate and comprehensive evaluation of the
the list of in-context examples, we add the description of the new semantic quality of our results.
incident, including its title and summary. Notably, the incident 4.1.3 Baseline Methods. We chose CodeGen [36], a medium-sized
summary used here is the original version, allowing for the inclu- language model, and GPT-3, a large language model, as the baselines
sion of more details about the new incident. Finally, we conclude for our fine-tuned models. CodeGen is an autoregressive language
the prompt with the phrase "Root Cause:" to prompt the LLM to model designed for program synthesis, trained sequentially on the
generate the root cause. Once the prompt is prepared, we utilize Pile, BigQuery, and BigPython datasets. For the larger language
the OpenAI API8 to call upon GPT models for generating the root model (LLM), we selected GPT-3 because it is the only model we
cause. In particular, we set the temperature to zero to ensure a more have the resources to fine-tune using a large amount of incident
deterministic output from the model. Additionally, we configure data. However, even for the GPT-3 model, the fine-tuning process
the completion length to 200 for the root cause generation. demands 16 V100-32GB GPUs, and during inference, 4 V100-32GB
GPUs are required. Moreover, we choose four GPT models, namely
4 EXPERIMENT GPT-35-turbo9 , Text-davinci-003, GPT-4 [37], and GPT-4-32K [37],
4.1 Experiment Setup as our baseline models. We compare their zero-shot model output
4.1.1 Datasets and Labels. As outlined in Section 3.2, our dataset with the results obtained from our in-context learning approach.
comprises incident data collected from various services and severi- Lastly, we adopt the traditional retrieval augmentation method [25],
ties at CompanyX, totaling 101,308 incidents. To construct our data wherein we split the incident description into chunks and use these
set, we select the first 98,308 incidents for the retrieval set, 2,000 chunks as retrieval documents, which are then compared to our
for the validation set, and the remaining 1,000 for the testing set. in-context examples.
8 https://azure.microsoft.com/en-us/products/ai-services/openai-service 9 https://platform.openai.com/docs/models/gpt-3-5

271
Automated Root Causing of Cloud Incidents FSE Companion ’24, July 15–19, 2024, Porto de Galinhas, Brazil

Table 1: Comparison between fine-tuned model and in-context learning models w/o fine-tuning

ROUGE-L ROUGE-1 METEOR GLEU BERTScore Nubia


CodeGen (Fine-tuned) 10.32 14.26 11.78 3.39 81.45 35.13
GPT3 (Fine-tuned) 14.39 17.18 16.16 5.83 82.27 40.91
GPT-35-turbo w/ ICL 12.33 17.47 17.38 4.49 81.84 37.28
Text-Davinci-003 w/ ICL 18.01 23.72 19.51 5.70 84.50 43.13
GPT-4 w/ ICL 19.89 26.08 22.40 6.37 84.91 43.98
GPT-4-32K w/ ICL 19.86 26.05 22.41 6.39 84.96 44.19

4.2 Experimental Results Table 2: Comparison between hunked incidents and in-
context examples
4.2.1 Given the high cost associated with fine-tuning a LLM, is it
possible to achieve comparable performance in the RCA task using a
vanilla LLM model without fine-tuning? (RQ1). Table 1 presents the ROUGE-L METEOR GLEU Nubia
effectiveness of our in-context learning approach using vanilla GPT Chunked (10 shots) 13.87 17.92 4.90 38.48
models. We conducted a comparison based on four GPT backbone Chunked (20 shots) 14.01 17.31 4.77 39.88
models against fine-tuned CodeGen and GPT-3 models on a training Chunked (30 shots) 14.22 17.50 4.93 39.73
set of 20,000 examples. Due to the high demand for GPU resources in Chunked (40 shots) 14.01 17.11 4.73 40.18
fine-tuning GPT models, we could only fine-tune the GPT-3 model In-context Examples 19.89 22.4 6.37 43.98
with our dataset. We employed an in-context learning model with a
few-shot examples setting, allowing a maximum of 20 instances, as
it demonstrated superior performance on our development dataset metrics10 . Additionally, we observed that the performance of the
compared to other configurations. From the results, we can conclude chunked retrieval model consistently increased until the chunk size
the following: 1) Our GPT-4 model outperforms the CodeGen and reached 30, but it started to decline for chunk sizes larger than 30.
GPT-3 fine-tuned model by 63.85% and 24.77% on average across
4.2.3 How does the in-context examples help the vanilla LLM in
all six metrics. 2) The performance of GPT-35-turbo still falls short
root cause analysis task? (RQ3). In Table 3, we conducted a com-
of the fine-tuned model, indicating that the in-context learning
parison between the results of the zero-shot model and in-context
approach still relies on the potency of the LLM to fully leverage the
learning models using 20 in-context examples, based on the four
in-context examples. 3) GPT-4-32K achieves results similar to the
GPT baseline models. The findings reveal a substantial performance
GPT-4 model, which can handle a maximum of 8K input tokens.
improvement with the in-context examples, showing significant
This is primarily because we utilize 20 in-context examples, which
gains of 49.69% and 51.31% for the GPT-4 and GPT-4-32K mod-
do not exhaust the 8K input limit of the GPT-4 model. We carried
els, respectively. When comparing to the zero-shot model with
out significance testing for our proposed model as well. In analyzing
the GPT3 fine-tuned model, we observe a performance drop of
the Rouge-L scores and comparing the GPT-4 model with baseline
approximately 18.89%. However, with 20 in-context examples, we
models, we were able to confidently reject the null hypothesis (H0)
achieve an improvement of 24.77% over the fine-tuned model. Sim-
because the p-value is significantly lower than 0.05. For instance,
ilar trends are seen for the GPT-35-turbo and Text-Davinci-003
the p-value when comparing our GPT-4 model to the fine-tuned
models, with performance improvements of 25.53% and 35.56%,
GPT-3 model stands at 1.64e-11.
respectively, compared to the zero-shot model. Additionally, we
4.2.2 Is it feasible to employ a traditional retrieval augmented ap- noticed that the performance improvement for the two semantic
proach to enhance the performance of a vanilla GPT model in the RCA metrics, BERTScore and Nubia, is notably smaller compared to the
task? (RQ2). To demonstrate the superiority of in-context learning, lexical metrics, particularly for the two GPT3.5 models. Nonetheless,
we conducted a comparison with the traditional retrieval augmen- the GPT-4 models manage to achieve a noteworthy 30% or higher
tation method, which involved chunking historical incident details. performance gain in the Nubia metric compared to the GPT3.5 mod-
In this process, we combined incident details, including their title, els, which show only 2-3 times less improvement. This underscores
summary, and root cause. Subsequently, we split these incidents the superior ability of the GPT-4 models to enhance overall root
into chunks, each containing 128 tokens, and constructed a retrieval cause analysis task through the utilization of in-context examples.
index using the same sentence transformer model. For each new 4.2.4 Does having more in-context examples result in better per-
incident, we retrieved the top-k most relevant chunks from the formance? (RQ4). To address the research question, we initially
retrieval corpus, and we experimented with four different chunk employed the GPT-4 model on various few-shot settings, ranging
number settings, varying from 10 to 40 chunks. The performance from 0 to 40 shots. The results are presented in Figure 6, with lexical
of the chunked retrieval method was then compared to that of our metrics displayed in Figure 6a, and semantic metrics depicted in
in-context learning model, and the results are shown in Table 2. Our Figure 6b. We discovered that both lexical and semantic metrics
findings revealed that our in-context learning model outperformed achieved optimal performance when the number of shots reached
the chunked retrieval model. With 30 shots, our model achieved
an average improvement of approximately 22.37% across all six 10 Due to space limitations, we only presented four metrics in Table 2

272
FSE Companion ’24, July 15–19, 2024, Porto de Galinhas, Brazil X Zhang, S Ghosh, C Bansal, R Wang, M Ma, Y Kang, S Rajmohan

Table 3: Comparison between zero-shot model and in-context learning model with 20-shot examples

ROUGE-L ROUGE-1 METEOR GLEU BERTScore Nubia


GPT-35-turbo Zero-shot 8.25 13.64 14.03 3.21 80.60 33.78
GPT-35-turbo w/ ICL 12.33 17.47 17.38 4.49 81.84 37.28
%gain for GPT-35-turbo +49.45% +28.08% +23.88% +39.88% +1.54% +10.36%
Text-Davinci-003 Zero-shot 11.08 16.77 14.39 3.62 82.63 37.81
Text-Davinci-003 w/ ICL 18.01 23.72 19.51 5.70 84.50 43.13
%gain for Text-Davinci-003 +62.55% +41.44% +35.58% +57.46% +2.26% +14.07%
GPT-4 Zero-shot 10.27 16.40 16.21 3.71 81.95 33.33
GPT-4 w/ ICL 19.89 26.08 22.40 6.37 84.91 43.98
%gain for GPT-4 +93.67% +59.02% +38.19% +71.70% +3.61% +31.95%
GPT-4-32K Zero-shot 10.13 16.15 16.10 3.68 81.93 32.99
GPT-4-32K w/ ICL 19.86 26.05 22.41 6.39 84.96 44.19
%gain for GPT-4-32K +96.05% +61.30% +39.19% +73.64% +3.70% +33.95%

(a) Lexical Metrics (b) Semantic Metrics

Figure 6: Performance for different few-shot examples

20. Moving from 0-shot to 10 shots, we observed a significant im- variation between different methods, demonstrates a more mod-
provement of 46.12% on average across all the metrics. However, est 1.9% difference. Additionally, Nubia shows an improvement of
increasing the shots from 5 to 10 only resulted in a marginal 2.12% around 17.6%, albeit the improvement for lexical metrics appears
improvement, and further increasing the shots to 20 showed an to be more significant than for semantic metrics. The reason for
even smaller improvement of 0.18%. Beyond 20 shots, we noticed this observation can be attributed to two factors. Firstly, lexical
some performance degradation, likely due to the inclusion of more metrics tend to have relatively lower values, which can amplify
irrelevant examples when using too many in-context examples. the percentage change. Secondly, lexical metrics heavily rely on
Additionally, Table 4 presents a comparison between fixed 20-shot word-level matching, favoring incidents that share more similar
examples and full prompts that fill up to the token limit of the LLMs. expressions, thereby providing a greater advantage in performance
It is evident that all the models exhibited worse performance when improvement. Moreover, when comparing the random examples
too many examples were included in the prompt. For instance, the to the zero-shot model in Table 3, we observe a 5.9% performance
GPT-4-32K model had an average of approximately 160 in-context improvement, which is significantly lower than the 49.7% improve-
examples, which proved to be sufficient to include both relevant ment achieved on relevant examples. These results indicate that
and irrelevant examples. the relevance of the in-context example contributes more than its
functionality as a task exemplar in the RCA task.
4.2.5 How does the performance differ between highly relevant in-
context examples and irrelevant (random) examples? (RQ5). Figure 7 4.2.6 How does the ordering of in-context examples affect the per-
presents the comparison between the most relevant incidents and formance? (RQ6). Once the relevant incidents have been retrieved,
random incidents, both consisting of 20 in-context incident exam- a remaining research question is how to best arrange the order of
ples. It is evident that using the most relevant examples leads to these examples for achieving the best performance. To assess the
a considerable performance improvement of approximately 41.2% impact of different ordering methods for the in-context examples,
when compared to randomly selected incidents that lack any se- we conducted three experiments. Initially, we sorted the examples
mantic relevance to the current incident. Notably, the ROUGE-L in descending order based on their relevance scores. Next, we tried
metric exhibits the most significant improvement, showing a re- the ascending order, and finally, we experimented with a random-
markable 64.93% boost, while BertScore, a metric with minimal ized order. The results of these three ordering settings for all six

273
Automated Root Causing of Cloud Incidents FSE Companion ’24, July 15–19, 2024, Porto de Galinhas, Brazil

Table 4: Comparison between in-context examples with 20- 4.3 Human Evaluation
shots and full prompt examples
To showcase the human evaluation results, we conducted a random
selection of 28 incidents from our incident pool. These incidents
ROUGE-L Nubia are spread across 11 different owning services, covering 8 different
20 shots 12.33 37.28 countries, and involving 18 distinct individual incident owners. Ta-
GPT-35-turbo
(≈ 17.0 shots) 11.03 36.68 ble 5 showcases the human evaluation results, which were carried
Text-Davinci- 20 shots 18.01 43.13 out by incident owners to ensure the accuracy of the assessments.
003 (≈ 17.0 shots) 17.73 42.41 The evaluations were based on two metrics: correctness and read-
ability. The correctness metric aimed to determine whether the
20 shots 19.89 43.98
GPT-4 model offered helpful and relevant suggestions compared to the
(≈ 37.6 shots) 17.18 41.43
actual root-cause of the incidents. On the other hand, the readability
20 shots 19.86 44.19
GPT-4-32K metric assessed how easily readers could understand the generated
(≈ 160.0 shots) 17.13 41.48
text. The scoring system used ranged from 1 to 5, with 5 being the
highest rating and 1 being the lowest. We utilized three in-context
learning models (Text-davinci-003, GPT-4, and GPT-4-32K) incorpo-
rating 20-shot examples for comparison with the fine-tuned GPT3
model. The results revealed that the GPT-4 model, enhanced with in-
context examples, significantly outperformed the fine-tuned GPT3
model, scoring 43.5% higher in terms of correctness. Moreover, the
GPT-4 model exhibited an 8.7% improvement in readability. The
comparison also indicated that the Text-davinci-003 model slightly
underperformed compared to the fine-tuned model by 1.9%. This
suggests that the in-context learning approach benefits from a
more powerful model. Additionally, the use of only 20-shot exam-
ples hindered the GPT-4-32K model from leveraging its advantage
of accommodating large prompt inputs, resulting in even poorer
performance than the GPT-4 model.

Figure 7: Comparison between relevant and random in-


context examples 4.4 Case Study
Table 6 presents a case study showcasing the application of our pro-
posed method using real-world incident data. The incident involves
metrics are depicted in Figure 8, with a chosen in-context exam- a problem referred to as the "Cable reseat is blocked" issue. The root
ple size of 20. We observe that the order had minimal impact on cause of this problem was traced back to incorrect data inheritance,
performance. On average, the standard variance among the three which occurred due to a bug related to spares and sourcing fac-
settings was 0.12 across the six metrics. Notably, we observed a tors. Our model’s prediction closely aligned with the ground truth,
slightly larger variance for the Nubia metric, reaching up to 0.55, though the wording may vary slightly. By examining examples
while all other metrics had a standard variance of less than 0.1. from our retrieval corpus, we identified similar incidents like the
one displayed in Table 6. The example shares the same issue but
differ in their parent task and description. Nevertheless, our model
effectively leveraged the root cause from these in-context examples
to accurately predict the correct root cause for the given incident.
This case serves as an illustration of the significance of using sim-
ilar in-context examples. In contrast to fine-tuning the LLM, our
approach relies entirely on extracted incidents, without concern for
the generation of false information or reliance on outdated facts.

5 RELATED WORK
5.1 LLMs in Software Engineering
In recent year, the emergence of LLMs has opened up new prospects
in the software systems field, enabling various tasks such as pro-
Figure 8: Comparing different orders of the in-context exam- gram synthesis [18, 36], log analysis [34], vulnerability repair [13],
ples software testing [40], and incident management [1, 9]. For example,
Jain et al. [18] propose an approach that enhances large language
models with post-processing steps based on program analysis and

274
FSE Companion ’24, July 15–19, 2024, Porto de Galinhas, Brazil X Zhang, S Ghosh, C Bansal, R Wang, M Ma, Y Kang, S Rajmohan

Table 5: Correctness and readability scores assigned by the incident owners

GPT3 Fine-tuned Text-Davinci-003 GPT-4 GPT-4-32K


Criteria
Mean Median Mean Median Mean Median Mean Median
Correctness 1.72 1 1.69 (-1.9%) 1 2.47 (+43.5%) 2 2.13 (+23.6%) 2
Readability 4.34 5 4.47 (+3.0%) 5 4.72 (+8.7%) 5 4.59 (+5.8%) 5

Table 6: Case study of our in-context learning approach

Title BreakFix | Networking | Device: Cable Reseat is blocked and needs assistance.
Summary Parent GDCO Task (Sev 3): xxx .... Blocking Description: Report Incorrect Ops Model for Servicing Vendor Dell for ... Block Reason: xxx Logs
Attached? False KB Article Number: Did Server show POST? Alias of Senior Tech
Groud Data is inherited incorrectly due to a bug based on other factors from spares and sourcing or other attributes about the Rack itself
Truth
Model Pre- Data was inherited incorrectly due to a software bug that was based on other factors from spares and sourcing or other attributes about the
diction Rack itself.
In-context BreakFix | Networking | Cable Reseat is blocked and needs assistance. The incident report refers to a Sev 3 issue with the Parent xxx Task with
Examples a blocking task, related to the Operations Model being wrong for the xxx. The Block Reason is provided, but no xxx logs were attached. The
incident report also lists the Device Name, Device Type, Rack, Slot and other details for reference. The incident was caused by the incorrect
inheritance of data due to a bug. This bug was influenced by other factors such as spares, sourcing, or attributes of the Rack itself.

synthesis techniques, resulting in improved performance of pro- methods typically rely on predefined features to predict the root
gram synthesis. Mastropaolo et al. design LANCE system [34] that cause label from a fixed set of predefined root cause labels. Recently,
utilizes fine-tuned T5 to automatically generate logging statements Ahmed et al. [1] proposed a method to fine-tune GPT models for
for Java methods. Similarly, VulRepair [13] tool automatically sug- the root cause analysis task using historical incident data. However,
gests vulnerability fixes with a fine-tuned T5 model based on their fine-tuning these models on state-of-the-art language models like
vulnerability repair datasets. In constrast to previous studies, our GPT-4 poses significant challenges, such as requiring substantial
approach harnesses the cutting-edge LLMs to generate root causes GPU resources and incurring high maintenance costs for customiz-
without requiring model fine-tuning, relying instead on the in- ing the LLM for future use. Recently, Chen et al. [9] developed a
context learning method. retrieval-augmented LLM model for root cause analysis but limited
its application to specific service data, demanding domain-specific
5.2 Incident Management knowledge. In contrast, our approach leverages in-context learning
Incident management within large cloud services has emerged as with a substantial dataset comprising over 100,000 incidents. This
a popular research topic in the systems and software engineering allows us to support OCEs in resolving incidents without requiring
communities. Several empirical studies have analyzed incidents fine-tuning or specific domain expertise in a broad context.
and outages in production systems, specifically delving into in-
cidents caused by particular types of issues [2, 14, 24, 46] or is-
sues arising from specific services and systems [15, 30, 43]. More-
over, researchers have explored the use of machine learning and 6 CONCLUSION
data-driven techniques to automate various aspects of the inci- In this paper, we present the effectiveness of utilizing cutting-edge
dent life-cycle, including triaging [3, 7, 8], diagnosis [5, 33, 35], and language models like GPT-4 in root cause analysis task. We pro-
mitigation [19]. Moreover, researchers have explored the use of pose an in-context learning method that integrates historical in-
machine learning and data-driven techniques to automate various cident knowledge into vanilla language models without the need
aspects of the incident life-cycle, including triaging [3, 7, 8], diag- for fine-tuning. Through extensive experiments on a large-scale
nosis [5, 33, 35], and mitigation [19]. For root-cause analysis tasks, incident dataset consisting of over 100,000 production incidents, we
several research studies (e.g., TraceRCA [27], CIRCA [26], DiagFu- demonstrate that our in-context learning approach outperforms the
sion [44], Eadro [23]) have been proposed for anomaly detection fine-tuned GPT-3 model by an average of 24.8% across six metrics.
and root cause positioning. While these methods are useful to locate Additionally, the incorporation of in-context examples results in an
either categories of the root causes [23, 44] or to identify the po- impressive 49.7% improvement over the zero-shot model. Human
tential problematic microservice [26, 27] to investigate, these does evaluation involving incident owners also indicates promising en-
not provide a detailed description of the root cause. In contrast, we hancements compared to the fine-tuned model, achieving a 43.5%
propose to generate the actual descriptive root cause information to improvement in correctness. Considering the challenges of fine-
guide On-Call Engineers (OCEs) into right direction by leveraging tuning such massive incident data, our work provides valuable
the power of LLMs. More specifically, our approach is designed as insights into utilizing cutting-edge language models effectively in
a generative task, which sets it apart from traditional RCA method- our incident management domain without the necessity for fine-
ologies that treat problems as classification tasks. These traditional tuning.

275
Automated Root Causing of Cloud Incidents FSE Companion ’24, July 15–19, 2024, Porto de Galinhas, Brazil

REFERENCES [20] Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity
[1] Toufique Ahmed, Supriyo Ghosh, Chetan Bansal, Thomas Zimmermann, Xuchao search with GPUs. IEEE Transactions on Big Data 7, 3 (2019), 535–547.
Zhang, and Saravan Rajmohan. 2023. Recommending Root-Cause and Mitiga- [21] Hassan Kane, Muhammed Yusuf Kocyigit, Ali Abdalla, Pelkins Ajanoh, and
tion Steps for Cloud Incidents using Large Language Models. arXiv preprint Mohamed Coulibali. 2020. NUBIA: NeUral Based Interchangeability Assessor for
arXiv:2301.03797 (2023). Text Generation. arXiv:2004.14667 [cs.CL]
[2] Ahmed Alquraan, Hatem Takruri, Mohammed Alfatafta, and Samer Al-Kiswany. [22] Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey
2018. An Analysis of { Network-Partitioning } Failures in Cloud Systems. In 13th Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open-
USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). domain question answering. arXiv preprint arXiv:2004.04906 (2020).
51–68. [23] Cheryl Lee, Tianyi Yang, Zhuangbin Chen, Yuxin Su, and Michael R Lyu. 2023.
[3] Amar Prakash Azad, Supriyo Ghosh, Ajay Gupta, Harshit Kumar, Prateeti Mo- Eadro: An End-to-End Troubleshooting Framework for Microservices on Multi-
hapatra, Lena Eckstein, Leonard Posner, and Robert Kern. 2022. Picking Pearl source Data. arXiv preprint arXiv:2302.05092 (2023).
From Seabed: Extracting Artefacts from Noisy Issue Triaging Collaborative Con- [24] Tanakorn Leesatapornwongsa, Jeffrey F Lukman, Shan Lu, and Haryadi S Gunawi.
versations for Hybrid Cloud Services. In Proceedings of the AAAI Conference on 2016. TaxDC: A taxonomy of non-deterministic concurrency bugs in datacenter
Artificial Intelligence, Vol. 36. 12440–12446. distributed systems. In Proceedings of the Twenty-First International Conference
[4] Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for on Architectural Support for Programming Languages and Operating Systems. 517–
MT evaluation with improved correlation with human judgments. In Proceedings 530.
of the acl workshop on intrinsic and extrinsic evaluation measures for machine [25] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin,
translation and/or summarization. 65–72. Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel,
[5] Chetan Bansal, Sundararajan Renganathan, Ashima Asudani, Olivier Midy, and et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks.
Mathru Janakiraman. 2020. DeCaf: Diagnosing and Triaging Performance Issues Advances in Neural Information Processing Systems 33 (2020), 9459–9474.
in Large-Scale Cloud Services. In 2020 IEEE/ACM 42nd International Conference [26] Mingjie Li, Zeyan Li, Kanglin Yin, Xiaohui Nie, Wenchi Zhang, Kaixin Sui, and
on Software Engineering: Software Engineering in Practice (ICSE-SEIP). Dan Pei. 2022. Causal inference-based root cause analysis for online service
[6] Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Ruther- systems with intervention recognition. In Proceedings of the 28th ACM SIGKDD
ford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bog- Conference on Knowledge Discovery and Data Mining. 3230–3240.
dan Damoc, Aidan Clark, et al. 2022. Improving language models by retrieving [27] Zeyan Li, Junjie Chen, Rui Jiao, Nengwen Zhao, Zhijun Wang, Shuwei Zhang,
from trillions of tokens. In International conference on machine learning. PMLR, Yanjun Wu, Long Jiang, Leiqin Yan, Zikai Wang, et al. 2021. Practical root cause
2206–2240. localization for microservice systems via trace analysis. In 2021 IEEE/ACM 29th
[7] J. Chen, X. He, Q. Lin, Y. Xu, H. Zhang, D. Hao, F. Gao, Z. Xu, Y. Dang, and D. International Symposium on Quality of Service (IWQOS). IEEE, 1–10.
Zhang. 2019. An Empirical Investigation of Incident Triage for Online Service [28] Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries.
Systems. In 2019 IEEE/ACM 41st International Conference on Software Engineering: In Text summarization branches out. 74–81.
Software Engineering in Practice (ICSE-SEIP). 111–120. [29] Chin-Yew Lin and Franz Josef Och. 2004. Orange: a method for evaluating auto-
[8] J. Chen, X. He, Q. Lin, H. Zhang, D. Hao, F. Gao, Z. Xu, Y. Dang, and D. Zhang. matic evaluation metrics for machine translation. In COLING 2004: Proceedings
2019. Continuous Incident Triage for Large-Scale Online Service Systems. In of the 20th International Conference on Computational Linguistics. 501–507.
2019 34th IEEE/ACM International Conference on Automated Software Engineering [30] Haopeng Liu, Shan Lu, Madan Musuvathi, and Suman Nath. 2019. What bugs
(ASE). 364–375. cause production cloud incidents?. In Proceedings of the Workshop on Hot Topics
[9] Yinfang Chen, Huaibing Xie, Minghua Ma, Yu Kang, Xin Gao, Liu Shi, Yunjie in Operating Systems. 155–162.
Cao, Xuedong Gao, Hao Fan, Ming Wen, et al. 2023. Empowering Practical Root [31] Ruibo Liu, Guoqing Zheng, Shashank Gupta, Radhika Gaonkar, Chongyang
Cause Analysis by Large Language Models for Cloud Incidents. arXiv preprint Gao, Soroush Vosoughi, Milad Shokouhi, and Ahmed Hassan Awadallah. 2022.
arXiv:2305.15778 (2023). Knowledge infused decoding. arXiv preprint arXiv:2204.03084 (2022).
[10] Zhuangbin Chen, Yu Kang, Liqun Li, Xu Zhang, Hongyu Zhang, Hui Xu, Yangfan [32] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer
Zhou, Li Yang, Jeffrey Sun, Zhangwei Xu, et al. 2020. Towards intelligent incident Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A
management: why we need it and how we make it. In Proceedings of the 28th robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692
ACM Joint Meeting on European Software Engineering Conference and Symposium (2019).
on the Foundations of Software Engineering. 1487–1497. [33] Chen Luo, Jian-Guang Lou, Qingwei Lin, Qiang Fu, Rui Ding, Dongmei Zhang,
[11] Ching-Yao Chuang, Joshua Robinson, Yen-Chen Lin, Antonio Torralba, and Ste- and Zhe Wang. 2014. Correlating events with time series for incident diagnosis.
fanie Jegelka. 2020. Debiased contrastive learning. Advances in neural information In Proceedings of the 20th ACM SIGKDD international conference on Knowledge
processing systems 33 (2020), 8765–8775. discovery and data mining. 1583–1592.
[12] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: [34] Antonio Mastropaolo, Luca Pascarella, and Gabriele Bavota. 2022. Using Deep
Pre-training of deep bidirectional transformers for language understanding. arXiv Learning to Generate Complete Log Statements. In Proceedings of the 44th Inter-
preprint arXiv:1810.04805 (2018). national Conference on Software Engineering (ICSE ’22). 2279–2290.
[13] Michael Fu, Chakkrit Tantithamthavorn, Trung Le, Van Nguyen, and Dinh Phung. [35] Vinod Nair, Ameya Raul, Shwetabh Khanduja, Vikas Bahirwani, Qihong Shao,
2022. VulRepair: a T5-based automated software vulnerability repair. In Pro- Sundararajan Sellamanickam, Sathiya Keerthi, Steve Herbert, and Sudheer Dhuli-
ceedings of the 30th ACM Joint European Software Engineering Conference and palla. 2015. Learning a hierarchical monitoring system for detecting and di-
Symposium on the Foundations of Software Engineering. 935–947. agnosing service issues. In Proceedings of the 21th ACM SIGKDD International
[14] Yu Gao, Wensheng Dou, Feng Qin, Chushu Gao, Dong Wang, Jun Wei, Ruirui Conference on Knowledge Discovery and Data Mining. 2029–2038.
Huang, Li Zhou, and Yongming Wu. 2018. An empirical study on crash recovery [36] Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou,
bugs in large-scale distributed systems. In Proceedings of the 2018 26th ACM Joint Silvio Savarese, and Caiming Xiong. 2022. Codegen: An open large language
Meeting on European Software Engineering Conference and Symposium on the model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474
Foundations of Software Engineering. 539–550. (2022).
[15] Supriyo Ghosh, Manish Shetty, Chetan Bansal, and Suman Nath. 2022. How to [37] OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
fight production incidents? an empirical study on a large-scale cloud service. In [38] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever,
Proceedings of the 13th Symposium on Cloud Computing. 126–141. et al. 2019. Language models are unsupervised multitask learners. OpenAI blog
[16] Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. 2020. 1, 8 (2019), 9.
Retrieval augmented language model pre-training. In International conference on [39] Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings
machine learning. PMLR, 3929–3938. using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Em-
[17] Daniel S Hirschberg. 1977. Algorithms for the longest common subsequence pirical Methods in Natural Language Processing. Association for Computational
problem. Journal of the ACM (JACM) 24, 4 (1977), 664–675. Linguistics. https://arxiv.org/abs/1908.10084
[18] Naman Jain, Skanda Vaidyanath, Arun Iyer, Nagarajan Natarajan, Suresh [40] Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing
Parthasarathy, Sriram Rajamani, and Rahul Sharma. 2022. Jigsaw: Large lan- Wang. 2023. Software Testing with Large Language Model: Survey, Landscape,
guage models meet program synthesis. In Proceedings of the 44th International and Vision. arXiv preprint arXiv:2307.07221 (2023).
Conference on Software Engineering. 1219–1231. [41] Yuhuai Wu, Markus N Rabe, DeLesley Hutchins, and Christian Szegedy. 2022.
[19] Jiajun Jiang, Weihai Lu, Junjie Chen, Qingwei Lin, Pu Zhao, Yu Kang, Hongyu Memorizing transformers. arXiv preprint arXiv:2203.08913 (2022).
Zhang, Yingfei Xiong, Feng Gao, Zhangwei Xu, et al. 2020. How to mitigate [42] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi,
the incident? an effective troubleshooting guide recommendation technique for Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al.
online service systems. In Proceedings of the 28th ACM Joint Meeting on European 2016. Google’s neural machine translation system: Bridging the gap between
Software Engineering Conference and Symposium on the Foundations of Software human and machine translation. arXiv preprint arXiv:1609.08144 (2016).
Engineering. 1410–1420. [43] Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle
Zhang, Pranay U Jain, and Michael Stumm. 2014. Simple Testing Can Prevent

276
FSE Companion ’24, July 15–19, 2024, Porto de Galinhas, Brazil X Zhang, S Ghosh, C Bansal, R Wang, M Ma, Y Kang, S Rajmohan

Most Critical Failures: An Analysis of Production Failures in Distributed { Data- arXiv:1904.09675 (2019).
Intensive } Systems. In 11th USENIX Symposium on Operating Systems Design and [46] Yongle Zhang, Junwen Yang, Zhuqi Jin, Utsav Sethi, Kirk Rodrigues, Shan Lu,
Implementation (OSDI 14). 249–265. and Ding Yuan. 2021. Understanding and detecting software upgrade failures
[44] Shenglin Zhang, Pengxiang Jin, Zihan Lin, Yongqian Sun, Bicheng Zhang, Sibo in distributed systems. In Proceedings of the ACM SIGOPS 28th Symposium on
Xia, Zhengdan Li, Zhenyu Zhong, Minghua Ma, Wa Jin, et al. 2023. Robust Operating Systems Principles. 116–131.
Failure Diagnosis of Microservice System through Multimodal Data. arXiv
preprint arXiv:2302.10512 (2023). Received 2024-02-08; accepted 2024-04-18
[45] Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav
Artzi. 2019. Bertscore: Evaluating text generation with bert. arXiv preprint

277

You might also like