Exploring LLM-Based Agents For Root Cause Analysis
Exploring LLM-Based Agents For Root Cause Analysis
Saravan Rajmohan
[email protected]
Microsoft
Redmond, Washington, USA
ABSTRACT KEYWORDS
Root cause analysis (RCA), a critical part of the incident manage- Incident Management, Cloud Computing, Root Cause Analysis,
ment process, is a demanding task for on-call engineers, requiring AIOps
deep domain knowledge and extensive experience with a team’s
ACM Reference Format:
specific services. Automation of RCA can result in significant sav-
Devjeet Roy, Xuchao Zhang, Rashi Bhave, Chetan Bansal, Pedro Las-Casas,
ings of time, and ease the burden of incident management on on-call Rodrigo Fonseca, and Saravan Rajmohan. 2024. Exploring LLM-Based Agents
engineers. Recently, researchers have utilized Large Language Mod- for Root Cause Analysis. In Companion Proceedings of the 32nd ACM Interna-
els (LLMs) to perform RCA, and have demonstrated promising tional Conference on the Foundations of Software Engineering (FSE Companion
results. However, these approaches are not able to dynamically col- ’24), July 15–19, 2024, Porto de Galinhas, Brazil. ACM, New York, NY, USA,
lect additional diagnostic information such as incident related logs, 12 pages. https://doi.org/10.1145/3663529.3663841
metrics or databases, severely restricting their ability to diagnose
root causes. In this work, we explore the use of LLM based agents 1 INTRODUCTION
for RCA to address this limitation. We present a thorough empiri-
cal evaluation of a ReAct agent equipped with retrieval tools, on For the last several decades, large scale enterprises have been trans-
an out-of-distribution dataset of production incidents collected at forming their software into cloud services. With the rise of Artificial
Microsoft. Results show that ReAct performs competitively with Intelligence (AI) in recent years, there has been even greater move-
strong retrieval and reasoning baselines, but with highly increased ment of computation from consumer devices to the cloud. This
factual accuracy. We also conduct a case study with a team at shift in paradigm has brought with it complex software systems
Microsoft to allow the ReAct agent to interface with diagnostic that are characterized by multi-tiered architectures, microservices
services that are used by the team for manual RCA. Our results and distributed applications. The increased complexity of these
show how agents can overcome the limitations of prior work, and systems makes them highly susceptible to production incidents,
considerations for implementing such a system in practice. which can incur substantial costs and disrupt critical services. There-
fore, prompt mitigation and resolution of these incidents is crucial
to maintaining service availability and reliability [41]. However,
CCS CONCEPTS cloud incident management [15, 7] is extremely labor-intensive. On-
• Computer systems organization → Cloud computing; • Soft- call engineers (OCEs) require extensive experience with a team’s
ware and its engineering → Maintaining software. services and deep domain knowledge to be effective at incident
management. Even for experienced OCEs, incident management
represents a time-intensive endeavor. As software systems continue
∗ This work was done during an internship at Microsoft. scaling in size and complexity, the demands placed on OCEs and in-
cident management systems is only bound to increase in the future.
To address these challenges, the field of AIOps (Artificial Intelli-
gence for IT Operations) has proposed numerous techniques to ease
incident management. Despite these developments, several parts
This work is licensed under a Creative Commons Attribution 4.0 Interna- of the incident management lifecycle still largely rely on human
tional License. intervention.
FSE Companion ’24, July 15–19, 2024, Porto de Galinhas, Brazil One of the most challenging aspects of cloud incident manage-
© 2024 Copyright held by the owner/author(s).
ACM ISBN 979-8-4007-0658-5/24/07 ment is root cause analysis (RCA). Before an incident can be re-
https://doi.org/10.1145/3663529.3663841 solved, OCEs must identify the root cause of the incident to ensure
208
FSE Companion ’24, July 15–19, 2024, Porto de Galinhas, Brazil D. Roy, X. Zhang, R. Bhave, C. Bansal, P. Las-Casas, R. Fonseca and S. Rajmohan
that any resolution actions comprehensively and correctly fix the 1) Can LLM agents be effective at RCA in the absence of fine-tuning?
incident. RCA represents one of the most labor- and skill-intensive and 2) What are the practical considerations of using LLM agents in
components of the incident management lifecycle [17]. Even a vet- real world scenarios? To answer these questions, we first conduct
eran software engineer might need to spend several years on a team an evaluation of the ReAct agent equipped with retrieval tools on a
before they are able to effectively perform RCA on a team’s services. static dataset, mirroring the evaluation setting by Ahmed et al. [1].
Therefore, it comes as no surprise that researchers have tried to In this setting, the agent does not have access to specialized, team
automate parts of this process. Numerous techniques have been specific, diagnostic services, thereby restricting its abilities. This
proposed to assist OCEs with RCA, such as incident prioritization establishes a lower bound for their performance, and also reflects a
and retrieval of similar historical incidents. Recently, Ahmed et practical scenario where agents are incrementally adopted across
al.[1] proposed the use of fine-tuned LLMs for incident root cause an organization or company, gradually gaining access to diagnos-
analysis and mitigation. They showed that LLMs can find root tic services over time. Next, we investigate the use of discussion
causes of incidents even when working with a very limited set of comments from historical incident reports to augment our retrieval
information about an incident. Chen et al.[8] propose RCACopilot, corpus. This serves two purposes; not only do discussion comments
which expands upon this work and adds retrieval augmentation add additional context to the incident report, but they also contain
and diagnostic collection procedures to the LLM-based root cause records of the diagnostic steps followed by OCEs for past incidents.
analysis pipeline. The latter can potentially be used in lieu of few-shot examples to
While these approaches have shown promising results on the guide the agent. Lastly, to explore the full potential of agents, we
ability of LLMs to perform RCA, neither equips the LLM to dynam- present a case study of a practical implementation of an LLM agent
ically query real time diagnostic information about the service(s) for RCA, fully equipped with team specific diagnostic resources, in
affected by an incident. RCACopilot [8] relies on predefined han- collaboration with another team at MicrosoftĊoncretely, we make
dlers that must be engineered by hand, and predicts root cause the following contributions:
categories rather than specific root causes, while Ahmed et al.[1] • We present the first empirical study on the use of ReAct [40],
rely only on the incident title and description for predicting the an LLM agent, for RCA in an out of domain setting on a static
root cause. What’s missing here is a critical step that is taken by dataset of real world production incidents
OCEs in real world RCA scenarios - for any incident, one of the first • We conduct a qualitative analysis of the different success and
steps performed by OCEs is collection of novel diagnostic data that is failure modes of the ReAct in RCA.
not present in the incident report. In prior work, LLMs do not have • We evaluate the use of discussion comments from historical inci-
the ability to interact with the outside environment to be able to dents and its impact on the agent’s performance.
collect this data. In this work, we propose the use of LLM-based • We present a case study of a real world implementation of an
agents – systems that can reason, plan and interact with the ex- LLM-based agent for RCA with a team at a large scale enterprise
ternal environment to collect new information – to address this • We highlight both the potential of LLM-based agents and the
limitation and help with root cause analysis. challenges involved in implementing real world systems capable
Despite the remarkable capabilities demonstrated by LLM-based of fully autonomous RCA.
agents across diverse domains and tasks, adapting them for the pur-
poses of RCA represents a significant challenge. Incident production
data is highly confidential, and likely out of distribution for LLMs
without fine-tuning, which can be costly and impractical for large 2 BACKGROUND AND RELATED WORK
models [8]. In-context examples can serve as an alternative to fine- 2.1 Cloud Incident Management and Root
tuning for domain adaptation, but for agent based RCA, crafting Cause Analysis
entire reasoning trajectories can be challenging. This is exacerbated
by the fact that agents require sophisticated prompting and typically Production incidents are unplanned events or disruptions in ser-
also require fine-tuning [40] or in-context examples [33]. Lastly, vice that adversely affect customers. Outages in service due to
RCA poses some unique characteristics that differentiate it from production incidents can be extremely costly for enterprises. The
standard NLP tasks. For most NLP tasks, relevant external tools complexity of modern software systems renders production inci-
such as web search engines and document retrieval are easy to use dents inevitable, and incident management a key component of
in a single step process, and do not require much prior knowledge the software development life cycle. The life cycle of an incident
from the LLM. For RCA, crafting a query for search or retrieval involves incident detection, triaging, diagnosis and mitigation [1].
requires much more specialized domain knowledge; many sources While incidents may be reported by customers or automatically
of information such as logs, traces, and monitoring services involve detected and triaged using monitoring services, the remaining steps
querying and processing of tabular data using specialized query are traditionally conducted by one or more on-call engineers (OCEs).
languages as well as knowledge of ancillary information (e.g. which The goal of incident management is to minimize the time between
database to query). Therefore, while LLM agents offer exceptional the occurrence of the incident, and its resolution.
abilities that go far beyond prior approaches, it is unclear whether
they can be effectively adapted to the RCA task. 2.2 Root Cause Analysis (RCA)
In this work, we present an empirical evaluation of an LLM-based Root Cause Analysis constitutes one of the most time-consuming
agent, ReAct for root cause analysis for cloud incident manage- aspects of the incident management life cycle. When OCEs receive
ment. Our goal is to answer two important questions in this regard: an incident, they systematically perform a series of troubleshooting
209
Exploring LLM-Based Agents for Root Cause Analysis FSE Companion ’24, July 15–19, 2024, Porto de Galinhas, Brazil
Title: SD#1234123412341234 | PRE | SEV A | Specified blob does and deep learning models [32] to identify patterns in event data
not exist. | Cloud Services LLC and determine the underlying causes of incidents. Another im-
Description: Customer mentioned that after stopping stream portant area of research in RCA is the use of anomaly detection
analytics on 09/23 they are getting errors on streaming into models [34], such as statistical, machine learning and deep learning
<database product>[...] It was throwing an error "Specified blob models [10], have been proposed to identify anomalies in system
does not exist” and “Invalid connection string format. [Ses- behavior and alert operators in real-time. Studies have proposed
sionID: <uuid> Found Another error message "Error while In- various techniques for RCA and triage such as learning a hierarchi-
gesting data to <database product>" cal monitoring system [21], diagnosing and triaging performance
issues [3], and correlating events with time series [16]. In addition,
there have been studies exploring the use of structured knowledge
Figure 1: Example Incident mining from various artifacts, such as incident reports and root
cause documentation, to mine structured knowledge in software
engineering such as troubleshooting guides (TSGs) [11] and there
steps to identify the root cause. Each troubleshooting step yields have been efforts to improve TSG quality [28]and make them more
previously unknown information, helping the OCE narrow down effective for incident resolution.
on the set of plausible root causes. This highlights a key aspect of Large Language Models (LLMs) have shown remarkable ability to
root cause analysis: the process of collecting additional diagnostic work with a wide variety of data modalities, including unstructured
information related to the incident. The incident report describes natural language, tabular data and even images. Recently, Ahmed
the symptoms leading to the reporting of the incident, but similar et al. [1] proposed the use of fine-tuned pretrained LLMs for RCA
symptoms can emerge from distinct root causes, which might span of cloud incidents. Since incident data is highly confidential, and
a diverse set of domains, such as hardware failures, network issues unlikely to have been observed by pretrained LLMs, fine-tuning is
or software bugs. Therefore, OCEs must start the diagnosis process necessary for domain adaptation of vanilla LLMs. In this work, we
by collecting supplementary data from relevant logs, metrics and adopt the RCA task as framed in Ahmed et al. [1]; given an incident
other monitoring and diagnostic services. For example, the incident report, we want our model to predict a specific root cause. However,
shown in Figure 1 was resolved by checking logs collected from the unlike the original setting, we exclude the use of fine-tuning or
affected service to identify the sequence of events that lead to the other training approaches for domain adaption. As pointed out by
failure encountered by the customer. Another implicit requirement Chen et al. [8], while fine-tuning can be effective, it is also costly and
in this process is that OCEs know 1) what additional information time-consuming, and must be repeated every time the base model
needs to be collected, and 2) how to collect this information. This gets updated, or services evolve. To address these limitations, Chen
is why even experienced engineers need to have experience with et al. [8] introduce RCACopilot, which uses predefined handlers
team’s services before they can effectively perform RCA. In all, to automatically collect multi-modal diagnostic data relevant to the
successful RCA requires the following pieces of information: 1) incident, and an LLM to analyze the collected data and predict a root
symptoms reported in the incident report, 2) additional diagnostic cause category for the incident that serves to assist OCEs with RCA,
information, and 3) domain expertise, i.e. what diagnostic informa- without the need for finetuning. Unlike RCACopilot, the ReAct
tion should be collected based on the information, how to collect it agent presented in our case study can dynamically collect related
and general knowledge about the application domain diagnostic data autonomously, without the need for predefined
The root cause analysis pipeline demonstrates many of the chal- handlers.
lenges posed for OCEs as well as efforts to automate this procedure.
OCEs must have sufficient domain knowledge and familiarity with
the affected service to know 1) which supplementary data to collect, 2.4 Augmented LLMs and LLM-Based Agents
2) how this data must be collected and 3) how to analyze all the A recent development in LM research has been the rise of LMs
available information (including the incident report). Depending augmented with the ability to reason and use tools, or Augmented
on the scale and complexity of the underlying service, this might Language Models (ALMs) [13, 20, 27]. Augmenting LLMs extends
require OCEs to have several years of experience with the team’s their ability beyond what is possible in a purely language mod-
services to develop the requisite skill set for effective root cause elling regime. Primarily, these augmentations are either external
analysis. Even when OCEs are sufficiently trained, the data col- components that allow the LLM to interact dynamically with its
lected can be multi-faceted, spanning from structured tabular data environment for a given problem setting, or prompting techniques
to unstructured logs and customer reports. This further compli- that endow the LLM with sophisticated reasoning abilities for com-
cates data analysis and subsequent hypothesis generation for OCEs. plex analytical tasks [37]. For example, LLMs have been augmented
While OCEs can overcome these challenges by leveraging domain with external retrieval databases that can factually ground their
expertise and experience, this poses a significant challenge for prior predictions, as well as allow them to use information that was
automated approaches, that are unable to collect this supplementary not seen in training. Retrieval can also narrow the gap between
data, let alone analyze it to produce a root cause. smaller models and their larger counterparts. LLMs can also be
augmented with external components beyond retrieval, such as
2.3 Automated RCA code interpreters [9] and web search engines. More recently, LLM-
Numerous studies have proposed various techniques for automat- based agents combine the external augmentation components with
ing root cause analysis, such as using machine learning models reasoning and planning abilities to allow the LLM to autonomously
210
FSE Companion ’24, July 15–19, 2024, Porto de Galinhas, Brazil D. Roy, X. Zhang, R. Bhave, C. Bansal, P. Las-Casas, R. Fonseca and S. Rajmohan
solve for complex tasks such as sequential decision-making prob- 3.2 Zero-Shot Prompting
lems [29], knowledge-intensive question answering [35] and self While LLM-based agent approaches typically benefit with few-shot
debugging [6]. examples [40, 33], we use ReAct in a much more challenging setup
with a zero-shot prompt. Originally, we set out to craft few-shot
3 LLM-BASED AGENTS FOR RCA examples based on examples from the evaluation set. However, for
An LLM agent is an ALM that has the ability to both reason and the setting in RQ1 and RQ2, where we only utilize the incident title
use tools. In recent years, several different formulations of LLM and descrption, we found it extremely challenging to come up with
agents have been proposed [40, 33]. For this work, we base the reasoning traces grounded in the available information that would
RCA Agent on the ReAct framework [40]. This framework in- arrive at the correct root cause.
terleaves reasoning and tool usage steps, combining principles
from reasoning-based approaches such as Chain of Thought [38] 3.3 Agent Evaluation
with tool usage models like Toolformer[12]. ReAct is a natural One of the primary benefits of agent-based RCA is their ability to
fit for the RCA task for many reasons: 1) Real-world RCA task collect external diagnostic information via tools. This is difficult to
has elements of both sequential decision-making (deciding which evaluate without the existence of a simulated environment such
troubleshooting steps to take) and knowledge-intensive question as WebArena [44], AlfWorld [31] or WebShop [39]. The main
answering (assessing available diagnostic information to produce challenge in constructing such an environment is that it is difficult
a candidate root cause), both of which are supported by ReAct; to determine what diagnostic services were used to diagnose a
2) in an out of distribution setting such as the one we consider, particular incident, since OCEs are not required to report each
ReActcan quickly adapt to new information since it interleaves and every diagnostic step taken. Moreover, the type of diagnostic
reasoning, planning and environment feedback rather than creating services used by different teams can vary greatly. Another challenge
a long-horizon plan upfront; and 3) it can easily be augmented with is that the environment needs to support not only the most optimal
additional components such as reflection [30] and external memory troubleshooting trajectory that the agent can take, but a reasonably
mechanisms [43] which would benefit RCA for incidents requiring large subset of other plausible trajectories, i.e. even if we know what
a longer diagnostic process. diagnostic data is needed to resolve an incident, it does not suffice
to only capture this specific data for the environment. Given such
3.1 Overview an evaluation environment does not currently exist, we evaluate
the agent in a restricted setting where we do not assume access to
any specialized services, similar to [1]. While this evaluation does
not reflect the benefits of the agent’s ability to perform autonomous
diagnostic steps, it provides us with a lower bound for performance
of the agent when specialized tools are unavailable, and allows
us to fairly compare it to other ALMs that do not have the ability
to query additional diagnostic data. In addition, to demonstrate
the agent’s ability to interact with diagnostic services, we also
present a case study of a prototype implementation of an agent in
collaboration with a team at our company. This presents a more
realistic evaluation of the agent but at a much smaller scale. The
goal of the case study is to examine the benefits and limitations
of the agent in a practical environment, and to identify practical
considerations for real world adoption based.
211
Exploring LLM-Based Agents for Root Cause Analysis FSE Companion ’24, July 15–19, 2024, Porto de Galinhas, Brazil
simply returns the retrieved documents as an observation without evaluation (2000) and test (3000) sets. For this work, we randomly
further processing. Since the query here is a passage, we exclusively sample 100 incidents from the evaluation set and 500 incidents
use the SentenceTransformer retriever for retrieval. The second from the test set to reduce costs, in line with work in NLP [36].
variant, REACT S+Q, first retrieves a set of historical incidents, We use the training set is primarily used as the retrieval corpus
and then uses another LLM to answer the planner’s query. This for our experiments. Like Ahmed et al. [1], we use the incident
two-step process allows the planner to disentangle the retrieval title and description as the primary sources of information about
query from the target incident report, and also mitigates instances the incident. For RQ2, we also include discussion comments into
where the size of retrieved historical incidents might extend beyond the historical corpus. Incident descriptions and root causes do not
the context length of the underlying LLM. We restrict the retrieval follow a standard format, and can be quite long. This imposes
tool to retrieve 𝑘 = 3 documents per query, to give the agent the limitations on the number of historical incidents that can be fit in
opportunity to create a diverse set of queries while still maintaining context when using any kind of retrieval augmented generation.
an overall budget of 10 retrieved documents for parity with other Hence, we use gpt-3.5-turbo to summarize descriptions and root
baselines. causes. For RQ2, we also summarize discussions comments. Since
discussion comments are much longer, we split them into chunks,
4 RESEARCH QUESTIONS summarize each individual chunk and recombine them, utilizing the
To evaluate the efficacy of LLM-based Agents in RCA, we ask the LLM for each step. Note that the summarization process is difficult
following research questions: to evaluate due to a lack of reference summaries, and hence we
RQ1: How effective are LLM-based agents at finding incident rely on qualitative analysis and end-to-end evaluation on RCA to
root causes when given access to a generalized toolkit? In iterate on the summarization process.
this setting, we test the efficacy of LLM-based agents at root cause
analysis in an out of distribution setting when they are given access 5.2 Base LLMs
only to tools that are independent of specific teams. We equip the For all of our experiments, we use OpenAI GPT4-8k [22] as the
agent with a generalized retrieval tool over historical incidents, and primary language model. GPT-4 is the most powerful model in
a question-answering tool over the raw incident description. We OpenAI’s repository of models, and is one of the few models that
consider various strong ALM baselines that, unlike the agent, are can be used to reliably drive an agent in a zero-shot setting. The
unable to use tools. large context size (8,000 tokens) also enables us to use a larger
RQ2: Do discussion comments help improve LLM based ap- number of retrieved incidents for our models. For summarization
proaches to root cause analysis? of incidents and discussion comments, we use gpt-3.5-turbo to
Discussion comments on incident reports contain records of the di- lower costs.
agnostic steps taken by OCEs to resolve the incident, and can guide
models in performing RCA on future incidents. Here, we aim to
investigate whether incorporating these discussion comments into 5.3 Retrievers
our retrieval corpus of historical incidents impacts the performance We construct a retrieval corpus of historical incidents that encom-
of the agent as well as selected baselines from RQ1. To perform passes the entire training split of our collected dataset. We consider
this evaluation, we augment incidents in our retrieval corpus with one dense retriever and one sparse retriever.
associated discussion comments post-retrieval, to ensure that the Dense Retriever (ST) We use a pretrained Sentence-Bert [24]
presence of the comments does not affect retrieval. based encoder (all-mpnet-base-v2) from the associated Sentence-
RQ3: How effective are LLM agents at RCA when given access Transformers as our dense retriever and Max Marginal Relevance
to team specific diagnostic tools? (MMR) [4] for search.
In this research question, we evaluate a real world scenario when Sparse Retriever (BM-25) While models that perform a single
an LLM based agent has access to a team specific knowledge base retrieval step, other models such IR-CoT and the ReAct agent
and monitoring service. To conduct this evaluation, we perform a perform multiple retrieval steps with different queries, and can
case study with another team’s on-call engineers. We package the benefit from term based search [36]. We use BM-25 [25] as our
ReAct agent with these resources into a chat interface, and conduct sparse retriever.
an in person experiment to see if this agent is able to effectively
assist the on call engineer in finding the root cause of a small set of
5.4 Baseline Models
incidents.
Here, we describe the baselines used for our evaluation in RQ1
and RQ2. We restrict ourselves to ALMs that do not require any
5 METHODOLOGY
fine-tuning.
We describe the methodology used to answer RQ1 and RQ2 in this Retrieval Baseline (RB) Retrieval Augmented Generation
section. The methodology for RQ3 is described in Section 7. (RAG) is an effective strategy to providing domain adaptation for
language models without additional training. For our experiments,
5.1 Dataset we create a retrieval database of historical incident reports with
We collect incident data from our internal incident portal, from known root causes, and use the incoming incident’s title and descrip-
01/01/2020 to 09/30/2021. Our data collection process yielded a total tion to retrieve top-k relevant historical incidents. These incidents
of 107,000 unique incidents, which we split into a train (102,000), are then put into the LLM’s context as few-shot examples.
212
FSE Companion ’24, July 15–19, 2024, Porto de Galinhas, Brazil D. Roy, X. Zhang, R. Bhave, C. Bansal, P. Las-Casas, R. Fonseca and S. Rajmohan
Chain of Thought (CoT) Chain of Thought is one of the earlier for correct predictions, we differentiate correct predictions that
prompting methodologies developed to enhance the reasoning abil- unambiguously match the reference (Precise), match the reference
ities of LLMs[37]. The idea here is to encourage the model to break semantically but exclude some specifics present in the reference
the input problem into smaller parts by thinking step by step. For (Imprecise), and those that match the root cause semantically but
our experiments, we use CoT in a zero-shot setting, by appending also contain unrelated factual accuracies (Hallucinations). The last
a prefix ("Let’s think step by step") to the answer prompt. case commonly manifests as predictions that suggest the execution
Interleaving Retrieval - Chain of Thought (IR-CoT) Trivedi of post-hoc resolutions actions (e.g. the incident was resolved by
et al.[35] show that interleaving vanilla CoT prompting with re- restarting the affected cluster) that did not take place. Imprecise
trieval improves model performance on complex, multistep reason- predictions can be useful for OCEs, whereas factual errors can
ing tasks. After every reasoning step the LLM takes, the reasoning mislead OCEs. For predictions that don’t match the reference root
step is used to retrieve relevant documents from the retrieval cor- cause, we add two new categories to the ones from [40]. The first,
pus. This is shown to improve performance over using single step Insufficient Evidence, refers to an incorrect prediction that indicates
retrieval for knowledge intensive question answering tasks. that there isn’t enough evidence available to determine the root
cause for the incident. The second, Other, refers to instances of
5.5 Automatic Evaluation Metrics incorrect predictions that do not have a clearly identifiable cause for
For evaluating models in the general setting, we use a 3 evaluation error. This is an extension to the label ambiguity category from [40],
metrics based on lexical similarity (BLEU, METEOR, Rouge) and and now includes other failure cases where the model predicts a
1 on semantic similarity (BertS). BLEU [23] is a precision based plausible specific root cause (unlike Insufficient Evidence which is
lexical similarity metric that computes the n-gram overlap between only applied to cases where no specific root cause is indicated),
model predictions and ground truth references. We use both corpus but does not contain obvious reasoning, retrieval, or factual errors.
(C-BLEU) and segment (S-BLEU) level variants. METEOR [2] consid- This is often due to the information sparsity of incident reports,
ers both precision and recall, and uses more sophisticated text pro- especially in cases where the incident report provides details as
cessing and scoring systems. rougeL [14] is commonly used to eval- external links that are inaccessible for the models.
uate summarization and is recall based. BERTScore (BertS) [42]
measures semantic similarity rather using pretrained BERT models. 6 RQ1 AND RQ2 RESULTS
Table 1: Manual Annotation Criteria Model C-BLEU S-BLEU rougeL METEOR BertS
RB (k=3) 4.73 4.64 18.48 21.62 0.863
Outcome Description RB (k=6) 5.66 5.56 19.78 23.25 0.865
Correct
RB (k=10) 5.97 5.74 20.30 24.11 0.866
Precise Precisely matches reference root cause CoT 6.31 5.60 19.91 22.02 0.865
Imprecise Matches reference but misses some details
Hallucination Matches reference but contains unrelated factual errors IR-CoT ST 3.91 3.67 16.97 18.50 0.859
IR-CoT BM25 4.61 4.02 17.56 19.94 0.860
Incorrect
Hallucination Contains factual errors in reasoning or prediction ReAct BR 5.53 4.90 17.45 19.23 0.858
Insufficient Evidence Refrains from making a prediction ReAct S+Q BM25 5.59 4.73 17.43 18.72 0.857
Other Cause of error unknown ReAct S+Q ST 5.27 4.58 17.35 18.60 0.857
Reasoning Error Reasoning contains errors
Retrieval Error Unable to retrieve relevant historical incidents
213
Exploring LLM-Based Agents for Root Cause Analysis FSE Companion ’24, July 15–19, 2024, Porto de Galinhas, Brazil
more at a similar level as the Retrieval Baseline (k=3) despite CoT makes fewer reasoning errors than ReAct. This is likely
retrieving a larger number of historical incidents. For the remaining in part due to the more sophisticated prompting involved with
lexical metrics, RB (k=10) offers the highest levels of performance, ReAct and the zero-shot setting. We observed some instances of
closely followed by CoTand Retrieval Baseline (k=6), with the longer reasoning trajectories wherein ReAct would have difficulty
largest difference between these models on METEOR(2.09). These maintaining the prompt format. 66% of ReAct’s incorrect predic-
are followed ReAct variants, ReAct BR and ReAct BR, and Re- tions indicate lack of information to make a root cause prediction
trieval Baseline (k=3). When investigating reasoning logs, we (Insufficient Information, while this is much less frequent for CoT
discover that the ReAct variants retrieve a mean of 4 unique histor- (32%) and RB (k=10) (18%). Lastly, a notable portion of errors for
ical incidents on average, with an average of 2 lookups per incident. the RB (k=10) (32%) and CoT (45%) do not have a clear cause for the
While on each retrieval step they retrieve 3 unique incidents, the error (Other), while this happens much less for ReAct (12%). Many
historical_incidents tool is stateless, and does not take into of these uncategorized errors are predictions that are too generic
consideration documents that have already been retrieved in prior (e.g. suggesting a non-specific configuration issue), while others
steps, resulting in some duplication. This is a consequence of the are plausible based on historical incidents but incorrect.
limited information available in the incident title and description, Overall, the qualitative analysis indicates that the higher cor-
making it difficult for the model to crafting separate queries that rectness rates of the RB (k=10) come at the cost of factual accuracy,
yield distinct sets of historical incidents. This is likely also why despite the grounding offered by retrieval. CoT offers the same
the IR-CoT variants perform poorly on lexical metrics. When we correctness rate with lower rates of hallucination. This clearly
consider semantic similarity, we observe a performance envelope demonstrates the benefits of introducing explicit reasoning into
of < 1 across all models. Therefore, neither reasoning nor addi- an LLM. The ReAct agent also benefits from reasoning, offering
tional historical incidents drastically change the semantic content the lowest rates of hallucinations for both correct and incorrect
of predictions made by these models. predictions, albeit at a slightly lower overall accuracy rate.
Qualitative Analysis Table 3 shows the results of our qual-
itative assessment for the RB (k=10), CoT and a variant for Re- RQ1 Takeaways: ReAct agents perform competitively with
Act(ReAct S+Q BM25). CoT and RB (k=10) have an accuracy of retrieval and CoT baselines on semantic similarity, while un-
39%, followed by ReAct S+Q BM25at 35%. There are 28/97 examples der performing on lexical metrics. Manual labelling reveals that
that are solved correctly by all three models. ReAct S+Q BM25 they achieve competitive correctness rates, while providing a
correctly predicts 4 examples that are incorrectly predicted by the substantially lower rate of hallucinations.
other two. When we examine these instances, we discover that in
all of these instances, ReAct was able correctly filter out (in its 6.2 RQ2: Do discussion comments help improve
reasoning steps) historical incidents that share some lexical similar-
ity with the target incident report but ultimately are semantically
LLM based approaches to root cause
quite different, whereas the other two models incorrectly include analysis?
them in consideration for their final prediction. CoT and RB (k=10) Table 4 shows the performance of the considered models after
correctly predict 8 and 9 examples respectively for which ReAct incorporating discussions into retrieved historical incidents. In
is incorrect. 2 of these instances resulted from reasoning errors by general, incorporating discussions provides mixed results on model
ReAct and the rest were primarily instances where it indicated performance for lexical metrics across different models. Discussions
a lack of evidence (Insufficient Evidence) for RCA. Looking more improve performance on C-BLEU, S-BLEU and rougeL for RB
closely, 26% (10/38) of the correct predictions made by the RB (k=10) (k=10) but these improvements are modest. On the other hand,
contain hallucinations, while it is < 1% for CoT and ReAct S+Q it experiences a modest drop in performance for METEOR (< 1).
BM25. Similarly, 49% (29/59) of the RB (k=10)’s incorrect predictions CoT experiences performance degradation for all lexical metrics:
are hallucinations, dropping to 18% (11/59) for CoT and 6%(4/63) C-BLEU (-0.13), S-BLEU (-0.39), and METEOR (-0.7) and rougeL (-1).
for ReAct. Overall, ReAct has the highest precision among the 3, Unlike CoT both ReAct variants are mostly positively impacted
but this comes at the cost of lower overall accuracy. by the inclusion of discussions. ReAct BR shows improvements
in performance for rougeL (+0.1) and METEOR (+0.8) but obtains
lower C-BLEU (-0.1), and does not show any difference in S-BLEU.
Table 3: Manual Labelling of Success and Failure Cases Similarly, ReAct S+Q BM25 does not display differences in C-BLEU,
S-BLEU or rougeL, but gets a small improvement in METEOR
Type RB (k=10) CoT ReAct-BM25 (+0.24). It is important to note that these small differences in metric
Correct Imprecise 2 7 5 scores (≤ 1) are likely not be perceivable to human annotators, as
Hallucination 10 1 - has been empirically observed in NLP [19], as well as SE [26]. Lastly,
Precise 26 30 29
semantic metrics reveal that the incorporation of discussions does
All 38 38 34
not significantly impact the performance of the 4 models in Table 4.
Incorrect Hallucination 29 11 4
Insufficient Evidence 11 19 39
Our qualitative observations of a small set of model predictions
Other 19 27 8 (20) in the presence vs absence of discussions are in line with these
Reasoning Error - 2 10 findings. Notably, we do not observe any meaningful differences in
Retrieval Error - - 2 the semantic content of the produced root cause between the two
All 59 59 63
scenarios among the predictions that we analyzed.
214
FSE Companion ’24, July 15–19, 2024, Porto de Galinhas, Brazil D. Roy, X. Zhang, R. Bhave, C. Bansal, P. Las-Casas, R. Fonseca and S. Rajmohan
Table 4: Test set results after incorporating discussions Lastly, we conduct demonstrations of the agent with a small set of
incidents in a simple chat interface with the team to collect their
Model C-BLEU S-BLEU rougeL METEOR BertS feedback.
RB (k=10) 6.65 ↑ 6.01 ↑ 20.8 ↑ 23.81 ↓ 0.867
CoT 6.18 ↓ 5.21 ↓ 18.8 ↓ 21.32 ↓ 0.861 7.2 Knowledge Base Articles (KBAs)
ReAct BR 5.44 ↓ 4.91 17.8 ↑ 20.04 ↑ 0.854 A common practice in large IT companies is to encode domain
ReAct S+Q BM25 5.52 4.68 17.4 18.96 ↑ 0.858
knowledge in internal knowledge base articles. In the context of
incident management, these articles contain guidelines for how
certain types of incidents must be diagnosed and mitigated, as well
We conjecture that the small observed effect of discussions on
as key information about how to conduct these operations such
RCA performance is due to a combination of 3 factors. Firstly, com-
as example database queries. At Microsoft, engineers maintain a
ments reporting the end result of a diagnostic step constitute a
large number of KBAs for incident management. They help in stan-
large portion of RCA relevant discussions. While these comments
dardizing operational procedures, facilitating sharing of knowledge
shed light on the troubleshooting steps that lead to incident RCA
across various teams, and onboarding new engineers. Many types
and resolution, these steps cannot be replicated by models in the
of incidents, especially ones triggered by monitoring services, are
general setting; models do not have access to the same diagnos-
tagged with relevant KBAs either automatically, or manually dur-
tic services and resources that were utilized by OCEs to arrive
ing triage. For these incidents, OCEs will have access to relevant
at the conclusions indicated by these discussion comments. Sec-
KBAs the moment they start the RCA. Incidents that do not have
ondly, the sparsity of information present in incident titles and
associated KBAs typically require OCEs to spend time searching for
descriptions negatively impact the ability of models to connect
and locating relevant KBAs before they can start RCA. We consider
information arising from discussions to the target incident. For
both of these scenarios in our case study.
example, a discussion comment might signal that the presence of a
certain symptom indicates a particular root cause, but this symptom
might not be reported in the target incident. This is especially true 7.3 Agent Development
for symptoms that must be elicited using troubleshooting steps. We replace the generalized tools in the ReAct agent from RQ1
Lastly, discussion threads on incident reports can themselves be and RQ2 with the following specialized tools that can access team
quite data sparse, containing lots of administrative content that specific diagnostic data:
are not directly useful for RCA (e.g. incident acknowledgement,
Database Query Tool: We design and implement a tool that can
status updates). While we use length heuristics to remove clearly
be used by the agent to query databases and then analyze query
uninformative comments, comprehensively improving the quality
results. The database framework used by the team utilizes a custom
of the comment dataset will require sophisticated techniques (such
query language, which is somewhat similar to SQL. The tool design
as LLM based filtering).
was informed by discussions with the team as well as analysis of
several historical incidents experienced by the team. Based on our
RQ2 Takeaways: Incorporating discussion comments into
investigation, we settled on a design that uses two distinct com-
the historical corpus does not clearly improve models’ perfor-
ponents for this tool: the Query Execution Engine and the Pandas
mance on RCA. Depending on the metric considered, it can
DataFrame Query Engine. The Query Execution Engine can be used
both improve or degrade performance on lexical metrics.
by the agent to query the database. This requires not only the con-
struction of the actual database query, but also knowledge of the
7 PRACTICAL IMPLEMENTATION OF RCA cluster on which the database is deployed and the name of the
AGENT: A CASE STUDY database. This generic design gives the agent flexibility in making
queries and also increases re-usability of this tool for other teams
Our evaluation of the ReAct agent in RQ1 and RQ2 does not fully
that are also using the same database platform. Once a query is
capture the capability of the agent to dynamically plan and collect
successfully executed, the results returned by the database are trans-
additional diagnostic data from team specific diagnostic services.
formed into a Pandas DataFrame and sent to the Pandas DataFrame
Here, we explore these abilities of the agent by conducting a case
Query Engine. The agent can then perform question-answering
study with Azure Fundamental Team to shed light on these capa-
over the returned table using natural language queries. The Pandas
bilities.
DataFrame Query Engine itself consists of an LLM, which, based
7.1 Approach on the agent’s queries, performs transformations on the DataFrame
using the Python Interpreter and then generates a final answer.
We work with Azure Fundamental Team over a period of 4 weeks,
primarily using unstructured discussions. We start by understand- KBA Q/A Tool: KBAs often contain critical information that is
ing their needs and the challenges they face with regard to RCA, required to perform RCA, and are one of the most widely used
followed by presenting them with the potential benefits and limi- resources for incident management at Microsoft. For example, one
tations associated with integrating an LLM based agent into their of the key pieces of information required to use the Database Query
workflow. Next, we identify key diagnostic services used by the Tool is the cluster address. This information is typically only avail-
team in practice, how these services are used, and iteratively de- able to OCEs via KBAs. To incorporate this information into the
velop tools that can allow the agent to interface with these services. agent, we expose a question-answering tool over a set of KBAs (14
215
Exploring LLM-Based Agents for Root Cause Analysis FSE Companion ’24, July 15–19, 2024, Porto de Galinhas, Brazil
documents) provided by the team. The tool consists of a vector- example of incidents of this type. This incident reports that there
store containing chunks of KBAs, and an LLM which, given a query has been a setting drift in a cluster, i.e., a setting is out of sync with
from the agent, uses knowledge from the retrieved KBA chunks the central orchestrating server. This is a type of incident that is
to answer the query. If the incident in question has an associated automatically reported by monitoring services, which is why the
KBA, we do not use the vectorstore and directly use it to answer description is empty. Diagnosing this incident can lead to exactly
questions posed by the planner. two outcomes: 1) if there are no tenants in the affected cluster,
the incident is marked as a false positive and no mitigation is re-
KBA Planning Tool: During preliminary experiments with the
quired and 2) if the cluster is hosting tenants, then the OCE must
team, we noticed that the eager interleaving of thoughts and actions
identify the affected clusters (this information isn’t present in the
of ReAct can be detrimental to high level planning, i.e. it can
incident report) and manually instantiate a job that will rectify the
sometimes start unsuccessfully carrying out troubleshooting tasks
setting drift to mitigate the incident. Identifying the correct out-
without constructing a high level plan to guide RCA. To mitigate
come requires querying a database to identify affected clusters and
this phenomenon, we introduce a variant of the KBA Q/A tool
analyzing the returned table to determine whether the incident is a
which is designed to be used specifically for planning. Structurally,
false positive, or requires mitigation. Lastly, the incident report in-
it is identical to the Q/A tool, but introducing it explicitly into the
cludes an associated KBA describing the necessary troubleshooting
action space of the agent encourages it to consistently construct
steps, example database queries as well as key pieces of information
high level plans before taking concrete diagnostic steps.
such as the database address.
Human Interaction Tool: Our discussions with the team revealed
several scenarios instances where a human-in-the-loop style work-
flow is necessary for RCA. For example, diagnosing certain types Title: [SettingDrift] Enable AppliancePathCreation is
of incidents requires reproducing the error reported in the incident, drifted
or manually logging into a cloud device and extracting diagnostic Description: <empty>
information, which would be difficult for the agent to do. Therefore,
it is desirable to have the ability for the OCE to collect such infor-
Figure 3: Sample Incident
mation, and provide it as an observation to the agent. Moreover,
our preliminary experiments revealed that the agent struggles to
make progress when key information is missing in the KBAs (such
as missing cluster address for DB queries), but this information Even though this incident is relatively straightforward for OCEs,
can often easily be provided by the OCE to the agent. Therefore, it is not possible to identify whether it is a false alarm or not based
we add a Human Interaction Tool to allow the agent to request only on the incident report. This underlines the importance of hav-
diagnostic information from OCEs, and also add UI enhancements ing access to diagnostic APIs for any automated RCA mechanism.
to allow OCEs to interject the agent’s action steps, manually verify In particular, it is worthwhile to note that even if an automated
tool executions and provide explicit feedback to the agent when approach is able to correctly predict the outcome without carrying
desired. out the proper diagnostic steps, the OCE would still have to carry
them out to verify the prediction. When we tested ReAct agent on
7.4 RQ3 Results: How effective are LLM agents this incident, it is able to correctly identify case 1 consistently. This
at RCA when given access to team specific involved using the KBA Planning Tool to gather the required trou-
diagnostic tools? bleshooting steps, adapt and execute the sample query from the
KBA, and correctly assess the resulting table. While this series of
7.4.1 Challenges faced by OCEs in the team for RCA. Azure Fun-
action is not challenging for OCEs to execute, we stress the fact that
damental Teamdevelops and maintains core services within the
ReAct agent has no prior knowledge of the domain, the incident, or
company’s cloud platform, which hosts both internal and external
the syntax of the database query language. Yet, it is able to leverage
customers that host cloud applications on their platform. While
the KBA to autonomously complete the RCA process. We observed
many of the incidents they receive are human reported, also main-
that the agent would sometimes fail to execute the database query
tain several systems that automatically detect and report error
in its first attempt. However, since we surface appropriate error
states. They maintain a large number of troubleshooting guides
messages to the agent as observations, it was consistently able to
(KBAs) to mitigate the diversity of incident types and associated
rectify these mistakes and complete the troubleshooting process.
diagnostic steps. When a KBA exists for a certain type of incident,
One engineer expressed that they were "amazed by the tool’s capa-
and the incident is relatively simple, new engineers with limited
bility to automatically discern the right parameters and even rectify
experience with the team’s services are able to effectively perform
mistakes when the parameters are initially incorrect by querying
RCA. However, identifying the right KBA for an incident can take
the documents". On the other hand, the second outcome of this
time, and when incidents get more complex, a significant amount
incident (case 2), requires an additional filtering step to remove
(> 1.5 years) of experience is required for RCA.
some rows from the table returned by the database query. In our
7.4.2 Real world RCA using ReAct. We started by investigating demonstrations with the team, the agent is unable to resolve this
simple incidents which have a clear KBA article available, and re- error consistently, but engineers were able to use the human-in-
quires a straightforward sequence of diagnostic steps with minimal the-loop features of the prototype to intervene and fix the error
branching. We use the incident shown in Figure 3 as an illustrative encountered in the filtering step.
216
FSE Companion ’24, July 15–19, 2024, Porto de Galinhas, Brazil D. Roy, X. Zhang, R. Bhave, C. Bansal, P. Las-Casas, R. Fonseca and S. Rajmohan
We also examined complex incidents from the team that did not for learning based using natural language as the medium, and
have a clear set of troubleshooting steps in a single KBA, i.e. it re- present opportunities for building a system where learnings from a
quired combining information from multiple KBAs. The diagnosis specific team’s incidents can be stored in a database, and retrieved
steps typically involved a series of database queries. Engineers on for performing RCA on future incidents for the team.
Azure Fundamental Teamindicated that these incidents require at
Human intervention is necessary to build trust and provide some
least a year of experience with the team’s services to effectively
guardrails for LLMs for critical operations. . There are many diag-
diagnose. Here, we observed that while the agent initially produces
nostic steps which can be easily carried out by OCEs, but are not
a plausible high level plan, it was only ever able to successfully
accessible as consumable services for the agent. Moreover, when
execute one or two diagnostic steps before reaching the iteration
agents struggle with certain parts of the diagnostic process, such as
limit (20). This is primarily due to the difficulty in producing data-
in our study, engineers can easily intervene and correct the agent’s
base queries for these incidents, as information is distributed over
trajectory. Therefore, we recommend that agents used in practical
multiple KBAs, e.g. sample queries and cluster address are not in the
incident management scenarios be endowed with capabilities to
same KBA, requiring the agent to query the KBA Q/A tool multiple
allow for human interaction, using a combination of explicit tools
times before being able to execute a query. While the iteration limit
in the agent’s action space, and UI features for the application ex-
can be extended, it will eventually fill the context. This signals the
posing the agent to users. These capabilities can also be incredibly
need for a scalable multi-trial framework, where experience from
useful when combined with experiential learning; one can imagine
past trials can be used to guide future trials (e.g. [29, 43, 18].
a scenario where an engineer supervises an agent for a small set of
team specific incidents, while it builds its repository of experiences,
7.5 Learnings and Practical Considerations to enable quick domain adaptation for team specific knowledge, and
In this section, we distill some key considerations for the imple- avoid the burden of building a fine-tuning dataset for adaptation
mentation of practical LLM agents for RCA based on the case study. purposes.
KBAs are critical to real world RCA. . As seen from our findings,
KBAs, are critical to performing real world RCA. They contain 8 THREATS TO VALIDITY
both specialized domain knowledge and auxiliary facts about the The evaluation of the agent and other baselines for RCA are con-
agent’s environment (e.g. database addresses, API information) ducted on an internal dataset collected at Microsoft, and might not
that are required both for OCEs and LLM agents to effectively apply to datasets constructed from other organizations. We use a
carry out diagnostic steps. Even experienced OCEs must either smaller sample (n=500) of our test set to satisfy budget constraints,
refer to KBAs in real-time or have internalized the information which might not reflect performance on the larger test set. However,
present in these KBAs to some degree to perform RCA. While some we minimize this threat by using random sampling, and a sample
of this information can also be gleaned from historical incidents, size that has been employed in prior studies. Lastly, the manual
incident reports typically only contain the outcome of diagnosis annotations used to qualitatively characterize model predictions
steps (commonly in discussion comments) rather than operational can potentially introduce bias. We mitigate this by adapting la-
knowledge of how these steps must be performed. This is also why belling criterion from prior work, and engage in multiple rounds of
engineers across the company invest significant time and effort discussion to converge on particularly ambiguous examples.
into the construction and maintenance of KBAs.
9 CONCLUSION & FUTURE WORK
Tool usage in RCA is non-trivial. While LLMs such as GPT-4 have
This work provides the first empirical evaluation of an LLM-based
shown remarkable ability to use tools prevalent in NLP such as
agent, ReAct for root cause analysis for cloud incident manage-
retrieval and search, querying of diagnostic services using special-
ment. We have shown that in an out of domain, zero-shot setting,
ized query languages requires some trial and error. For this reason,
ReAct can perform competitively with strong baselines such as re-
we found that it was critical to surface error messages to the agent
trieval augmented generation and CoT, while offering substantially
to provide feedback to the agent in instances of tool failures. One
lower rates of factual inaccuracies. Surprisingly, the use of discus-
optimization in this regard is to replace LLMs used in tools with
sion comments from incident reports does not have a significant
smaller models fine-tuned for using a particular type of tool (e.g.
impact on the agent’s performance, revealing the limitations of per-
a particular database service). These models can then be injected
forming RCA on a static dataset. Lastly, through our case study, we
with team specific information (database addresses) in-context for
demonstrate the potential of LLM-agents to autonomously perform
usage within a specific team.
RCA in a real world setting when given access to the right tools.
For complicated workflows, experiential learning and multi-trial The work presented here is a first step in the development of LLM-
workflows are necessary. Incidents that require a long and complex based agents for practical RCA. A promising direction for future
sequence of diagnostic steps for RCA typically have a large space of work is the construction of a simulated RCA environment, which
possible action trajectories. This poses significant challenges for the would rapidly enhance the development of agent based approaches
agent. For these incidents, single trial RCA, where we restrict the for RCA. As we continue to explore these avenues of future work,
agent trajectory to 20 steps, is not sufficient. Extending the agent we anticipate that ReAct and similar agents will play a pivotal
to a multi-trial setting necessitates the use of a reflection [30, 18] or role in advancing incident management practices and automating
long term memory component to be able to preserve progress across complex decision-making processes in the software engineering
trials, that allows for experiential learning. These mechanisms allow domain.
217
Exploring LLM-Based Agents for Root Cause Analysis FSE Companion ’24, July 15–19, 2024, Porto de Galinhas, Brazil
REFERENCES [19] Nitika Mathur, Timothy Baldwin, and Trevor Cohn. 2020. Tangled up in BLEU:
[1] Toufique Ahmed, Supriyo Ghosh, Chetan Bansal, Thomas Zimmermann, Xuchao reevaluating the evaluation of automatic machine translation evaluation met-
Zhang, and Saravan Rajmohan. 2023. Recommending Root-Cause and mitiga- rics, (June 2020). http://arxiv.org/abs/2006.06264 arXiv: 2006.06264 [cs.CL].
tion steps for cloud incidents using large language models, (Jan. 2023). http://a [20] Grégoire Mialon et al. 2023. Augmented language models: a survey, (Feb. 2023).
rxiv.org/abs/2301.03797 arXiv: 2301.03797 [cs.SE]. http://arxiv.org/abs/2302.07842 arXiv: 2302.07842 [cs.CL].
[2] Satanjeev Banerjee and Alon Lavie. 2005. METEOR: an automatic metric for MT [21] Vinod Nair, Ameya Raul, Shwetabh Khanduja, Vikas Bahirwani, Qihong Shao,
evaluation with improved correlation with human judgments. In Proceedings Sundararajan Sellamanickam, Sathiya Keerthi, Steve Herbert, and Sudheer
of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Dhulipalla. 2015. Learning a hierarchical monitoring system for detecting
Translation and/or Summarization. Association for Computational Linguistics, and diagnosing service issues. In Proceedings of the 21th ACM SIGKDD In-
Ann Arbor, Michigan, (June 2005), 65–72. https://aclanthology.org/W05-0909. ternational Conference on Knowledge Discovery and Data Mining (KDD ’15).
[3] Chetan Bansal, Sundararajan Renganathan, Ashima Asudani, Olivier Midy, Association for Computing Machinery, Sydney, NSW, Australia, 2029–2038.
and Mathru Janakiraman. 2019. Decaf: diagnosing and triaging performance isbn: 9781450336642. doi: 10.1145/2783258.2788624.
issues in large-scale cloud services. CoRR, abs/1910.05339. http://arxiv.org/abs [22] OpenAI. 2023. GPT-4 technical report, (Mar. 2023). http://arxiv.org/abs/2303.08
/1910.05339 arXiv: 1910.05339. 774 arXiv: 2303.08774 [cs.CL].
[4] Jaime Carbonell and Jade Goldstein. 1998. The use of MMR, diversity-based [23] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a
reranking for reordering documents and producing summaries. In Proceed- method for automatic evaluation of machine translation. https://aclanthology
ings of the 21st annual international ACM SIGIR conference on Research and .org/P02-1040.pdf. Accessed: 2023-9-27. (2002). https://aclanthology.org/P02-1
development in information retrieval (SIGIR ’98). Association for Computing 040.pdf.
Machinery, Melbourne, Australia, (Aug. 1998), 335–336. https://doi.org/10.1145 [24] Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: sentence embeddings
/290941.291025. using siamese BERT-Networks, (Aug. 2019). http://arxiv.org/abs/1908.10084
[5] H Chase. 2022. LangChain. Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, HP d. arXiv: 1908.10084 [cs.CL].
O. [25] Stephen Robertson and Hugo Zaragoza. 2009. The probabilistic relevance frame-
[6] Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2023. Teaching work: BM25 and beyond. Foundations and Trends® in Information Retrieval, 3,
large language models to self-debug. arXiv preprint arXiv:2304. 05128. http://ar 4, 333–389. http://dx.doi.org/10.1561/1500000019.
xiv.org/abs/2304.05128. [26] Devjeet Roy, Sarah Fakhoury, and Venera Arnaoudova. 2021. Reassessing
[7] Yinfang Chen, Xudong Sun, Suman Nath, Ze Yang, and Tianyin Xu. 2023. { Push- automatic evaluation metrics for code summarization tasks. In Proceedings
Button } reliability testing for { Cloud-Backed } applications with rainmaker. of the 29th ACM Joint Meeting on European Software Engineering Conference
In 20th USENIX Symposium on Networked Systems Design and Implementation and Symposium on the Foundations of Software Engineering (ESEC/FSE 2021).
(NSDI 23), 1701–1716. https://www.usenix.org/conference/nsdi23/presentation Association for Computing Machinery, Athens, Greece, (Aug. 2021), 1105–1116.
/chen-yinfang. https://doi.org/10.1145/3468264.3468588.
[8] Yinfang Chen et al. 2023. Empowering practical root cause analysis by large [27] Timo Schick, Jane Dwivedi-Yu, Roberto Dessı, Roberta Raileanu, Maria Lomeli,
language models for cloud incidents, (May 2023). http://arxiv.org/abs/2305.157 Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer:
78 arXiv: 2305.15778 [cs.SE]. language models can teach themselves to use tools, (Feb. 2023). http://arxiv.or
[9] Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, g/abs/2302.04761 arXiv: 2302.04761 [cs.CL].
Jamie Callan, and Graham Neubig. 2023. PAL: program-aided language models. [28] Manish Shetty, Chetan Bansal, Sai Pramod Upadhyayula, Arjun Radhakrishna,
Proceedings of Machine Learning Research 202, 10764–10799. Andreas Krause, and Anurag Gupta. 2022. Autotsg: learning and synthesis for incident trou-
Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and bleshooting. In Proceedings of the 30th ACM Joint European Software Engi-
Jonathan Scarlett, (Eds.) https://proceedings.mlr.press/v202/gao23f.html. neering Conference and Symposium on the Foundations of Software Engineering
[10] Tanja Hagemann and Katerina Katsarou. 2021. A systematic review on anomaly (ESEC/FSE 2022). Association for Computing Machinery, Singapore, Singapore,
detection for cloud computing environments. In Proceedings of the 2020 3rd 1477–1488. isbn: 9781450394130. doi: 10.1145/3540250.3558958.
Artificial Intelligence and Cloud Computing Conference (AICCC ’20). Association [29] Noah Shinn. [n. d.] Reflexion: reflexion: language agents with verbal reinforce-
for Computing Machinery, Kyoto, Japan, 83–96. isbn: 9781450388832. doi: 10.1 ment learning. en. (). https://github.com/noahshinn024/reflexion.
145/3442536.3442550. [30] Noah Shinn, Beck Labash, and Ashwin Gopinath. 2023. Reflexion: an au-
[11] Jiajun Jiang et al. 2020. How to mitigate the incident? an effective troubleshoot- tonomous agent with dynamic memory and self-reflection, (Mar. 2023). ht
ing guide recommendation technique for online service systems. In Proceedings tp://arxiv.org/abs/2303.11366 arXiv: 2303.11366 [cs.AI].
of the 28th ACM Joint Meeting on European Software Engineering Conference [31] Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam
and Symposium on the Foundations of Software Engineering (ESEC/FSE 2020). Trischler, and Matthew Hausknecht. 2020. ALFWorld: aligning text and em-
Association for Computing Machinery, Virtual Event, USA, 1410–1420. isbn: bodied environments for interactive learning, (Oct. 2020). http://arxiv.org/abs
9781450370431. doi: 10.1145/3368089.3417054. /2010.03768 arXiv: 2010.03768 [cs.CL].
[12] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman [32] Jacopo Soldani and Antonio Brogi. 2022. Anomaly detection and failure root
Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. BART: cause analysis in (micro) service-based cloud applications: a survey. ACM
denoising Sequence-to-Sequence pre-training for natural language generation, Comput. Surv., 55, 3, Article 59, (Feb. 2022), 39 pages. doi: 10.1145/3501297.
translation, and comprehension, (Oct. 2019). http://arxiv.org/abs/1910.13461 [33] Chan Hee Song, Jiaman Wu, Clayton Washington, Brian M Sadler, Wei-Lun
arXiv: 1910.13461 [cs.CL]. Chao, and Yu Su. 2022. LLM-Planner: Few-Shot grounded planning for embod-
[13] Patrick Lewis et al. 2020. Retrieval-augmented generation for knowledge- ied agents with large language models, (Dec. 2022). http://arxiv.org/abs/2212.0
intensive nlp tasks. Adv. Neural Inf. Process. Syst., 33, 9459–9474. https://p 4088 arXiv: 2212.04088 [cs.AI].
roceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5- [34] Mbarka Soualhia and Fetahi Wuhib. 2022. Automated traces-based anomaly
Abstract.html. detection and root cause analysis in cloud platforms. In 2022 IEEE International
[14] Chin-Yew Lin. 2004. ROUGE: a package for automatic evaluation of summaries. Conference on Cloud Engineering (IC2E), 253–260. doi: 10.1109/IC2E55432.2022
In Text Summarization Branches Out. Association for Computational Linguistics, .00034.
Barcelona, Spain, (July 2004), 74–81. https://aclanthology.org/W04-1013. [35] Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal.
[15] Chang Lou, Cong Chen, Peng Huang, Yingnong Dang, Si Qin, Xinsheng Yang, 2022. Interleaving retrieval with Chain-of-Thought reasoning for Knowledge-
Xukun Li, Qingwei Lin, and Murali Chintalapati. 2022. {Resin}: a holistic service Intensive Multi-Step questions, (Dec. 2022). http://arxiv.org/abs/2212.10509
for dealing with memory leaks in production cloud infrastructure. In 16th arXiv: 2212.10509 [cs.CL].
USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), [36] Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal.
109–125. https://www.usenix.org/conference/osdi22/presentation/lou-resin. 2022. Interleaving retrieval with Chain-of-Thought reasoning for Knowledge-
[16] Chen Luo, Jian-Guang Lou, Qingwei Lin, Qiang Fu, Rui Ding, Dongmei Zhang, Intensive Multi-Step questions, (Dec. 2022). http://arxiv.org/abs/2212.10509
and Zhe Wang. 2014. Correlating events with time series for incident diagnosis. arXiv: 2212.10509 [cs.CL].
In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge [37] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc
Discovery and Data Mining (KDD ’14). Association for Computing Machinery, Le, and Denny Zhou. 2022. Chain of thought prompting elicits reasoning in
New York, New York, USA, 1583–1592. isbn: 9781450329569. doi: 10.1145/2623 large language models, (Jan. 2022). http://arxiv.org/abs/2201.11903 arXiv:
330.2623374. 2201.11903 [cs.CL].
[17] Minghua Ma et al. 2020. Diagnosing root causes of intermittent slow queries in [38] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei
cloud databases. Proceedings VLDB Endowment, 13, 8, (Apr. 2020), 1176–1189. Xia, Ed Chi, Quoc Le, and Denny Zhou. 2022. Chain-of-thought prompting
https://doi.org/10.14778/3389133.3389136. elicits reasoning in large language models, (Jan. 2022), 24824–24837. S Koyejo,
[18] Aman Madaan et al. 2023. Self-Refine: iterative refinement with Self-Feedback, S Mohamed, A Agarwal, D Belgrave, K Cho, and A Oh, (Eds.) https://proceedi
(Mar. 2023). http://arxiv.org/abs/2303.17651 arXiv: 2303.17651 [cs.CL]. ngs.neurips.cc/paper_files/paper/2022/file/9d5609613524ecf4f15af0f7b31abc
a4-Paper-Conference.pdf arXiv: 2201.11903 [cs.CL].
218
FSE Companion ’24, July 15–19, 2024, Porto de Galinhas, Brazil D. Roy, X. Zhang, R. Bhave, C. Bansal, P. Las-Casas, R. Fonseca and S. Rajmohan
[39] Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. 2022. Web- [42] T Zhang, V Kishore, F Wu, K Q Weinberger, et al. 2019. Bertscore: evaluating
Shop: towards scalable real-world web interaction with grounded language text generation with bert. arXiv preprint arXiv. https://arxiv.org/abs/1904.09675.
agents, (July 2022), 20744–20757. S Koyejo, S Mohamed, A Agarwal, D Belgrave, [43] Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and
K Cho, and A Oh, (Eds.) https://proceedings.neurips.cc/paper_files/paper/2 Gao Huang. 2023. ExpeL: LLM agents are experiential learners, (Aug. 2023).
022/file/82ad13ec01f9fe44c01cb91814fd7b8c-Paper-Conference.pdf arXiv: http://arxiv.org/abs/2308.10144 arXiv: 2308.10144 [cs.LG].
2207.01206 [cs.CL]. [44] Shuyan Zhou et al. 2023. WebArena: a realistic web environment for building
[40] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, autonomous agents, (July 2023). http : / / arxiv . org / abs / 2307 . 13854 arXiv:
and Yuan Cao. 2022. ReAct: synergizing reasoning and acting in language 2307.13854 [cs.AI].
models, (Oct. 2022). http : / / arxiv . org / abs / 2210 . 03629 arXiv: 2210 . 03629
[cs.CL]. Received 2024-02-08; accepted 2024-04-18
[41] Zhengran Zeng et al. 2023. TraceArk: towards actionable performance anomaly
alerting for online service systems. In 2023 IEEE/ACM 45th International Con-
ference on Software Engineering: Software Engineering in Practice (ICSE-SEIP).
(May 2023), 258–269. http://dx.doi.org/10.1109/ICSE-SEIP58684.2023.00029.
219