0% found this document useful (0 votes)
3 views

Robust Failure Diagnosis of Microservice System Through Multimodal Data

The document presents DiagFusion, a robust failure diagnosis approach for microservice systems that utilizes multimodal data (traces, logs, and metrics) to improve accuracy in identifying root causes and failure types. By leveraging embedding techniques and data augmentation, DiagFusion effectively represents diverse data formats and addresses the challenges of imbalanced failure types. Evaluations demonstrate significant performance improvements over existing methods, making DiagFusion a promising solution for automatic failure diagnosis in complex microservice environments.

Uploaded by

fhd.jafari
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Robust Failure Diagnosis of Microservice System Through Multimodal Data

The document presents DiagFusion, a robust failure diagnosis approach for microservice systems that utilizes multimodal data (traces, logs, and metrics) to improve accuracy in identifying root causes and failure types. By leveraging embedding techniques and data augmentation, DiagFusion effectively represents diverse data formats and addresses the challenges of imbalanced failure types. Evaluations demonstrate significant performance improvements over existing methods, making DiagFusion a promising solution for automatic failure diagnosis in complex microservice environments.

Uploaded by

fhd.jafari
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

IEEE TRANSACTIONS ON SERVICES COMPUTING, VOL. 16, NO.

6, NOVEMBER/DECEMBER 2023 3851

Robust Failure Diagnosis of Microservice System


Through Multimodal Data
Shenglin Zhang , Member, IEEE, Pengxiang Jin , Zihan Lin , Yongqian Sun , Member, IEEE,
Bicheng Zhang , Sibo Xia , Zhengdan Li , Zhenyu Zhong , Minghua Ma , Member, IEEE, Wa Jin ,
Dai Zhang , Zhenyu Zhu , and Dan Pei , Senior Member, IEEE

Abstract—Automatic failure diagnosis is crucial for large mi- I. INTRODUCTION


croservice systems. Currently, most failure diagnosis methods rely
ICROSERVICES architecture is becoming increasingly
solely on single-modal data (i.e., using either metrics, logs, or
traces). In this study, we conduct an empirical study using real-
world failure cases to show that combining these sources of data
M popular for its reliability and scalability [1]. Typically, it
is a large-scale distributed system with dozens to thousands of
(multimodal data) leads to a more accurate diagnosis. However, service instances running on various environments (e.g., phys-
effectively representing these data and addressing imbalanced fail-
ures remain challenging. To tackle these issues, we propose Diag- ical machines, VMs, or containers) [2]. Due to the complex
Fusion, a robust failure diagnosis approach that uses multimodal and dynamic nature of microservice systems, the failure of one
data. It leverages embedding techniques and data augmentation service instance can propagate to other service instances, result-
to represent the multimodal data of service instances, combines ing in user dissatisfaction and financial losses for the service
deployment data and traces to build a dependency graph, and provider. For example, Amazon Web Service (AWS) suffered a
uses a graph neural network to localize the root cause instance
and determine the failure type. Our evaluations using real-world failure in December 2021 that impacted the whole networking
datasets show that DiagFusion outperforms existing methods in system and took nearly seven hours to diagnose and mitigate [3].
terms of root cause instance localization (improving by 20.9% to Therefore, it is crucial to timely and accurately diagnose failures
368%) and failure type determination (improving by 11.0% to in microservice systems.
169%). To effectively diagnose failures, microservice system oper-
Index Terms—Microservice systems, failure diagnosis, multi- ators typically collect three types of monitoring data: traces,
modal data, graph neural network. logs, and metrics. Traces are tree-structured data that record
the detailed invocation flow of user requests. Logs are semi-
structured text that records hardware and software events of
Manuscript received 19 February 2023; revised 22 May 2023; accepted
a service instance, including business events, state changes,
14 June 2023. Date of publication 27 June 2023; date of current version 13 hardware errors, etc. Metrics are time series indicating service
December 2023. This work was supported in part by the Advanced Research status, including system metrics (e.g., CPU utilization, memory
Project of China under Grant 31511010501, in part by the National Natural
Science Foundation of China under Grants 62272249 and 62072264, and in part
utilization) and user-perceived metrics (e.g., average response
by the Natural Science Foundation of Tianjin under Grant 21JCQNJC00180. time, error rate). From now on, we use the term modality to
Recommended for acceptance by T. Batista. (Corresponding author: Yongqian describe a particular data type. Fig. 1 shows an example of the
Sun.)
Shenglin Zhang is with the College of Software, Nankai University, Tianjin
three modalities of a microservice system.
300071, China, also with the Key Laboratory of Data and Intelligent System Automatic failure diagnosis of microservice systems has been
Security, Ministry of Education, Tianjin 300071, China, and also with the Haihe a topic of great interest over the years, particularly when identify-
Laboratory of Information Technology Application Innovation (HL-IT), Tianjin
300350, China (e-mail: [email protected]).
ing the root cause instance and determining the failure type. Most
Pengxiang Jin, Zihan Lin, Yongqian Sun, Sibo Xia, Zhengdan Li, Zhenyu approaches rely on single-modal data, such as traces [1], [4], [5],
Zhong, and Wa Jin are with the College of Software, Nankai Univer- [6], logs [7], [8], [9], [10], or metrics [11], [12], [13], [14], to
sity, Tianjin 300071, China (e-mail: [email protected]; linz-
[email protected]; [email protected]; [email protected].
capture failure patterns. However, relying solely on single-modal
edu.cn; [email protected]; [email protected]; 1913173@mail. data for diagnosing failures is not effective enough for two
nankai.edu.cn). reasons. First, a failure can impact multiple aspects of a service
Bicheng Zhang is with the School of Computer Science, Fudan University,
Shanghai 200437, China (e-mail: [email protected]).
instance, causing more than one modality to exhibit abnormal
Minghua Ma is with the Microsoft, Beijing 100080, China (e-mail: patterns. Using just one data source cannot fully capture these
[email protected]). patterns and accurately distinguish between different types of
Dai Zhang and Zhenyu Zhu are with the ZhejiangE-CommerceBank Co.,
Ltd., Hangzhou, Zhejiang 310013, China (e-mail: [email protected];
failures. Second, some types of failures may not be reflected
[email protected]). in certain modalities, making it difficult for methods relying on
Dan Pei is with the Department of Computer Science, Tsinghua University, that modality to identify these failures.
Beijing 100190, China, and also with the Beijing National Research Center for
Information Science and Technology, Beijing 100084, China (e-mail: peidan@
Moreover, we conduct an empirical study on an open-source
tsinghua.edu.cn). dataset to verify the necessity of combining multimodal data
Digital Object Identifier 10.1109/TSC.2023.3290018 for robust failure diagnosis. As listed in Table I, the dataset

1939-1374 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: York University. Downloaded on March 04,2025 at 18:27:02 UTC from IEEE Xplore. Restrictions apply.
3852 IEEE TRANSACTIONS ON SERVICES COMPUTING, VOL. 16, NO. 6, NOVEMBER/DECEMBER 2023

TABLE I
DETAILED INFORMATION OF THE FAILURES IN THE EMPIRICAL STUDY

and metric data. To form a unified representation of the three


modalities with different formats and natures, DiagFusion com-
bines lightweight preprocessing and representation learning,
which maps data from different modalities into the same vec-
tor space. Since the labeled failures are usually inadequate to
train the representation model effectively, we propose a data
augmentation mechanism, which helps DiagFusion to learn the
correlation between the three modalities and failures effectively.
To further enhance the accuracy of our diagnosis, DiagFusion
uses historical failure patterns to train a Graph Neural Network
(GNN), capturing both spatial features and possible failure
propagation paths, which allows DiagFusion to conduct root
cause instance localization and failure type determination.
Fig. 1. Multimodal data of a microservice system. S1–S7 are different mi-
Our contributions are summarized as follows:
croservices.
r We propose DiagFusion, a multimodal data-based ap-
proach for failure diagnosis (Section IV). DiagFusion
builds a dependency graph from trace and deployment
contains failures caused by various reasons: high memory usage, data to capture possible failure propagation paths. Then
incorrect deallocation, code bug, misconfiguration, network in- it applies a GNN to achieve a two-fold failure diagnosis,
terruption, etc. We examine hundreds of service instance failures i.e., root cause instance localization and failure type deter-
and conclude that combining traces, logs, and metrics (mul- mination. To the best of our knowledge, we are among the
timodal) is crucial for accurate diagnosis. For example, the first to learn a unified representation of the three modalities
microservice shown in Fig. 1 is experiencing a failure due to for the failure diagnosis of microservice systems (i.e., trace,
missing files. It generated error messages in logs and a significant log, and metric).
increase in status code 500 in related traces. Additionally, one r We leverage data augmentation to improve the quality of
of its metrics, network out bytes, dropped dramatically during the learned representation, which allows DiagFusion to
this failure. work with limited labeled failures and imbalanced failure
These observations highlight the importance of incorporating types.
multimodal data for robust failure diagnosis. However, com- r We conduct extensive experiments on two datasets, one
bining multimodal data for diagnosing failures in microservice from an open-source platform and another from a real-
systems faces two major challenges: world microservice system (Section V). The results show
1) Representation of multimodal data: The formats of met- that when DiagFusion is trained on 160 and 80 cases, it
rics, logs, and traces are significantly different from each achieves Avg@5 of 0.75 and 0.76 on the two datasets,
other. Service instance metrics are often in the form of respectively, improving the accuracy of root cause instance
time series (the bottom of Fig. 1), while logs are usually localization by 20.9% to 368%. Moreover, DiagFusion
semi-structured text (the middle of Fig. 1) and traces achieves F1-score of 0.84 and 0.80, improving the accuracy
often take the form of tree structures with spans as nodes of failure type determination by 11.0% to 169%.
(the top of Fig. 1). It is challenging to find a unified Our implementation of DiagFusion is publicly available.1
representation of all this multimodal data that fully utilizes The rest of the paper is organized as follows: Section II
complementary information from each data type. introduces the necessary background. Section III presents the
2) Imbalanced failure types: Fault tolerance mechanisms in results of an empirical study of failures in microservice systems.
microservice systems often result in a high ratio of normal Section IV describes the overview and detailed implementation
data to failure-related data. Some types of failures are of DiagFusion in failure diagnosis. In Section V, we evaluate
much rarer than others, leading to an imbalance in the the performance and time efficiency of DiagFusion using two
ratio of different types of failures (Table I).
To tackle the above challenges, we present DiagFusion, an
automated failure diagnosis approach that integrates trace, log, 1 https://anonymous.4open.science/r/DiagFusion-378D

Authorized licensed use limited to: York University. Downloaded on March 04,2025 at 18:27:02 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: ROBUST FAILURE DIAGNOSIS OF MICROSERVICE SYSTEM THROUGH MULTIMODAL DATA 3853

datasets. Section VI discusses the technical rationale, robust- In addition to trace, log, and metric, deployment data is also
ness, and threats to validity. Section VII presents the related important to failure diagnosis. A microservice system comprises
work in failure diagnosis. Section VIII concludes the paper. many hardware and software assets that form complicated inter-
relationships. Operators must carefully record these relation-
II. BACKGROUND ships (a.k.a. deployment data) to keep high maintainability of the
system. Leveraging deployment data enables the understanding
A. Microservice Systems and Multimodal Data of failure propagation paths and characteristics.
Microservice systems allow developers to independently de-
velop and deploy functional software units (microservice). For B. Preliminaries
example, when a user tries to buy an item on an online shopping
Representation Learning: Representation learning has been
website, the user will experience item searching, item display-
widely used in natural language processing tasks, usually in
ing, order generation, payment, etc. Each of these functions
the form of word embedding. Popular techniques of representa-
is served by a specific microservice. A failure at a specific
tion learning include static representation like word2vec [34],
service instance can propagate to other service instances in
GloVe [35], fastText [36], and dynamic representation like
many ways, bringing cascading failures. However, diagnosing
ELMo [37], BERT [38], GPT [39]. With the similarities be-
online failures in microservice systems is difficult due to these
tween logs and natural languages, representation learning can
systems’ highly complex orchestration and dynamic interaction.
be applied to extract log features [40]. We employ fastText
To accurately find the cause of a failure, operators must carefully
to learn a unified representation of events from multimodal
monitor the system and record traces, logs, and metrics. These
data. Compared to word2vec and GloVe, fastText can utilize
three modalities of monitoring data stand as the three pillars of
more information [36]. We employ fastText to learn a unified
the observability of microservice systems. The collection and
representation of the multimodal data.
storage of instances’ monitoring data are not in the scope of
In essence, fastText is a neural network model that processes
this paper. The three modalities: trace, log, and metric, and their
words as input and takes the output from the hidden layer (a
roles in failure diagnosis are described below.
vector of real numbers) as its representation. It can be trained
Trace: Traces record the execution paths of users’ requests.
in both supervised and unsupervised modes, but the supervised
Fig. 1 shows an example of trace at the top. Google formally
mode generally yields more accurate results due to its incor-
proposed the concept of traces at Dapper [15], in which it defined
poration of label information. In the supervised training mode,
the whole lifecycle of a request as a trace and the invocation
the neural network is optimized by predicting the class of the
and answering of a component as a span. By examining traces,
document. Once the training is completed, fastText can be used
operators may identify microservices that have possibly gone
to provide vectorized representations (i.e., embeddings) for any
wrong [4], [6], [16], [17], [18], [19], [20], [21]. Traces can be
given input.
viewed as trees, with microservices as nodes and invocations as
Graph Neural Network: GNN can effectively model data
edges. Each subtree corresponds to a span. Typically, traces carry
from non-euclidean space, thereby being popular among fields
information about invocations, e.g., start time, caller, callee,
with graph structures, e.g., social networks, biology, and recom-
response time, and status code.
mendation systems. Popular GNN architecture includes Graph
Log: Logs record comprehensive events of a service instance.
Convolution Network (GCN) [41], GraphSAGE [42], Graph
Some examples of logs are shown in the middle of Fig. 1.
Attention Network (GAT) [43], etc. GNNs apply graph convolu-
Logs are generated by developers using commands like printf,
tions, allowing nodes to utilize their information and learn from
logging.debug, logging.error. They provide an internal picture
their neighbors through message passing. There are numerous
of a service instance. By examining logs, operators may discover
components in microservice systems that interconnect with each
the actual cause of why an instance performs not well. Typically,
other. Thus graph structure is suitable to model microservice
logs consist of three fields: timestamp, verbosity level, and raw
systems, and we employ GNN to learn the propagation patterns
message [22]. Four commonly used verbosity levels, i.e., INFO,
of historical failure cases.
WARN, DEBUG, and ERROR, indicate the severity of a log
message. The raw message of a log conveys detailed information
about the event. To utilize logs more effectively, researchers have C. Problem Statement
proposed various parsing techniques to extract templates and pa- When a failure occurs, operators need to localize the root
rameters, e.g., FT-Tree [23], Drain [22], POP [24], MoLFI [25], cause instance and determine what has happened to it to achieve
Spell [26], and Logram [27]. timely failure mitigation. For large-scale microservice systems,
Metric: Various system-level metrics (e.g., CPU utilization, the first task is a ranking problem: to rank the root cause instance
memory utilization) and user-perceived metrics (e.g., average higher than other instances. We use the term root cause instance
response time) are configured for monitoring system instances. localization to name this task (Task #1). The second task is a
Each metric is collected at a predefined interval, forming a time classification problem: to classify the failure into a predefined
series, as shown at the bottom of Fig. 1. These metrics track set of failure types. We use the term failure type determination
various aspects of performance issues. By examining metrics, to name this task (Task #2).
operators can determine which physical resource is anomalous After each failure, operators will carefully conduct a post-
or is the bottleneck [28], [29], [30], [31], [32], [33]. failure analysis: labeling its root cause instance and its failure

Authorized licensed use limited to: York University. Downloaded on March 04,2025 at 18:27:02 UTC from IEEE Xplore. Restrictions apply.
3854 IEEE TRANSACTIONS ON SERVICES COMPUTING, VOL. 16, NO. 6, NOVEMBER/DECEMBER 2023

type. Additionally, chaos engineering can generate a large num-


ber of failure cases [44]. It can enlarge the number of failure cases
and enrich the types of failures. We train DiagFusion based on
these failure cases.

III. EMPIRICAL STUDY


Most existing failure diagnosis methods are based on single-
modal data. However, these methods cannot fully capture the Fig. 2. The distribution of failure types at a large-scale real-world microservice
patterns of failed instances, leading to ineffective failure diagno- system.
sis. We conduct an empirical study conducted on Generic AIOps
Atlas (GAIA)2 dataset to show the ineffectiveness of these
methods. The dataset is collected from a simulation environment
consisting of 10 microservices, two database services (MySQL
and Redis), and five host machines. The system serves mobile
users and PC users. Operators injected five types of failures,
including system failures (System stuck and Process crash) and
service failures (Login failure, File missing, and Access denied).
The failure injection record is provided along with the data.
Table I lists some typical symptoms of failures. We can see
that no modality alone can distinguish the patterns of these five
types of failures. It also shows that traces, logs, and metrics
may display different anomalous patterns when a failure occurs.
Mining the correlation between multimodal data can provide
operators with a more comprehensive understanding of failures.
Besides, Table I shows that some failures occur much more
frequently than others. For example, the total occurrences of
Process crash, File missing, and Access denied (67) equals only
12% of the occurrences of Login failure (527).
To further understand the distribution of failure types in Fig. 3. The training framework of DiagFusion.
the production environment, we investigated N failures in a
microservice system of Microsoft. Due to the company policy,
we have to hide some details of these failures. The failures of the
studied system are recorded in the Incident Management System and metrics data and serializes them by their timestamps. Then,
(IcM) of Microsoft, where a failure is centralized handled, we train a neural network to learn the distributed representation
including the detection, discussion, mitigation, and post-failure of events by encoding events into vectors. The challenge of
analysis of failures. The IcM data of failures are persistently data imbalance is overcome through data augmentation during
stored in a database. We query the failure records from the model training. We unify three modalities with different natures
database within the time range from 2021 August to 2022 by turning unstructured raw data into structured events and
August. We only keep the failures with the status of “completed”, vectors. Then we combine traces with deployment data to build
for their post-failure analyses have been reviewed. In the root a dependency graph (DG) of the microservice system. After
cause field of post-failure analysis, operators categorize the that, the representations of events and DG are glued together
failures into the following types: code, data, network, hardware, by a GNN. We train GNN using historical failures to learn the
and external. We can see from Fig. 2 that different failure propagation pattern of system failures.
types are imbalanced regarding the number of failure cases. After the training stage, we save the event embedding model
The imbalanced data poses a significant challenge because most and the GNN. Fig. 5 depicts the real-time failure diagnosis
machine learning methods perform poorly on failure types with framework of DiagFusion. The trigger of DiagFusion can be
fewer occurrences. alerts generated through predefined rules. When a new failure is
alerted, DiagFusion will perform a real-time diagnosis and give
IV. APPROACH the results back to operators.

A. Design Overview
B. Unified Event Representation
In this article, we propose DiagFusion, which combines the
modality of trace, log, and metric for accurate failure diagno- DiagFusion unifies the three modalities by extracting events
sis. The training framework of DiagFusion is summarized in from the raw data and encoding them into vectors. Specifically,
Fig. 3. First, DiagFusion extracts events from raw traces, logs, it collects failure-indicative events by leveraging effective and
lightweight methods, including anomaly detection techniques
2 https://github.com/CloudWise-OpenSource/GAIA-DataSet for metrics and traces and template parsing techniques for logs.

Authorized licensed use limited to: York University. Downloaded on March 04,2025 at 18:27:02 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: ROBUST FAILURE DIAGNOSIS OF MICROSERVICE SYSTEM THROUGH MULTIMODAL DATA 3855

Then, it trains a fastText [36] model on event sequences to


generate embedding vectors of events.
First, we introduce the instances in a microservice system.
Microservice systems have the advantage of dynamic deploy-
ment by utilizing the container technique. In this paper, we
use the term instance to describe a running container and the
term service group to describe the logical component that an
instance belongs to. For example, Billing is a service group in
a microservice system, and Billing_cff19b denotes an instance,
where cff19b is the container id. Below we will describe the
event extraction from different modalities.
Trace Event Extraction: Traces record calling relationships
between services. We group trace data by its caller and callee
services. DiagFusion will examine multiple fields inside a trace
group. Under different implementations of trace recording, trace
data can carry different fields, e.g., response time and status
code, which reflect different aspects of operators’ interests.
Fig. 4. The event extraction and serialization process using traces, logs, and
We apply an anomaly detection algorithm (i.e., 3-sigma) for metrics.
numerical fields like response time to detect anomalous be-
haviors. For categorical fields like status code, we count the
number of occurrences of each value. If the count of some value
increases dramatically, we determine that this field is anomalous. utilize the label information, we relabel event sequences in an
We determine that a group of caller and callee is anomalous instance-wise manner. Specifically, the root cause instance’s
if one of its fields becomes anomalous. The extracted trace event sequence is labeled by the actual failure type, while other
events are in the form of tuple <timestamp, caller-instance-id, instances’ event sequences are labeled as “non-root-cause”. A
callee-instance-id>. microservice system with p historical failures and q instances
Log Event Extraction: Logs record detailed activities of an results in N = p × q event sequences after relabeling. Then,
instance (service or machine). We perform log parsing for log we learn unified representations from these relabeled historical
event extraction using Drain [22], which has been proven to event sequences using the event embedding model.
be effective in practice. Drain uses a fixed depth parse tree With event sequence and instance labeling, we can transform
to distinguish the template part and the variable part of log events into vectors. We use the term event embedding to describe
messages. For example, in the log message “uuid: 8fef9f0 in- the mapping of events to real number vectors. Specifically, we
formation has expired, mobile phone login is invalid”, “uuid: train a fastText model on the event sequences to obtain the
****** information has expired, mobile phone login is invalid” vectorized representation for events from all three modalities.
is the template part, and “8fef9f0” is the variable part. After we FastText is a neural network originally proposed for text classi-
get the template part of a log message, we hash the string of fication. For a document with word sequences, fastText extracts
the template part to obtain an event template id. The extracted n-grams from it and predicts its label. In our scenario, we replace
log events are in the form of tuple <timestamp, instance-id, word sequences with event sequences and replace document
event-template-id>. labels with failure types. The training of fastText minimizes the
Metric Event Extraction: Metrics are also recorded at the negative log-likelihood over classes:
instance level. We perform 3-sigma to detect anomalous metrics.
N
When the value of a metric exceeds the upper bound of 3-sigma, 1 
min − yn log (f (xn )) (1)
the anomaly direction is up. Similarly, the anomaly direction f N n=1
is down if the value is below the lower bound. The extracted
metric events are in the form of tuple <timestamp, instance-id, where xn is the normalized bag of features of the n-th event
metric-name, anomaly-direction>. sequence, yn denotes the relabeled information, and f is the
The above extraction provides events from different modal- neural network. We treat fastText’s output as the vectorized rep-
ities. Despite the differences in raw data, all extracted events resentation of events. The training detail of the event embedding
share two fields, namely timestamp and instance-id. These are model is described in Section IV-D.
the keys to unifying different modalities. We group events by
instance-id and serialize events in the same group by timestamp.
Fig. 4 shows the event extraction and serialization process for C. Graph Neural Network
one instance. The event sequence of instance i is denoted by Ei . In the event representation process, DiagFusion captures the
After getting the event sequence of every instance, we further local features of instances. However, failures can propagate
assign labels to every event sequence according to operators’ between instances, so we need to have a global picture of the
post-failure analysis. Original failure labels are often in the system, i.e., how a failure will affect the system. To this end,
form of tuple <root cause instance-id, failure type>. To fully we employ a GNN to learn the failure propagation between

Authorized licensed use limited to: York University. Downloaded on March 04,2025 at 18:27:02 UTC from IEEE Xplore. Restrictions apply.
3856 IEEE TRANSACTIONS ON SERVICES COMPUTING, VOL. 16, NO. 6, NOVEMBER/DECEMBER 2023

service instances and integrate all the information of the whole (determined by euclidean distance) in V0 . After all failure types
system. are expanded to a relatively large size, e.g., 1000, we can obtain
To leverage a GNN, it is essential to consider both nodes and a more balanced training set. Further details on the choice of
edges within a graph. The nodes in a GNN corresponds to the the expanding size can be found at Section V-E. Then we train
instances in a microservice system. An instance is characterized the event embedding model again (f1 ) on the expanded data and
by its anomalous events in DiagFusion. We represent an instance regard the representations generated in this round (V1 ) as the
i by averaging all of its events: final unified event representations.
2) Training of Graph Neural Network: We train the GNN in
(0) 1 
hi = V1 (e) (2) a joint learning fashion to fully utilize the shared information
|Ei | between tasks #1 and #2. Then we combine the trained GNN
∀e∈Ei
with a ranking strategy to better fit the nature of microservice
where Ei is the extracted event sequences, and V1 (e) is the vec-
systems.
torized representation of event e learned by the event embedding
Ranking Strategy: One of the advantages of microservice
model.
systems is that the architecture allows dynamic deployment of
The edges in a GNN correspond to the dependency graph in
service instances. Thus, service instances are constantly being
a microservice system. There are two dominant ways of prop-
created and destroyed. However, when it comes to failure diag-
agation failure between services: function calling or resource
nosis, this kind of flexibility raises a challenge for learning-based
contention [45]. So we combine traces and deployment data
methods. The failure diagnosis model will have to be retrained
to capture probable failure propagation paths. Specifically, we
frequently if the output layer directly outputs the probability
aggregate traces to get a call graph. Then we add two directed
of being the root cause instance for each instance since many
edges for each pair of caller and callee, with one pointing from
instances can be created or destroyed after the model training
the caller to the callee and the other in the reverse direction.
is finished. We add an extract step in DiagFusion to overcome
From deployment data, we add edges between two instances if
this challenge. Instead of directly determining the root cause
they are co-deployed, i.e., sharing resources.
instance, DiagFusion is trained on service groups, the logical
After obtaining the dependency graph and instance represen-
aggregation of service instances, for task #1. Then DiagFusion
tations, we employ GNN to learn the failure propagation pattern
ranks the instances inside a candidate service group by the length
by its message-passing mechanism. At the K-th layer of GNN,
of their event sequences. The instance with more anomaly events
we apply topology adaptive graph convolution [46] and update
will be ranked higher and likely be the root cause instance.
the internal data of instances according to:
Joint Learning: Intuitively, the two tasks of failure diagnosis,
K 
 k i.e., root cause instance localization and failure type determina-
HK = D−1/2 AD−1/2 XΘk (3) tion, share some knowledge in common. For a given failure, the
k=0 only difference between task #1 and task #2 lies in their labels. So
 DiagFusion integrates a joint learning mechanism to utilize the
where A denotes the adjacency matrix, Dii = j=0 Aij is a
shared knowledge and reduce the training time. (Training two
diagonal degree matrix, Θk denotes the linear weights to sum
models separately requires twice the time otherwise.) Specifi-
the results of different hops together.
cally, the joint loss function is:
Finally, we add a MaxPooling layer as the readout layer to
⎛ ⎞
integrate the information of the whole microservice system. F S T
1  ⎝ 
Following the MaxPooling layer, there is a fully connected layer − y(s)i,j log p(s)i,j + y(t)i,k log p(t)i,k ⎠
where each neuron corresponds to either a service group with F i=1 j=1
k=1
possible root cause instances for task #1 or a failure type for (4)
task #2. where F is the number of historical failures, S is the number
of service groups, T is the number of failure types, y(s) is the
D. Training of DiagFusion root cause service group labeled by operators, y(t) is the failure
type, p(s) is the predicted service group, and p(t) is the predicted
DiagFusion applies a two-phase training strategy to learn the
failure type.
failure pattern of a microservice system. First, it trained the event
embedding model with data augmentation. Then it trains the
E. Real-Time Failure Diagnosis
GNN with a joint learning technique.
1) Training of Event Embedding Model: DiagFusion em- After the training stage, we save the trained event embedding
ploys a data augmentation strategy to enrich the training dataset model and the GNN. When a new failure is alerted, DiagFusion
and reduce the model’s bias towards the majority class. First, performs a real-time diagnosis process as shown in Fig. 5.
we train our event embedding model on the original data. 1) Running Example: Fig. 6 shows how DiagFusion can be
The trained neural network, denoted by f0 , maps events to integrated with microservice systems. To better explain how
the vector space V0 . To increase the number of failure cases, DiagFusion diagnoses failure, we demonstrate the workflow of
we add new event sequences for each failure type (including DiagFusion using one real-world failure from D1. At 10:46,
“non-root-cause”) by randomly taking an event sequence of that service instance B1 encounters a failure of access denied.
type and replacing one of the events with its closest neighbor Fig. 7 shows the original data, event sequence, and the DG.

Authorized licensed use limited to: York University. Downloaded on March 04,2025 at 18:27:02 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: ROBUST FAILURE DIAGNOSIS OF MICROSERVICE SYSTEM THROUGH MULTIMODAL DATA 3857

TABLE II
DETAILED INFORMATION OF DATASETS

and architectures, D1 and D2. To prevent data leakage, we split


the data of D1 and D2 into training and testing sets according
to their start time, i.e., we use data from the earlier time as the
training set and data from the later time as the test set. Detailed
Fig. 5. Real-time failure diagnosis.
information is listed in Table II. The systems that produce D1
and D2 are as follows:
1) D1. The details of D1 are elaborated in Section III.
2) D2. The second dataset is collected from the manage-
ment system of a top-tier commercial bank. The studied
system consists of 14 instances, including microservices,
web servers, application servers, databases, and dockers.
Due to the non-disclosure agreement, we cannot make
this dataset publicly available. Two experienced operators
examined the failure records from January 2021 to June
2021. They classified the failures into five types of failures,
i.e., CPU-related failures, memory-related failures, JVM-
CPU-related failures, JVM-memory-related failures, and
IO-related failures. The classification was done separately,
and they checked the labeling with each other to reach a
Fig. 6. Integration of DiagFusion with a microservice system. consensus.
2) Baseline Methods: We select six advanced single-modal-
based methods (two for trace (i.e., MicroHECL [5], Micro-
From Fig. 7(a), we can see that failure-indicative events from Rank [6]), two for log (i.e., Cloud19 [8], LogCluster [7]),
different modalities are temporally intertwined. Then the GNN and two for metric (i.e., AutoMAP [13], MS-Rank [12])),
predicts service group “B” and failure type “access denied”. and two multimodal-based methods (i.e., PDiagnose [47],
Further ranking within the service group “B” gives “B1” as the CloudRCA [48]) as the baseline methods. More details can be
Top1 instance. The overall process takes less than 10 seconds. found in Section VII. Among the baseline methods, Cloud19,
Thus, DiagFusion effectively addresses tasks #1 and #2. LogCluster, and CloudRCA cannot address Task #1 (root
cause instance localization), while MicroHECL, MicroRank,
V. EVALUATION AutoMAP, MS-Rank, and PDiagnose cannot address Task #2
(failure type determination). Therefore, we divide the baseline
In this section, we evaluate the performance of DiagFusion
methods into two groups to evaluate the performance of Task #1
using two real-world datasets. We aim to answer the following
and Task #2, respectively: MicroHECL, MicroRank, AutoMAP,
research questions (RQs):
MS-Rank, and PDiagnose for Task #1, Cloud19, LogCluster, and
RQ1: How effective is DiagFusion in failure diagnosis?
CloudRCA for Task #2.
RQ2: Does each component of DiagFusion have significant
We configure the parameters of all these methods according
contributions to DiagFusion’s performance?
to their papers. Specifically, we use the same configuration for
RQ3: Is the computational efficiency of DiagFusion suffi-
parameter settings explicitly mentioned in the papers and not
cient for failure diagnosis in the real world?
limited to a particular dataset (e.g., significance level, feature
RQ4: What is the impact of different hyperparameters?
dimension). For parameter settings that apply to a particular
dataset (e.g., window length, period), we adapt them according
A. Experimental Setup to the range the papers provide or to our data.
1) Dataset: To evaluate the performance of DiagFusion, we 3) Evaluation Metrics: As stated in Section II-C, DiagFu-
conduct extensive experiments on two datasets collected from sion aims to localize the root cause instance and determine the
two microservice systems under different business backgrounds failure type. We carefully select different evaluation metrics for

Authorized licensed use limited to: York University. Downloaded on March 04,2025 at 18:27:02 UTC from IEEE Xplore. Restrictions apply.
3858 IEEE TRANSACTIONS ON SERVICES COMPUTING, VOL. 16, NO. 6, NOVEMBER/DECEMBER 2023

Fig. 7. A running example of DiagFusion. (a) the serialized multimodal event sequence of the root cause instance (B1); (b) the original data corresponding to
the event sequence; (c) part of the dependency graph in this failure.

both tasks to better reflect the real-world performance of all TABLE III
EFFECTIVENESS OF FAILURE TYPE DETERMINATION (TASK #2)
selected methods.
For Task #1, we use Top-k accuracy (A@k) and Top-5 average
accuracy (Avg@5) as the evaluation metrics. A@k is a well-
adopted metric that quantifies the probability that top-k instances
output by each method indeed contain the root cause instance
[5]. Formally, given |A| as the test set of failures, RCi as the
ground truth root cause instance, RCs [k] as the top-k root cause
instances set generated by a method, A@k is defined as:
1  1, if RCia ∈ RCsa [k] The comparison result of Task #1 is shown in Fig. 8. Diag-
A@k = (5)
|A| 0, otherwise Fusion achieves the best performance. Specifically, the A@1 to
a∈A
A@5 of DiagFusion are almost the best on D1 and D2. More
Avg@5 is another popular metric that evaluates a method’s specifically, the Avg@5 of DiagFusion exceeds 0.75 on both D1
overall capability of localizing the root cause instance[49]. In and D2, respectively. It is at least 0.13 higher on both datasets
practice, operators often examine the top 5 results. Avg@5 is than baselines using single-modal data due to the advantage of
calculated by: using multimodal data. Compared with PDiagnose, which also
1  uses multimodal data, the Avg@5 of DiagFusion is higher by
Avg@5 = A@k (6) at least 0.18. This indicates that learning from historical failures
5
1≤k≤5 improves the accuracy of diagnosis significantly.
For Task #2, which is a multi-class classification problem, we The result of Task #2 is shown in Table III. For this task,
use the weighted average precision, recall, and F1-score to test DiagFusion is better than almost all baselines. On D1, the
the performances. These metrics have been selected based on precision, recall, and F1-score of DiagFusion are over 0.80.
a previous study [50] as a reliable way to assess the model’s On D2, DiagFusion manages to maintain an F1-score of 0.80,
effectiveness in this specific context. With True Positives (TP), which is at least 0.195 higher than the baselines. Considering
False Positives (FP), and False Negatives (FN), the calculation is both systems and tasks, DiagFusion consistently demonstrates
precision×recall
given by F1-score = 2 × precision+recall , where precision = superior performance, thereby substantiating its effectiveness.
TP+FP and recall = TP+FN .
TP TP

4) Implementation: We implement DiagFusion and base- C. Ablation Study (RQ2)


lines with Python 3.7.13, PyTorch 1.10.0, scikit-learn 1.0.2,
To evaluate the effects of the three key technique contributions
fastText 0.9.2, and DGL 0.9.0. We run the experiments on a
of DiagFusion: 1) data augmentation; 2) fastText embedding;
server with 12 × Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20 GHz
3) DG and GNN, we create five variants of DiagFusion. C1:
and 128 G RAM (without GPUs). We repeat every experiment
Remove the data augmentation. C2: Use word2vec embedding
five times and take the average result to reduce the effect of
instead of fastText. C3: Use GloVe embedding instead of fast-
randomness.
Text. C4: Replace the GNN output layer with a decision tree.
C5: Replace the GNN output layer with a kNN model.
B. Overall Performance (RQ1)
Table IV lists that DiagFusion outperforms all the variants
To demonstrate the effectiveness of DiagFusion, we compare on D1 and D2, demonstrating each component’s significance.
it with the baseline methods on Task #1 and Task #2. When removing the data augmentation (C1), the performance

Authorized licensed use limited to: York University. Downloaded on March 04,2025 at 18:27:02 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: ROBUST FAILURE DIAGNOSIS OF MICROSERVICE SYSTEM THROUGH MULTIMODAL DATA 3859

Fig. 8. Effectiveness of root cause instance localization (Task #1).

TABLE IV
CONTRIBUTIONS OF COMPONENTS

reduces across the board as models trained from imbalanced TABLE V


THE COMPARISON OF TRAINING TIME (OFFLINE) AND DIAGNOSIS TIME
data are more likely to bias predictions toward classes with (ONLINE) PER CASE (“-” MEANS NO NEED TRAINING)
more samples. Data augmentation can alleviate this problem.
The performance becomes worse when replacing fastText em-
bedding strategy (C2 & C3). The reason is that fastText can learn
from operators’ failure labeling as well as co-occur relationships,
while word2vec and GloVe can only learn from the co-occur
relationships between events. Replacing the GNN output layer
with classifiers such as decision trees and kNN (C4 & C5) de-
grades performance because the GNN can capture the interaction
patterns and fault propagation among instances in microservice
systems, but traditional classifiers cannot understand the graph
structure information.

D. Efficiency (RQ3) E. Hyperparameter Sensitivity (RQ4)


We record the running time of all methods and compare them We discuss the effect of four hyperparameters of DiagFusion.
in Table V. The offline training time of DiagFusion is acceptable, Fig. 9 shows how Avg@5 (Task #1), F1-score (Task #2) change
particularly when considering its infrequent need for retraining. with different hyperparameters.
It shows that DiagFusion can diagnose one failure within 12 Embedding Dimension: The performance of DiagFusion re-
seconds on average online, which means it can achieve quasi- acts differently on different datasets in terms of sensitivity to
real-time diagnosis because the interval of data collection in D1 dimensionality (D1 remains stable while D2 fluctuates more),
and D2 is at least 30 seconds. Although DiagFusion may not and the optimal dimensionality is inconsistent across datasets
possess apparent advantages among the methods in Table V, and tasks. We choose the 100 dimensions in our experiments
DiagFusion can meet the needs of online diagnosis. because it has the best overall performance.

Authorized licensed use limited to: York University. Downloaded on March 04,2025 at 18:27:02 UTC from IEEE Xplore. Restrictions apply.
3860 IEEE TRANSACTIONS ON SERVICES COMPUTING, VOL. 16, NO. 6, NOVEMBER/DECEMBER 2023

Fig. 9. The effectiveness of DiagFusion under different hyperparameters.

The Number of Augmented Samples: The experiments in relationships, making them well-suited to handle complicated
Section V-B show that data augmentation has some improvement data.
in the model’s performance. However, when the number of sam- Why FastText? FastText was chosen because trace, log, and
ples increases to a certain amount, the information in the training metric data have very different formats. However, they all share
set has already been fully utilized. Instead, the performance may timestamps, meaning they can be sequenced according to their
be degraded due to the excessive introduction of noise. Generally temporal order. FastText provides superior performance over
speaking, DiagFusion does not need an excessive number of other static embeddings like word2vec and GloVe, which was
augmented samples as long as the samples are balanced. demonstrated in Section V-C. Although deep dynamic embed-
The Number of Layers in GNN: As the layer number of GNN dings like ELMo, BERT, and GPT are popular in Natural Lan-
varies from 1 to 5, the performance of DiagFusion in three tasks guage Processing, they are not suitable for microservice settings
shows a decreasing trend. The model performs best when the as the number of failure cases is insufficient to train these large
layer number is lower than 3. We do not recommend setting models.
the layer number too large since training deep GNN requires Why GNN? GNN was chosen because the structure of mi-
extra training samples, which is hard to meet in real-world croservice systems involves many instances and their relation-
microservice systems. ships, which form the structure of a graph. Various approaches
Time Window: The length of the time window has little impact incorporating Random Walk [12], [13] exist to accomplish fail-
on performance because the moments when failures occur are ure diagnosis on such graph structures. However, their ability to
sparse, and the anomaly events reported in a time window generalize is limited since domain knowledge can vary greatly
are only relevant to the current failure. With accurate anomaly between different systems. The domain knowledge contained
detection, the performance of DiagFusion is stable. in graph data can be effectively learned by GNNs [51], giving
them a stronger generalization ability than approaches based on
VI. DISCUSSION Random Walk.
Concerns About Learning-Based Methods: While learning-
A. Why Learning-Based Methods? based methods offer several advantages, they do require labeled
The DiagFusion approach incorporates several learning- samples for training. This can be addressed by 1) utilizing the
based techniques, such as fastText in the Unified Event Rep- well-established failure management system in microservice
resentation (Section IV-B) and GNN (Section IV-C). By doing systems as a natural source of failure labeling, 2) DiagFusion
so, DiagFusion significantly outperforms baseline approaches. not requiring too many training samples to achieve good per-
We chose to build DiagFusion using learning-based methods formance (the sizes of the training set of D1 and D2 are 160
for the following reasons: 1) Accuracy: learning-based meth- and 80, respectively), and 3) the increasing adoption of chaos
ods provide high accuracy (Section V) and are therefore ideal engineering, which enables operators to quickly obtain sufficient
for diagnosing failures. 2) Generalization ability: failure cases failure cases. Several successful practices with the help of chaos
used to train DiagFusion contain different patterns of failure engineering have been reported [2], [6], [16], [18].
propagation for different systems. A strong generalization abil-
ity allows DiagFusion to perform robust diagnosis for each
system. 3) Ability to handle complicated data: as microservice B. Robustness
systems become increasingly complex and monitoring data more In practice, some modalities can be absent, hindering a suc-
high-dimensional, manually setting up rules for failure diagno- cessful failure diagnosis system to some extent. The cause
sis becomes time-consuming and error-prone. Learning-based of missing modalities can be generally classified into three
methods, on the other hand, take this data as input and learn their categories. The first category refers to missing modalities caused

Authorized licensed use limited to: York University. Downloaded on March 04,2025 at 18:27:02 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: ROBUST FAILURE DIAGNOSIS OF MICROSERVICE SYSTEM THROUGH MULTIMODAL DATA 3861

TABLE VI modality. Besides, the GNN module deals with feature vectors
ROBUSTNESS COMPARED TO PDIAGNOSE (TASK #1)
rather than original monitor data. DiagFusion can work given
that any two of the three modalities are available.
There are two main threats to the validity of the study. The
first one lies in the limited sizes of the two datasets used in the
study. D1 and D2 are relatively smaller than complex industrial
microservice systems. The second one lies in the limitation
of the failure cases used in the study. Some failure cases of
D1 are simpler than industrial failures and represent only a
by data collection problems. Modern microservice systems are
limited part of different types of failures. However, according to
developing rapidly; the same truth applies to their monitoring
our experiments, DiagFusion is effective and robust. It is very
agents. Therefore, it is hard to guarantee that all monitoring data
promising that DiagFusion can also be effectively applied to
are ideally collected and transmitted. As a result, missing data
much larger industrial microservice systems and more complex
is inevitable, which can give rise to missing modalities when
failure cases.
specific modalities of the monitoring data are having collection
problems. The second category refers to missing modalities
caused by data availability problems. In some large corporations, VII. RELATED WORK
monitoring data is individually collected by many different
Metric-Based Failure Diagnosis Methods: Monitoring met-
divisions. Sometimes, specific modalities can be exclusively
rics are one of the most important observable data in microser-
governed by a division that does not want to disclose its service
vice systems. Many works try to build a dependency graph to de-
maintenance data. Thus, these modalities are collected but not
pict the interaction between system components during failure,
available to general operators. The third category stands for
such as Microscope [11], MS-Rank [12], and AutoMAP [13].
missing modalities caused by data retrieval problems. In prac-
However, the correctness of the above works heavily depends
tice, we often encounter situations where it is very inconvenient
on the parameter settings, which degrades their applicability.
to retrieve monitoring data from the data pool. Multimodal
Besides, many methods extract features from system failures,
failure diagnosis requires much more data to be collected than
such as Graph-RCA [52] and iSQUAD [50]. Nonetheless, failure
single-modal-based methods and may face missing modality
cases are few in microservice systems because operators try to
problems. However, an excellent multimodal-based approach
run the system as robustly as possible, severely affecting the
should perform well even when some modalities are missing. We
performance of these feature-based methods.
discover that 62 failure cases of D1 lack metric data. DiagFusion
Trace-Based Failure Diagnosis Methods: Trace can be used
is compared with PDiagnose in these cases. As PDiagnose
to localize the culprit service, for example, TraceRCA [4],
cannot address Task #2, we only present the results of Task #1.
MEPFL [18], MicroHECL [5], and MicroRank [6]. However,
As shown in Table VI, the performance of PDiagnose drops
these trace-based methods often focus on the global feature of
dramatically in these cases, while DiagFusion presents salient
the systems and do not deal with the local features of a service
robustness. Although DiagFusion also witnesses a performance
instance.
degradation, it is still better than PDiagnose and other Task
Log-Based Failure Diagnosis Methods: LogCluster [7] per-
#1 baselines. DiagFusion has seen complete data modalities
forms hierarchical clustering on log sequences and matches
during training and learned a unified representation, allowing
online log sequences to the most similar cluster. Cloud19 [8] ap-
it to capture anomalous patterns’ correlation to failures better
plies word2vec to construct the vectorized representation of a log
than single-modal-based methods. On the other hand, PDiagnose
item and trains classifiers to identify the failure type. Onion [9]
treats each modality independently, making it ineffective when
performs contrast analysis on agglomerated log cliques to find
facing missing modalities. To sum up, DiagFusion demonstrates
incident-indicating logs. DeepLog [10] and LogFlash [53] inte-
robustness since it achieves satisfactory performance even when
grate anomaly detection and failure diagnosis. They calculate
working with data with incomplete modalities.
the deviation from normal status and suggest the root cause
accordingly. Log-based methods often ignore the topological
C. Concerns About Deployment and Validity feature of microservice systems.
There are some concerns about deploying DiagFusion to Multimodal Data-Based Failure Diagnosis Methods: Re-
real-world microservice systems: 1) DiagFusion needs to adapt cently, combining multimodal data to conduct failure diagnosis
to the highly dynamic nature of microservice architecture. The has drawn increasing attention. CloudRCA [48] uses both metric
stored model of DiagFusion can still be effective when service and log. It uses the PC algorithm to learn the causal relationship
instances are created or destroyed, for DiagFusion utilizes the between anomaly patterns of metrics, anomaly patterns of logs,
concept of service group as a middle layer. The only situation in and types of failure. Then it constructs a hierarchical Bayesian
which DiagFusion needs to be retrained is when new service Network to infer the failure type. PDiagnose [47] combines
groups are created. However, the creation of service groups metric, log, and trace. It uses lightweight anomaly detection
is very rare in practice. 2) Some production systems do not of the three modalities to detect anomaly patterns. Then its
monitor all three modalities at the same time. The workflow vote-based strategy selects the most severe component as the
of DiagFusion is general because the event embedding model root cause. However, these two methods ignore the topology
is trained on event sequences and does not rely on any specific feature of microservice systems. Groot [54] integrates metrics,
Authorized licensed use limited to: York University. Downloaded on March 04,2025 at 18:27:02 UTC from IEEE Xplore. Restrictions apply.
3862 IEEE TRANSACTIONS ON SERVICES COMPUTING, VOL. 16, NO. 6, NOVEMBER/DECEMBER 2023

TABLE VII
COMPARISON OF DIAGFUSION AND EXISTING REPRESENTATIVE APPROACHES

status logs, and developer activity. It needs numerous predefined [8] Y. Yuan, W. Shi, B. Liang, and B. Qin, “An approach to cloud execution
rules to conduct accurate failure diagnosis, which degrades its failure diagnosis based on exception logs in OpenStack,” in Proc. IEEE
12th Int. Conf. Cloud Comput., 2019, pp. 124–131.
applicability to most scenarios. [9] X. Zhang et al., “Onion: Identifying incident-indicating logs for cloud
We compare DiagFusion and existing representative ap- systems,” in Proc. 29th ACM Joint Meeting Eur. Softw. Eng. Conf. Symp.
proaches in Table VII. In conclusion, compared to single-modal- Found. Softw. Eng., 2021, pp. 1253–1263.
[10] M. Du et al., “DeepLog: Anomaly detection and diagnosis from system
based methods, DiagFusion takes the three important modalities logs through deep learning,” in Proc. ACM SIGSAC Conf. Comput. Com-
into account. Compared to existing multimodal-based methods, mun. Secur., 2017, pp. 1285–1298.
DiagFusion is among the first to represent different modalities in [11] J. Lin, P. Chen, and Z. Zheng, “Microscope: Pinpoint performance issues
with causal graphs in micro-service environments,” in Proc. 16th Int. Conf.
a unified manner, thus performing more robustly and accurately. Serv.-Oriented Comput., Springer, Hangzhou, China, Nov. 12–15, 2018,
pp. 3–20.
[12] M. Ma, W. Lin, D. Pan, and P. Wang, “Self-adaptive root cause diagnosis
VIII. CONCLUSION for large-scale microservice architecture,” IEEE Trans. Services Comput.,
vol. 15, no. 3, pp. 1399–1410, May/Jun. 2022.
Failure diagnosis is of great importance for microservice [13] M. Ma et al., “AutoMAP: Diagnose your microservice-based web appli-
systems. In this paper, we first conduct an empirical study to cations automatically,” in Proc. Web Conf., Y. Huang Eds. et al., Taipei,
illustrate the importance of using multimodal data (i.e., trace, Taiwan, Apr. 20–24, 2020, pp. 246–258.
[14] Y. Pan et al., “Faster, deeper, easier: Crowdsourcing diagnosis of microser-
metric, log) for failure diagnosis of microservice systems. Then vice kernel failure from user space,” in Proc. 30th ACM SIGSOFT Int.
we propose DiagFusion, an automatic failure diagnosis method, Symp. Softw. Testing Anal., 2021, pp. 646–657.
which first extracts events from three modalities of data and [15] B. H. Sigelman et al., “Dapper, a large-scale distributed systems tracing
infrastructure,” 2010. [Online]. Available: http://research.google.com/
applies fastText embedding to unify the event from different archive/papers/dapper-2010-1.pdf
modalities. During training, DiagFusion leverages data aug- [16] T. Yang et al., “AID: Efficient prediction of aggregated intensity of
mentation to tackle the challenge of data imbalance. Then it dependency in large-scale cloud systems,” in Proc. IEEE/ACM 36th Int.
Conf. Automated Softw. Eng., 2021, pp. 653–665.
constructs a dependency graph by combining trace and deploy- [17] J. Kaldor et al., “Canopy: An end-to-end performance tracing and analysis
ment data. Moreover, DiagFusion integrates event embedding system,” in Proc. 26th Symp. Operating Syst. Princ., Shanghai, China, Oct.
and the dependency graph through GNN. Finally, the GNN 28–31, 2017, pp. 34–50.
[18] X. Zhou et al., “Latent error prediction and fault localization for microser-
reports the root cause instance and the failure type of online vice applications by learning from system trace logs,” in Proc. 27th ACM
failure. We evaluate DiagFusion using two real-world datasets. Joint Meeting Eur. Softw. Eng. Conf. Symp. Found. Softw. Eng., 2019,
The evaluation results confirm the effectiveness and efficiency pp. 683–694.
[19] C. Zhang et al., “DeepTraLog: Trace-log combined microservice anomaly
of DiagFusion. detection through graph-based deep learning,” in Proc. 44th Int. Conf.
Softw. Eng., 2022, pp. 623–634.
[20] B. Li et al., “Enjoy your observability: An industrial survey of microser-
REFERENCES vice tracing and analysis,” Empirical Softw. Eng., vol. 27, no. 1, 2022,
[1] X. Guo et al., “Graph-based trace analysis for microservice architecture Art. no. 25.
understanding and problem diagnosis,” in Proc. 28th ACM Joint Meeting [21] P. Liu et al., “Unsupervised detection of microservice trace anomalies
Eur. Softw. Eng. Conf. Symp. Found. Softw. Eng., 2020, pp. 1387–1397. through service-level deep Bayesian networks,” in Proc. IEEE 31st Int.
[2] X. Zhou et al., “Fault analysis and debugging of microservice systems: Symp. Softw. Rel. Eng., 2020, pp. 48–58.
Industrial survey, benchmark system, and empirical study,” IEEE Trans. [22] P. He, J. Zhu, Z. Zheng, and M. R. Lyu, “Drain: An online log parsing
Softw. Eng., vol. 47, no. 2, pp. 243–260, Feb. 2021. approach with fixed depth tree,” in Proc. IEEE Int. Conf. Web Serv.,
[3] AWS, “Summary of the AWS service event in the Northern Virginia (US- I. Altintas and S. Chen, Eds., Honolulu, HI, USA, Jun. 25–30, 2017,
EAST-1) region,” 2021. [Online]. Available: https://aws.amazon.com/cn/ pp. 33–40.
message/12721/ [23] S. Zhang et al., “Syslog processing for switch failure diagnosis and
[4] Z. Li et al., “Practical root cause localization for microservice systems prediction in datacenter networks,” in Proc. 25th IEEE/ACM Int. Symp.
via trace analysis,” in Proc. IEEE/ACM 29th Int. Symp. Qual. Serv., 2021, Qual. Serv., Vilanova i la Geltrú, Spain, Jun. 14–16, 2017, pp. 1–10.
pp. 1–10. [24] P. He, J. Zhu, S. He, J. Li, and M. R. Lyu, “Towards automated log
[5] M. Jin et al., “An anomaly detection algorithm for microservice architec- parsing for large-scale log data analysis,” IEEE Trans. Dependable Secure
ture based on robust principal component analysis,” IEEE Access, vol. 8, Comput., vol. 15, no. 6, pp. 931–944, Nov./Dec. 2018.
pp. 226 397–226 408, 2020. [25] S. Messaoudi et al., “A search-based approach for accurate identification
[6] G. Yu et al., “MicroRank: End-to-end latency issue localization with of log message formats,” in Proc. 26th Conf. Prog. Comprehension, F.
extended spectrum analysis in microservice environments,” in Proc. Web Khomh, C. K. Roy, and J. Siegmund, Eds., Gothenburg, Sweden, May
Conf., 2021, pp. 3087–3098. 27/28, 2018, pp. 167–177.
[7] Q. Lin et al., “Log clustering based problem identification for online [26] M. Du and F. Li, “Spell: Online streaming parsing of large unstruc-
service systems,” in Proc. 38th Int. Conf. Softw. Eng. Companion, 2016, tured system logs,” IEEE Trans. Knowl. Data Eng., vol. 31, no. 11,
pp. 102–111. pp. 2213–2227, Nov. 2019.

Authorized licensed use limited to: York University. Downloaded on March 04,2025 at 18:27:02 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: ROBUST FAILURE DIAGNOSIS OF MICROSERVICE SYSTEM THROUGH MULTIMODAL DATA 3863

[27] H. Dai, H. Li, C. Chen, W. Shang, and T.-H. Chen, “Logram: Efficient [49] Y. Meng et al., “Localizing failure root causes in a microservice through
log parsing using n-gram dictionaries,” IEEE Trans. Softw. Eng., vol. 48, causality inference,” in Proc. IEEE/ACM 28th Int. Symp. Qual. Serv.,
no. 3, pp. 879–892, Mar. 2022. Hangzhou, China, Jun. 15–17, 2020, pp. 1–10. [Online]. Available: https:
[28] M. Sun et al., “CTF: Anomaly detection in high-dimensional time series //doi.org/10.1109/IWQoS49365.2020.9213058
with coarse-to-fine model transfer,” in Proc. IEEE 40th Conf. Comput. [50] M. Ma et al., “Diagnosing root causes of intermittent slow queries in
Commun., Vancouver, BC, Canada, May 10–13, 2021, pp. 1–10. large-scale cloud databases,” in Proc. VLDB Endowment, vol. 13, no. 8,
[29] Y. Su et al., “Detecting outlier machine instances through Gaussian mix- pp. 1176–1189, 2020.
ture variational autoencoder with one dimensional CNN,” IEEE Trans. [51] Z. Zhang, P. Cui, and W. Zhu, “Deep learning on graphs: A survey,” IEEE
Comput., vol. 71, no. 4, pp. 892–905, Apr. 2022. Trans. Knowl. Data Eng., vol. 34, no. 1, pp. 249–270, Jan. 2022.
[30] L. Shen et al., “Time series anomaly detection with multiresolution ensem- [52] Á. Brandón et al., “Graph-based root cause analysis for service-
ble decoding,” in Proc. 35th AAAI Conf. Artif. Intell. 33rd Conf. Innov. oriented and microservice architectures,” J. Syst. Softw., vol. 159, 2020,
Appl. Artif. Intell. 11th Symp. Educ. Adv. Artif. Intell., 2021, pp. 9567– Art. no. 110432.
9575. [53] T. Jia, Y. Wu, C. Hou, and Y. Li, “LogFlash: Real-time streaming anomaly
[31] M. Ma et al., “Jump-starting multivariate time series anomaly detection detection and diagnosis from system logs for large-scale software sys-
for online service systems,” in Proc. USENIX Annu. Tech. Conf., I. Calciu tems,” in Proc. IEEE 32nd Int. Symp. Softw. Rel. Eng., Z. Jin Eds. et al.,
and G. Kuenning, Eds., USENIX Assoc., 2021, pp. 413–426. Wuhan, China, Oct. 25–28, 2021, pp. 80–90.
[32] Z. Li et al., “Multivariate time series anomaly detection and interpretation [54] H. Wang et al., “Groot: An event-graph-based approach for root cause anal-
using hierarchical inter-metric and temporal embedding,” in Proc. 27th ysis in industrial settings,” in Proc. IEEE/ACM 36th Int. Conf. Automated
ACM SIGKDD Conf. Knowl. Discov. Data Mining, F. Zhu, B. C. Ooi, and Softw. Eng., Melbourne, Australia, Nov. 15–1, 2021, pp. 419–429.
C. Miao, Eds., 2021, pp. 3220–3230.
[33] L. Dai et al., “SDFVAE: Static and dynamic factorized VAE for anomaly
detection of multivariate CDN KPIs,” in Proc. Web Conf., J. Leskovec Eds. Shenglin Zhang (Member, IEEE) received the BS
et al., 2021, pp. 3076–3086. degree in network engineering from the School of
[34] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of Computer Science and Technology, Xidian Univer-
word representations in vector space,” 2013, arXiv:1301.3781. sity, Xi’an, China, in 2012, and the PhD degree in
[35] J. Pennington, R. Socher, and C. D. Manning, “GloVe: Global vectors computer science from Tsinghua University, Beijing,
for word representation,” in Proc. Conf. Empirical Methods Natural Lang. China, in 2017. He is currently an associate professor
Process., A. Moschitti, B. Pang, and W. Daelemans, Eds., 2014, pp. 1532– with the College of Software, Nankai University,
1543. Tianjin, China. His current research interests include
[36] P. Bojanowski et al., “Enriching word vectors with subword information,” failure detection, diagnosis, and prediction for service
Trans. Assoc. Comput. Linguistics, vol. 5, pp. 135–146, 2017. management.
[37] M. E. Peters et al., “Deep contextualized word representations,” in Proc.
Conf. North Amer. Chapter Assoc. Comput. Linguistics: Hum. Lang.
Technol., M. A. Walker, H. Ji, and A. Stent, Eds., New Orleans, LA, USA,
Jun. 1–6, 2018, pp. 2227–2237. Pengxiang Jin received the bachelor’s degree in soft-
[38] J. Devlin et al., “BERT: Pre-training of deep bidirectional transformers for ware engineering from Nankai University, Tianjin,
language understanding,” 2018, arXiv:1810.04805. China, in 2020. He is currently working toward the
[39] T. B. Brown et al., “Language models are few-shot learners,” in Proc. master degree with the College of Software, Nankai
Int. Conf. Neural Inf. Process. Syst., H. Larochelle Eds. et al., 2020, University. His research interests include anomaly
Art. no. 159. detection and anomaly localization.
[40] W. Meng et al., “LogAnomaly: Unsupervised detection of sequential and
quantitative anomalies in unstructured logs,” in Proc. 28th Int. Joint Conf.
Artif. Intell., S. Kraus, Ed., Macao, China, Aug. 10–16, 2019, pp. 4739–
4745.
[41] T. N. Kipf and M. Welling, “Semi-supervised classification with graph
convolutional networks,” 2016, arXiv:1609.02907.
[42] W. L. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation
learning on large graphs,” in Proc. Int. Conf. Neural Inf. Process. Syst., I. Zihan Lin received the bachelor’s degree in software
Guyon Eds. et al., Long Beach, CA, USA, 2017, pp. 1024–1034. engineering from Nankai University, Tianjin, China,
[43] L. Zhou, Q. Zeng, and B. Li, “Hybrid anomaly detection via multihead in 2021. He is currently working toward the master
dynamic graph attention networks for multivariate time series,” IEEE degree with the College of Software, Nankai Univer-
Access, vol. 10, pp. 40 967–40 978, 2022. sity. His research interests include failure localization
[44] L. Zhang, B. Morin, P. Haller, B. Baudry, and M. Monperrus, “A Chaos and anomaly detection.
engineering system for live analysis and falsification of exception-handling
in the JVM,” IEEE Trans. Softw. Eng., vol. 47, no. 11, pp. 2534–2548,
Nov. 2021.
[45] Y. Wang et al., “Fast outage analysis of large-scale production clouds with
service correlation mining,” in Proc. IEEE/ACM 43rd Int. Conf. Softw.
Eng., Madrid, Spain, May 22–30, 2021, pp. 885–896.
[46] Y. Zhou et al., “Graph neural networks: Taxonomy, advances, and
trends,” ACM Trans. Intell. Syst. Technol., vol. 13, no. 1, pp. 15:1–15:54, Yongqian Sun (Member, IEEE) received the BS de-
2022. gree in statistical specialty from Northwestern Poly-
[47] C. Hou, T. Jia, Y. Wu, Y. Li, and J. Han, “Diagnosing performance issues technical University, Xi’an, China, in 2012, and the
in microservices with heterogeneous data source,” in Proc. IEEE Int. PhD degree in computer science from Tsinghua Uni-
Conf. Parallel Distrib. Process. Appl. Big Data Cloud Comput. Sustain. versity, Beijing, China, in 2018. He is currently an as-
Comput. Commun. Social Comput. Netw., New York, NY, USA, 2021, sistant professor with the College of Software, Nankai
pp. 493–500. University, Tianjin, China. His research focuses on
[48] Y. Zhang et al., “CloudRCA: A root cause analysis framework for cloud anomaly detection, root cause analysis, and failure
computing platforms,” in Proc. 30th ACM Int. Conf. Inf. Knowl. Manage., diagnosis in service management.
G. Demartini Eds. et al., 2021, pp. 4373–4382.

Authorized licensed use limited to: York University. Downloaded on March 04,2025 at 18:27:02 UTC from IEEE Xplore. Restrictions apply.
3864 IEEE TRANSACTIONS ON SERVICES COMPUTING, VOL. 16, NO. 6, NOVEMBER/DECEMBER 2023

Bicheng Zhang received the bachelor’s degree from Wa Jin is currently working toward the bachelor
Nankai University. He is currently working toward degree. Her main research interests include anomaly
the master degree with Fudan University. His research detection and failure diagnosis.
interests include cloud native and AIOps.

Sibo Xia is currently working toward the master de-


gree. His main research interests include knowledge
graph, failure detection, and diagnosis. Dai Zhang is employed with ZhejiangE-
CommerceBank Co., Ltd., Launched by Ant Group.
As a technical expert, he mainly focuses on financial
basic technical architecture and cloud-native system
stability.

Zhengdan Li is an experimentalist with Nankai Uni-


versity, Tianjin, China. Her research interests include
artificial intelligence and software engineering.

Zhenyu Zhu is employed with ZhejiangE-


CommerceBank Co., Ltd., Launched by Ant Group.
As a technical expert, he mainly focuses on financial
basic technical architecture and cloud-native system
stability.

Zhenyu Zhong received the BS degree in software


engineering from Nankai University, Tianjin, China,
in 2020. He is currently working toward the PhD de-
gree with the College of Software, Nankai University,
Tianjin, China. His current research interests include
anomaly detection, deep learning, and NLP.

Dan Pei (Senior Member, IEEE) received the BE and


MS degrees in computer science from the Department
of Computer Science and Technology, Tsinghua Uni-
versity, in 1997 and 2000, respectively, and the PhD
degree in computer science from the Computer Sci-
Minghua Ma (Member, IEEE) received the PhD ence Department, University of California, Los An-
degree from Tsinghua University, in 2021. He is a geles (UCLA), in 2005. He is currently an associate
researcher with Microsoft. His current research inter- professor with the Department of Computer Science
ests include cloud intelligence/AIOps. and Technology, Tsinghua University. His research
interests include network and service management in
general. He is asenior member of the ACM.

Authorized licensed use limited to: York University. Downloaded on March 04,2025 at 18:27:02 UTC from IEEE Xplore. Restrictions apply.

You might also like