Robust Failure Diagnosis of Microservice System Through Multimodal Data
Robust Failure Diagnosis of Microservice System Through Multimodal Data
1939-1374 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: York University. Downloaded on March 04,2025 at 18:27:02 UTC from IEEE Xplore. Restrictions apply.
3852 IEEE TRANSACTIONS ON SERVICES COMPUTING, VOL. 16, NO. 6, NOVEMBER/DECEMBER 2023
TABLE I
DETAILED INFORMATION OF THE FAILURES IN THE EMPIRICAL STUDY
Authorized licensed use limited to: York University. Downloaded on March 04,2025 at 18:27:02 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: ROBUST FAILURE DIAGNOSIS OF MICROSERVICE SYSTEM THROUGH MULTIMODAL DATA 3853
datasets. Section VI discusses the technical rationale, robust- In addition to trace, log, and metric, deployment data is also
ness, and threats to validity. Section VII presents the related important to failure diagnosis. A microservice system comprises
work in failure diagnosis. Section VIII concludes the paper. many hardware and software assets that form complicated inter-
relationships. Operators must carefully record these relation-
II. BACKGROUND ships (a.k.a. deployment data) to keep high maintainability of the
system. Leveraging deployment data enables the understanding
A. Microservice Systems and Multimodal Data of failure propagation paths and characteristics.
Microservice systems allow developers to independently de-
velop and deploy functional software units (microservice). For B. Preliminaries
example, when a user tries to buy an item on an online shopping
Representation Learning: Representation learning has been
website, the user will experience item searching, item display-
widely used in natural language processing tasks, usually in
ing, order generation, payment, etc. Each of these functions
the form of word embedding. Popular techniques of representa-
is served by a specific microservice. A failure at a specific
tion learning include static representation like word2vec [34],
service instance can propagate to other service instances in
GloVe [35], fastText [36], and dynamic representation like
many ways, bringing cascading failures. However, diagnosing
ELMo [37], BERT [38], GPT [39]. With the similarities be-
online failures in microservice systems is difficult due to these
tween logs and natural languages, representation learning can
systems’ highly complex orchestration and dynamic interaction.
be applied to extract log features [40]. We employ fastText
To accurately find the cause of a failure, operators must carefully
to learn a unified representation of events from multimodal
monitor the system and record traces, logs, and metrics. These
data. Compared to word2vec and GloVe, fastText can utilize
three modalities of monitoring data stand as the three pillars of
more information [36]. We employ fastText to learn a unified
the observability of microservice systems. The collection and
representation of the multimodal data.
storage of instances’ monitoring data are not in the scope of
In essence, fastText is a neural network model that processes
this paper. The three modalities: trace, log, and metric, and their
words as input and takes the output from the hidden layer (a
roles in failure diagnosis are described below.
vector of real numbers) as its representation. It can be trained
Trace: Traces record the execution paths of users’ requests.
in both supervised and unsupervised modes, but the supervised
Fig. 1 shows an example of trace at the top. Google formally
mode generally yields more accurate results due to its incor-
proposed the concept of traces at Dapper [15], in which it defined
poration of label information. In the supervised training mode,
the whole lifecycle of a request as a trace and the invocation
the neural network is optimized by predicting the class of the
and answering of a component as a span. By examining traces,
document. Once the training is completed, fastText can be used
operators may identify microservices that have possibly gone
to provide vectorized representations (i.e., embeddings) for any
wrong [4], [6], [16], [17], [18], [19], [20], [21]. Traces can be
given input.
viewed as trees, with microservices as nodes and invocations as
Graph Neural Network: GNN can effectively model data
edges. Each subtree corresponds to a span. Typically, traces carry
from non-euclidean space, thereby being popular among fields
information about invocations, e.g., start time, caller, callee,
with graph structures, e.g., social networks, biology, and recom-
response time, and status code.
mendation systems. Popular GNN architecture includes Graph
Log: Logs record comprehensive events of a service instance.
Convolution Network (GCN) [41], GraphSAGE [42], Graph
Some examples of logs are shown in the middle of Fig. 1.
Attention Network (GAT) [43], etc. GNNs apply graph convolu-
Logs are generated by developers using commands like printf,
tions, allowing nodes to utilize their information and learn from
logging.debug, logging.error. They provide an internal picture
their neighbors through message passing. There are numerous
of a service instance. By examining logs, operators may discover
components in microservice systems that interconnect with each
the actual cause of why an instance performs not well. Typically,
other. Thus graph structure is suitable to model microservice
logs consist of three fields: timestamp, verbosity level, and raw
systems, and we employ GNN to learn the propagation patterns
message [22]. Four commonly used verbosity levels, i.e., INFO,
of historical failure cases.
WARN, DEBUG, and ERROR, indicate the severity of a log
message. The raw message of a log conveys detailed information
about the event. To utilize logs more effectively, researchers have C. Problem Statement
proposed various parsing techniques to extract templates and pa- When a failure occurs, operators need to localize the root
rameters, e.g., FT-Tree [23], Drain [22], POP [24], MoLFI [25], cause instance and determine what has happened to it to achieve
Spell [26], and Logram [27]. timely failure mitigation. For large-scale microservice systems,
Metric: Various system-level metrics (e.g., CPU utilization, the first task is a ranking problem: to rank the root cause instance
memory utilization) and user-perceived metrics (e.g., average higher than other instances. We use the term root cause instance
response time) are configured for monitoring system instances. localization to name this task (Task #1). The second task is a
Each metric is collected at a predefined interval, forming a time classification problem: to classify the failure into a predefined
series, as shown at the bottom of Fig. 1. These metrics track set of failure types. We use the term failure type determination
various aspects of performance issues. By examining metrics, to name this task (Task #2).
operators can determine which physical resource is anomalous After each failure, operators will carefully conduct a post-
or is the bottleneck [28], [29], [30], [31], [32], [33]. failure analysis: labeling its root cause instance and its failure
Authorized licensed use limited to: York University. Downloaded on March 04,2025 at 18:27:02 UTC from IEEE Xplore. Restrictions apply.
3854 IEEE TRANSACTIONS ON SERVICES COMPUTING, VOL. 16, NO. 6, NOVEMBER/DECEMBER 2023
A. Design Overview
B. Unified Event Representation
In this article, we propose DiagFusion, which combines the
modality of trace, log, and metric for accurate failure diagno- DiagFusion unifies the three modalities by extracting events
sis. The training framework of DiagFusion is summarized in from the raw data and encoding them into vectors. Specifically,
Fig. 3. First, DiagFusion extracts events from raw traces, logs, it collects failure-indicative events by leveraging effective and
lightweight methods, including anomaly detection techniques
2 https://github.com/CloudWise-OpenSource/GAIA-DataSet for metrics and traces and template parsing techniques for logs.
Authorized licensed use limited to: York University. Downloaded on March 04,2025 at 18:27:02 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: ROBUST FAILURE DIAGNOSIS OF MICROSERVICE SYSTEM THROUGH MULTIMODAL DATA 3855
Authorized licensed use limited to: York University. Downloaded on March 04,2025 at 18:27:02 UTC from IEEE Xplore. Restrictions apply.
3856 IEEE TRANSACTIONS ON SERVICES COMPUTING, VOL. 16, NO. 6, NOVEMBER/DECEMBER 2023
service instances and integrate all the information of the whole (determined by euclidean distance) in V0 . After all failure types
system. are expanded to a relatively large size, e.g., 1000, we can obtain
To leverage a GNN, it is essential to consider both nodes and a more balanced training set. Further details on the choice of
edges within a graph. The nodes in a GNN corresponds to the the expanding size can be found at Section V-E. Then we train
instances in a microservice system. An instance is characterized the event embedding model again (f1 ) on the expanded data and
by its anomalous events in DiagFusion. We represent an instance regard the representations generated in this round (V1 ) as the
i by averaging all of its events: final unified event representations.
2) Training of Graph Neural Network: We train the GNN in
(0) 1
hi = V1 (e) (2) a joint learning fashion to fully utilize the shared information
|Ei | between tasks #1 and #2. Then we combine the trained GNN
∀e∈Ei
with a ranking strategy to better fit the nature of microservice
where Ei is the extracted event sequences, and V1 (e) is the vec-
systems.
torized representation of event e learned by the event embedding
Ranking Strategy: One of the advantages of microservice
model.
systems is that the architecture allows dynamic deployment of
The edges in a GNN correspond to the dependency graph in
service instances. Thus, service instances are constantly being
a microservice system. There are two dominant ways of prop-
created and destroyed. However, when it comes to failure diag-
agation failure between services: function calling or resource
nosis, this kind of flexibility raises a challenge for learning-based
contention [45]. So we combine traces and deployment data
methods. The failure diagnosis model will have to be retrained
to capture probable failure propagation paths. Specifically, we
frequently if the output layer directly outputs the probability
aggregate traces to get a call graph. Then we add two directed
of being the root cause instance for each instance since many
edges for each pair of caller and callee, with one pointing from
instances can be created or destroyed after the model training
the caller to the callee and the other in the reverse direction.
is finished. We add an extract step in DiagFusion to overcome
From deployment data, we add edges between two instances if
this challenge. Instead of directly determining the root cause
they are co-deployed, i.e., sharing resources.
instance, DiagFusion is trained on service groups, the logical
After obtaining the dependency graph and instance represen-
aggregation of service instances, for task #1. Then DiagFusion
tations, we employ GNN to learn the failure propagation pattern
ranks the instances inside a candidate service group by the length
by its message-passing mechanism. At the K-th layer of GNN,
of their event sequences. The instance with more anomaly events
we apply topology adaptive graph convolution [46] and update
will be ranked higher and likely be the root cause instance.
the internal data of instances according to:
Joint Learning: Intuitively, the two tasks of failure diagnosis,
K
k i.e., root cause instance localization and failure type determina-
HK = D−1/2 AD−1/2 XΘk (3) tion, share some knowledge in common. For a given failure, the
k=0 only difference between task #1 and task #2 lies in their labels. So
DiagFusion integrates a joint learning mechanism to utilize the
where A denotes the adjacency matrix, Dii = j=0 Aij is a
shared knowledge and reduce the training time. (Training two
diagonal degree matrix, Θk denotes the linear weights to sum
models separately requires twice the time otherwise.) Specifi-
the results of different hops together.
cally, the joint loss function is:
Finally, we add a MaxPooling layer as the readout layer to
⎛ ⎞
integrate the information of the whole microservice system. F S T
1 ⎝
Following the MaxPooling layer, there is a fully connected layer − y(s)i,j log p(s)i,j + y(t)i,k log p(t)i,k ⎠
where each neuron corresponds to either a service group with F i=1 j=1
k=1
possible root cause instances for task #1 or a failure type for (4)
task #2. where F is the number of historical failures, S is the number
of service groups, T is the number of failure types, y(s) is the
D. Training of DiagFusion root cause service group labeled by operators, y(t) is the failure
type, p(s) is the predicted service group, and p(t) is the predicted
DiagFusion applies a two-phase training strategy to learn the
failure type.
failure pattern of a microservice system. First, it trained the event
embedding model with data augmentation. Then it trains the
E. Real-Time Failure Diagnosis
GNN with a joint learning technique.
1) Training of Event Embedding Model: DiagFusion em- After the training stage, we save the trained event embedding
ploys a data augmentation strategy to enrich the training dataset model and the GNN. When a new failure is alerted, DiagFusion
and reduce the model’s bias towards the majority class. First, performs a real-time diagnosis process as shown in Fig. 5.
we train our event embedding model on the original data. 1) Running Example: Fig. 6 shows how DiagFusion can be
The trained neural network, denoted by f0 , maps events to integrated with microservice systems. To better explain how
the vector space V0 . To increase the number of failure cases, DiagFusion diagnoses failure, we demonstrate the workflow of
we add new event sequences for each failure type (including DiagFusion using one real-world failure from D1. At 10:46,
“non-root-cause”) by randomly taking an event sequence of that service instance B1 encounters a failure of access denied.
type and replacing one of the events with its closest neighbor Fig. 7 shows the original data, event sequence, and the DG.
Authorized licensed use limited to: York University. Downloaded on March 04,2025 at 18:27:02 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: ROBUST FAILURE DIAGNOSIS OF MICROSERVICE SYSTEM THROUGH MULTIMODAL DATA 3857
TABLE II
DETAILED INFORMATION OF DATASETS
Authorized licensed use limited to: York University. Downloaded on March 04,2025 at 18:27:02 UTC from IEEE Xplore. Restrictions apply.
3858 IEEE TRANSACTIONS ON SERVICES COMPUTING, VOL. 16, NO. 6, NOVEMBER/DECEMBER 2023
Fig. 7. A running example of DiagFusion. (a) the serialized multimodal event sequence of the root cause instance (B1); (b) the original data corresponding to
the event sequence; (c) part of the dependency graph in this failure.
both tasks to better reflect the real-world performance of all TABLE III
EFFECTIVENESS OF FAILURE TYPE DETERMINATION (TASK #2)
selected methods.
For Task #1, we use Top-k accuracy (A@k) and Top-5 average
accuracy (Avg@5) as the evaluation metrics. A@k is a well-
adopted metric that quantifies the probability that top-k instances
output by each method indeed contain the root cause instance
[5]. Formally, given |A| as the test set of failures, RCi as the
ground truth root cause instance, RCs [k] as the top-k root cause
instances set generated by a method, A@k is defined as:
1 1, if RCia ∈ RCsa [k] The comparison result of Task #1 is shown in Fig. 8. Diag-
A@k = (5)
|A| 0, otherwise Fusion achieves the best performance. Specifically, the A@1 to
a∈A
A@5 of DiagFusion are almost the best on D1 and D2. More
Avg@5 is another popular metric that evaluates a method’s specifically, the Avg@5 of DiagFusion exceeds 0.75 on both D1
overall capability of localizing the root cause instance[49]. In and D2, respectively. It is at least 0.13 higher on both datasets
practice, operators often examine the top 5 results. Avg@5 is than baselines using single-modal data due to the advantage of
calculated by: using multimodal data. Compared with PDiagnose, which also
1 uses multimodal data, the Avg@5 of DiagFusion is higher by
Avg@5 = A@k (6) at least 0.18. This indicates that learning from historical failures
5
1≤k≤5 improves the accuracy of diagnosis significantly.
For Task #2, which is a multi-class classification problem, we The result of Task #2 is shown in Table III. For this task,
use the weighted average precision, recall, and F1-score to test DiagFusion is better than almost all baselines. On D1, the
the performances. These metrics have been selected based on precision, recall, and F1-score of DiagFusion are over 0.80.
a previous study [50] as a reliable way to assess the model’s On D2, DiagFusion manages to maintain an F1-score of 0.80,
effectiveness in this specific context. With True Positives (TP), which is at least 0.195 higher than the baselines. Considering
False Positives (FP), and False Negatives (FN), the calculation is both systems and tasks, DiagFusion consistently demonstrates
precision×recall
given by F1-score = 2 × precision+recall , where precision = superior performance, thereby substantiating its effectiveness.
TP+FP and recall = TP+FN .
TP TP
Authorized licensed use limited to: York University. Downloaded on March 04,2025 at 18:27:02 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: ROBUST FAILURE DIAGNOSIS OF MICROSERVICE SYSTEM THROUGH MULTIMODAL DATA 3859
TABLE IV
CONTRIBUTIONS OF COMPONENTS
Authorized licensed use limited to: York University. Downloaded on March 04,2025 at 18:27:02 UTC from IEEE Xplore. Restrictions apply.
3860 IEEE TRANSACTIONS ON SERVICES COMPUTING, VOL. 16, NO. 6, NOVEMBER/DECEMBER 2023
The Number of Augmented Samples: The experiments in relationships, making them well-suited to handle complicated
Section V-B show that data augmentation has some improvement data.
in the model’s performance. However, when the number of sam- Why FastText? FastText was chosen because trace, log, and
ples increases to a certain amount, the information in the training metric data have very different formats. However, they all share
set has already been fully utilized. Instead, the performance may timestamps, meaning they can be sequenced according to their
be degraded due to the excessive introduction of noise. Generally temporal order. FastText provides superior performance over
speaking, DiagFusion does not need an excessive number of other static embeddings like word2vec and GloVe, which was
augmented samples as long as the samples are balanced. demonstrated in Section V-C. Although deep dynamic embed-
The Number of Layers in GNN: As the layer number of GNN dings like ELMo, BERT, and GPT are popular in Natural Lan-
varies from 1 to 5, the performance of DiagFusion in three tasks guage Processing, they are not suitable for microservice settings
shows a decreasing trend. The model performs best when the as the number of failure cases is insufficient to train these large
layer number is lower than 3. We do not recommend setting models.
the layer number too large since training deep GNN requires Why GNN? GNN was chosen because the structure of mi-
extra training samples, which is hard to meet in real-world croservice systems involves many instances and their relation-
microservice systems. ships, which form the structure of a graph. Various approaches
Time Window: The length of the time window has little impact incorporating Random Walk [12], [13] exist to accomplish fail-
on performance because the moments when failures occur are ure diagnosis on such graph structures. However, their ability to
sparse, and the anomaly events reported in a time window generalize is limited since domain knowledge can vary greatly
are only relevant to the current failure. With accurate anomaly between different systems. The domain knowledge contained
detection, the performance of DiagFusion is stable. in graph data can be effectively learned by GNNs [51], giving
them a stronger generalization ability than approaches based on
VI. DISCUSSION Random Walk.
Concerns About Learning-Based Methods: While learning-
A. Why Learning-Based Methods? based methods offer several advantages, they do require labeled
The DiagFusion approach incorporates several learning- samples for training. This can be addressed by 1) utilizing the
based techniques, such as fastText in the Unified Event Rep- well-established failure management system in microservice
resentation (Section IV-B) and GNN (Section IV-C). By doing systems as a natural source of failure labeling, 2) DiagFusion
so, DiagFusion significantly outperforms baseline approaches. not requiring too many training samples to achieve good per-
We chose to build DiagFusion using learning-based methods formance (the sizes of the training set of D1 and D2 are 160
for the following reasons: 1) Accuracy: learning-based meth- and 80, respectively), and 3) the increasing adoption of chaos
ods provide high accuracy (Section V) and are therefore ideal engineering, which enables operators to quickly obtain sufficient
for diagnosing failures. 2) Generalization ability: failure cases failure cases. Several successful practices with the help of chaos
used to train DiagFusion contain different patterns of failure engineering have been reported [2], [6], [16], [18].
propagation for different systems. A strong generalization abil-
ity allows DiagFusion to perform robust diagnosis for each
system. 3) Ability to handle complicated data: as microservice B. Robustness
systems become increasingly complex and monitoring data more In practice, some modalities can be absent, hindering a suc-
high-dimensional, manually setting up rules for failure diagno- cessful failure diagnosis system to some extent. The cause
sis becomes time-consuming and error-prone. Learning-based of missing modalities can be generally classified into three
methods, on the other hand, take this data as input and learn their categories. The first category refers to missing modalities caused
Authorized licensed use limited to: York University. Downloaded on March 04,2025 at 18:27:02 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: ROBUST FAILURE DIAGNOSIS OF MICROSERVICE SYSTEM THROUGH MULTIMODAL DATA 3861
TABLE VI modality. Besides, the GNN module deals with feature vectors
ROBUSTNESS COMPARED TO PDIAGNOSE (TASK #1)
rather than original monitor data. DiagFusion can work given
that any two of the three modalities are available.
There are two main threats to the validity of the study. The
first one lies in the limited sizes of the two datasets used in the
study. D1 and D2 are relatively smaller than complex industrial
microservice systems. The second one lies in the limitation
of the failure cases used in the study. Some failure cases of
D1 are simpler than industrial failures and represent only a
by data collection problems. Modern microservice systems are
limited part of different types of failures. However, according to
developing rapidly; the same truth applies to their monitoring
our experiments, DiagFusion is effective and robust. It is very
agents. Therefore, it is hard to guarantee that all monitoring data
promising that DiagFusion can also be effectively applied to
are ideally collected and transmitted. As a result, missing data
much larger industrial microservice systems and more complex
is inevitable, which can give rise to missing modalities when
failure cases.
specific modalities of the monitoring data are having collection
problems. The second category refers to missing modalities
caused by data availability problems. In some large corporations, VII. RELATED WORK
monitoring data is individually collected by many different
Metric-Based Failure Diagnosis Methods: Monitoring met-
divisions. Sometimes, specific modalities can be exclusively
rics are one of the most important observable data in microser-
governed by a division that does not want to disclose its service
vice systems. Many works try to build a dependency graph to de-
maintenance data. Thus, these modalities are collected but not
pict the interaction between system components during failure,
available to general operators. The third category stands for
such as Microscope [11], MS-Rank [12], and AutoMAP [13].
missing modalities caused by data retrieval problems. In prac-
However, the correctness of the above works heavily depends
tice, we often encounter situations where it is very inconvenient
on the parameter settings, which degrades their applicability.
to retrieve monitoring data from the data pool. Multimodal
Besides, many methods extract features from system failures,
failure diagnosis requires much more data to be collected than
such as Graph-RCA [52] and iSQUAD [50]. Nonetheless, failure
single-modal-based methods and may face missing modality
cases are few in microservice systems because operators try to
problems. However, an excellent multimodal-based approach
run the system as robustly as possible, severely affecting the
should perform well even when some modalities are missing. We
performance of these feature-based methods.
discover that 62 failure cases of D1 lack metric data. DiagFusion
Trace-Based Failure Diagnosis Methods: Trace can be used
is compared with PDiagnose in these cases. As PDiagnose
to localize the culprit service, for example, TraceRCA [4],
cannot address Task #2, we only present the results of Task #1.
MEPFL [18], MicroHECL [5], and MicroRank [6]. However,
As shown in Table VI, the performance of PDiagnose drops
these trace-based methods often focus on the global feature of
dramatically in these cases, while DiagFusion presents salient
the systems and do not deal with the local features of a service
robustness. Although DiagFusion also witnesses a performance
instance.
degradation, it is still better than PDiagnose and other Task
Log-Based Failure Diagnosis Methods: LogCluster [7] per-
#1 baselines. DiagFusion has seen complete data modalities
forms hierarchical clustering on log sequences and matches
during training and learned a unified representation, allowing
online log sequences to the most similar cluster. Cloud19 [8] ap-
it to capture anomalous patterns’ correlation to failures better
plies word2vec to construct the vectorized representation of a log
than single-modal-based methods. On the other hand, PDiagnose
item and trains classifiers to identify the failure type. Onion [9]
treats each modality independently, making it ineffective when
performs contrast analysis on agglomerated log cliques to find
facing missing modalities. To sum up, DiagFusion demonstrates
incident-indicating logs. DeepLog [10] and LogFlash [53] inte-
robustness since it achieves satisfactory performance even when
grate anomaly detection and failure diagnosis. They calculate
working with data with incomplete modalities.
the deviation from normal status and suggest the root cause
accordingly. Log-based methods often ignore the topological
C. Concerns About Deployment and Validity feature of microservice systems.
There are some concerns about deploying DiagFusion to Multimodal Data-Based Failure Diagnosis Methods: Re-
real-world microservice systems: 1) DiagFusion needs to adapt cently, combining multimodal data to conduct failure diagnosis
to the highly dynamic nature of microservice architecture. The has drawn increasing attention. CloudRCA [48] uses both metric
stored model of DiagFusion can still be effective when service and log. It uses the PC algorithm to learn the causal relationship
instances are created or destroyed, for DiagFusion utilizes the between anomaly patterns of metrics, anomaly patterns of logs,
concept of service group as a middle layer. The only situation in and types of failure. Then it constructs a hierarchical Bayesian
which DiagFusion needs to be retrained is when new service Network to infer the failure type. PDiagnose [47] combines
groups are created. However, the creation of service groups metric, log, and trace. It uses lightweight anomaly detection
is very rare in practice. 2) Some production systems do not of the three modalities to detect anomaly patterns. Then its
monitor all three modalities at the same time. The workflow vote-based strategy selects the most severe component as the
of DiagFusion is general because the event embedding model root cause. However, these two methods ignore the topology
is trained on event sequences and does not rely on any specific feature of microservice systems. Groot [54] integrates metrics,
Authorized licensed use limited to: York University. Downloaded on March 04,2025 at 18:27:02 UTC from IEEE Xplore. Restrictions apply.
3862 IEEE TRANSACTIONS ON SERVICES COMPUTING, VOL. 16, NO. 6, NOVEMBER/DECEMBER 2023
TABLE VII
COMPARISON OF DIAGFUSION AND EXISTING REPRESENTATIVE APPROACHES
status logs, and developer activity. It needs numerous predefined [8] Y. Yuan, W. Shi, B. Liang, and B. Qin, “An approach to cloud execution
rules to conduct accurate failure diagnosis, which degrades its failure diagnosis based on exception logs in OpenStack,” in Proc. IEEE
12th Int. Conf. Cloud Comput., 2019, pp. 124–131.
applicability to most scenarios. [9] X. Zhang et al., “Onion: Identifying incident-indicating logs for cloud
We compare DiagFusion and existing representative ap- systems,” in Proc. 29th ACM Joint Meeting Eur. Softw. Eng. Conf. Symp.
proaches in Table VII. In conclusion, compared to single-modal- Found. Softw. Eng., 2021, pp. 1253–1263.
[10] M. Du et al., “DeepLog: Anomaly detection and diagnosis from system
based methods, DiagFusion takes the three important modalities logs through deep learning,” in Proc. ACM SIGSAC Conf. Comput. Com-
into account. Compared to existing multimodal-based methods, mun. Secur., 2017, pp. 1285–1298.
DiagFusion is among the first to represent different modalities in [11] J. Lin, P. Chen, and Z. Zheng, “Microscope: Pinpoint performance issues
with causal graphs in micro-service environments,” in Proc. 16th Int. Conf.
a unified manner, thus performing more robustly and accurately. Serv.-Oriented Comput., Springer, Hangzhou, China, Nov. 12–15, 2018,
pp. 3–20.
[12] M. Ma, W. Lin, D. Pan, and P. Wang, “Self-adaptive root cause diagnosis
VIII. CONCLUSION for large-scale microservice architecture,” IEEE Trans. Services Comput.,
vol. 15, no. 3, pp. 1399–1410, May/Jun. 2022.
Failure diagnosis is of great importance for microservice [13] M. Ma et al., “AutoMAP: Diagnose your microservice-based web appli-
systems. In this paper, we first conduct an empirical study to cations automatically,” in Proc. Web Conf., Y. Huang Eds. et al., Taipei,
illustrate the importance of using multimodal data (i.e., trace, Taiwan, Apr. 20–24, 2020, pp. 246–258.
[14] Y. Pan et al., “Faster, deeper, easier: Crowdsourcing diagnosis of microser-
metric, log) for failure diagnosis of microservice systems. Then vice kernel failure from user space,” in Proc. 30th ACM SIGSOFT Int.
we propose DiagFusion, an automatic failure diagnosis method, Symp. Softw. Testing Anal., 2021, pp. 646–657.
which first extracts events from three modalities of data and [15] B. H. Sigelman et al., “Dapper, a large-scale distributed systems tracing
infrastructure,” 2010. [Online]. Available: http://research.google.com/
applies fastText embedding to unify the event from different archive/papers/dapper-2010-1.pdf
modalities. During training, DiagFusion leverages data aug- [16] T. Yang et al., “AID: Efficient prediction of aggregated intensity of
mentation to tackle the challenge of data imbalance. Then it dependency in large-scale cloud systems,” in Proc. IEEE/ACM 36th Int.
Conf. Automated Softw. Eng., 2021, pp. 653–665.
constructs a dependency graph by combining trace and deploy- [17] J. Kaldor et al., “Canopy: An end-to-end performance tracing and analysis
ment data. Moreover, DiagFusion integrates event embedding system,” in Proc. 26th Symp. Operating Syst. Princ., Shanghai, China, Oct.
and the dependency graph through GNN. Finally, the GNN 28–31, 2017, pp. 34–50.
[18] X. Zhou et al., “Latent error prediction and fault localization for microser-
reports the root cause instance and the failure type of online vice applications by learning from system trace logs,” in Proc. 27th ACM
failure. We evaluate DiagFusion using two real-world datasets. Joint Meeting Eur. Softw. Eng. Conf. Symp. Found. Softw. Eng., 2019,
The evaluation results confirm the effectiveness and efficiency pp. 683–694.
[19] C. Zhang et al., “DeepTraLog: Trace-log combined microservice anomaly
of DiagFusion. detection through graph-based deep learning,” in Proc. 44th Int. Conf.
Softw. Eng., 2022, pp. 623–634.
[20] B. Li et al., “Enjoy your observability: An industrial survey of microser-
REFERENCES vice tracing and analysis,” Empirical Softw. Eng., vol. 27, no. 1, 2022,
[1] X. Guo et al., “Graph-based trace analysis for microservice architecture Art. no. 25.
understanding and problem diagnosis,” in Proc. 28th ACM Joint Meeting [21] P. Liu et al., “Unsupervised detection of microservice trace anomalies
Eur. Softw. Eng. Conf. Symp. Found. Softw. Eng., 2020, pp. 1387–1397. through service-level deep Bayesian networks,” in Proc. IEEE 31st Int.
[2] X. Zhou et al., “Fault analysis and debugging of microservice systems: Symp. Softw. Rel. Eng., 2020, pp. 48–58.
Industrial survey, benchmark system, and empirical study,” IEEE Trans. [22] P. He, J. Zhu, Z. Zheng, and M. R. Lyu, “Drain: An online log parsing
Softw. Eng., vol. 47, no. 2, pp. 243–260, Feb. 2021. approach with fixed depth tree,” in Proc. IEEE Int. Conf. Web Serv.,
[3] AWS, “Summary of the AWS service event in the Northern Virginia (US- I. Altintas and S. Chen, Eds., Honolulu, HI, USA, Jun. 25–30, 2017,
EAST-1) region,” 2021. [Online]. Available: https://aws.amazon.com/cn/ pp. 33–40.
message/12721/ [23] S. Zhang et al., “Syslog processing for switch failure diagnosis and
[4] Z. Li et al., “Practical root cause localization for microservice systems prediction in datacenter networks,” in Proc. 25th IEEE/ACM Int. Symp.
via trace analysis,” in Proc. IEEE/ACM 29th Int. Symp. Qual. Serv., 2021, Qual. Serv., Vilanova i la Geltrú, Spain, Jun. 14–16, 2017, pp. 1–10.
pp. 1–10. [24] P. He, J. Zhu, S. He, J. Li, and M. R. Lyu, “Towards automated log
[5] M. Jin et al., “An anomaly detection algorithm for microservice architec- parsing for large-scale log data analysis,” IEEE Trans. Dependable Secure
ture based on robust principal component analysis,” IEEE Access, vol. 8, Comput., vol. 15, no. 6, pp. 931–944, Nov./Dec. 2018.
pp. 226 397–226 408, 2020. [25] S. Messaoudi et al., “A search-based approach for accurate identification
[6] G. Yu et al., “MicroRank: End-to-end latency issue localization with of log message formats,” in Proc. 26th Conf. Prog. Comprehension, F.
extended spectrum analysis in microservice environments,” in Proc. Web Khomh, C. K. Roy, and J. Siegmund, Eds., Gothenburg, Sweden, May
Conf., 2021, pp. 3087–3098. 27/28, 2018, pp. 167–177.
[7] Q. Lin et al., “Log clustering based problem identification for online [26] M. Du and F. Li, “Spell: Online streaming parsing of large unstruc-
service systems,” in Proc. 38th Int. Conf. Softw. Eng. Companion, 2016, tured system logs,” IEEE Trans. Knowl. Data Eng., vol. 31, no. 11,
pp. 102–111. pp. 2213–2227, Nov. 2019.
Authorized licensed use limited to: York University. Downloaded on March 04,2025 at 18:27:02 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: ROBUST FAILURE DIAGNOSIS OF MICROSERVICE SYSTEM THROUGH MULTIMODAL DATA 3863
[27] H. Dai, H. Li, C. Chen, W. Shang, and T.-H. Chen, “Logram: Efficient [49] Y. Meng et al., “Localizing failure root causes in a microservice through
log parsing using n-gram dictionaries,” IEEE Trans. Softw. Eng., vol. 48, causality inference,” in Proc. IEEE/ACM 28th Int. Symp. Qual. Serv.,
no. 3, pp. 879–892, Mar. 2022. Hangzhou, China, Jun. 15–17, 2020, pp. 1–10. [Online]. Available: https:
[28] M. Sun et al., “CTF: Anomaly detection in high-dimensional time series //doi.org/10.1109/IWQoS49365.2020.9213058
with coarse-to-fine model transfer,” in Proc. IEEE 40th Conf. Comput. [50] M. Ma et al., “Diagnosing root causes of intermittent slow queries in
Commun., Vancouver, BC, Canada, May 10–13, 2021, pp. 1–10. large-scale cloud databases,” in Proc. VLDB Endowment, vol. 13, no. 8,
[29] Y. Su et al., “Detecting outlier machine instances through Gaussian mix- pp. 1176–1189, 2020.
ture variational autoencoder with one dimensional CNN,” IEEE Trans. [51] Z. Zhang, P. Cui, and W. Zhu, “Deep learning on graphs: A survey,” IEEE
Comput., vol. 71, no. 4, pp. 892–905, Apr. 2022. Trans. Knowl. Data Eng., vol. 34, no. 1, pp. 249–270, Jan. 2022.
[30] L. Shen et al., “Time series anomaly detection with multiresolution ensem- [52] Á. Brandón et al., “Graph-based root cause analysis for service-
ble decoding,” in Proc. 35th AAAI Conf. Artif. Intell. 33rd Conf. Innov. oriented and microservice architectures,” J. Syst. Softw., vol. 159, 2020,
Appl. Artif. Intell. 11th Symp. Educ. Adv. Artif. Intell., 2021, pp. 9567– Art. no. 110432.
9575. [53] T. Jia, Y. Wu, C. Hou, and Y. Li, “LogFlash: Real-time streaming anomaly
[31] M. Ma et al., “Jump-starting multivariate time series anomaly detection detection and diagnosis from system logs for large-scale software sys-
for online service systems,” in Proc. USENIX Annu. Tech. Conf., I. Calciu tems,” in Proc. IEEE 32nd Int. Symp. Softw. Rel. Eng., Z. Jin Eds. et al.,
and G. Kuenning, Eds., USENIX Assoc., 2021, pp. 413–426. Wuhan, China, Oct. 25–28, 2021, pp. 80–90.
[32] Z. Li et al., “Multivariate time series anomaly detection and interpretation [54] H. Wang et al., “Groot: An event-graph-based approach for root cause anal-
using hierarchical inter-metric and temporal embedding,” in Proc. 27th ysis in industrial settings,” in Proc. IEEE/ACM 36th Int. Conf. Automated
ACM SIGKDD Conf. Knowl. Discov. Data Mining, F. Zhu, B. C. Ooi, and Softw. Eng., Melbourne, Australia, Nov. 15–1, 2021, pp. 419–429.
C. Miao, Eds., 2021, pp. 3220–3230.
[33] L. Dai et al., “SDFVAE: Static and dynamic factorized VAE for anomaly
detection of multivariate CDN KPIs,” in Proc. Web Conf., J. Leskovec Eds. Shenglin Zhang (Member, IEEE) received the BS
et al., 2021, pp. 3076–3086. degree in network engineering from the School of
[34] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of Computer Science and Technology, Xidian Univer-
word representations in vector space,” 2013, arXiv:1301.3781. sity, Xi’an, China, in 2012, and the PhD degree in
[35] J. Pennington, R. Socher, and C. D. Manning, “GloVe: Global vectors computer science from Tsinghua University, Beijing,
for word representation,” in Proc. Conf. Empirical Methods Natural Lang. China, in 2017. He is currently an associate professor
Process., A. Moschitti, B. Pang, and W. Daelemans, Eds., 2014, pp. 1532– with the College of Software, Nankai University,
1543. Tianjin, China. His current research interests include
[36] P. Bojanowski et al., “Enriching word vectors with subword information,” failure detection, diagnosis, and prediction for service
Trans. Assoc. Comput. Linguistics, vol. 5, pp. 135–146, 2017. management.
[37] M. E. Peters et al., “Deep contextualized word representations,” in Proc.
Conf. North Amer. Chapter Assoc. Comput. Linguistics: Hum. Lang.
Technol., M. A. Walker, H. Ji, and A. Stent, Eds., New Orleans, LA, USA,
Jun. 1–6, 2018, pp. 2227–2237. Pengxiang Jin received the bachelor’s degree in soft-
[38] J. Devlin et al., “BERT: Pre-training of deep bidirectional transformers for ware engineering from Nankai University, Tianjin,
language understanding,” 2018, arXiv:1810.04805. China, in 2020. He is currently working toward the
[39] T. B. Brown et al., “Language models are few-shot learners,” in Proc. master degree with the College of Software, Nankai
Int. Conf. Neural Inf. Process. Syst., H. Larochelle Eds. et al., 2020, University. His research interests include anomaly
Art. no. 159. detection and anomaly localization.
[40] W. Meng et al., “LogAnomaly: Unsupervised detection of sequential and
quantitative anomalies in unstructured logs,” in Proc. 28th Int. Joint Conf.
Artif. Intell., S. Kraus, Ed., Macao, China, Aug. 10–16, 2019, pp. 4739–
4745.
[41] T. N. Kipf and M. Welling, “Semi-supervised classification with graph
convolutional networks,” 2016, arXiv:1609.02907.
[42] W. L. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation
learning on large graphs,” in Proc. Int. Conf. Neural Inf. Process. Syst., I. Zihan Lin received the bachelor’s degree in software
Guyon Eds. et al., Long Beach, CA, USA, 2017, pp. 1024–1034. engineering from Nankai University, Tianjin, China,
[43] L. Zhou, Q. Zeng, and B. Li, “Hybrid anomaly detection via multihead in 2021. He is currently working toward the master
dynamic graph attention networks for multivariate time series,” IEEE degree with the College of Software, Nankai Univer-
Access, vol. 10, pp. 40 967–40 978, 2022. sity. His research interests include failure localization
[44] L. Zhang, B. Morin, P. Haller, B. Baudry, and M. Monperrus, “A Chaos and anomaly detection.
engineering system for live analysis and falsification of exception-handling
in the JVM,” IEEE Trans. Softw. Eng., vol. 47, no. 11, pp. 2534–2548,
Nov. 2021.
[45] Y. Wang et al., “Fast outage analysis of large-scale production clouds with
service correlation mining,” in Proc. IEEE/ACM 43rd Int. Conf. Softw.
Eng., Madrid, Spain, May 22–30, 2021, pp. 885–896.
[46] Y. Zhou et al., “Graph neural networks: Taxonomy, advances, and
trends,” ACM Trans. Intell. Syst. Technol., vol. 13, no. 1, pp. 15:1–15:54, Yongqian Sun (Member, IEEE) received the BS de-
2022. gree in statistical specialty from Northwestern Poly-
[47] C. Hou, T. Jia, Y. Wu, Y. Li, and J. Han, “Diagnosing performance issues technical University, Xi’an, China, in 2012, and the
in microservices with heterogeneous data source,” in Proc. IEEE Int. PhD degree in computer science from Tsinghua Uni-
Conf. Parallel Distrib. Process. Appl. Big Data Cloud Comput. Sustain. versity, Beijing, China, in 2018. He is currently an as-
Comput. Commun. Social Comput. Netw., New York, NY, USA, 2021, sistant professor with the College of Software, Nankai
pp. 493–500. University, Tianjin, China. His research focuses on
[48] Y. Zhang et al., “CloudRCA: A root cause analysis framework for cloud anomaly detection, root cause analysis, and failure
computing platforms,” in Proc. 30th ACM Int. Conf. Inf. Knowl. Manage., diagnosis in service management.
G. Demartini Eds. et al., 2021, pp. 4373–4382.
Authorized licensed use limited to: York University. Downloaded on March 04,2025 at 18:27:02 UTC from IEEE Xplore. Restrictions apply.
3864 IEEE TRANSACTIONS ON SERVICES COMPUTING, VOL. 16, NO. 6, NOVEMBER/DECEMBER 2023
Bicheng Zhang received the bachelor’s degree from Wa Jin is currently working toward the bachelor
Nankai University. He is currently working toward degree. Her main research interests include anomaly
the master degree with Fudan University. His research detection and failure diagnosis.
interests include cloud native and AIOps.
Authorized licensed use limited to: York University. Downloaded on March 04,2025 at 18:27:02 UTC from IEEE Xplore. Restrictions apply.