0% found this document useful (0 votes)
7 views

Hybrid Anomaly Detection and Prioritization for Network Logs at Cloud Scale

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Hybrid Anomaly Detection and Prioritization for Network Logs at Cloud Scale

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Hybrid Anomaly Detection and Prioritization for

Network Logs at Cloud Scale


David Ohana Bruno Wassermann Nicolas Dupuis
[email protected] [email protected] [email protected]
IBM Research IBM Research IBM Research
Haifa, Israel Haifa, Israel Yorktown Heights, NY, USA

Elliot Kolodner Eran Raichstein Michal Malka


[email protected] [email protected] [email protected]
IBM Research IBM Research IBM Research
Haifa, Israel Haifa, Israel Haifa, Israel
Abstract CCS Concepts: • Computer systems organization → Re-
Monitoring the health of large-scale systems requires signifi- liability; • Computing methodologies → Anomaly de-
cant manual effort, usually through the continuous curation tection; Supervised learning by classification.
of alerting rules based on keywords, thresholds and regular Keywords: Anomaly Detection, Log Analysis, Reliability,
expressions, which might generate a flood of mostly irrel- Machine Learning, Deep Learning, Cloud Computing, AIOps
evant alerts and obscure the actual information operators
ACM Reference Format:
would like to see. Existing approaches try to improve the
David Ohana, Bruno Wassermann, Nicolas Dupuis, Elliot Kolod-
observability of systems by intelligently detecting anoma-
ner, Eran Raichstein, and Michal Malka. 2022. Hybrid Anomaly
lous situations. Such solutions surface anomalies that are Detection and Prioritization for Network Logs at Cloud Scale. In
statistically significant, but may not represent events that re- Seventeenth European Conference on Computer Systems (EuroSys
liability engineers consider relevant. We propose ADEPTUS, ’22), April 5–8, 2022, RENNES, France. ACM, New York, NY, USA,
a practical approach for detection of relevant health issues 15 pages. https://doi.org/10.1145/3492321.3519566
in an established system. ADEPTUS combines statistics and
unsupervised learning to detect anomalies with supervised 1 Introduction
learning and heuristics to determine which of the detected As the scale of computing systems grows, it becomes increas-
anomalies are likely to be relevant to the Site Reliability En- ingly challenging to ensure that they function reliably. A
gineers (SREs). ADEPTUS overcomes the labor-intensive pre- single cloud-scale deployment may consist of tens of thou-
requisite of obtaining anomaly labels for supervised learning sands of software or hardware nodes in a single data center
by automatically extracting information from historic alerts (DC) [13]. A small SRE team is often responsible to manage
and incident tickets. We leverage ADEPTUS for observability the availability of deployments distributed globally across
in the network infrastructure of IBM Cloud. We perform an multiple data centers. In addition, large systems are com-
extensive real-world evaluation on 10 months of logs gen- prised of heterogeneous nodes. The components in use are
erated by tens of thousands of network devices across 11 often from different vendors and versions, and components
data centers and demonstrate that ADEPTUS achieves higher can assume a variety of roles.
alerting accuracy than the rule-based log alerting solution, Monitoring of IT systems, and communication networks
curated by domain experts, used by SREs daily. in particular, can be performed in an active or a passive ap-
proach [14]. The active approach, also known as synthetic,
is performed by simulating workloads on the system, while
the passive approach observes data (commonly metrics and
Permission to make digital or hard copies of all or part of this work for
logs) that was generated through workloads from real users.
personal or classroom use is granted without fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear
Active monitoring can detect issues in critical business pro-
this notice and the full citation on the first page. Copyrights for components cesses, before they are visible to end-users. However, it can-
of this work owned by others than ACM must be honored. Abstracting with not cover all usage scenarios as it is scripted in advance. For
credit is permitted. To copy otherwise, or republish, to post on servers or to this reason, active and passive monitoring are commonly
redistribute to lists, requires prior specific permission and/or a fee. Request
used together.
permissions from [email protected].
EuroSys ’22, April 5–8, 2022, RENNES, France
In passive monitoring, alerts are commonly triggered us-
© 2022 Association for Computing Machinery.
ing rules. Such rules may be based on thresholds (for numeric
ACM ISBN 978-1-4503-9162-7/22/04. . . $15.00 metrics) or regular-expressions / keywords (for textual logs).
https://doi.org/10.1145/3492321.3519566 Rule-based approach offers a number of benefits that have

236
EuroSys ’22, April 5–8, 2022, RENNES, France Ohana, Wassermann and Dupuis, et al.

led to its wide use for the detection of health and perfor- logs, for example, state-ratio vectors based on message pa-
mance issues. First, it offers determinism [27] so that we can rameter values [44] and weight coefficients per word [28].
trust that given certain conditions, a matching alert will be Each feature has the potential to increase the coverage and
raised. The second benefit is explainability - the reason for accuracy of an anomaly detector. In this work, using count
each alert is easy to understand [16, 26, 42]. Finally, there is vectors and metadata (i.e. severity, device role, etc.) was suffi-
an element of simplicity as manual alerting rules are rela- cient to achieve accuracy higher than rule-based approaches.
tively simple to implement and computational requirements ADEPTUS needs labeled data for its supervised part. IBM
as well as detection latency can be kept low. Cloud uses a help-desk software solution that records how
However, a rule-based approach also suffers from a num- the SREs reacted to alerts and tracks incidents that occurred
ber of shortcomings that become more significant as the scale in the monitored system. We used temporal and spatial cor-
of systems grow [26, 42], making it an inadequate solution relation on this information source to obtain labels program-
for reliable failure detection in large systems. Commonly, it matically, without any additional manual labeling effort.
cannot detect novel failures that have not been encountered Our approach can be generalized and applied also on nu-
previously and encoded as a rule, such as a new type of error meric metrics (CPU, IOPS, request rates, etc.), and in other
message that has been emitted for the first time after a soft- domains beyond networking. It is also possible to apply only
ware upgrade, or a failure expressed through a metric that the supervised prioritization step (cf. 3.2) of ADEPTUS by
was not previously available. Inaccurate rules and changes accepting alert candidates provided by other solutions (e.g.,
in workload over time and geography can lead to a high rate rule-based) as input. ADEPTUS can then prioritize the alert
of false alarms. Maintenance of a large database of alert rules candidates and promote only most relevant ones.
requires ongoing investment in detecting overlapping rules, We also applied the supervised part of ADEPTUS for pri-
updating rules, and deleting obsolete ones. Finally, human oritization of alert candidates generated by IT systems of 4
mistakes in rule definitions are difficult to avoid. different partners (in the telecommunications, finance and
In this paper, we present ADEPTUS (acronym for Anomaly transportation domains), and were able to produce relevant
Detection and Prioritization Through hybrid Unsupervised alerts with significantly higher accuracy than current state.
and Supervised learning). This approach combines unsuper- This may serve as a preliminary evidence that our approach
vised and supervised learning, to produce highly sensitive is generally applicable beyond the network infrastructure
and relevant alerts, that can be used instead, or in addition to domain.
manual alerting rules. The first step in ADEPTUS is unsuper- In this paper, we present a very scalable end-to-end solu-
vised: it uses raw logs as an input, and its output is a set of tion, that is able to process over 5 billion real-world logs, on
log anomalies with anomaly scores above a certain threshold, a single machine with low overhead and latency, and with
which are considered as alert candidates. The second step better-than human accuracy. The main contributions of this
is supervised. It uses the alert candidates as an input and paper are as follows:
trains a classification model to produce a relevance score • We propose a practical, generic and cost-effective ap-
that indicates how important an alert candidate is likely to proach for supervised anomaly-detection based on
be for the SRE team. We evaluated three classification mod- incident tickets as ground-truth.
els: Random Forest, CatBoost, and a Convolutional Neural • We present an efficient way to apply an existing ma-
Network. The third step is heuristics-driven. It performs in- chine learning technique to improve anomaly detec-
ference and then temporal and spatial hierarchical grouping tion – applying supervised learning only on anomalies
of alert candidates, to produce a single system-level alert that were generated beforehand by unsupervised learn-
score. Only system-level scores above a configurable thresh- ing.
old are promoted to alerts and brought to the attention of • To the best of our knowledge, we are the first to use
the reliability team. log metadata as features for a classification model.
In our evaluation, we use syslog [12] messages generated • We introduce and evaluate innovative techniques for
by the network devices in several data centers of IBM Cloud improving detection accuracy: Gaussian Tail rule with
as the raw input data for anomaly detection. The unstruc- exponentially weighted moving average (EWMA); rar-
tured data provided by log messages needs to be properly ity features; alert consolidation using 𝑛𝑜𝑟𝑚 function;
parsed and quantized in order to obtain suitable features for re-use of optimal alert threshold from the latest evalu-
machine learning. We mine log templates from the free-text ation cycle.
portion of a log entry and then compute time-windowed • We introduce novel ideas in the evaluation strategy:
message-count vectors for each unique combination of net- hybrid population for 𝐹 𝛽 -score and cross-validation
work device and event type. In addition, we extract log sever- based on data-center.
ity level, network device role (function), and rarity level of • We evaluate our approach on a large real-world dataset
the events. It is possible to obtain additional features from and compare its accuracy to human accuracy based on
manually curated rules.

237
Hybrid Anomaly Detection and Prioritization for Network Logs at Cloud Scale EuroSys ’22, April 5–8, 2022, RENNES, France

• We compare accuracy and training times of three pop- Unsupervised techniques, especially those that claim to be
ular classification models (Random Forest, CatBoost, generic and require little or no configuration, might suffer
Convolutional Neural Networks) for the same classifi- from a high false-positive rate. Often, those false alarms are
cation problem, demonstrating that a state-of-the-art indeed statistically significant anomalous events, however
gradient boosting model such as CatBoost may achieve they are not interesting from the perspective of an SRE [21].
an accuracy similar to that of a Neural Network model, Examples for significant yet irrelevant anomalies are human
with a significantly lower training time and complex- activities that are performed as a part of maintenance or up-
ity. grades of the system, issues in components that are not being
used for production workloads, etc. Unsupervised solutions
are often able to achieve a high alert recall rate [34] but at the
2 Background and Related Work cost of a low precision [22, 38]. DECORUS [43], our previous
Over recent years, various approaches have been proposed work, attempts to distinguish relevant anomalies from the
for AIOps (Artificial Intelligence for IT Operations), or more rest by allowing the incorporation of domain knowledge,
specifically, for automated, real-time monitoring and detec- such as system topology, weights, and anomaly directions,
tion of health and performance issues in IT systems. The in the prioritization of the anomalies detected. Nair et al. also
goal of AIOps is to make SRE teams more efficient. uses domain knowledge about system topology to form a
AIOps solutions may be divided according to the type hierarchy of anomaly detectors [30]. However, an attempt
of input data they use for anomaly detection. As discussed to capture the knowledge of Subject-Matter Experts (SMEs)
previously, some use numeric metrics or KPIs [2, 4, 11, 21, and formalize it mathematically, can lead to a large count of
24, 30, 36] and others use textual logs [9, 28, 44, 45, 47]. Logs domain-knowledge entries, thus creating a maintainability
are an invaluable resource for getting insights into a system, problem similar to the approach of curating a large database
especially when deployed in a production environment [22]. of alerting rules [26]. Furthermore, the accuracy of such an
It is relatively easy for software developers to add new types approach is also reduced, because SMEs might have gaps and
of log messages. Logs contain more diverse information than biases in their knowledge [26]. For example, we observed
what can be conveyed in numeric metrics. In spite of this, that in our evaluation, anomalies related to syslogs with
monitoring and alerting is often based on metrics, whereas severity level ’Error’ (3) have a probability of 12.7% of being
logs are more commonly used for postmortems and root related to an incident. For ’Warning’ (4) level, this probabil-
cause analysis of an issue. Anomaly detection on logs al- ity is reduced to 5.2% as expected, but for ’Notice’ (5) level,
lows early detection of many types of issues that are only the probability is raised to 9.5%, which is counter-intuitive.
manifested in logs. The aforementioned limitations of DECORUS served as our
Many of these solutions use unsupervised learning, which motivation for creating the more accurate ADEPTUS model.
does not require a labeled dataset to learn from. The curation Another option for improving the relevancy of alarms pro-
of labels for normal and anomalous data points is often con- duced by unsupervised solutions is to add a feedback loop
sidered to be impractical, as it is labor-intensive and requires to the workflow, in which SREs can encourage or discour-
domain-knowledge, usually provided by an SRE team that age the detection of some alerts [9, 21]. Feedback requires
is already heavily loaded. Gabel et al. detect latent failures ramp-up time and many samples to learn from to be effective.
by comparing many machines with similar hardware and Feedback also suffers from being subjective and prone to hu-
workload [11]. However, the homogeneity prerequisite is man mistakes, which results in contradicting or misleading
often not applicable: in our use case we would also like to feedback. For example, in one case, an SRE provided negative
detect problems in aggregating network devices. There are feedback (‘not relevant’) to an alert because a similar alert
few aggregation devices in each data center, typically 2 or was already produced by another monitoring tool. We clearly
4 of each functionality. Nevertheless, they are highly pri- want to keep such an alert as we aim for an independent
oritized for alerting, since an issue in such a device might alerting tool. In addition, having an option to provide feed-
affect many customers. In addition, devices from multiple back only on false positives [9] cannot improve the detection
vendors or models are often used for the same role. Xu et al. of false negatives.
extract log message count vectors grouped by an identifier Supervised learning is also being proposed as an approach
(and not by time-window as in ADEPTUS), and ratios of for anomaly detection in AIOps scenarios, even though it is
state (categorical) parameters found in log messages. Then, less common. Usually, supervised learning is preceded by
they apply PCA to the extracted features to detect unusual an unsupervised learning step that extracts statistical, tem-
log segments [44]. Leveraging information found in log mes- poral or forecasting features from raw data [21, 24, 45, 46].
sage parameters is an advantage that our solution does not Extracted features are transformed into samples and fed to
currently have. However, the prerequisite for source-code a classifier (usually a Random Forest classifier) to decide
availability is often not practical, as many logs come from whether to produce an alert. Those solutions, however, usu-
third-party components or pre-compiled libraries. ally require some degree of manual labeling of anomalies by

238
EuroSys ’22, April 5–8, 2022, RENNES, France Ohana, Wassermann and Dupuis, et al.

Table 1. Syslog Parsing


Raw Log Message Headers Event Type
Mar 31 19:52:11 dc01_mss017 : 2017 Apr 1 00:52:11 Time=Mar 31 19:52:11 %ETHPORT-5-SPEED
UTC: %ETHPORT-5-SPEED: Interface Ethernet1/8, Host=dc01_mss017
operational speed changed to 10 Gbps EventCode=%ETHPORT-5-SPEED
Level=5
Apr 6 11:20:07 dc02_console006 inetd[1074]: pid Time=Apr 6 11:20:07 exit status <NUM>
3677: exit status 255 Host=dc02_console006
Apr 9 00:05:21 dc09_xcr002 Apr 1 05:05:21 Time=Apr 9 00:05:21 Accepted password for <*> from <IP> port
dc09_xcr002 sshd[46507]: Accepted password for Host=dc09_xcr002 <NUM> ssh2
davidoh from 18.1.31.27 port 55912 ssh2
Few different formats of network devices syslog messages. First row contains an explicit event code, while next two require masking and template extraction.

Figure 1. ADEPTUS Complete Workflow

Integrated workflow for both offline training route (solid blue) and online inference route (dashed red).

SMEs on individual time series or logs, which is not always monitored system [11, 30, 45]. This enables an SRE to con-
feasible. They also do not take advantage of time series meta- sider fewer alerts and focus on the right set of components
data (e.g. log severity, role of the component that emitted sooner. Other solutions [2, 4, 9, 21, 24, 28, 44, 46, 47] operate
the log, actual text of log) as an input to the classifier. Such on the level of a single metric (time series) or log message
metadata might be helpful for determining the relevancy of only.
an alert.
Meng et al. propose Positive and Unlabeled Learning (PU 3 ADEPTUS Approach
Learning) which requires only a portion of positive logs to be ADEPTUS is composed of three main steps: (3.1) Generate
labeled [28]. Log messages are vectorized by textual content alert candidates with an unsupervised step; (3.2) prioritize
and then a Random Forest classifier is trained. Applying a alert candidates with a supervised step; (3.3) consolidate
classifier on each log message might not be scalable for big alert candidates with a heuristics-driven step that groups
systems. Prefix [45] converts the logs of network switches related alert candidates together and produces a combined
into time-binned template sequences, extracts four types alert score.
of features for each bin, and then trains a Random Forest While the first step is also used in our previous paper,
classifier using failure tickets as ground truth to predict DECORUS, it is elaborated below in a more formal manner.
failures in near-future time-bins. The complete ADEPTUS workflow is presented in Figure 1
Similar to ADEPTUS, some solutions are also able to lo- and detailed below.
calize the detected issues to a specific component in the

239
Hybrid Anomaly Detection and Prioritization for Network Logs at Cloud Scale EuroSys ’22, April 5–8, 2022, RENNES, France

3.1 Alert Candidate Generation the anomaly score to recover the effects of the change point
3.1.1 Log Parsing and Template Mining. ADEPTUS sooner than the raw prediction error.
uses raw syslog messages produced by network devices as The predicted value 𝑃ℎ,𝑒 (𝑤) of a time series at window 𝑤
input. Logs are unstructured text messages. In a typical DC, is simply the mean of all historic data points, defined as:
devices of multiple vendors co-exist, and, unfortunately, log- 𝑤0 ..𝑤
𝑃ℎ,𝑒 (𝑤) = 𝜇 (𝐴ℎ,𝑒 ), (2)
ging conventions and formats are not consistent (see Table
1). in which the notation 𝐴𝑖 1 ..𝑖 2 refers to a slice of vector 𝐴
A simple solution would be to avoid any header-extraction defined by the indices 𝑖 1 and 𝑖 2 . Note that the data point
and directly apply template mining on the raw logs. How- count for the mean is counted starting from the first window
ever, log lines contain important information we would like 𝑤 0 since we started monitoring the system. This is done in
to use explicitly: timestamp, hostname of the device emit- order to produce a high anomaly score at the first occurrence
ting the logs, severity level of the log (optional), event type of a rare event.
(optional), free-form text content. For this reason, we ap- The raw prediction error 𝐸ℎ,𝑒 (𝑤) of a time series at win-
ply a regular-expression-based header extractor to the fields dow 𝑤 measures the absolute distance between the last sam-
mentioned above. Timestamp and severity fields are normal- ple and the historic mean in standard deviation units (also
ized among different vendors to a standard format [12]. In called Standard Score or Z-score [41]). It can be computed
case event type is included by the network device’s vendor, incrementally and efficiently when a new data point is added:
we can use it directly for counting. Otherwise, we apply
template mining which attempts to recover the original for- 𝐴ℎ,𝑒 (𝑤) − 𝑃ℎ,𝑒 (𝑤)
𝐸ℎ,𝑒 (𝑤) = 𝑤0 ..𝑤 . (3)
mat string (printf(), String.format(), etc.), which was 𝜎 (𝐴ℎ,𝑒 )
used to emit the log message. The extracted log template The Gaussian Tail score 𝐺ℎ,𝑒 (𝑤) of a time series at window
serves as the event type. We use the Drain3 [31] log template 𝑤 is the difference between two moving averages (MA) of
miner, which is a production-ready variant that we created raw prediction error, measured in standard deviation units:
for the Drain log parser [15]. The Drain algorithm is applied
𝜇 (𝑊1 ) − 𝜇 (𝑊2 )
only on the free-text content part of the log, after regular 𝐺ℎ,𝑒 (𝑤) = , (4)
expression-based masking of common entities like numbers, 𝜎 (𝑊2 )
IP addresses, and URLs to improve its accuracy. 𝑤−𝑑 1 ..𝑤
with 𝑊1 = 𝐸ℎ,𝑒 a short-term window MA of size 𝑑1,
𝑤−𝑑 2 ..𝑤
and 𝑊2 = 𝐸ℎ,𝑒 a long-term window of size 𝑑2, where
3.1.2 Log Aggregation. We generate message count vec-
𝑑 1 ≪ 𝑑 2 . In practice, we use Exponentially Weighted Moving
tors which serve as signals for the anomaly detection. Logs
Average and Standard Deviation [10] with two 𝛼 settings
are grouped by hostname, event type and timestamp into
for calculating short and long term moving window values
5-minute bins. The outcome is a multitude of time series,
incrementally. Intuitively the Gaussian Tail score measures
where each time series counts how many logs of a certain
how well the model is able to predict recent values as opposed
event type were emitted by a certain host in each 5-minute
to older values.
time window. In Figure 2, a single log event count time series
We only care for situations where the prediction ability
is plotted. The metadata consisting of log severity, hostname
of the model gets worse, therefore we do not consider neg-
and event type is associated with the time series for future
ative Gaussian Tail scores as anomalies. Next, we take the
prioritization.
minimum between the raw prediction error (Z-score) and
We define 𝐿ℎ,𝑒 (𝑡) as the count of logs of event type 𝑒 for
the clipped Gaussian Tail score to obtain a final anomaly
host ℎ at timestamp 𝑡. We define the log count vector as a
score 𝑆ℎ,𝑒 (𝑤) for the log counter vector defined by host ℎ
time series 𝐴ℎ,𝑒 (𝑤) with 𝑤 the time window index and 𝑑 the
and event type 𝑒 at time window 𝑤:
duration of the window in seconds as:
𝑆ℎ,𝑒 (𝑤) = min(𝐸ℎ,𝑒 (𝑤), max(𝐺ℎ,𝑒 (𝑤), 0)). (5)
𝑑 ·(𝑤+1)−1
∑︁
𝐴ℎ,𝑒 (𝑤) = 𝐿ℎ,𝑒 (𝑡). (1) We use the three-sigma rule [35] to apply a threshold to the
𝑡 =𝑑 ·𝑤 anomaly score 𝑆 and produce alert candidates. In practice,
we found that using a slightly lower threshold of 2.999 pro-
3.1.3 Unsupervised Anomaly Detection. We detect duced better results than 3.0. Our evaluation shows that us-
anomalies on each log count vector independently using ing Gaussian Tail instead of raw Z-score for producing alert
a Gaussian Tail rule [1, 39]. Essentially, we compare the pre- candidates improves the accuracy of the supervised model
diction error rate of a short term window and a long term significantly, as will be discussed in the evaluation section
window. Compared to using the prediction error directly, this (Section 4.2.4). Figure 2 shows alert candidates produced by
method is better at dealing with noisy time series and also raw Z-scores compared to those produced by Gaussian Tail
tend to adapt to change points in time series faster, allowing scores. Table 2 shows a single alert candidate.

240
EuroSys ’22, April 5–8, 2022, RENNES, France Ohana, Wassermann and Dupuis, et al.

predict whether a single alert candidate is relevant by learn-


Figure 2. Anomaly Detection on a Log Count Vector ing from labels of relevant and not relevant alert candidates.
The relevance score of an individual alert candidate is the
(Gaussian Tail)
prediction probability of the model, in range 0..1. We can see
40 (a) 40
this approach as the model ‘recommending’ an alert candi-
30
Event Count

20 date to the SRE team, given its features, such as event type,
20 device type, severity, rarity, event count, and so on.
0
10 0 2 4 6
Time (hours) 3.2.1 Automatic Label Acquisition. Before we can train
0
a supervised model, we have to acquire labels. Commonly,
(Standard Z-score)

40 (b) 40
this is a pain point, and sometimes even a showstopper for
30
the application of supervised models to our problem domain.
Event Count

20
There might be millions of alert candidates, which makes it
20
0 effectively impossible for SREs to label a significant subset
10 0 2 4 6
Time (hours)
manually. Therefore, we rely on the help-desk solution of
0 IBM Cloud instead, as an objective ground-truth data-source.
2020-10 2020-11 2020-12 2021-01 2021-02 2021-03
Date Our assumption is, that a relevant alert candidate will have
a corresponding incident ticket, such as the one in Table 3.
Log count vector for network device dc01_dev272_i1 and event The correlation of alert candidates and incident tickets is
type %LDP-5-NBRCHG. The 𝑋 -axis is the start time of the 5-minute
based on both time and location. In this way, we are able to
aggregation window and the 𝑌 -axis is the number of messages.
Smaller plots are an enlarged excerpt of an 8-hour period. In (a),
obtain ground truth for supervised models in a programmatic
the alert candidates (red dots) are detected using the Gaussian Tail manner.
model while in (b) they are detected using the raw prediction error
(𝑍 -score). Gray dashed lines in the insets mark the outage start Table 3. An Incident Ticket
and end time of a related incident observed in the same data center Priority 1
(dc01) and same device (dev272_i1). Comparing (a) and (b), we Start Time 2021-02-13 02:01:00
clearly see that the Gaussian Tail anomaly score relaxes much End Time 2021-02-13 06:52:00
faster. Also note that an anomaly is observed already at the first Title Backend Network disruption behind
non-zero value (2020-09), due to zero-padding of time-windows dc01_dev272 in the DC01 data center
before first occurrence of this event. Data Center DC01
Affected Devices [dev272;dev272_i1;dev272_i2]
Table 2. An Alert Candidate A sample ticket of an incident (partial). Set of affected devices is
extracted from free-text fields. This ticket has strong spatial
Timestamp 2021-02-13 13:55:00
correlation and temporal correlation with a Log Count Vector
Host dev272_i1
anomaly in Figure 2 (dashed gray line).
Data Center DC01
Event Type %LDP-5-NBRCHG
Severity 5 Temporal Correlation: We assign a timestamp 𝑐𝑡 for
Role Backend Router each alert candidate. The timestamp is calculated as the time
Window Event Count 9 of the first log message in the time window of the correspond-
Anomaly Score 3.07 ing host and event type. In case of zero-count anomalies, we
Seen Times (Global) 614 use the middle of the time window as the timestamp. Conve-
Seen Times (DC) 258 niently, the incident tickets have start-time and end-time
Seen Times (Host) 21 fields which are updated by the SREs when a ticket is cre-
Seen Days (Global) 43 ated or while a problem is being investigated. We use only
Seen Days (DC) 16 the start time as incident timestamp 𝑖𝑡 , assuming related log
Seen Days (Host) 3
messages will usually appear near the incident start time,
Label Strong
but will not always continue while the incident is ongoing.
A sample alert candidate corresponding to Figure 2. The ‘Seen’
Even if they do continue, the anomaly score is relaxed grad-
fields are added later in the preprocessing step (cf. 3.2.2 - Rarity
ually after first appearance. Let 𝐼 be the set of all historic
Features). The label is added later in the Automatic Label
Acquisition step (cf. 3.2.1). incident tickets. We define a positive temporal correlation
for an alert candidate 𝑐, if an incident ticket 𝑖 ∈ 𝐼 exists such
that |𝑐𝑡 − 𝑖𝑡 | ≤ 𝑐𝑤, where 𝑐𝑤 is the size of the correlation
3.2 Alert Candidate Prioritization window (i.e. five minutes).
Prioritization of alert candidates is achieved by using a su- Spatial Correlation: The location of an alert candidate
pervised machine learning model. The model is trained to is inherent in the time series it originated from. Both host

241
Hybrid Anomaly Detection and Prioritization for Network Logs at Cloud Scale EuroSys ’22, April 5–8, 2022, RENNES, France

name 𝑐ℎ and data center name 𝑐𝑑𝑐 of an alert candidate can Rarity Features. Our experiments show that an alert
be extracted from the metadata associated with its time se- candidate with a novel event type have a 19.7 times higher
ries. The set of affected data-centers is structured in each probability of being related to an actual incident ticket. Novel
incident ticket. We can also extract a set of host names of event types are event types that were not present in the train-
affected network devices from the incident ticket and related ing set, which renders them as out-of-vocabulary categories
records. This information is not structured but can be ex- during inference. Also, hypothetically, the rarer event an
tracted rather easily by applying simple regular expression type is, the greater its potential to become an incident. For
on ticket description and free-text fields. Let 𝑖ℎ and 𝑖𝑑𝑐 be example, an event of type E1 might be observed for the first
the affected host and data center of an incident ticket, re- time in host H1, but we may have encountered it many times
spectively. We define a strong spatial correlation for an alert on other hosts, which makes it less likely to be an actual is-
candidate 𝑐 if an incident ticket 𝑖 ∈ 𝐼 exists such that 𝑐ℎ = 𝑖ℎ sue. In order to capture the rarity information and allow the
(host match). A match on only the data center 𝑐𝑑𝑐 = 𝑖𝑑𝑐 is model to learn from it, we computed some additional features
defined as a weak spatial correlation, since a log anomaly in a for each alert candidate: count of earlier occurrences for the
non-affected device is possible (e.g. a message about connec- event type in the same network device, in the same DC, and
tion loss to the affected device) but we have less confidence globally. We also counted how many days each event type
about it. was seen for each of those spatial scopes, as some events
Finally, we assign a relevance label for each alert candidate are emitted in bursts, rendering absolute occurrence counts
𝑐 if an incident ticket 𝑖 ∈ 𝐼 exists as seen in Table 4. misleading. We excluded the most common alert candidates
from training: an event type that was seen in the scope of
Table 4. Correlation Conditions for Relevance Label
a host on more than 50% of days. This reduced the size of
Label Temporal Spatial Condition training set to 41% without any decrease in accuracy.
Strong Yes Strong |𝑐𝑡 − 𝑖𝑡 | ≤ 𝑐𝑤 & 𝑐ℎ = 𝑖ℎ Sample imbalance. On average, only one alert candi-
Weak Yes Weak |𝑐𝑡 − 𝑖𝑡 | ≤ 𝑐𝑤 & 𝑐𝑑𝑐 = 𝑖𝑑𝑐 date out of every 1111 is positively labeled in the evaluation
None otherwise dataset (0.09%). We perform down-sampling of the majority
Strong/Weak label means that we have a high/low confidence that samples: let 𝑚𝑠 𝑓 be Majority Size Factor (a model hyperpa-
an alert candidate is related to an incident. None label means that rameter) and 𝑝𝑜𝑠𝐷𝐶 be the count of positively labeled alert
we have a high confidence that the alert candidate is not related to candidates in each DC. We randomly select up to 𝑚𝑠 𝑓 ·𝑝𝑜𝑠𝐷𝐶
an incident. Conditions are evaluated according to the order of negatively-labeled samples from each DC. Majority down-
rows in the table, first match wins. sampling obviously allows reducing model training time,
but we have found that model accuracy might also decrease
Note that alert candidates that were produced due to main-
when the proportion of negative samples is above a certain
tenance or user activity will not be correlated to an incident
sweet spot as we discuss in more details in our evaluation
ticket, therefore the model will learn to avoid those. Never-
(Section 4.2.5).
theless, we could upgrade our relevant / not-relevant binary
The input features we use for each sample are as follows:
classifier to a multi-class classifier that would learn from
event type, device role, log severity level (if it exists) and rar-
maintenance tickets in order to predict the classes ’incident’,
ity counters. Event type is the explicit event code specified
’maintenance’, and ’not-relevant’. We plan to investigate this
in the log, (e.g. %ENVMON_6_FAN_SPEED_UNSTABLE) or the
option as part of future work.
mined log template, if no explicit event code exists (cf. Table
3.2.2 Preprocessing and Model Training. Once the la- 1). Event type is treated both as a categorical feature and as
bels are available for all alert candidates, we can train a a text feature. This allows the model to learn also text-based
supervised model. Our design decision of using a single alert abstractions, for example "alert candidates related to FAN in
candidate as model input keeps the dimensionality of the FCS devices are not important". It is also more robust against
model rather low, and more importantly, allows us to train template mining errors which might occur, for example, due
a single global model instead of one model per data center, to small changes in logging code. Consider the following
thus significantly increasing the amount of data available for two logging statements:
model training and enabling use of the pre-trained model log(f"Emergency shutdown started at {time}")
on small data centers for which only few labels exist. How- log(f"Emergency shutdown was initiated at {time}")
ever, it requires additional step of alert candidate correlation, In case we change our code from the former statement to the
which is performed in the next step (3.3). latter, by incorporating the actual template and treating it
We exclude alert candidates with weak relevance label as a text feature, the model can leverage knowledge learned
from training (cf. Table 4). We have less confidence that those prior to such a change (e.g. weights of the bi-gram “emer-
candidates are related to an incident, and prefer not to use gency shutdown”).
them as input samples. We map candidates with strong/none
relevance label to a binary target variable 1 or 0 respectively.

242
EuroSys ’22, April 5–8, 2022, RENNES, France Ohana, Wassermann and Dupuis, et al.

We evaluated the following supervised ML models: use word-embedding in order to encode the alert messages.
Random Forest with CatBoost Category Encoder. Random Each input message is built by concatenating the event type,
Forest [6] is an ensemble learning technique that constructs the rarity features, and severity levels of each log message.
multiple decision trees during training, and considers vote While the event type is a text-based feature, the rarity and
ratio for a certain class as the prediction probability for that severity are categorical features, and we had to map each
class. Random Forest (RF) is robust to noisy features and level to a distinct word (e.g. severity level “L0” maps to “emer-
class imbalance, and does not require careful hyperparame- gency”, and level “L7” maps to “debugging”). We therefore
ter tuning [24, 28, 45]. It works relatively well in our use case. obtain for each input a sequence of words whose embeddings
We encoded categorical features (device role) using CatBoost we pre-train using word2vec [29]. Once the embeddings are
encoding [32] - a supervised encoding method that encodes pre-trained, each input can be fed to our classifier. The first
categorical data based on the target label, and also includes layer of the classifier is an embedding layer which maps
an ordering principle in order to overcome the problem of the words into a sequence of vectors, and is initialized with
target leakage. For encoding of the event type text feature, we the pre-trained embeddings. The sequences are padded us-
tokenized each event type into words and then encoded each ing the <PAD> token if they contain fewer tokens than the
of the first 10 words independently, also using the CatBoost maximum sequence size, which we set to 60 words. The em-
encoder. The CatBoost encoder transforms each categorical bedding layer is dynamic, meaning that its parameters can
value into a single floating number. This avoids having a be adjusted during the training of the classifier. Following
large input dimensionality for categorical data with high the embedding layer is a 3-channel convolutional network, a
cardinality, as opposed to one-hot encoding for example. max pooling layer, and a fully connected layer with dropout
We used a popular Random Forest implementation provided and softmax output [20]. We used the Adam optimizer with
by the Scikit-learn library [33]. Random Forest with Cat- a learning rate of 3 × 10−4 , and for all experiments, we fixed
Boost Category Encoder. Random Forest [6] is an ensemble the batch size to 1024, and the embedding dimension to 128.
learning technique that constructs multiple decision trees The model was implemented using TensorFlow 2.3.
during training, and considers vote ratio for a certain class Note that no hyperparameter optimization was performed
as the prediction probability for that class. Random Forest for any of the supervised models - we used the suggested
(RF) is robust to noisy features and class imbalance, and does defaults for all settings.
not require careful hyperparameter tuning [24, 28, 45]. It
works relatively well in our use case. We encoded categori- 3.3 Inference and Consolidation
cal features (device role) using CatBoost encoding [32] - a This step improves the detection accuracy by applying hierar-
supervised encoding method that encodes categorical data chical grouping. It is usually performed in an online fashion,
based on the target label, and also includes an ordering prin- on a batch of recent, near real-time, alert candidates. Online
ciple in order to overcome the problem of target leakage. For alert candidates are generated by applying the log parsing,
encoding of the event type text feature, we tokenized each log aggregation and unsupervised anomaly detection steps
event type into words and then encoded each of the first on fresh logs (cf. red dashed lines in Figure 1 and Section 3.1).
10 words independently, also using the CatBoost encoder. The output is a single score in the range 0..1 which specifies
The CatBoost encoder transforms each categorical value into the probability that a relevant issue recently occurred in the
a single floating number. This avoids having a large input data center.
dimensionality for categorical data with high cardinality, as
3.3.1 Inference. The pre-trained model is loaded and ap-
opposed to one-hot encoding for example. We used a popular
plied on all recent alert candidates one-by-one, to infer a
Random Forest implementation provided by the Scikit-learn
relevance score per alert candidate.
library [33].
CatBoost [32]. A high performance machine learning al- 3.3.2 Consolidation. ADEPTUS groups related alerts can-
gorithm based on the gradient boosting technique. Unlike didates in order to reduce alert count. Instead of producing
Random Forest that creates independent decision trees, in an alert for every alert candidate with relevancy score over
gradient boosting, trees are created one after the other. The a threshold, we produce at most a single alert per times-
CatBoost toolkit handles some drawbacks of gradient boost- tamp, with an aggregated score of all the alert candidates
ing by reducing the importance of hyperparameter tuning, with the same timestamp 𝑡. The timestamp we selected for
reduced overfitting, and faster training times due to ability this temporal grouping is 1 second. A trivial way for alert
to use GPU training out-of-the-box. relevancy score aggregation for each temporal group would
Convolutional Neural Network (CNN). In addition to be to use the maximum relevancy score in that group. How-
the tree-based techniques described above, we also inves- ever, we propose to improve on that by leveraging domain
tigated neural-network-based approaches. While the input knowledge and perform hierarchical aggregation. First we
features are exactly the same as for RF and CatBoost, some compute an alert relevancy score 𝑟 (𝑡, ℎ) per network device
additional pre-processing is required for CNN. We chose to ℎ, by calculating the 𝐿 4 norm [17] of relevancy scores of all

243
Hybrid Anomaly Detection and Prioritization for Network Logs at Cloud Scale EuroSys ’22, April 5–8, 2022, RENNES, France

alert candidates for the same network device and temporal To assist in fault localization for the alerts produced, the
group: top-scored alert candidates that composed each hierarchical
group are presented to the operator along with each alert,
∑︁  14 allowing quick focus on the most anomalous components
𝑟 (𝑡, ℎ) = 𝑆ℎ,𝑒 (𝑡) 4 . (6)
𝑒 ∈𝐸
(such as network devices).
For each data-center, we select the network device with 4 Evaluation
maximum alert relevance score as the data-center relevance
We evaluated ADEPTUS on a large real-world dataset con-
score 𝑟𝑑𝑐 (𝑡) for the temporal group. Scores larger than 1.0
sisting of the syslog messages of the network devices com-
are capped:
prising the production infrastructure of 11 data centers of
IBM Cloud. Overall, we have obtained 5,056,305,714 raw log
𝑟𝑑𝑐 (𝑡) = min(1, max(𝑟 (𝑡, ℎ) : ℎ ∈ 𝐷𝐶)). (7) messages (over 4 TB of data) of 22,476 network routers and
The intuition for using 𝐿4
norm as our anomaly score switches belonging to 58 different device types with various
aggregation function is similar to using Root Mean Square versions and vendors. Data center size is diverse, ranging
Error (RMSE) [18] for measuring cumulative error. Using a from 22 to 8122 network devices in each. The logs were re-
power of 4 instead of 2 assigns even greater weight to large trieved for the same period of 10 contiguous months in all
errors (anomalies), and computing a root aligns the scale data centers. They were translated to 444,568 individual time
of the result with that of the input. However, in the norm series (message count vectors) of 6120 event types.
function, the sum is not divided by the total count of aggre- For the aforementioned period, a total of 2094 network-
gated elements. This avoids network devices with a smaller related incident tickets were created in the help-desk data-
number of log count vectors to get an undue advantage when base for the data-centers under evaluation (cf. Figure 3). All
we average their anomaly scores. incident tickets were intentionally created by an SRE as a
response to an actual reliability issue. The tickets cover a
3.3.3 Alerting. An alert is created for each timestamp 𝑡 wide range of issues: hard failures such as power loss and
when the data center relevance score 𝑟𝑑𝑐 (𝑡) is larger than a network switch reloads; soft (intermittent) failures such as
defined constant threshold in the range 0..1. The threshold flaps of communications links and high packet error rates.
selection affects the balance between Recall and Precision Intermittent errors are tougher to detect [5], and commonly
[34]. A low threshold will produce high recall at the expense they are not handled by fault tolerance mechanisms. Few
of precision. A high threshold will give high precision at of the incidents were classified by the SREs as Customer
the expense of recall. A close-to-optimal threshold can be Impacting Events (CIE). The rest were transparent to clients
found automatically by evaluating the latest trained model thanks to the cloud infrastructure’s fail-over and redundancy
on a labeled test set of alert candidates, as described in the mechanisms. Nevertheless, we would like to detect non-CIE
evaluation section below (cf. 4.1.4). incidents as well, so that the SRE team may apply a manual
remediation action before it escalates into an CIE.

4.1 Evaluation Strategy


Figure 3. Incident Distribution in Evaluation Dataset
4.1.1 Metrics. A popular way for measuring the perfor-
mance of a binary classification task is using precision, recall
and F-score metrics [34]:
True Positives
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (8)
True Positives + False Positives
True Positives
𝑅𝑒𝑐𝑎𝑙𝑙 = (9)
True Positives + False Negatives
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 · 𝑟𝑒𝑐𝑎𝑙𝑙
𝐹 𝛽 = (1 + 𝛽 2 ) · (10)
𝛽 2 · 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
We consider a selected object as an object that was marked
as Positive by the classifier. Therefore, precision (cf. Eq. 8)
measures how many relevant objects were selected out of all
Distribution of disruption start time for 2094 incident tickets in the selected objects and recall (cf. Eq. 9) measures how many
11 data-centers we evaluated for 10-month period. Each incident is relevant objects were selected out of all relevant objects.
denoted by a circle. Overlapping incidents are indicated by Precision and recall usually vary in opposite directions, de-
opaqueness of the circle. pending on the selected classification confidence threshold

244
EuroSys ’22, April 5–8, 2022, RENNES, France Ohana, Wassermann and Dupuis, et al.

Table 5. ADEPTUS Accuracy — Compared to Baselines


Model → Naive Production ADEPTUS
Metric ↓ Random Always-Positive Perfect Rule-Based DECORUS RF CatBoost CNN
Total Incidents # 1030 1030 1030 1030 1030 1030 1030 1030
Detected Incidents # 428 943 943 506 175 452 441 433
Produced Alerts # 147,401 4,116,724 25,383 18,396 2636 9435 7330 6920
True Alerts # 1212 25,383 25,383 1645 182 1440 1345 1392
Recall % 41.56 91.56 91.56 49.13 16.99 43.93 42.8 42.05
Precision % 1.56 0.915 100 13.78 10.88 18.61 19.84 20.61
Optimal 𝐹 2 % 6.13 4.3 93.07 28.8 12 32.97 33.87 33.89
𝐹 2 -SEM 0.037 - - - - 0.065 0.027 0.042
Repeats 3 1 1 1 1 51 51 48
Optimal Alert Threshold 1.0 1.0 1.0 0.67 0.98 0.36 0.32 0.18
Model quality metrics of ADEPTUS with different classifiers in comparison with naive baselines, domain experts (rule-based approach) and
our own prior art unsupervised model (DECORUS). Metrics were measured at the optimal alert threshold, specified in the last row.

(cf. 3.3.3). The balance between those two metrics can be ex- in similar fashion, if the timestamp of the alert is within five
pressed as 𝐹 𝛽 -score (cf. Eq. 10), which is the harmonic mean minutes of the disruption start time of any incident ticket in
of precision and recall. For 𝛽=1, F-score evenly weights recall the same data center. In order to minimize a possible distor-
and precision. However, we opted for using 𝐹 2 -score (𝛽 = 2) tion of the measured precision due to many alerts related to
instead, since it considers recall twice as important as preci- the same incident (as in the example above), we also perform
sion, and this reflects the preferences of SREs in our use case temporal grouping of alerts by taking the maximum score of
to detect as many incidents as possible, even at the expense all alert candidates with the same timestamp, after rounding
of incurring more false alarms. to the closest second.
4.1.2 Population. A possible approach is to use alert can- 4.1.3 Cross-Validation. Instead of using a single arbi-
didates as the object population for accuracy metrics of 4.1.1, trary train-test split and risk overfitting, we used Population-
therefore counting a true positive as an alert candidate that Informed Cross-Validation [8] for the model quality estima-
we have strong confidence it is related to an incident (cf. Ta- tion. Since we are dealing with temporal data, traditional ran-
ble 4). However, alert candidates may not be evenly dis- dom selection of alert candidates for k-fold cross-validation is
tributed among incidents: some incidents may produce many not possible due to a risk of future data leakage [40]. For each
anomalies and some only a few, if any. Consider, for example, data center, we withhold the second half (five months) of
an extreme situation where all alert candidates relate to a data and use it as a test set. The train set consists of all labeled
single incident. A classifier which is able to correctly classify alert candidates from the other data centers and the first half
all alert candidates will gain a perfect F-score, even though of alert candidates from the DC under test. Eventually we
only a single incident out of many was detected. are left with 11 folds for cross-validation. We then compute
Therefore, we opted for a novel, hybrid population ap- a single mean 𝐹 2 -score across all data centers, weighted by
proach. We measure the accuracy of ADEPTUS as the 𝐹 2 - the proportion of test incident tickets in a data center.
score of the relevant alert rate and the detected incidents
rate. Precision is redefined to measure the rate of relevant 4.1.4 Optimal Alert Threshold Selection. Since we aim
alerts, and Recall is redefined to measure the rate of detected to deploy a single global model rather than a DC-specific
incidents: model, we search for a single confidence value between 0..1,
𝑡𝑟𝑢𝑒_𝑎𝑙𝑒𝑟𝑡_𝑐𝑜𝑢𝑛𝑡 which will produce the optimal cross-DC mean 𝐹 2 . We iterate
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = , (11)
𝑡𝑜𝑡𝑎𝑙_𝑎𝑙𝑒𝑟𝑡_𝑐𝑜𝑢𝑛𝑡 the unit range in steps of 0.01, considering as alerts only tem-
poral groups with relevance score ≥ attempted threshold
𝑑𝑒𝑡𝑒𝑐𝑡𝑒𝑑_𝑖𝑛𝑐𝑖𝑑𝑒𝑛𝑡_𝑐𝑜𝑢𝑛𝑡 (cf. 3.3.2). The threshold that produced the maximum cross-
𝑅𝑒𝑐𝑎𝑙𝑙 = , (12)
𝑡𝑜𝑡𝑎𝑙_𝑖𝑛𝑐𝑖𝑑𝑒𝑛𝑡_𝑐𝑜𝑢𝑛𝑡 DC 𝐹 2 -score is selected.
An incident is counted as detected if an alert was pro-
duced up to five minutes before or after a disruption start 4.1.5 Repetitions. In a further effort to avoid bias, we re-
time specified in the incident ticket, for the same data center. peat the evaluation 51 times per tested data center and alert
Note that we do not require the spatial element of the alert to threshold. The random seed value for any non-deterministic
contain a host name which is specified as an affected device operation (data shuffling, majority down-sampling, classi-
in the incident ticket, because sometimes an incident can be fier model training) uses system time to avoid reproducible
detected by its influence on a peer device, e.g. connectivity results. The mean 𝐹 2 -score across all repetitions is selected.
loss in case of a power outage. We count an alert as relevant We use Standard Error of the Mean (SEM) [3] to measure

245
Hybrid Anomaly Detection and Prioritization for Network Logs at Cloud Scale EuroSys ’22, April 5–8, 2022, RENNES, France

how close this mean is to the true optimal 𝐹 2 -score of the 4.2 Evaluation Results
model. 4.2.1 Overall Accuracy Evaluation. Table 5 summarizes
accuracy results for ADEPTUS with three different classifi-
cation models, and all compared baselines. All models had to
detect 1030 incidents which occurred in 11 data centers over
the last five months of the evaluation period, with a minimal
4.1.6 Baselines. We generated the following five baselines number of alerts.
for ADEPTUS model quality comparison: We start with a review of the naive baselines. The best pos-
Random baseline. The classifier assigns a uniformly ran- sible recall, 91.56%, is achieved by the Perfect and the Always
dom value between 0..1 as the relevance score of each alert Positive baselines, which were able to detect 943 incidents.
candidate. The Always Positive baseline suffers from very low precision
Always-Positive baseline. The classifier assigns value since it raises an alert for every alert candidate, to a total of
of 1 as the relevance score of each alert candidate, meaning more than four million alerts. It achieves the lowest 𝐹 2 -score
that every alert candidate is relevant, and no prioritization in the benchmark (4.3%). The Random baseline is marginally
is performed. better due to a better-balanced alert/recall trade-off. The Per-
Perfect Classifier baseline. The classifier assigns the fect baseline shows that up to 25,383 (0.62%) of alerts could be
ground truth value as the relevance score of each alert can- counted as true positives, as they are correlated temporally
didate: either 1 or 0 depending on the incident correlation and spatially to an incident. Note however that the Random
label of either strong or none. This reflects the best achievable and Always-Positive baselines are not entirely naive because
score. Note however that the recall is still less than 100%, the alert candidate generation, the unsupervised part of the
which can be due to two situations. Either the given incident model, provides some ability to detect relevant incidents.
type does not manifest itself in network device syslog mes- Otherwise, we would expect to see worse accuracy results.
sages, or we did not successfully detect it as an anomaly in DECORUS, our prior art model, is better than the naive
our Alert Candidate Generation step (cf. 3.1). baselines. It produces relatively few alerts at its optimal
Rule-based baseline. The actual alerts that were pro- alert threshold of 0.98, but is able to detect only 175 of the
duced by the syslog-based monitoring tool being used in the incidents to a recall of 16.99%, and 𝐹 2 -score of 12.0.
cloud network underlay. The tool produces alerts by inspect-
ing each syslog message and matching it to a list of over 500 4.2.2 ADEPTUS v.s. Rule-Based. The Rule-Base baseline
manually created and actively curated regular-expression is our real competitor, as it is the de facto, real-world syslog
rules. Rules can trigger an alert after a single match, or after X monitoring tool in this use case. It achieves a relatively high
occurrences in Y seconds. A sample rule: vendor == "Cisco"
AND os == "IOSXR" AND facility == "PKT_INFRA-LINEPROTO"
AND mnemonic == "UPDOWN" AND hostname !~ /tgr/ AND Figure 4. ADEPTUS versus Rule-Based
submessage !~ /tunnel-ip/.
Being the operational tool for syslog monitoring, this base-
line represents the best achievable score by a subject-matter
expert. Note that IBM Cloud employs additional, non-syslog-
based monitoring tools, but they are not included in this
comparison as they make use of data which is not available
to ADEPTUS. Each rule-based alert carries a priority level
through an integer in the range 1..7 which we normalize to
unit range. We take the maximal alert score per data center in
one-second resolution and apply the same scoring function
and metrics we used for our model evaluation.
Unsupervised Model DECORUS. Alerts that were pro-
duced by our prior-art model [43], which is fully unsuper-
vised and uses manually-tuned weights and additional domain-
knowledge. This model is being used in production for the
cloud network underlay, as a second-net alerting solution in
conjunction with the rule-based solution. In [43], we have
demonstrated that DECORUS is both more accurate and
resource efficient than five other unsupervised anomaly de- Precision versus Recall and 𝐹 2 -score versus Alert Threshold plots
tection approaches: LOF [7], PCA [19], Isolation Forest [25], of ADEPTUS-CatBoost and Rule-Based alerts. The highlighted
LogCluster [23], OC-SVM [37]. circle marker denotes optimal alert threshold.

246
EuroSys ’22, April 5–8, 2022, RENNES, France Ohana, Wassermann and Dupuis, et al.

recall of 49.1% by being able to detect 506 incidents at the benefits from having more alert candidates to choose from,
optimal alert threshold of five out of 1..7 integer range. How- scoring higher on 𝐹 2 at lower alert thresholds.
ever, it suffers from a low precision, with only 13.78% of the
alerts it produces being relevant, scoring a final 𝐹 2 -score of Table 6. Gaussian Tail versus Vanilla Z-score
28.8%. Model Vanilla Z-score Gaussian Tail
ADEPTUS with the CatBoost classifier is able to detect Threshold 3.0 6.0 15.0 50.0 2.9 2.999 3.0
441 relevant incidents (42.8% recall), which, while being less Anomalies (M) 57.6 31.9 13.4 1.8 21.4 15.6 1.7
than the Rule-Based approach, still produces significantly 𝐹 2 -score % 21.6 20.7 22.0 22.1 32.9 32.8 26.2
fewer alerts, with a higher proportion of them being relevant Accuracy (expressed as 𝐹 2 -score) of ADEPTUS with RF Classifier
(19.84% precision). This leads to an 𝐹 2 -score of 33.87%. The when Vanilla Z-score and Gaussian Tail rule are used for
optimal alert threshold was found to be 0.32, which is less producing alert candidates. Any anomaly with a score above
threshold level is considered an alert candidate that should be
than the default classification threshold of 0.5.
classified. Number of anomalies / alert candidates is in millions.
It is worth to note that ADEPTUS alerts could not be
inspected in real-time by SREs, unlike rule-based alerts. We The significant difference between Gaussian Tail and the
believe that at least a subset of the ADEPTUS alerts, that vanilla Z-score can be explained by the tendency of Gaussian
are currently considered false positives, would result in the Tail to relax anomaly scores faster after a change point in
creation of new incident tickets (due to an actual reliability the time series, as can be observed comparing plots (a) and
that would not be detected without ADEPTUS). This would (b) in Figure 2. Recall that when training, we have positive
make those alerts true positives and further improve the labels only for alert candidates near disruption start time of
𝐹 2 -score. incident tickets. Having additional alert candidates longer
Figure 4 shows the impact of the alert threshold on recall, after the disruption start time, with similar features but neg-
precision and 𝐹 2 -score for both ADEPTUS and Rule-Based. ative label, as vanilla Z-score produces, prevents effective
We observe that on almost any given recall level, ADEPTUS learning.
achieves better precision than Rule-Based. In addition, ADEP- 4.2.5 Impact of Majority Size Factor. The Majority Size
TUS achieves higher 𝐹 2 -scores than Rule-Based with most Factor (MSF) hyperparameter determines the ratio of positive
alert threshold levels. With a recall of 49.58%, which is close to negative labels, as described in Section 3.2.2. It is required
to the optimal recall for Rule-Based, ADEPTUS achieves for reducing the training time of the classification model.
precision of 12.83%, slightly less than the precision of Rule- Hypothetically, the more training data is available for the
Based. model, the better accuracy it will achieve. However, In Fig-
4.2.3 Classification Models Comparison. The highest ure 5 we can observe that the 𝐹 2 -score reaches a maximum
accuracy scores measured in terms of 𝐹 2 -score for the three around an MSF of 90-110 and then starts declining, meaning
classification models (RF, CatBoost, CNN) were quite close additional negative labels do not contribute to accuracy. Also
each other. CNN achieved an 𝐹 2 of 33.89%, CatBoost had 𝐹 2 of note that adding negative labels causes the classifier to be
33.87% and RF lagged behind with an 𝐹 2 of 32.97% (cf. Table less confident and produce lower prediction probabilities
5). When considering training time and assuming availability in general. Therefore, the optimal alert threshold (cf. 3.3.3)
of a GPU, CatBoost is the most cost-effective model, with declines as the MSF is increased.
a much lower training time than CNN for the same count
of samples (cf. Table 8). Nevertheless, the schema-less, un- Figure 5. Impact of Majority Size Factor
structured text form of the input to the CNN model (cf. 3.2.2)
has the benefit of being able to integrate alert candidates of
multiple sources in a single model. For example, log-based
alert candidates have event-type and severity level (cf. Table
2). Metric-based alert candidates may carry metric-type and
no severity information.
4.2.4 Impact of Gaussian Tail. ADEPTUS detects anom-
alies using the Gaussian Tail rule (cf. Eq. 5), instead of a
vanilla Z-score. This has a significant positive impact on the
results. In Table 6, we can observe that a vanilla Z-score pro- Metrics of ADEPTUS-RF at various MSF settings. 𝐹 2 -score in
duces vastly more alert candidates (anomalies) than Gaussian solid-blue. Average count of samples used for training the
Tail for the same anomaly score threshold ≥ 3.0. The opti- classifier, after majority down-sampling, in dashed-orange to the
mal 𝐹 2 -scores for vanilla Z-score are generally lower, even left. Optimal alert threshold in dashed-green to the right. At the
when the count of alert candidates is about the same. We last MSF value of 2000, down-sampling is effectively disabled.
also note that the ADEPTUS classifier with Gaussian Tail

247
Hybrid Anomaly Detection and Prioritization for Network Logs at Cloud Scale EuroSys ’22, April 5–8, 2022, RENNES, France

4.2.6 Impact of other heuristics. We have evaluated the The combination of unsupervised and supervised learn-
impact of additional techniques introduced in this paper ing also makes ADEPTUS resource-efficient and suitable for
on the quality of the model. Calculation of rarity features processing data at a large scale, as it is common in modern
with host-scoped rarity-based filtering was discussed in Sec- clouds. The unsupervised algorithm uses simple and fast
tion 3.2.2. Consolidation of alert candidates by hierarchical statistical techniques to process all log messages, and only a
grouping was discussed in Section 3.3.2. In Table 7, we can small fraction of the logs are promoted to alert candidates,
observe the impact of these techniques on the 𝐹 2 -score of for training and inference by the computationally more de-
the ADEPTUS-CatBoost model. manding supervised classification model.
In order to facilitate programmatic label acquisition, simi-
Table 7. Impact of Heuristics lar to that presented in this paper, we recommend that SRE
Heuristic 𝐹 2 Before 𝐹 2 After Improvement teams incorporate structured fields for “disruption start time”
Rarity 32.46% 33.87% 4.43% and “set of affected components” in the ticket schema of the
Consolidation 33.33% 33.87% 1.63% help-desk solution being used by the organization. These
Relative improvement in 𝐹 2 score of ADEPTUS-CatBoost when fields can be populated during the root-cause analysis per-
Rarity and Consolidation techniques are disabled / enabled. formed for incidents. This allows a rather precise correlation
. between incidents and alerts, which makes the “Unsuper-
vised Detection, Supervised Prioritization” strategy feasible
without incurring further manual labeling effort. Another
Table 8. Training Time Comparison simple option is to track acknowledgement actions of alerts
Classifier Training Time Processor as well as other types of human interactions with alerts,
Random Forest 3m 13s CPU - Intel Xeon (1 Core) and to use this information to prioritize alerts through a
CatBoost 1m 41s GPU - Nvidia Tesla K40 supervised model that predicts the probability of acknowl-
CNN 30m GPU - Nvidia Tesla V100 edgement for new alerts. However, such explicit correlation
is less robust against future changes in the facilities that
generate alerts.
4.2.7 Training volume and time. The raw dataset that
was used for the evaluation consists of 5,056,305,714 raw 5.1 Future Work
syslog messages. Vectorization and unsupervised anomaly We would like to incorporate advanced techniques in the
detection reduced the data to 18,112,812 alert candidates. unsupervised part of our approach: anomaly detection on
Applying rarity-based filtering reduced it further to 7,538,422 the variable parts of log messages [9, 44], and additional
alert candidates. Majority down-sampling reduced the data statistical tests [21, 46]. We expect it will allow us to achieve
volume even further to an average training-set size of 820,502 a better recall by detecting more classes of anomalies, with
samples for a single cross-validation fold. Overall, the various at least similar precision. In addition, we plan to evaluate our
techniques described in this paper were able to reduce data approach on additional sources of raw data, such as metric
scale by ratio of 1:6162. signals. In this case, metadata features such as component
Table 8 lists the time it took to train each of the classi- name or metric type can enable prioritization of anomalous
fication models. For all models, the train set is of a single measured values. Logs and metrics can be incorporated as an
cross-validation fold, containing 807,086 samples (alert can- ensemble of independent models, or as an integrated model,
didates). We can observe that by training on a GPU, CatBoost as discussed in 4.2.3.
offers the best training time, in addition to the best accuracy, Another avenue to improve the alert detection sensitivity
as demonstrated at 4.2.3. would be to set up an alert feedback system that could be
used for example by the SREs. The feedback system can be
5 Conclusion as simple as a “relevant” vs. “not-relevant” toggle switch,
In this paper, we introduced ADEPTUS as a hybrid unsu- or more complex including a text field to provide details on
pervised and supervised learning approach for the detection the alert relevancy. The latter can be handled by our word-
of relevant anomalies in textual logs. ADEPTUS makes use embedding approach as described in 3.2.2. We did prototype
of existing databases of incident tickets to prepare ground a feedback system using an ensemble of two models. The
truth in support of supervised learning. This avoids a time- first model consists in ADEPTUS classifier presented in this
consuming and often infeasible labeling effort by human paper, and the second model is another classifier trained
experts. In a real world evaluation on a large volume of sys- on the alert feedback data. The ensemble then combine the
log messages from network devices, ADEPTUS was able to prediction from each model, giving the final alert score. We
perform better than a set of hundreds of regular-expression did observe a slight improvement to the 𝐹 2 score, which is
rules, that were carefully curated by a team of subject-matter promising, but additional feedback data would be necessary
experts and are actually relied on for alerting in production. to fully demonstrate the merit of the approach.

248
EuroSys ’22, April 5–8, 2022, RENNES, France Ohana, Wassermann and Dupuis, et al.

References Service Systems. 2016 IEEE/ACM 38th International Conference on


[1] Subutai Ahmad, Alexander Lavin, S. Purdy, and Zuha Agha. 2017. Software Engineering Companion (ICSE-C) (2016), 102–111.
Unsupervised real-time anomaly detection for streaming data. Neuro- [23] Qingwei Lin, Hongyu Zhang, Jian-Guang Lou, Yu Zhang, and Xuewei
computing 262 (2017), 134–147. Chen. 2016. Log Clustering Based Problem Identification for Online
[2] Nusaybah Alghanmi, Reem Alotaibi, and S. Buhari. 2019. HLMCC: Service Systems. In Proceedings of the 38th International Conference on
A Hybrid Learning Anomaly Detection Model for Unlabeled Data in Software Engineering Companion (Austin, Texas) (ICSE ’16). Association
Internet of Things. IEEE Access 7 (2019), 179492–179504. for Computing Machinery, New York, NY, USA, 102–111. https://doi.
[3] D. Altman and J. Bland. 2005. Standard deviations and standard errors. org/10.1145/2889160.2889232
BMJ : British Medical Journal 331 (2005), 903. [24] Dapeng Liu, Y. Zhao, Haowen Xu, Yongqian Sun, Dan Pei, Jiao Luo,
[4] S. Baek, Donghwoon Kwon, Jinoh Kim, S. Suh, Hyunjoo Kim, and Xiaowei Jing, and Mei Feng. 2015. Opprentice: Towards Practical and
Ikkyun Kim. 2017. Unsupervised Labeling for Supervised Anomaly Automatic Anomaly Detection Through Machine Learning. Proceed-
Detection in Enterprise and Cloud Networks. 2017 IEEE 4th Interna- ings of the 2015 Internet Measurement Conference (2015).
tional Conference on Cyber Security and Cloud Computing (CSCloud) [25] Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2008. Isolation Forest.
(2017), 205–210. In 2008 Eighth IEEE International Conference on Data Mining. 413–422.
[5] Roozbeh Bakhshi, Surya Tej Kunche, and Michael G. Pecht. 2014. In- https://doi.org/10.1109/ICDM.2008.17
termittent Failures in Hardware and Software. Journal of Electronic [26] Han Liu. 2015. Rule based systems for classification in machine learn-
Packaging 136 (2014), 011014. ing context.
[6] Leo Breiman. 2001. Random Forests. Machine Learning 45, 1 (2001), [27] Han Liu and Alexander E. Gegov. 2016. Rule based systems and net-
5–32. https://doi.org/10.1023/A:1010933404324 works: Deterministic and fuzzy approaches. 2016 IEEE 8th International
[7] Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and Jörg Conference on Intelligent Systems (IS) (2016), 316–321.
Sander. 2000. LOF: Identifying Density-Based Local Outliers. SIGMOD [28] Weibin Meng, Ying Liu, Shenglin Zhang, Federico Zaiter, Yuzhe Zhang,
Rec. 29, 2 (may 2000), 93–104. https://doi.org/10.1145/335191.335388 Yuheng Huang, Zhaoyang Yu, Yuzhi Zhang, Lei Song, Ming Zhang,
[8] Courtney Cochrane. 2018. Time Series Nested Cross-Validation. and Dan Pei. 2021. LogClass: Anomalous Log Identification and Classi-
https://towardsdatascience.com/time-series-nested-cross- fication With Partial Labels. IEEE Transactions on Network and Service
validation-76adba623eb9 Management 18 (2021), 1870–1884.
[9] Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. 2017. DeepLog: [29] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey
Anomaly Detection and Diagnosis from System Logs through Deep Dean. 2013. Distributed Representations of Words and Phrases and
Learning. Proceedings of the 2017 ACM SIGSAC Conference on Computer their Compositionality. In Neural and Information Processing System
and Communications Security (2017). (NIPS).
[10] Tony Finch. 2009. Incremental calculation of weighted mean and [30] Vinod Nair, A. Raul, Shwetabh Khanduja, Vikas Bahirwani, Sundarara-
variance. (01 2009). jan Sellamanickam, S. Keerthi, Steve Herbert, and Sudheer Dhulipalla.
[11] Moshe Gabel, A. Schuster, Ran Gilad-Bachrach, and N. Bjørner. 2012. 2015. Learning a Hierarchical Monitoring System for Detecting and
Latent fault detection in large scale services. IEEE/IFIP International Diagnosing Service Issues. Proceedings of the 21th ACM SIGKDD Inter-
Conference on Dependable Systems and Networks (DSN 2012) (2012), national Conference on Knowledge Discovery and Data Mining (2015).
1–12. [31] David Ohana and Moshik Hershcovitch. 2020. IBM/Drain3: Drain LOG
[12] R. Gerhards. 2009. The Syslog Protocol, RFC 5424. DOI TEMPLATE miner in Python3. https://github.com/IBM/Drain3
10.17487/RFC5424 (2009), 9–10. [32] L. Ostroumova, Gleb Gusev, A. Vorobev, Anna Veronika Dorogush, and
[13] Chuanxiong Guo, L. Yuan, Dong Xiang, Yingnong Dang, Ray Huang, A. Gulin. 2018. CatBoost: unbiased boosting with categorical features.
D. Maltz, Zhaoyi Liu, Vin Wang, Bin Pang, H. Chen, Zhi Lin, and In NeurIPS.
Varugis Kurien. 2015. Pingmesh: A Large-Scale System for Data Center [33] Fabian Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
Network Latency Measurement and Analysis. In SIGCOMM. O. Grisel, Mathieu Blondel, Gilles Louppe, P. Prettenhofer, Ron Weiss,
[14] Maheen Hasib and John A. Schormans. 2003. LIMITATIONS OF Ron J. Weiss, J. Vanderplas, Alexandre Passos, D. Cournapeau, M.
PASSIVE & ACTIVE MEASUREMENT METHODS IN PACKET NET- Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine
WORKS. Learning in Python. J. Mach. Learn. Res. 12 (2011), 2825–2830.
[15] Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R. Lyu. 2017. Drain: [34] D. Powers. 2020. Evaluation: from precision, recall and F-measure to
An Online Log Parsing Approach with Fixed Depth Tree. 2017 IEEE ROC, informedness, markedness and correlation. ArXiv abs/2010.16061
International Conference on Web Services (ICWS) (2017), 33–40. (2020).
[16] J. Higgins. 1993. Classification and approximation with rule-based [35] Friedrich Pukelsheim. 1994. The Three Sigma Rule. The American
networks. Statistician 48, 2 (1994), 88–91. http://www.jstor.org/stable/2684253
[17] Roger A. Horn and Charles R. Johnson. 1985. Norms for vectors and [36] Hansheng Ren, Bixiong Xu, Yujing Wang, Chao Yi, Congrui Huang,
matrices. Xiaoyu Kou, Tony Xing, Mao Yang, Jie Tong, and Q. Zhang. 2019. Time-
[18] R. Hyndman and A. Koehler. 2006. Another look at measures of forecast Series Anomaly Detection Service at Microsoft. Proceedings of the 25th
accuracy. International Journal of Forecasting 22 (2006), 679–688. ACM SIGKDD International Conference on Knowledge Discovery & Data
[19] Ian Jolliffe. 2002. Principal component analysis. Springer Verlag, New Mining (2019).
York. [37] Bernhard Schölkopf, John C. Platt, John Shawe-Taylor, Alex J.
[20] Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classi- Smola, and Robert C. Williamson. 2001. Estimating the Sup-
fication. CoRR abs/1408.5882 (2014). arXiv:1408.5882 http://arxiv.org/ port of a High-Dimensional Distribution. Neural Compu-
abs/1408.5882 tation 13, 7 (07 2001), 1443–1471. https://doi.org/10.1162/
[21] N. Laptev, S. Amizadeh, and Ian Flint. 2015. Generic and Scalable 089976601750264965 arXiv:https://direct.mit.edu/neco/article-
Framework for Automated Time-series Anomaly Detection. Proceed- pdf/13/7/1443/814849/089976601750264965.pdf
ings of the 21th ACM SIGKDD International Conference on Knowledge [38] Weiyi Shang, Zhen Ming Jiang, H. Hemmati, B. Adams, A. Hassan, and
Discovery and Data Mining (2015). Patrick Martin. 2013. Assisting developers of Big Data Analytics Ap-
[22] Qingwei Lin, Hongyu Zhang, Jian-Guang Lou, Y. Zhang, and Xuewei plications when deploying on Hadoop clouds. 2013 35th International
Chen. 2016. Log Clustering Based Problem Identification for Online Conference on Software Engineering (ICSE) (2013), 402–411.

249
Hybrid Anomaly Detection and Prioritization for Network Logs at Cloud Scale EuroSys ’22, April 5–8, 2022, RENNES, France

[39] Dominique T. Shipmon, Jason M. Gurevitch, Paolo Piselli, and arXiv:2202.06892 [cs.LG]
Stephen T. Edwards. 2017. Time Series Anomaly Detection; Detection [44] W. Xu, Ling Huang, A. Fox, D. Patterson, and Michael I. Jordan. 2009.
of anomalous drops with limited features and sparse examples in noisy Detecting large-scale system problems by mining console logs. In SOSP
highly periodic data. ArXiv abs/1708.03665 (2017). ’09.
[40] L. Tashman. 2000. Out-of-sample tests of forecasting accuracy: an [45] Shenglin Zhang, Y. Liu, Weibin Meng, Zhiling Luo, Jiahao Bu, Sen
analysis and review. International Journal of Forecasting 16 (2000), Yang, Peixian Liang, Dan Pei, Jun Xu, Yuzhi Zhang, Y. Chen, Hui Dong,
437–450. Xianping Qu, and Lei Song. 2019. PreFix: Switch Failure Prediction in
[41] G. Upton, R. Larsen, and M. L. Marx. 1987. An introduction to math- Datacenter Networks. In PERV.
ematical statistics and its applications (2nd edition) , by R. J. Larsen [46] Xu Zhang, Qingwei Lin, Yong Xu, Si Qin, Hongyu Zhang, Bo Qiao,
and M. L. Marx. Pp 630. £17·95. 1987. ISBN 13-487166-9 (Prentice-Hall). Yingnong Dang, Xinsheng Yang, Qian Cheng, Murali Chintalapati,
The Mathematical Gazette 71 (1987), 251–252. Youjiang Wu, Ken Hsieh, Kaixin Sui, Xin Meng, Yaohai Xu, Wenchi
[42] Zhuo Wang, Wei Zhang, Ning Liu, and Jianyong Wang. 2021. Scalable Zhang, S. Furao, and Dongmei Zhang. 2019. Cross-dataset Time Series
Rule-Based Representation Learning for Interpretable Classification. Anomaly Detection for Cloud Systems. In USENIX Annual Technical
ArXiv abs/2109.15103 (2021). Conference.
[43] Bruno Wassermann, David Ohana, Ronen Schaffer, Robert Shahla, [47] Deqing Zou, Hao Qin, Hai Jin, Weizhong Qiang, Zongfen Han, and X.
Elliot K. Kolodner, Eran Raichstein, and Michal Malka. 2022. Chen. 2014. Improving Log-Based Fault Diagnosis by Log Classifica-
DeCorus: Hierarchical Multivariate Anomaly Detection at Cloud-Scale. tion. In NPC.

250

You might also like