Hybrid Anomaly Detection and Prioritization for Network Logs at Cloud Scale
Hybrid Anomaly Detection and Prioritization for Network Logs at Cloud Scale
236
EuroSys ’22, April 5–8, 2022, RENNES, France Ohana, Wassermann and Dupuis, et al.
led to its wide use for the detection of health and perfor- logs, for example, state-ratio vectors based on message pa-
mance issues. First, it offers determinism [27] so that we can rameter values [44] and weight coefficients per word [28].
trust that given certain conditions, a matching alert will be Each feature has the potential to increase the coverage and
raised. The second benefit is explainability - the reason for accuracy of an anomaly detector. In this work, using count
each alert is easy to understand [16, 26, 42]. Finally, there is vectors and metadata (i.e. severity, device role, etc.) was suffi-
an element of simplicity as manual alerting rules are rela- cient to achieve accuracy higher than rule-based approaches.
tively simple to implement and computational requirements ADEPTUS needs labeled data for its supervised part. IBM
as well as detection latency can be kept low. Cloud uses a help-desk software solution that records how
However, a rule-based approach also suffers from a num- the SREs reacted to alerts and tracks incidents that occurred
ber of shortcomings that become more significant as the scale in the monitored system. We used temporal and spatial cor-
of systems grow [26, 42], making it an inadequate solution relation on this information source to obtain labels program-
for reliable failure detection in large systems. Commonly, it matically, without any additional manual labeling effort.
cannot detect novel failures that have not been encountered Our approach can be generalized and applied also on nu-
previously and encoded as a rule, such as a new type of error meric metrics (CPU, IOPS, request rates, etc.), and in other
message that has been emitted for the first time after a soft- domains beyond networking. It is also possible to apply only
ware upgrade, or a failure expressed through a metric that the supervised prioritization step (cf. 3.2) of ADEPTUS by
was not previously available. Inaccurate rules and changes accepting alert candidates provided by other solutions (e.g.,
in workload over time and geography can lead to a high rate rule-based) as input. ADEPTUS can then prioritize the alert
of false alarms. Maintenance of a large database of alert rules candidates and promote only most relevant ones.
requires ongoing investment in detecting overlapping rules, We also applied the supervised part of ADEPTUS for pri-
updating rules, and deleting obsolete ones. Finally, human oritization of alert candidates generated by IT systems of 4
mistakes in rule definitions are difficult to avoid. different partners (in the telecommunications, finance and
In this paper, we present ADEPTUS (acronym for Anomaly transportation domains), and were able to produce relevant
Detection and Prioritization Through hybrid Unsupervised alerts with significantly higher accuracy than current state.
and Supervised learning). This approach combines unsuper- This may serve as a preliminary evidence that our approach
vised and supervised learning, to produce highly sensitive is generally applicable beyond the network infrastructure
and relevant alerts, that can be used instead, or in addition to domain.
manual alerting rules. The first step in ADEPTUS is unsuper- In this paper, we present a very scalable end-to-end solu-
vised: it uses raw logs as an input, and its output is a set of tion, that is able to process over 5 billion real-world logs, on
log anomalies with anomaly scores above a certain threshold, a single machine with low overhead and latency, and with
which are considered as alert candidates. The second step better-than human accuracy. The main contributions of this
is supervised. It uses the alert candidates as an input and paper are as follows:
trains a classification model to produce a relevance score • We propose a practical, generic and cost-effective ap-
that indicates how important an alert candidate is likely to proach for supervised anomaly-detection based on
be for the SRE team. We evaluated three classification mod- incident tickets as ground-truth.
els: Random Forest, CatBoost, and a Convolutional Neural • We present an efficient way to apply an existing ma-
Network. The third step is heuristics-driven. It performs in- chine learning technique to improve anomaly detec-
ference and then temporal and spatial hierarchical grouping tion – applying supervised learning only on anomalies
of alert candidates, to produce a single system-level alert that were generated beforehand by unsupervised learn-
score. Only system-level scores above a configurable thresh- ing.
old are promoted to alerts and brought to the attention of • To the best of our knowledge, we are the first to use
the reliability team. log metadata as features for a classification model.
In our evaluation, we use syslog [12] messages generated • We introduce and evaluate innovative techniques for
by the network devices in several data centers of IBM Cloud improving detection accuracy: Gaussian Tail rule with
as the raw input data for anomaly detection. The unstruc- exponentially weighted moving average (EWMA); rar-
tured data provided by log messages needs to be properly ity features; alert consolidation using 𝑛𝑜𝑟𝑚 function;
parsed and quantized in order to obtain suitable features for re-use of optimal alert threshold from the latest evalu-
machine learning. We mine log templates from the free-text ation cycle.
portion of a log entry and then compute time-windowed • We introduce novel ideas in the evaluation strategy:
message-count vectors for each unique combination of net- hybrid population for 𝐹 𝛽 -score and cross-validation
work device and event type. In addition, we extract log sever- based on data-center.
ity level, network device role (function), and rarity level of • We evaluate our approach on a large real-world dataset
the events. It is possible to obtain additional features from and compare its accuracy to human accuracy based on
manually curated rules.
237
Hybrid Anomaly Detection and Prioritization for Network Logs at Cloud Scale EuroSys ’22, April 5–8, 2022, RENNES, France
• We compare accuracy and training times of three pop- Unsupervised techniques, especially those that claim to be
ular classification models (Random Forest, CatBoost, generic and require little or no configuration, might suffer
Convolutional Neural Networks) for the same classifi- from a high false-positive rate. Often, those false alarms are
cation problem, demonstrating that a state-of-the-art indeed statistically significant anomalous events, however
gradient boosting model such as CatBoost may achieve they are not interesting from the perspective of an SRE [21].
an accuracy similar to that of a Neural Network model, Examples for significant yet irrelevant anomalies are human
with a significantly lower training time and complex- activities that are performed as a part of maintenance or up-
ity. grades of the system, issues in components that are not being
used for production workloads, etc. Unsupervised solutions
are often able to achieve a high alert recall rate [34] but at the
2 Background and Related Work cost of a low precision [22, 38]. DECORUS [43], our previous
Over recent years, various approaches have been proposed work, attempts to distinguish relevant anomalies from the
for AIOps (Artificial Intelligence for IT Operations), or more rest by allowing the incorporation of domain knowledge,
specifically, for automated, real-time monitoring and detec- such as system topology, weights, and anomaly directions,
tion of health and performance issues in IT systems. The in the prioritization of the anomalies detected. Nair et al. also
goal of AIOps is to make SRE teams more efficient. uses domain knowledge about system topology to form a
AIOps solutions may be divided according to the type hierarchy of anomaly detectors [30]. However, an attempt
of input data they use for anomaly detection. As discussed to capture the knowledge of Subject-Matter Experts (SMEs)
previously, some use numeric metrics or KPIs [2, 4, 11, 21, and formalize it mathematically, can lead to a large count of
24, 30, 36] and others use textual logs [9, 28, 44, 45, 47]. Logs domain-knowledge entries, thus creating a maintainability
are an invaluable resource for getting insights into a system, problem similar to the approach of curating a large database
especially when deployed in a production environment [22]. of alerting rules [26]. Furthermore, the accuracy of such an
It is relatively easy for software developers to add new types approach is also reduced, because SMEs might have gaps and
of log messages. Logs contain more diverse information than biases in their knowledge [26]. For example, we observed
what can be conveyed in numeric metrics. In spite of this, that in our evaluation, anomalies related to syslogs with
monitoring and alerting is often based on metrics, whereas severity level ’Error’ (3) have a probability of 12.7% of being
logs are more commonly used for postmortems and root related to an incident. For ’Warning’ (4) level, this probabil-
cause analysis of an issue. Anomaly detection on logs al- ity is reduced to 5.2% as expected, but for ’Notice’ (5) level,
lows early detection of many types of issues that are only the probability is raised to 9.5%, which is counter-intuitive.
manifested in logs. The aforementioned limitations of DECORUS served as our
Many of these solutions use unsupervised learning, which motivation for creating the more accurate ADEPTUS model.
does not require a labeled dataset to learn from. The curation Another option for improving the relevancy of alarms pro-
of labels for normal and anomalous data points is often con- duced by unsupervised solutions is to add a feedback loop
sidered to be impractical, as it is labor-intensive and requires to the workflow, in which SREs can encourage or discour-
domain-knowledge, usually provided by an SRE team that age the detection of some alerts [9, 21]. Feedback requires
is already heavily loaded. Gabel et al. detect latent failures ramp-up time and many samples to learn from to be effective.
by comparing many machines with similar hardware and Feedback also suffers from being subjective and prone to hu-
workload [11]. However, the homogeneity prerequisite is man mistakes, which results in contradicting or misleading
often not applicable: in our use case we would also like to feedback. For example, in one case, an SRE provided negative
detect problems in aggregating network devices. There are feedback (‘not relevant’) to an alert because a similar alert
few aggregation devices in each data center, typically 2 or was already produced by another monitoring tool. We clearly
4 of each functionality. Nevertheless, they are highly pri- want to keep such an alert as we aim for an independent
oritized for alerting, since an issue in such a device might alerting tool. In addition, having an option to provide feed-
affect many customers. In addition, devices from multiple back only on false positives [9] cannot improve the detection
vendors or models are often used for the same role. Xu et al. of false negatives.
extract log message count vectors grouped by an identifier Supervised learning is also being proposed as an approach
(and not by time-window as in ADEPTUS), and ratios of for anomaly detection in AIOps scenarios, even though it is
state (categorical) parameters found in log messages. Then, less common. Usually, supervised learning is preceded by
they apply PCA to the extracted features to detect unusual an unsupervised learning step that extracts statistical, tem-
log segments [44]. Leveraging information found in log mes- poral or forecasting features from raw data [21, 24, 45, 46].
sage parameters is an advantage that our solution does not Extracted features are transformed into samples and fed to
currently have. However, the prerequisite for source-code a classifier (usually a Random Forest classifier) to decide
availability is often not practical, as many logs come from whether to produce an alert. Those solutions, however, usu-
third-party components or pre-compiled libraries. ally require some degree of manual labeling of anomalies by
238
EuroSys ’22, April 5–8, 2022, RENNES, France Ohana, Wassermann and Dupuis, et al.
Integrated workflow for both offline training route (solid blue) and online inference route (dashed red).
SMEs on individual time series or logs, which is not always monitored system [11, 30, 45]. This enables an SRE to con-
feasible. They also do not take advantage of time series meta- sider fewer alerts and focus on the right set of components
data (e.g. log severity, role of the component that emitted sooner. Other solutions [2, 4, 9, 21, 24, 28, 44, 46, 47] operate
the log, actual text of log) as an input to the classifier. Such on the level of a single metric (time series) or log message
metadata might be helpful for determining the relevancy of only.
an alert.
Meng et al. propose Positive and Unlabeled Learning (PU 3 ADEPTUS Approach
Learning) which requires only a portion of positive logs to be ADEPTUS is composed of three main steps: (3.1) Generate
labeled [28]. Log messages are vectorized by textual content alert candidates with an unsupervised step; (3.2) prioritize
and then a Random Forest classifier is trained. Applying a alert candidates with a supervised step; (3.3) consolidate
classifier on each log message might not be scalable for big alert candidates with a heuristics-driven step that groups
systems. Prefix [45] converts the logs of network switches related alert candidates together and produces a combined
into time-binned template sequences, extracts four types alert score.
of features for each bin, and then trains a Random Forest While the first step is also used in our previous paper,
classifier using failure tickets as ground truth to predict DECORUS, it is elaborated below in a more formal manner.
failures in near-future time-bins. The complete ADEPTUS workflow is presented in Figure 1
Similar to ADEPTUS, some solutions are also able to lo- and detailed below.
calize the detected issues to a specific component in the
239
Hybrid Anomaly Detection and Prioritization for Network Logs at Cloud Scale EuroSys ’22, April 5–8, 2022, RENNES, France
3.1 Alert Candidate Generation the anomaly score to recover the effects of the change point
3.1.1 Log Parsing and Template Mining. ADEPTUS sooner than the raw prediction error.
uses raw syslog messages produced by network devices as The predicted value 𝑃ℎ,𝑒 (𝑤) of a time series at window 𝑤
input. Logs are unstructured text messages. In a typical DC, is simply the mean of all historic data points, defined as:
devices of multiple vendors co-exist, and, unfortunately, log- 𝑤0 ..𝑤
𝑃ℎ,𝑒 (𝑤) = 𝜇 (𝐴ℎ,𝑒 ), (2)
ging conventions and formats are not consistent (see Table
1). in which the notation 𝐴𝑖 1 ..𝑖 2 refers to a slice of vector 𝐴
A simple solution would be to avoid any header-extraction defined by the indices 𝑖 1 and 𝑖 2 . Note that the data point
and directly apply template mining on the raw logs. How- count for the mean is counted starting from the first window
ever, log lines contain important information we would like 𝑤 0 since we started monitoring the system. This is done in
to use explicitly: timestamp, hostname of the device emit- order to produce a high anomaly score at the first occurrence
ting the logs, severity level of the log (optional), event type of a rare event.
(optional), free-form text content. For this reason, we ap- The raw prediction error 𝐸ℎ,𝑒 (𝑤) of a time series at win-
ply a regular-expression-based header extractor to the fields dow 𝑤 measures the absolute distance between the last sam-
mentioned above. Timestamp and severity fields are normal- ple and the historic mean in standard deviation units (also
ized among different vendors to a standard format [12]. In called Standard Score or Z-score [41]). It can be computed
case event type is included by the network device’s vendor, incrementally and efficiently when a new data point is added:
we can use it directly for counting. Otherwise, we apply
template mining which attempts to recover the original for- 𝐴ℎ,𝑒 (𝑤) − 𝑃ℎ,𝑒 (𝑤)
𝐸ℎ,𝑒 (𝑤) = 𝑤0 ..𝑤 . (3)
mat string (printf(), String.format(), etc.), which was 𝜎 (𝐴ℎ,𝑒 )
used to emit the log message. The extracted log template The Gaussian Tail score 𝐺ℎ,𝑒 (𝑤) of a time series at window
serves as the event type. We use the Drain3 [31] log template 𝑤 is the difference between two moving averages (MA) of
miner, which is a production-ready variant that we created raw prediction error, measured in standard deviation units:
for the Drain log parser [15]. The Drain algorithm is applied
𝜇 (𝑊1 ) − 𝜇 (𝑊2 )
only on the free-text content part of the log, after regular 𝐺ℎ,𝑒 (𝑤) = , (4)
expression-based masking of common entities like numbers, 𝜎 (𝑊2 )
IP addresses, and URLs to improve its accuracy. 𝑤−𝑑 1 ..𝑤
with 𝑊1 = 𝐸ℎ,𝑒 a short-term window MA of size 𝑑1,
𝑤−𝑑 2 ..𝑤
and 𝑊2 = 𝐸ℎ,𝑒 a long-term window of size 𝑑2, where
3.1.2 Log Aggregation. We generate message count vec-
𝑑 1 ≪ 𝑑 2 . In practice, we use Exponentially Weighted Moving
tors which serve as signals for the anomaly detection. Logs
Average and Standard Deviation [10] with two 𝛼 settings
are grouped by hostname, event type and timestamp into
for calculating short and long term moving window values
5-minute bins. The outcome is a multitude of time series,
incrementally. Intuitively the Gaussian Tail score measures
where each time series counts how many logs of a certain
how well the model is able to predict recent values as opposed
event type were emitted by a certain host in each 5-minute
to older values.
time window. In Figure 2, a single log event count time series
We only care for situations where the prediction ability
is plotted. The metadata consisting of log severity, hostname
of the model gets worse, therefore we do not consider neg-
and event type is associated with the time series for future
ative Gaussian Tail scores as anomalies. Next, we take the
prioritization.
minimum between the raw prediction error (Z-score) and
We define 𝐿ℎ,𝑒 (𝑡) as the count of logs of event type 𝑒 for
the clipped Gaussian Tail score to obtain a final anomaly
host ℎ at timestamp 𝑡. We define the log count vector as a
score 𝑆ℎ,𝑒 (𝑤) for the log counter vector defined by host ℎ
time series 𝐴ℎ,𝑒 (𝑤) with 𝑤 the time window index and 𝑑 the
and event type 𝑒 at time window 𝑤:
duration of the window in seconds as:
𝑆ℎ,𝑒 (𝑤) = min(𝐸ℎ,𝑒 (𝑤), max(𝐺ℎ,𝑒 (𝑤), 0)). (5)
𝑑 ·(𝑤+1)−1
∑︁
𝐴ℎ,𝑒 (𝑤) = 𝐿ℎ,𝑒 (𝑡). (1) We use the three-sigma rule [35] to apply a threshold to the
𝑡 =𝑑 ·𝑤 anomaly score 𝑆 and produce alert candidates. In practice,
we found that using a slightly lower threshold of 2.999 pro-
3.1.3 Unsupervised Anomaly Detection. We detect duced better results than 3.0. Our evaluation shows that us-
anomalies on each log count vector independently using ing Gaussian Tail instead of raw Z-score for producing alert
a Gaussian Tail rule [1, 39]. Essentially, we compare the pre- candidates improves the accuracy of the supervised model
diction error rate of a short term window and a long term significantly, as will be discussed in the evaluation section
window. Compared to using the prediction error directly, this (Section 4.2.4). Figure 2 shows alert candidates produced by
method is better at dealing with noisy time series and also raw Z-scores compared to those produced by Gaussian Tail
tend to adapt to change points in time series faster, allowing scores. Table 2 shows a single alert candidate.
240
EuroSys ’22, April 5–8, 2022, RENNES, France Ohana, Wassermann and Dupuis, et al.
20 date to the SRE team, given its features, such as event type,
20 device type, severity, rarity, event count, and so on.
0
10 0 2 4 6
Time (hours) 3.2.1 Automatic Label Acquisition. Before we can train
0
a supervised model, we have to acquire labels. Commonly,
(Standard Z-score)
40 (b) 40
this is a pain point, and sometimes even a showstopper for
30
the application of supervised models to our problem domain.
Event Count
20
There might be millions of alert candidates, which makes it
20
0 effectively impossible for SREs to label a significant subset
10 0 2 4 6
Time (hours)
manually. Therefore, we rely on the help-desk solution of
0 IBM Cloud instead, as an objective ground-truth data-source.
2020-10 2020-11 2020-12 2021-01 2021-02 2021-03
Date Our assumption is, that a relevant alert candidate will have
a corresponding incident ticket, such as the one in Table 3.
Log count vector for network device dc01_dev272_i1 and event The correlation of alert candidates and incident tickets is
type %LDP-5-NBRCHG. The 𝑋 -axis is the start time of the 5-minute
based on both time and location. In this way, we are able to
aggregation window and the 𝑌 -axis is the number of messages.
Smaller plots are an enlarged excerpt of an 8-hour period. In (a),
obtain ground truth for supervised models in a programmatic
the alert candidates (red dots) are detected using the Gaussian Tail manner.
model while in (b) they are detected using the raw prediction error
(𝑍 -score). Gray dashed lines in the insets mark the outage start Table 3. An Incident Ticket
and end time of a related incident observed in the same data center Priority 1
(dc01) and same device (dev272_i1). Comparing (a) and (b), we Start Time 2021-02-13 02:01:00
clearly see that the Gaussian Tail anomaly score relaxes much End Time 2021-02-13 06:52:00
faster. Also note that an anomaly is observed already at the first Title Backend Network disruption behind
non-zero value (2020-09), due to zero-padding of time-windows dc01_dev272 in the DC01 data center
before first occurrence of this event. Data Center DC01
Affected Devices [dev272;dev272_i1;dev272_i2]
Table 2. An Alert Candidate A sample ticket of an incident (partial). Set of affected devices is
extracted from free-text fields. This ticket has strong spatial
Timestamp 2021-02-13 13:55:00
correlation and temporal correlation with a Log Count Vector
Host dev272_i1
anomaly in Figure 2 (dashed gray line).
Data Center DC01
Event Type %LDP-5-NBRCHG
Severity 5 Temporal Correlation: We assign a timestamp 𝑐𝑡 for
Role Backend Router each alert candidate. The timestamp is calculated as the time
Window Event Count 9 of the first log message in the time window of the correspond-
Anomaly Score 3.07 ing host and event type. In case of zero-count anomalies, we
Seen Times (Global) 614 use the middle of the time window as the timestamp. Conve-
Seen Times (DC) 258 niently, the incident tickets have start-time and end-time
Seen Times (Host) 21 fields which are updated by the SREs when a ticket is cre-
Seen Days (Global) 43 ated or while a problem is being investigated. We use only
Seen Days (DC) 16 the start time as incident timestamp 𝑖𝑡 , assuming related log
Seen Days (Host) 3
messages will usually appear near the incident start time,
Label Strong
but will not always continue while the incident is ongoing.
A sample alert candidate corresponding to Figure 2. The ‘Seen’
Even if they do continue, the anomaly score is relaxed grad-
fields are added later in the preprocessing step (cf. 3.2.2 - Rarity
ually after first appearance. Let 𝐼 be the set of all historic
Features). The label is added later in the Automatic Label
Acquisition step (cf. 3.2.1). incident tickets. We define a positive temporal correlation
for an alert candidate 𝑐, if an incident ticket 𝑖 ∈ 𝐼 exists such
that |𝑐𝑡 − 𝑖𝑡 | ≤ 𝑐𝑤, where 𝑐𝑤 is the size of the correlation
3.2 Alert Candidate Prioritization window (i.e. five minutes).
Prioritization of alert candidates is achieved by using a su- Spatial Correlation: The location of an alert candidate
pervised machine learning model. The model is trained to is inherent in the time series it originated from. Both host
241
Hybrid Anomaly Detection and Prioritization for Network Logs at Cloud Scale EuroSys ’22, April 5–8, 2022, RENNES, France
name 𝑐ℎ and data center name 𝑐𝑑𝑐 of an alert candidate can Rarity Features. Our experiments show that an alert
be extracted from the metadata associated with its time se- candidate with a novel event type have a 19.7 times higher
ries. The set of affected data-centers is structured in each probability of being related to an actual incident ticket. Novel
incident ticket. We can also extract a set of host names of event types are event types that were not present in the train-
affected network devices from the incident ticket and related ing set, which renders them as out-of-vocabulary categories
records. This information is not structured but can be ex- during inference. Also, hypothetically, the rarer event an
tracted rather easily by applying simple regular expression type is, the greater its potential to become an incident. For
on ticket description and free-text fields. Let 𝑖ℎ and 𝑖𝑑𝑐 be example, an event of type E1 might be observed for the first
the affected host and data center of an incident ticket, re- time in host H1, but we may have encountered it many times
spectively. We define a strong spatial correlation for an alert on other hosts, which makes it less likely to be an actual is-
candidate 𝑐 if an incident ticket 𝑖 ∈ 𝐼 exists such that 𝑐ℎ = 𝑖ℎ sue. In order to capture the rarity information and allow the
(host match). A match on only the data center 𝑐𝑑𝑐 = 𝑖𝑑𝑐 is model to learn from it, we computed some additional features
defined as a weak spatial correlation, since a log anomaly in a for each alert candidate: count of earlier occurrences for the
non-affected device is possible (e.g. a message about connec- event type in the same network device, in the same DC, and
tion loss to the affected device) but we have less confidence globally. We also counted how many days each event type
about it. was seen for each of those spatial scopes, as some events
Finally, we assign a relevance label for each alert candidate are emitted in bursts, rendering absolute occurrence counts
𝑐 if an incident ticket 𝑖 ∈ 𝐼 exists as seen in Table 4. misleading. We excluded the most common alert candidates
from training: an event type that was seen in the scope of
Table 4. Correlation Conditions for Relevance Label
a host on more than 50% of days. This reduced the size of
Label Temporal Spatial Condition training set to 41% without any decrease in accuracy.
Strong Yes Strong |𝑐𝑡 − 𝑖𝑡 | ≤ 𝑐𝑤 & 𝑐ℎ = 𝑖ℎ Sample imbalance. On average, only one alert candi-
Weak Yes Weak |𝑐𝑡 − 𝑖𝑡 | ≤ 𝑐𝑤 & 𝑐𝑑𝑐 = 𝑖𝑑𝑐 date out of every 1111 is positively labeled in the evaluation
None otherwise dataset (0.09%). We perform down-sampling of the majority
Strong/Weak label means that we have a high/low confidence that samples: let 𝑚𝑠 𝑓 be Majority Size Factor (a model hyperpa-
an alert candidate is related to an incident. None label means that rameter) and 𝑝𝑜𝑠𝐷𝐶 be the count of positively labeled alert
we have a high confidence that the alert candidate is not related to candidates in each DC. We randomly select up to 𝑚𝑠 𝑓 ·𝑝𝑜𝑠𝐷𝐶
an incident. Conditions are evaluated according to the order of negatively-labeled samples from each DC. Majority down-
rows in the table, first match wins. sampling obviously allows reducing model training time,
but we have found that model accuracy might also decrease
Note that alert candidates that were produced due to main-
when the proportion of negative samples is above a certain
tenance or user activity will not be correlated to an incident
sweet spot as we discuss in more details in our evaluation
ticket, therefore the model will learn to avoid those. Never-
(Section 4.2.5).
theless, we could upgrade our relevant / not-relevant binary
The input features we use for each sample are as follows:
classifier to a multi-class classifier that would learn from
event type, device role, log severity level (if it exists) and rar-
maintenance tickets in order to predict the classes ’incident’,
ity counters. Event type is the explicit event code specified
’maintenance’, and ’not-relevant’. We plan to investigate this
in the log, (e.g. %ENVMON_6_FAN_SPEED_UNSTABLE) or the
option as part of future work.
mined log template, if no explicit event code exists (cf. Table
3.2.2 Preprocessing and Model Training. Once the la- 1). Event type is treated both as a categorical feature and as
bels are available for all alert candidates, we can train a a text feature. This allows the model to learn also text-based
supervised model. Our design decision of using a single alert abstractions, for example "alert candidates related to FAN in
candidate as model input keeps the dimensionality of the FCS devices are not important". It is also more robust against
model rather low, and more importantly, allows us to train template mining errors which might occur, for example, due
a single global model instead of one model per data center, to small changes in logging code. Consider the following
thus significantly increasing the amount of data available for two logging statements:
model training and enabling use of the pre-trained model log(f"Emergency shutdown started at {time}")
on small data centers for which only few labels exist. How- log(f"Emergency shutdown was initiated at {time}")
ever, it requires additional step of alert candidate correlation, In case we change our code from the former statement to the
which is performed in the next step (3.3). latter, by incorporating the actual template and treating it
We exclude alert candidates with weak relevance label as a text feature, the model can leverage knowledge learned
from training (cf. Table 4). We have less confidence that those prior to such a change (e.g. weights of the bi-gram “emer-
candidates are related to an incident, and prefer not to use gency shutdown”).
them as input samples. We map candidates with strong/none
relevance label to a binary target variable 1 or 0 respectively.
242
EuroSys ’22, April 5–8, 2022, RENNES, France Ohana, Wassermann and Dupuis, et al.
We evaluated the following supervised ML models: use word-embedding in order to encode the alert messages.
Random Forest with CatBoost Category Encoder. Random Each input message is built by concatenating the event type,
Forest [6] is an ensemble learning technique that constructs the rarity features, and severity levels of each log message.
multiple decision trees during training, and considers vote While the event type is a text-based feature, the rarity and
ratio for a certain class as the prediction probability for that severity are categorical features, and we had to map each
class. Random Forest (RF) is robust to noisy features and level to a distinct word (e.g. severity level “L0” maps to “emer-
class imbalance, and does not require careful hyperparame- gency”, and level “L7” maps to “debugging”). We therefore
ter tuning [24, 28, 45]. It works relatively well in our use case. obtain for each input a sequence of words whose embeddings
We encoded categorical features (device role) using CatBoost we pre-train using word2vec [29]. Once the embeddings are
encoding [32] - a supervised encoding method that encodes pre-trained, each input can be fed to our classifier. The first
categorical data based on the target label, and also includes layer of the classifier is an embedding layer which maps
an ordering principle in order to overcome the problem of the words into a sequence of vectors, and is initialized with
target leakage. For encoding of the event type text feature, we the pre-trained embeddings. The sequences are padded us-
tokenized each event type into words and then encoded each ing the <PAD> token if they contain fewer tokens than the
of the first 10 words independently, also using the CatBoost maximum sequence size, which we set to 60 words. The em-
encoder. The CatBoost encoder transforms each categorical bedding layer is dynamic, meaning that its parameters can
value into a single floating number. This avoids having a be adjusted during the training of the classifier. Following
large input dimensionality for categorical data with high the embedding layer is a 3-channel convolutional network, a
cardinality, as opposed to one-hot encoding for example. max pooling layer, and a fully connected layer with dropout
We used a popular Random Forest implementation provided and softmax output [20]. We used the Adam optimizer with
by the Scikit-learn library [33]. Random Forest with Cat- a learning rate of 3 × 10−4 , and for all experiments, we fixed
Boost Category Encoder. Random Forest [6] is an ensemble the batch size to 1024, and the embedding dimension to 128.
learning technique that constructs multiple decision trees The model was implemented using TensorFlow 2.3.
during training, and considers vote ratio for a certain class Note that no hyperparameter optimization was performed
as the prediction probability for that class. Random Forest for any of the supervised models - we used the suggested
(RF) is robust to noisy features and class imbalance, and does defaults for all settings.
not require careful hyperparameter tuning [24, 28, 45]. It
works relatively well in our use case. We encoded categori- 3.3 Inference and Consolidation
cal features (device role) using CatBoost encoding [32] - a This step improves the detection accuracy by applying hierar-
supervised encoding method that encodes categorical data chical grouping. It is usually performed in an online fashion,
based on the target label, and also includes an ordering prin- on a batch of recent, near real-time, alert candidates. Online
ciple in order to overcome the problem of target leakage. For alert candidates are generated by applying the log parsing,
encoding of the event type text feature, we tokenized each log aggregation and unsupervised anomaly detection steps
event type into words and then encoded each of the first on fresh logs (cf. red dashed lines in Figure 1 and Section 3.1).
10 words independently, also using the CatBoost encoder. The output is a single score in the range 0..1 which specifies
The CatBoost encoder transforms each categorical value into the probability that a relevant issue recently occurred in the
a single floating number. This avoids having a large input data center.
dimensionality for categorical data with high cardinality, as
3.3.1 Inference. The pre-trained model is loaded and ap-
opposed to one-hot encoding for example. We used a popular
plied on all recent alert candidates one-by-one, to infer a
Random Forest implementation provided by the Scikit-learn
relevance score per alert candidate.
library [33].
CatBoost [32]. A high performance machine learning al- 3.3.2 Consolidation. ADEPTUS groups related alerts can-
gorithm based on the gradient boosting technique. Unlike didates in order to reduce alert count. Instead of producing
Random Forest that creates independent decision trees, in an alert for every alert candidate with relevancy score over
gradient boosting, trees are created one after the other. The a threshold, we produce at most a single alert per times-
CatBoost toolkit handles some drawbacks of gradient boost- tamp, with an aggregated score of all the alert candidates
ing by reducing the importance of hyperparameter tuning, with the same timestamp 𝑡. The timestamp we selected for
reduced overfitting, and faster training times due to ability this temporal grouping is 1 second. A trivial way for alert
to use GPU training out-of-the-box. relevancy score aggregation for each temporal group would
Convolutional Neural Network (CNN). In addition to be to use the maximum relevancy score in that group. How-
the tree-based techniques described above, we also inves- ever, we propose to improve on that by leveraging domain
tigated neural-network-based approaches. While the input knowledge and perform hierarchical aggregation. First we
features are exactly the same as for RF and CatBoost, some compute an alert relevancy score 𝑟 (𝑡, ℎ) per network device
additional pre-processing is required for CNN. We chose to ℎ, by calculating the 𝐿 4 norm [17] of relevancy scores of all
243
Hybrid Anomaly Detection and Prioritization for Network Logs at Cloud Scale EuroSys ’22, April 5–8, 2022, RENNES, France
alert candidates for the same network device and temporal To assist in fault localization for the alerts produced, the
group: top-scored alert candidates that composed each hierarchical
group are presented to the operator along with each alert,
∑︁ 14 allowing quick focus on the most anomalous components
𝑟 (𝑡, ℎ) = 𝑆ℎ,𝑒 (𝑡) 4 . (6)
𝑒 ∈𝐸
(such as network devices).
For each data-center, we select the network device with 4 Evaluation
maximum alert relevance score as the data-center relevance
We evaluated ADEPTUS on a large real-world dataset con-
score 𝑟𝑑𝑐 (𝑡) for the temporal group. Scores larger than 1.0
sisting of the syslog messages of the network devices com-
are capped:
prising the production infrastructure of 11 data centers of
IBM Cloud. Overall, we have obtained 5,056,305,714 raw log
𝑟𝑑𝑐 (𝑡) = min(1, max(𝑟 (𝑡, ℎ) : ℎ ∈ 𝐷𝐶)). (7) messages (over 4 TB of data) of 22,476 network routers and
The intuition for using 𝐿4
norm as our anomaly score switches belonging to 58 different device types with various
aggregation function is similar to using Root Mean Square versions and vendors. Data center size is diverse, ranging
Error (RMSE) [18] for measuring cumulative error. Using a from 22 to 8122 network devices in each. The logs were re-
power of 4 instead of 2 assigns even greater weight to large trieved for the same period of 10 contiguous months in all
errors (anomalies), and computing a root aligns the scale data centers. They were translated to 444,568 individual time
of the result with that of the input. However, in the norm series (message count vectors) of 6120 event types.
function, the sum is not divided by the total count of aggre- For the aforementioned period, a total of 2094 network-
gated elements. This avoids network devices with a smaller related incident tickets were created in the help-desk data-
number of log count vectors to get an undue advantage when base for the data-centers under evaluation (cf. Figure 3). All
we average their anomaly scores. incident tickets were intentionally created by an SRE as a
response to an actual reliability issue. The tickets cover a
3.3.3 Alerting. An alert is created for each timestamp 𝑡 wide range of issues: hard failures such as power loss and
when the data center relevance score 𝑟𝑑𝑐 (𝑡) is larger than a network switch reloads; soft (intermittent) failures such as
defined constant threshold in the range 0..1. The threshold flaps of communications links and high packet error rates.
selection affects the balance between Recall and Precision Intermittent errors are tougher to detect [5], and commonly
[34]. A low threshold will produce high recall at the expense they are not handled by fault tolerance mechanisms. Few
of precision. A high threshold will give high precision at of the incidents were classified by the SREs as Customer
the expense of recall. A close-to-optimal threshold can be Impacting Events (CIE). The rest were transparent to clients
found automatically by evaluating the latest trained model thanks to the cloud infrastructure’s fail-over and redundancy
on a labeled test set of alert candidates, as described in the mechanisms. Nevertheless, we would like to detect non-CIE
evaluation section below (cf. 4.1.4). incidents as well, so that the SRE team may apply a manual
remediation action before it escalates into an CIE.
244
EuroSys ’22, April 5–8, 2022, RENNES, France Ohana, Wassermann and Dupuis, et al.
(cf. 3.3.3). The balance between those two metrics can be ex- in similar fashion, if the timestamp of the alert is within five
pressed as 𝐹 𝛽 -score (cf. Eq. 10), which is the harmonic mean minutes of the disruption start time of any incident ticket in
of precision and recall. For 𝛽=1, F-score evenly weights recall the same data center. In order to minimize a possible distor-
and precision. However, we opted for using 𝐹 2 -score (𝛽 = 2) tion of the measured precision due to many alerts related to
instead, since it considers recall twice as important as preci- the same incident (as in the example above), we also perform
sion, and this reflects the preferences of SREs in our use case temporal grouping of alerts by taking the maximum score of
to detect as many incidents as possible, even at the expense all alert candidates with the same timestamp, after rounding
of incurring more false alarms. to the closest second.
4.1.2 Population. A possible approach is to use alert can- 4.1.3 Cross-Validation. Instead of using a single arbi-
didates as the object population for accuracy metrics of 4.1.1, trary train-test split and risk overfitting, we used Population-
therefore counting a true positive as an alert candidate that Informed Cross-Validation [8] for the model quality estima-
we have strong confidence it is related to an incident (cf. Ta- tion. Since we are dealing with temporal data, traditional ran-
ble 4). However, alert candidates may not be evenly dis- dom selection of alert candidates for k-fold cross-validation is
tributed among incidents: some incidents may produce many not possible due to a risk of future data leakage [40]. For each
anomalies and some only a few, if any. Consider, for example, data center, we withhold the second half (five months) of
an extreme situation where all alert candidates relate to a data and use it as a test set. The train set consists of all labeled
single incident. A classifier which is able to correctly classify alert candidates from the other data centers and the first half
all alert candidates will gain a perfect F-score, even though of alert candidates from the DC under test. Eventually we
only a single incident out of many was detected. are left with 11 folds for cross-validation. We then compute
Therefore, we opted for a novel, hybrid population ap- a single mean 𝐹 2 -score across all data centers, weighted by
proach. We measure the accuracy of ADEPTUS as the 𝐹 2 - the proportion of test incident tickets in a data center.
score of the relevant alert rate and the detected incidents
rate. Precision is redefined to measure the rate of relevant 4.1.4 Optimal Alert Threshold Selection. Since we aim
alerts, and Recall is redefined to measure the rate of detected to deploy a single global model rather than a DC-specific
incidents: model, we search for a single confidence value between 0..1,
𝑡𝑟𝑢𝑒_𝑎𝑙𝑒𝑟𝑡_𝑐𝑜𝑢𝑛𝑡 which will produce the optimal cross-DC mean 𝐹 2 . We iterate
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = , (11)
𝑡𝑜𝑡𝑎𝑙_𝑎𝑙𝑒𝑟𝑡_𝑐𝑜𝑢𝑛𝑡 the unit range in steps of 0.01, considering as alerts only tem-
poral groups with relevance score ≥ attempted threshold
𝑑𝑒𝑡𝑒𝑐𝑡𝑒𝑑_𝑖𝑛𝑐𝑖𝑑𝑒𝑛𝑡_𝑐𝑜𝑢𝑛𝑡 (cf. 3.3.2). The threshold that produced the maximum cross-
𝑅𝑒𝑐𝑎𝑙𝑙 = , (12)
𝑡𝑜𝑡𝑎𝑙_𝑖𝑛𝑐𝑖𝑑𝑒𝑛𝑡_𝑐𝑜𝑢𝑛𝑡 DC 𝐹 2 -score is selected.
An incident is counted as detected if an alert was pro-
duced up to five minutes before or after a disruption start 4.1.5 Repetitions. In a further effort to avoid bias, we re-
time specified in the incident ticket, for the same data center. peat the evaluation 51 times per tested data center and alert
Note that we do not require the spatial element of the alert to threshold. The random seed value for any non-deterministic
contain a host name which is specified as an affected device operation (data shuffling, majority down-sampling, classi-
in the incident ticket, because sometimes an incident can be fier model training) uses system time to avoid reproducible
detected by its influence on a peer device, e.g. connectivity results. The mean 𝐹 2 -score across all repetitions is selected.
loss in case of a power outage. We count an alert as relevant We use Standard Error of the Mean (SEM) [3] to measure
245
Hybrid Anomaly Detection and Prioritization for Network Logs at Cloud Scale EuroSys ’22, April 5–8, 2022, RENNES, France
how close this mean is to the true optimal 𝐹 2 -score of the 4.2 Evaluation Results
model. 4.2.1 Overall Accuracy Evaluation. Table 5 summarizes
accuracy results for ADEPTUS with three different classifi-
cation models, and all compared baselines. All models had to
detect 1030 incidents which occurred in 11 data centers over
the last five months of the evaluation period, with a minimal
4.1.6 Baselines. We generated the following five baselines number of alerts.
for ADEPTUS model quality comparison: We start with a review of the naive baselines. The best pos-
Random baseline. The classifier assigns a uniformly ran- sible recall, 91.56%, is achieved by the Perfect and the Always
dom value between 0..1 as the relevance score of each alert Positive baselines, which were able to detect 943 incidents.
candidate. The Always Positive baseline suffers from very low precision
Always-Positive baseline. The classifier assigns value since it raises an alert for every alert candidate, to a total of
of 1 as the relevance score of each alert candidate, meaning more than four million alerts. It achieves the lowest 𝐹 2 -score
that every alert candidate is relevant, and no prioritization in the benchmark (4.3%). The Random baseline is marginally
is performed. better due to a better-balanced alert/recall trade-off. The Per-
Perfect Classifier baseline. The classifier assigns the fect baseline shows that up to 25,383 (0.62%) of alerts could be
ground truth value as the relevance score of each alert can- counted as true positives, as they are correlated temporally
didate: either 1 or 0 depending on the incident correlation and spatially to an incident. Note however that the Random
label of either strong or none. This reflects the best achievable and Always-Positive baselines are not entirely naive because
score. Note however that the recall is still less than 100%, the alert candidate generation, the unsupervised part of the
which can be due to two situations. Either the given incident model, provides some ability to detect relevant incidents.
type does not manifest itself in network device syslog mes- Otherwise, we would expect to see worse accuracy results.
sages, or we did not successfully detect it as an anomaly in DECORUS, our prior art model, is better than the naive
our Alert Candidate Generation step (cf. 3.1). baselines. It produces relatively few alerts at its optimal
Rule-based baseline. The actual alerts that were pro- alert threshold of 0.98, but is able to detect only 175 of the
duced by the syslog-based monitoring tool being used in the incidents to a recall of 16.99%, and 𝐹 2 -score of 12.0.
cloud network underlay. The tool produces alerts by inspect-
ing each syslog message and matching it to a list of over 500 4.2.2 ADEPTUS v.s. Rule-Based. The Rule-Base baseline
manually created and actively curated regular-expression is our real competitor, as it is the de facto, real-world syslog
rules. Rules can trigger an alert after a single match, or after X monitoring tool in this use case. It achieves a relatively high
occurrences in Y seconds. A sample rule: vendor == "Cisco"
AND os == "IOSXR" AND facility == "PKT_INFRA-LINEPROTO"
AND mnemonic == "UPDOWN" AND hostname !~ /tgr/ AND Figure 4. ADEPTUS versus Rule-Based
submessage !~ /tunnel-ip/.
Being the operational tool for syslog monitoring, this base-
line represents the best achievable score by a subject-matter
expert. Note that IBM Cloud employs additional, non-syslog-
based monitoring tools, but they are not included in this
comparison as they make use of data which is not available
to ADEPTUS. Each rule-based alert carries a priority level
through an integer in the range 1..7 which we normalize to
unit range. We take the maximal alert score per data center in
one-second resolution and apply the same scoring function
and metrics we used for our model evaluation.
Unsupervised Model DECORUS. Alerts that were pro-
duced by our prior-art model [43], which is fully unsuper-
vised and uses manually-tuned weights and additional domain-
knowledge. This model is being used in production for the
cloud network underlay, as a second-net alerting solution in
conjunction with the rule-based solution. In [43], we have
demonstrated that DECORUS is both more accurate and
resource efficient than five other unsupervised anomaly de- Precision versus Recall and 𝐹 2 -score versus Alert Threshold plots
tection approaches: LOF [7], PCA [19], Isolation Forest [25], of ADEPTUS-CatBoost and Rule-Based alerts. The highlighted
LogCluster [23], OC-SVM [37]. circle marker denotes optimal alert threshold.
246
EuroSys ’22, April 5–8, 2022, RENNES, France Ohana, Wassermann and Dupuis, et al.
recall of 49.1% by being able to detect 506 incidents at the benefits from having more alert candidates to choose from,
optimal alert threshold of five out of 1..7 integer range. How- scoring higher on 𝐹 2 at lower alert thresholds.
ever, it suffers from a low precision, with only 13.78% of the
alerts it produces being relevant, scoring a final 𝐹 2 -score of Table 6. Gaussian Tail versus Vanilla Z-score
28.8%. Model Vanilla Z-score Gaussian Tail
ADEPTUS with the CatBoost classifier is able to detect Threshold 3.0 6.0 15.0 50.0 2.9 2.999 3.0
441 relevant incidents (42.8% recall), which, while being less Anomalies (M) 57.6 31.9 13.4 1.8 21.4 15.6 1.7
than the Rule-Based approach, still produces significantly 𝐹 2 -score % 21.6 20.7 22.0 22.1 32.9 32.8 26.2
fewer alerts, with a higher proportion of them being relevant Accuracy (expressed as 𝐹 2 -score) of ADEPTUS with RF Classifier
(19.84% precision). This leads to an 𝐹 2 -score of 33.87%. The when Vanilla Z-score and Gaussian Tail rule are used for
optimal alert threshold was found to be 0.32, which is less producing alert candidates. Any anomaly with a score above
threshold level is considered an alert candidate that should be
than the default classification threshold of 0.5.
classified. Number of anomalies / alert candidates is in millions.
It is worth to note that ADEPTUS alerts could not be
inspected in real-time by SREs, unlike rule-based alerts. We The significant difference between Gaussian Tail and the
believe that at least a subset of the ADEPTUS alerts, that vanilla Z-score can be explained by the tendency of Gaussian
are currently considered false positives, would result in the Tail to relax anomaly scores faster after a change point in
creation of new incident tickets (due to an actual reliability the time series, as can be observed comparing plots (a) and
that would not be detected without ADEPTUS). This would (b) in Figure 2. Recall that when training, we have positive
make those alerts true positives and further improve the labels only for alert candidates near disruption start time of
𝐹 2 -score. incident tickets. Having additional alert candidates longer
Figure 4 shows the impact of the alert threshold on recall, after the disruption start time, with similar features but neg-
precision and 𝐹 2 -score for both ADEPTUS and Rule-Based. ative label, as vanilla Z-score produces, prevents effective
We observe that on almost any given recall level, ADEPTUS learning.
achieves better precision than Rule-Based. In addition, ADEP- 4.2.5 Impact of Majority Size Factor. The Majority Size
TUS achieves higher 𝐹 2 -scores than Rule-Based with most Factor (MSF) hyperparameter determines the ratio of positive
alert threshold levels. With a recall of 49.58%, which is close to negative labels, as described in Section 3.2.2. It is required
to the optimal recall for Rule-Based, ADEPTUS achieves for reducing the training time of the classification model.
precision of 12.83%, slightly less than the precision of Rule- Hypothetically, the more training data is available for the
Based. model, the better accuracy it will achieve. However, In Fig-
4.2.3 Classification Models Comparison. The highest ure 5 we can observe that the 𝐹 2 -score reaches a maximum
accuracy scores measured in terms of 𝐹 2 -score for the three around an MSF of 90-110 and then starts declining, meaning
classification models (RF, CatBoost, CNN) were quite close additional negative labels do not contribute to accuracy. Also
each other. CNN achieved an 𝐹 2 of 33.89%, CatBoost had 𝐹 2 of note that adding negative labels causes the classifier to be
33.87% and RF lagged behind with an 𝐹 2 of 32.97% (cf. Table less confident and produce lower prediction probabilities
5). When considering training time and assuming availability in general. Therefore, the optimal alert threshold (cf. 3.3.3)
of a GPU, CatBoost is the most cost-effective model, with declines as the MSF is increased.
a much lower training time than CNN for the same count
of samples (cf. Table 8). Nevertheless, the schema-less, un- Figure 5. Impact of Majority Size Factor
structured text form of the input to the CNN model (cf. 3.2.2)
has the benefit of being able to integrate alert candidates of
multiple sources in a single model. For example, log-based
alert candidates have event-type and severity level (cf. Table
2). Metric-based alert candidates may carry metric-type and
no severity information.
4.2.4 Impact of Gaussian Tail. ADEPTUS detects anom-
alies using the Gaussian Tail rule (cf. Eq. 5), instead of a
vanilla Z-score. This has a significant positive impact on the
results. In Table 6, we can observe that a vanilla Z-score pro- Metrics of ADEPTUS-RF at various MSF settings. 𝐹 2 -score in
duces vastly more alert candidates (anomalies) than Gaussian solid-blue. Average count of samples used for training the
Tail for the same anomaly score threshold ≥ 3.0. The opti- classifier, after majority down-sampling, in dashed-orange to the
mal 𝐹 2 -scores for vanilla Z-score are generally lower, even left. Optimal alert threshold in dashed-green to the right. At the
when the count of alert candidates is about the same. We last MSF value of 2000, down-sampling is effectively disabled.
also note that the ADEPTUS classifier with Gaussian Tail
247
Hybrid Anomaly Detection and Prioritization for Network Logs at Cloud Scale EuroSys ’22, April 5–8, 2022, RENNES, France
4.2.6 Impact of other heuristics. We have evaluated the The combination of unsupervised and supervised learn-
impact of additional techniques introduced in this paper ing also makes ADEPTUS resource-efficient and suitable for
on the quality of the model. Calculation of rarity features processing data at a large scale, as it is common in modern
with host-scoped rarity-based filtering was discussed in Sec- clouds. The unsupervised algorithm uses simple and fast
tion 3.2.2. Consolidation of alert candidates by hierarchical statistical techniques to process all log messages, and only a
grouping was discussed in Section 3.3.2. In Table 7, we can small fraction of the logs are promoted to alert candidates,
observe the impact of these techniques on the 𝐹 2 -score of for training and inference by the computationally more de-
the ADEPTUS-CatBoost model. manding supervised classification model.
In order to facilitate programmatic label acquisition, simi-
Table 7. Impact of Heuristics lar to that presented in this paper, we recommend that SRE
Heuristic 𝐹 2 Before 𝐹 2 After Improvement teams incorporate structured fields for “disruption start time”
Rarity 32.46% 33.87% 4.43% and “set of affected components” in the ticket schema of the
Consolidation 33.33% 33.87% 1.63% help-desk solution being used by the organization. These
Relative improvement in 𝐹 2 score of ADEPTUS-CatBoost when fields can be populated during the root-cause analysis per-
Rarity and Consolidation techniques are disabled / enabled. formed for incidents. This allows a rather precise correlation
. between incidents and alerts, which makes the “Unsuper-
vised Detection, Supervised Prioritization” strategy feasible
without incurring further manual labeling effort. Another
Table 8. Training Time Comparison simple option is to track acknowledgement actions of alerts
Classifier Training Time Processor as well as other types of human interactions with alerts,
Random Forest 3m 13s CPU - Intel Xeon (1 Core) and to use this information to prioritize alerts through a
CatBoost 1m 41s GPU - Nvidia Tesla K40 supervised model that predicts the probability of acknowl-
CNN 30m GPU - Nvidia Tesla V100 edgement for new alerts. However, such explicit correlation
is less robust against future changes in the facilities that
generate alerts.
4.2.7 Training volume and time. The raw dataset that
was used for the evaluation consists of 5,056,305,714 raw 5.1 Future Work
syslog messages. Vectorization and unsupervised anomaly We would like to incorporate advanced techniques in the
detection reduced the data to 18,112,812 alert candidates. unsupervised part of our approach: anomaly detection on
Applying rarity-based filtering reduced it further to 7,538,422 the variable parts of log messages [9, 44], and additional
alert candidates. Majority down-sampling reduced the data statistical tests [21, 46]. We expect it will allow us to achieve
volume even further to an average training-set size of 820,502 a better recall by detecting more classes of anomalies, with
samples for a single cross-validation fold. Overall, the various at least similar precision. In addition, we plan to evaluate our
techniques described in this paper were able to reduce data approach on additional sources of raw data, such as metric
scale by ratio of 1:6162. signals. In this case, metadata features such as component
Table 8 lists the time it took to train each of the classi- name or metric type can enable prioritization of anomalous
fication models. For all models, the train set is of a single measured values. Logs and metrics can be incorporated as an
cross-validation fold, containing 807,086 samples (alert can- ensemble of independent models, or as an integrated model,
didates). We can observe that by training on a GPU, CatBoost as discussed in 4.2.3.
offers the best training time, in addition to the best accuracy, Another avenue to improve the alert detection sensitivity
as demonstrated at 4.2.3. would be to set up an alert feedback system that could be
used for example by the SREs. The feedback system can be
5 Conclusion as simple as a “relevant” vs. “not-relevant” toggle switch,
In this paper, we introduced ADEPTUS as a hybrid unsu- or more complex including a text field to provide details on
pervised and supervised learning approach for the detection the alert relevancy. The latter can be handled by our word-
of relevant anomalies in textual logs. ADEPTUS makes use embedding approach as described in 3.2.2. We did prototype
of existing databases of incident tickets to prepare ground a feedback system using an ensemble of two models. The
truth in support of supervised learning. This avoids a time- first model consists in ADEPTUS classifier presented in this
consuming and often infeasible labeling effort by human paper, and the second model is another classifier trained
experts. In a real world evaluation on a large volume of sys- on the alert feedback data. The ensemble then combine the
log messages from network devices, ADEPTUS was able to prediction from each model, giving the final alert score. We
perform better than a set of hundreds of regular-expression did observe a slight improvement to the 𝐹 2 score, which is
rules, that were carefully curated by a team of subject-matter promising, but additional feedback data would be necessary
experts and are actually relied on for alerting in production. to fully demonstrate the merit of the approach.
248
EuroSys ’22, April 5–8, 2022, RENNES, France Ohana, Wassermann and Dupuis, et al.
249
Hybrid Anomaly Detection and Prioritization for Network Logs at Cloud Scale EuroSys ’22, April 5–8, 2022, RENNES, France
[39] Dominique T. Shipmon, Jason M. Gurevitch, Paolo Piselli, and arXiv:2202.06892 [cs.LG]
Stephen T. Edwards. 2017. Time Series Anomaly Detection; Detection [44] W. Xu, Ling Huang, A. Fox, D. Patterson, and Michael I. Jordan. 2009.
of anomalous drops with limited features and sparse examples in noisy Detecting large-scale system problems by mining console logs. In SOSP
highly periodic data. ArXiv abs/1708.03665 (2017). ’09.
[40] L. Tashman. 2000. Out-of-sample tests of forecasting accuracy: an [45] Shenglin Zhang, Y. Liu, Weibin Meng, Zhiling Luo, Jiahao Bu, Sen
analysis and review. International Journal of Forecasting 16 (2000), Yang, Peixian Liang, Dan Pei, Jun Xu, Yuzhi Zhang, Y. Chen, Hui Dong,
437–450. Xianping Qu, and Lei Song. 2019. PreFix: Switch Failure Prediction in
[41] G. Upton, R. Larsen, and M. L. Marx. 1987. An introduction to math- Datacenter Networks. In PERV.
ematical statistics and its applications (2nd edition) , by R. J. Larsen [46] Xu Zhang, Qingwei Lin, Yong Xu, Si Qin, Hongyu Zhang, Bo Qiao,
and M. L. Marx. Pp 630. £17·95. 1987. ISBN 13-487166-9 (Prentice-Hall). Yingnong Dang, Xinsheng Yang, Qian Cheng, Murali Chintalapati,
The Mathematical Gazette 71 (1987), 251–252. Youjiang Wu, Ken Hsieh, Kaixin Sui, Xin Meng, Yaohai Xu, Wenchi
[42] Zhuo Wang, Wei Zhang, Ning Liu, and Jianyong Wang. 2021. Scalable Zhang, S. Furao, and Dongmei Zhang. 2019. Cross-dataset Time Series
Rule-Based Representation Learning for Interpretable Classification. Anomaly Detection for Cloud Systems. In USENIX Annual Technical
ArXiv abs/2109.15103 (2021). Conference.
[43] Bruno Wassermann, David Ohana, Ronen Schaffer, Robert Shahla, [47] Deqing Zou, Hao Qin, Hai Jin, Weizhong Qiang, Zongfen Han, and X.
Elliot K. Kolodner, Eran Raichstein, and Michal Malka. 2022. Chen. 2014. Improving Log-Based Fault Diagnosis by Log Classifica-
DeCorus: Hierarchical Multivariate Anomaly Detection at Cloud-Scale. tion. In NPC.
250