Performance Evaluation of Unsupervised Techniques in Cyber-Attack Anomaly Detection
Performance Evaluation of Unsupervised Techniques in Cyber-Attack Anomaly Detection
https://doi.org/10.1007/s12652-019-01417-9
ORIGINAL RESEARCH
Received: 28 November 2018 / Accepted: 1 August 2019 / Published online: 7 August 2019
© Springer-Verlag GmbH Germany, part of Springer Nature 2019
Abstract
Cyber security is a critical area in computer systems especially when dealing with sensitive data. At present, it is becom-
ing increasingly important to assure that computer systems are secured from attacks due to modern society dependence
from those systems. To prevent these attacks, nowadays most organizations make use of anomaly-based intrusion detection
systems (IDS). Usually, IDS contain machine learning algorithms which aid in predicting or detecting anomalous patterns
in computer systems. Most of these algorithms are supervised techniques, which contain gaps in the detection of unknown
patterns or zero-day exploits, since these are not present in the algorithm learning phase. To address this problem, we pre-
sent in this paper an empirical study of several unsupervised learning algorithms used in the detection of unknown attacks.
In this study we evaluated and compared the performance of different types of anomaly detection techniques in two public
available datasets: the NSL-KDD and the ISCX. The aim of this evaluation allows us to understand the behavior of these
techniques and understand how they could be fitted in an IDS to fill the mentioned flaw. Also, the present evaluation could
be used in the future, as a comparison of results with other unsupervised algorithms applied in the cybersecurity field. The
results obtained show that the techniques used are capable of carrying out anomaly detection with an acceptable performance
and thus making them suitable candidates for future integration in intrusion detection tools.
13
Vol.:(0123456789)
4478 J. Meira et al.
approaches: (a) signature based, or (b) anomaly based. Sig- workflow used including a description of the datasets and
nature-based detection requires prior knowledge of an attack pre-processing techniques applied in our approach. It also
before being able to identify it; on the other hand, techniques describes all of the unsupervised algorithms tested, how
based on anomaly detection work by acquiring knowledge they work and which parameters where used in our study,
of the patterns that represent “normal” or “attack” data and Sect. 4 presents a comparative evaluation of the results,
then classify new data accordingly to their resemblance to and finally Sect. 5 draws the conclusion and ideas for future
those patterns. This later approach gives the IDS the pos- work.
sibility of detecting attacks, even if the attack is not cur-
rently known (a zero-day attack, that is, an attack that is
unknown or unaddressed yet, and thus can be exploited to 2 Related work
adversely affect the computer or network), because these
new attacks may present more similarities to other previous As IDS’s classification problems are a frequent topic of
attacks rather than to “normal” data. study in the literature, many authors have proposed and
Within anomaly-based approach IDSs, different algo- studied interesting techniques to deal with the problem of
rithms may be used. Supervised learning algorithms are unknown attacks. The task of identifying if a new instance
suitable for problems in which a set of already existing belongs to the class of the data that has been used for train-
and previously classified samples can be used as a training ing the classifier, or whether it is an outlier, is known as one-
dataset. On the other hand, when novel vulnerabilities and class classification. This means that the classifier only learns
attacks are involved, there are no classified examples for the data patterns of one class (target class) in the training
a supervised algorithm to learn from it. One possibility in phase. There are other names called to this field like novelty
order to deal with this problem is the use of unsupervised or outlier detection, and concept learning (Khan and Madden
learning algorithms. Unsupervised learning techniques can 2014). One-class algorithms were proven to be an important
learn what is normal for a given set of data and then are tool for several domains as in disease detection (Gardner
capable of finding deviations in new unclassified data, which et al. 2006), intrusion detection (Giacinto et al. 2008), text/
in this scenario would indicate a possible attack that until document classification (Manevitz and Yousef 2001), or pre-
now was unknown. dictive maintenance (Shin et al. 2005).
The motivation for this study comes from the SASSI— Fernández-Francos et al. (2017) presented a novel One-
“Sistema de Apoio à decisão de Segurança em Sistemas Class classification algorithm purposed for targeting dis-
Informáticos” (Decision Support System for Security in tributed environments called One-Class Convex Hull-Based
Computer System) project, which objective is the develop- Algorithm. Their results showed that this method was accu-
ment of an Intelligent Decision Support System that central- rate in one-class classification problems and efficient in big
izes, structures and allows the visualization of information data scenarios due to the distributed nature of the approach.
regarding the activity of computer networks and the indi- Castillo et al. (Castillo et al. 2015) proposed a Distributed
vidual machines in given networks, allowing the automatic One-Class Support Vector Machine (DOC-SVM) method
detection, prediction and prevention of anomalies, cyber- for classification problems. They experimented with differ-
attacks and possible security risks. This platform aims to ent datasets and their results demonstrated that the proposed
support computer network administrators who are increas- DOC-SVM was able to achieve accurate results and with a
ingly faced with critical decision-making tasks regarding reduction in the necessary training time when compared to
security problems that cannot be detected by typical anti- other classifiers known in the literature. Chen et al. (2017)
malware protection systems. This paper focusses on cyber- introduced the autoencoder ensembles for unsupervised out-
attack and anomaly detection using unsupervised learning lier detection. They presented the random edge sampling
algorithms, and explores six of these algorithms: Autoen- technique in which it randomly drops connections in a neural
coder, One-Class Nearest Neighbor, Isolation Forest, One- network retaining a certain level of control on the connection
Class K-Means, One-Class Scaled Convex Hull and One- density between several layers, so in this way they can create
Class Support Vector Machines, over two different public various models with different types of density. The men-
datasets the NSL-KDD (Tavallaee et al. 2009) and the ISCX tioned method was used in conjunction with adaptive data
datasets (Shiravi et al. 2012). sampling approach where the authors applied the RMSprop
Our results show that the techniques used are capable of (Tieleman et al. 2012) optimization method to speed up the
archiving high-performance results in the classification tasks learning process. Their method, named as RandNet, which
tested in our case study and consequently are candidates for stands for Randomized Neural Network for Outlier Detec-
future implementation in an IDS. tion, showed robustness avoiding the overfitting problem,
This paper has the following structure: Sect. 2 presents and it was competitive with respect to other neural network
some related work on this topic, Sect. 3 describes the techniques.
13
Performance evaluation of unsupervised techniques in cyber‑attack anomaly detection 4479
In the intrusion detection field, Goldstein and Uchida as, Local Outlier Factor (LOF), one class support vec-
(2016) presented a comparative evaluation of unsupervised tor machines (One class SVM) and cross-feature analysis
algorithms used in the context of Anomaly Detection. The (CFA).
algorithms were applied to a group of different datasets, one Our work intends to show and compare the behavior of
of each was the KDD 99, described in Sect. 3.2, however the several one-class classification algorithms (some of them
analyses only used part of the dataset regarding HTTP traf- already mentioned in this section) and apply them in two
fic. It is important to note that an improved version of this recent intrusion datasets with the purpose of identifying if
dataset called NSL-KDD is presented and used in this paper. these techniques could be integrated in an IDS inside the
Aleroud and Karabatis (2013) explored the detection of SASSI project.
zero-day attacks, with an approach that combines already
existing methods with linear data transformation techniques
such as discriminant functions that separated the data in nor- 3 Anomaly detection methodology
mal patterns from attack patterns, and anomaly detection
techniques using the One Class Nearest Neighbor algorithm In our work, we will study the behavior of several unsuper-
(1-NN) to identify the zero-day attacks. Their approach vised algorithms based in one class classification, in order
consisted in a system of several static components and pro- to verify if these techniques are a viable solution to discover
cesses. The first component was the network data reposi- and detect unknown attacks. In this section, we describe the
tory where they used the NSL-KDD dataset. The second network anomaly detection methodology, as shown in Fig. 1.
component represented the pre-processing methods applied We describe the datasets used and the pre-processing tech-
in the NSL-KDD dataset, where they converted numeric niques applied to them before feeding the algorithms, as
features into bins. The third module, Misuse detection, con- well as the unsupervised techniques employed. In the next
sisted in identifying attacks that are relevant to a particular section (Sect. 4) we will explain how the anomaly detection
context and also identifying normal activities in the network algorithms work.
to reduce the false positives alerts. This module used condi- In our exploration, we analyzed the NSL-KDD (Taval-
tional entropy to create known attacks context profiles using laee et al. 2009) and the ISCX datasets (Shiravi et al. 2012).
patterns from historical data. Finally, the last component These datasets contain samples from normal activity and
represented the anomaly detection module where it used the from simulated attacks in computer systems and are com-
1-NN algorithm to detect deviation from normal activity and monly used in the literature. Before using the learning algo-
also used the Singular Value Decomposition (SVD) tech- rithms, we have employed some pre-processing methods to
nique to reduce the data dimensionality. They showed good prepare the data.
performance in their approach, detecting zero-day attacks
with a low false positive rate. 3.1 NSL‑KDD
Casas and Mazel (2012) presented the concept of
an Unsupervised Network Intrusion Detection System In 1999, in the third international competition in the confer-
(UNIDS), using Sub-Space Clustering and Multiple Evi- ence Knowledge Discovery and Data Mining (KDD), the
dence Accumulation techniques for outlier detection. Their KDD 99 dataset (UCI Machine Learning Repository 2015)1
unsupervised security system consisted in analyzing packets was presented to the scientific community. This dataset is
captured in continuous times slots of fixed length running frequently used in the literature of IDS evaluation, and con-
in three consecutive steps. In the first step it was performed tains simulated network activity samples, corresponding to
the clustering analysis to detect anomalous time slots. The normal and abnormal activity divided in five categories:
second step used a multi-clustering algorithm based on a
combination of several techniques (Parsons et al. 2004; Ester • Denial of Service (DoS) An intruder tries to make a ser-
et al. 1996; Fred and Jain 2005) to rank the degree of abnor- vice unavailable (contains 9 types of DoS attacks);
mality of all the identified outlying flows. The third step used • Remote to Local (R2L) An intruder tries to obtain remote
a simple threshold detection technique to flag the top-ranked access to the victim’s machine (contains 15 types of R2L
outlying flows as anomalies. Their evaluation of this system attacks);
included its application to the KDD 99 dataset. • User to Root (U2R) An intruder with physical access to
Noto et al. (2012) studied anomaly detection using an the victim’s machine tries to gain super-user privileges
approach called FRaC, feature regression and classifica- (contains 8 types of U2R attacks);
tion. The FRaC technique built a model of normal data
and the distances of its features and used the learnt model
to detect when an anomaly occurred. They also compared
their approach with other commonly used techniques, such 1
https://archive.ics.uci.edu/ml/datasets/KDD+Cup+1999+Data.
13
4480 J. Meira et al.
Fig. 1 Anomaly detection methodology—the datasets were splitted, normalized and discretized through some pre-processing techniques before
being applied in the algorithms learning and testing phase
• Probe An intruder tries to get information about a vic- of the features were divided in k bins in the way that each
tim’s machine (contains 6 types of Probe attacks); bin contains approximately the same number of samples.
• Normal It constitutes the normal operations or activities Thus, each bin has nk adjacent values. The value of k is a user
in the network. defined parameter, and to obtain this value we used the heu-
ristic n where n is the number of samples. This discretiza-
Tavallaee et al. (2009) made some improvements on KDD tion technique can provide better accuracy and fast learning
99 dataset and the result was the NSL-KDD dataset. This in certain anomaly detection algorithms, since the range of
dataset is already organized in two subsets: one to train the values is smaller (Liu et al. 2002).
algorithms, and another one to test them. Each data sample The second pre-processing technique was the data nor-
contains 43 features where four of them are nominal type, malization, to have all the features within the same scale.
six are binary and the rest of them are numerical type. As we This operation prevents some classification algorithms to
are testing one class classification algorithms, it was selected give more importance to features with large numeric values.
a portion of normal data from the training set and a portion Once the features are all on the same scale, the classifiers
of both normal and attack data from the test set, where the assign the same weight to each attribute. The Z-score and
attack data contains all four attack categories and represents MinMax were the normalization techniques applied to the
10% of the test set. data. The Z-score technique transforms the input, so the
Some pre-treatment techniques were applied to the data- mean is zero and the standard deviation is one. On the other
set before performing the discretization and normalization hand, the MinMax transform the original input data to a new
operations, as shown in Fig. 1. Some features were removed specific set where the values range are between 0 to 1. We
namely: ‘Wrong_fragment’, ‘Num_outbound_cmds’, ‘Is_ tested the algorithms with each pre-processing technique and
hot_login’, ‘Land’ and ‘level_difficulty’ because they have with both combined to evaluate which techniques improve
redundant values in at least one of the subsets. In the case the performance of the algorithms. Then we made five expe-
of the ‘level_difficulty’ feature, it represents the level of riences with each algorithm with the best pre-processing
difficulty of attacks’ detection by learning algorithms. This techniques and calculate all the performance metrics mean
feature was removed because its information is not relevant to compare their results.
in a real-world anomaly detection problem. Another pre-
treatment operation to the data was the conversion of nomi- 3.2 ISCX
nal features to numerical features, since the algorithms to
be employed afterwards cannot handle non-numerical data. ISCX is a dataset developed by Shiravi et al. (2012) at the
After performing the cleaning of the subsets, two different Canadian cybersecurity institute. This dataset is based on
pre-processing techniques were applied to the data. First, the concept of profiles that contain detailed descriptions of
the data with continuous features was discretized with the abstract intrusions and distributions for applications, pro-
equal frequency technique. With this technique, the values tocols, services, and low-level network entities. To create
13
Performance evaluation of unsupervised techniques in cyber‑attack anomaly detection 4481
Table 1 ISCX captured activity algorithms are named as one-class classification methods
Capturing date Network activity
and appear to be good candidates for the problems of discov-
ering unknown attacks, since every attack can be considered
11/06/2010 Normal an outlier. In this work, we applied a set of 6 different one-
12/06/2010 Normal class algorithms, namely Autoencoder, Nearest Neighbor,
13/06/2010 Normal + internal infiltration into the network K-Means, Isolation Forest, Support Vector Machines and
14/06/2010 Normal + HTTP denial of service Scaled Convex Hull, which performance is to be evaluated
15/06/2010 Normal + distributed denial of service using a Botnet over the NSL-KDD and ISCX datasets.
IRC
16/06/2010 Normal
3.3.1 Autoencoder
17/06/2010 Normal + brute force SSH
The attacks were captured along with normal network activity. To An Autoencoder is a neural network that is part of a sub-area
distinguish between a normal observation and an abnormal one it is of machine learning called Deep learning [sets of algorithms
presented in the ISCX dataset an attribute called “label” where value with several layers of processing which are used to model
1 represents an attack and value 0 represents normal activity
high-level abstractions of data (Deng et al. 2014)]. Neural
Networks are interconnected processing units that are organ-
this dataset, real network communications were analyzed to ized by one or more layers which can be used in the imple-
create profiles for agents that generate real traffic for HTTP, mentation of a complex functional mapping between input
SMTP, SSH, IMAP, POP3 and FTP protocols. In this regard, and output variables (Bishop 1995). They can perform linear
a set of guidelines have been established to delineate a valid or non-linear transformations through the processing of the
dataset that establishes the basis for profiling. These guide- units in the different layers (Mazhelis 2016).
lines are vital to the effectiveness of the dataset in terms The autoencoder, also known as autoassociator, is a kind
of realism, total capture, integrity, and malicious activity of neural network that is trained to make the input features
(Shiravi et al. 2012). Each data sample in the ISCX dataset the same or very similar to the output features (Japkowicz
contains 21 attributes. There is a total of 7 days of network 1999). In the classification task, the autoencoder can repro-
traffic captured with four different attack types shown in duce accurately only the vectors whose structure is similar
Table 1. to the structure learned by the neural network.
For this dataset, we did the following changes before As a neural network, the autoencoders are sensitive to
applying the pre-processing techniques shown in Fig. 1 and outliers since they contribute to the minimization of the
described in the NSL-KDD dataset: error function. The disadvantage of this method is the need
of employing a number of parameters that have to be speci-
• All nominal features where converted to numerical—the fied by the user (Tax 2001). These parameters consist of
algorithms used cannot handle non-numeric features; selecting a number of hidden layers, a number of hidden
• All “Payload” features were removed—these are string units in each layer, the type of transformation function,
features, so it is not possible to train and test the algo- the learning rate and the stopping rule. In addition to these
rithms with these features; parameters, it is necessary to estimate a number of weights
• The source and destination IP address features were (usually equal to the number of hidden units and input units)
removed—there is no interest in training the algorithms for the training set. A large amount of data is essential for an
with these features, since the IP addresses are constantly accurate estimation of weights. The computational resources
changing; for this algorithm are considerably high, since the learning
• A new feature was created to represent the time interval process is iterative, being repeated several times throughout
of an operation on the network, defined as the differ- the training set, until the stop rule is satisfied.
ence between the features “stop date time” and “start date For the application of this algorithm, the H2O2 package
time”. was used. This package is an open source mathematical
engine for big data processing machine learning algorithms,
3.3 Unsupervised learning algorithms such as generalized linear models, gradient boosting, Ran-
dom Forests and neural networks (deep learning) in several
Unsupervised learning algorithms are suitable for scenarios cluster environments (Stadler 2011).
where the objective is to perform outlier detection to a data- After loading the datasets (NSL-KDD and ISCX)
set. Some of these algorithms follow the basic idea of learn- already pre-processed to this engine, the h2o.deeplearning
ing from a training dataset that only contains normal sam-
ples, and in the classification the output is either “normal”
if it resembles the learnt set or “outlier” if it does not. These 2
https://cran.r-project.org/web/packages/h2o/index.html.
13
4482 J. Meira et al.
13
Performance evaluation of unsupervised techniques in cyber‑attack anomaly detection 4483
Then subsequently the centroids are recomputed. This The observations from the dataset are represented by n
process is done by taking the mean of all objects assigned and H(i) is a harmonic number and can be estimated by the
to that centroid’s cluster. In the Eq. (3) the set of data point Euler’s constant. The parameter c(n) was used to normalize
assignments for each ith cluster centroid is Si. h(x) since it represents the average of given n. The Eq. (5)
represents the anomaly score s of an observation x:
1 ∑
ci = xi (3)
|S | − E(h(x))
(5)
| i | xi ∈si s(x, n) = 2 c(n)
To apply this algorithm as one-class classification, the E(h(x)) is the average of h(x) from a collection of isola-
building process of clusters should only use normal data tion trees. Using the anomaly score Liu et al. (2012) veri-
examples. In the classification process, the algorithm cal- fied that observations with a s value much smaller than 0.5
culates all the test data points distance to the closest cluster. are quite safe to be regarded as normal observations, while
Then a threshold is defined and if the calculated distance to observations with a s value very close to 1 are definitely
each object is higher than the threshold value, the sample is anomalies, and observations that return s ≈ 0.5 don’t really
classified as an anomaly. It was used the silhouette analysis mean any distinct anomaly.
that measures how close each point in one cluster is to points In our tests, we used the default algorithm parameter of
in the neighboring clusters. This measure gives us informa- 100 trees in both datasets, since experimentally the variation
tion about the best parameter (number of clusters) to apply. of this parameter did not show any substantial impact in the
In both the NSL-KDD and ISCX datasets the ideal number performance.
of clusters was set to 4.
3.3.5 One‑class scaled convex hull
3.3.4 Isolation Forest
The Scaled Convex Hull (SCH) is an algorithm based on
Isolation Forest (Liu et al. 2012) is a method for outlier a previously proposed method by Casale et al. (2011) that
detection that uses data structures called trees, such as binary uses the geometrical structure of the Convex Hull (CH) to
trees. Each tree is created by partitioning the instances recur- define the class in one-class classification problems. This
sively, by randomly selecting an attribute and a split value algorithm uses random projections and an ensemble of CH
between the maximum and minimum values of the selected models in two dimensions, and thus this method can be suit-
attribute (Liu et al. 2012). Being T an external node of a tree able for larger dimensions in an acceptable execution time
with no child or an internal-node designated by test with (Fernández-Francos et al. 2017). As we can see in Eq. (6).
exactly two daughter nodes (Tl, Tr). A test is an attribute q { |S| }
with a split value p, where q < p meaning the data points will ∑ | ∑|S|
|
be divided into Tl, Tr. CH(S) = 𝜃i xi ||(∀i ∶ 𝜃 ≥ 0) ∧ 𝜃i = 1, xi ∈ S
|
To build an insolation tree, the data X = {x1 , … , xn } will
i=1 | i=1
13
4484 J. Meira et al.
Each type of center leads to different decision regions (if Subject to:
a point belongs or not to the target class), giving more flex- yi (wT 𝜙(xi ) + b) ≥ 1 − 𝜉i for all i = 1, … , n
ibility to this method. 𝜉i ≥ 0 for all i = 1, … , n
In our experiment we found that the best parameters for
this algorithm were:
Therefore, for anomaly detection problems the One-Class
Support Vector Machines (OCSVM or 𝜈-SVM) will only
• A value of 𝜆 = 1, 22 in the NSL-KDD and a 𝜆 = 1, 11
train with data from one class, in this case, the class that
in the ISCX dataset;
represents normal activity in the network. Basically, it sepa-
• Around 2000 projections;
rates all the data points from the origin and maximizes the
• A center type that uses the average of the CH vertices in
distance from the hyperplane to the origin. This results in
the projected space.
a binary function that captures regions in the input space
where the probability density of the data lives.
3.3.6 One‑class support vector machines
The minimization function is given by the Eq. (9)
(Schölkopf et al. 2000):
Support Vector Machines (SVM) have the capability to
solve classification and regression problems. This algo- n
1 1 ∑
rithm focuses on the search for a hyperplane (generaliza- min w2 + 𝜉 −𝜌 (9)
w,𝜉i ,𝜌 2 𝜈n i=1 i
tion of a plane in different dimensions, for example in a
13
Performance evaluation of unsupervised techniques in cyber‑attack anomaly detection 4485
Table 2 Comparative results Best pre-processing techniques One-class algorithms NSL-KDD ISCX (AUC)
using mean AUC (× 100) for (AUC)
each algorithm using the best
combination of pre-processing No pre-processing Isolation forest 81.71 90.70
techniques, in NSL-KDD and
Z-score K-means 84.76 77.06
ISCX datasets
Z-score 1-Nearest neighbor 84.85 95.20
Equal frequency + minmax Autoencoder 83.65 80.44
Equal frequency + minmax Scaled convex hull 85.30 85.95
Equal frequency + minmax Support vector machines 83.14 91.63
• The radial base kernel function was used; In classification problems, metrics such as recall, precision
• The 𝛾 = 0.3 in the NSL-KDD and 𝛾 = 4.2 in the ISCX and F1 score are mostly used to understand the amount of
(parameter used for the radial basis kernel); miss predicted observations in a specific class. Each of these
• The 𝜈 = 0.01 in the NSL-KDD and 𝜈 = 0.005 in the metrics give us information about the different types of
ISCX. errors generated by the classifier. With Recall, also known
as TPR, and sensitivity, presented in the AUC section, we
can learn the percentage of instances from the anomaly class
4 Performance evaluation that are actually predicted correctly. This means that the
higher the Recall, the smaller the false negatives (type 2
All combinations of the pre-processing techniques with the error). On the other hand, Precision is very similar to Recall,
unsupervised learning algorithms were tested and we present but instead of computing the false negatives it computes the
the results of the best techniques applied to each algorithm
for NSL-KDD and ISCX datasets in Table 2. To evaluate
the performance of the classifiers we used several metrics
as described below:
13
4486 J. Meira et al.
( )
classifiers due to the lack of good datasets in the cyberse-
false positives TP
. This represents the portion of anom-
TP+FP curity field. Even though the classifiers are not significant
aly class elements that were correctly predicted. So, as in different to each other, we can see that on average Nearest
Recall, we can say that, the higher the Precision, the smaller Neighbor, SCH and 𝜈 -SVM have a high score compared
the false positives (type 1 error). When using these two met- to the other three algorithms.
rics, there is often a tradeoff between them, so it is important Since the test set has unbalanced classes, we plotted the
to evaluate them together using another metric called F1 performance of the algorithms using other metrics that can
score, showed in Eq. (10): measure the errors more in detail. This metrics are: Recall,
Precision ⋅ Recall Precision and F1 score.
F1 = 2 ⋅ (10) Starting by the NSL-KDD dataset, observing the Fig. 6,
Precision + Recall
looking at the F1 score metric as it represents the har-
As showed in Eq. (10), F1 score represents the harmonic monic mean combining the two other metrics, we can see
mean which combines Recall and Precision metrics with an that all algorithms showed similar results. The isolation
equal weight. Forest and K-Means with 53% and 55% respectively and
As we can see on Table 2, the algorithms One-class the others ranging between 60 to 66%, being SCH the
K-means and 1-Nearest Neighbor had the best performance algorithm with the highest F1 score. We can look also at
applying the Z-Score techniques. The Isolation Forest algo- precision and recall metrics as to have a better perception
rithm had the best results without any kind of data transfor- of the false positives and false negatives costs. Few false
mation as it uses binary trees in the process of data recur- negatives represent a higher value of recall and vice versa,
sive partitioning. In the case of the Autoencoder, SCH and and we can also say the same regarding precision with
𝜈-SVM had the best performances in detecting anomalies respect to the false positives. Observing the graphic in
applying MinMax and Equal Frequency (EF) techniques in Fig. 6, all algorithms except SCH and 𝜈-SVM had a recall
the pre-processing phase. value much higher than precision, so the false positives
Looking at the NSL-KDD results, the SCH classifier had were much higher than the false negatives in these cases.
the best performance with an AUC value around 85. The In cybersecurity it is important to have a low false negative
other algorithms obtained very close results ranging between rate, since it represents the worst-case scenario, where data
81 and 84 AUC, where the 1-Nearest Neighbor was the sec- is predicted as normal activity, while in fact it represents
ond-best classifier with an AUC close to 85. Regarding the malicious or abnormal activity. Regarding the SCH and 𝜈
ISCX dataset, analyzing the table, we can observe that the -SVM they both had the highest F1 score compared to the
1-Nearest Neighbor algorithm obtained the highest AUC other anomaly detection techniques but at the same time
result, followed by 𝜈-SVM. In this dataset the AUC results they had more misclassified observations that represent
were higher compared to the NSL-KDD. One of the reasons false negatives than misclassified observations represent-
for this is the fact that the NSL-KDD has 38 different types ing false positives.
of attacks compared to the ISCX with only 4 different types. Analyzing Fig. 7, concerning the ISCX dataset, we
To verify if there is significant difference between the observe that Nearest Neighbor, SCH and 𝜈 -SVM have
classifiers performance in both datasets we applied the much better performance results than those obtained for
Nemenyi post hoc statistical test and presented a critical the NSL-KDD. On the other hand, the Isolation Forest
difference diagram (Demšar 2006) as shown in Fig. 5. As and K-means algorithms remained with approximately the
we can see all the algorithms are connected to each other same results as in NSL-KDD. Another fact that can be
(thickest horizontal line underneath the critical difference observed is that the algorithm SCH generates fewer false
scale), meaning that they are not significantly different (at negatives and increases the false positives when trying
level 𝛼 = 0.10). to detect the four different types of attacks contained in
The non-significant difference between algorithms can ISCX dataset.
be explained because we only used two datasets to test the
13
Performance evaluation of unsupervised techniques in cyber‑attack anomaly detection 4487
5 Conclusion and future work system calls classification in normal activity or malicious
activity. The efficiency of intrusion detection depends on
Threats in information systems have become increasingly the techniques used in these systems. As mentioned, the
intelligent and they can deceive the basic security solu- work carried out in this paper was motivated by the SASSI
tions such as firewalls and antivirus. Anomaly-based IDSs project. The goal was to verify if any of the unsupervised
allow monitored network traffic classification or computer techniques presented in this paper could be implemented
in an IDS to support Systems administrators in decision
13
4488 J. Meira et al.
13
Performance evaluation of unsupervised techniques in cyber‑attack anomaly detection 4489
Liu H, Hussain F, Tan CL, Dash M (2002) Discretization: an ena- Schölkopf B, Williamson R, Smola A, Shawe-Taylor J, Platt J (2000)
bling technique, pp 393–423. https : //pdfs.seman t icsc h olar Support vector method for novelty detection. Adv Neural Inf Pro-
.org/2d18/73800b294a104a836168ac5bba11edeadc7f.pdf cess Syst 12:582–588
Liu Z, Liu JG, Pan C, Wang G (2009) A novel geometric approach Shin HJ, Eom D-H, Kim S-S (2005) One-class support vector
to binary classification base. IEEE Trans Neural Net- machines—an application in machine fault detection and classifi-
works 20(7):1215–1220. https: //doi.org/10.1109/TNN.2009.20223 cation. Comput Ind Eng 48(2):395–408. https: //doi.org/10.1016/j.
99 cie.2005.01.009
Liu FT, Ting KM, Zhou Z-H (2012) Isolation-based anomaly detec- Shiravi A, Shiravi H, Tavallaee M, Ghorbani AA (2012) Toward devel-
tion. ACM Trans Knowl Discov Data 6(1):3:1–3:39. https://doi. oping a systematic approach to generate benchmark datasets for
org/10.1145/2133360.2133363 intrusion detection. Comput Secur 31(3):357–374. https://doi.
Manevitz LM, Yousef M (2001) One-class SVMs for document clas- org/10.1016/j.cose.2011.12.012
sification. J Mach Learn Res 2:139–154 Stadler T (2011) R Topics Documented. Package ‚ÄòTreePar‚ Äô, 2.
Mazhelis O (2016) One-class classifiers: a review and analysis of suit- https://doi.org/10.2307/2533043
ability in the context of mobile-masquerader detection Oleksiy Tavallaee M, Bagheri E, Lu W, Ghorbani AA (2009) A detailed analy-
Mazhelis to cite this version: HAL Id: Hal-01262354 One-Class sis of the KDD CUP 99 data set. In: IEEE symposium on compu-
Classifiers: a review and analysis of suitability in the context of tational intelligence for security and defense applications, CISDA
mobile. https://hal.inria.fr/hal-01262354/document 2009, pp 1–6. https://doi.org/10.1109/CISDA.2009.5356528
Niyaz Q, Sun W, Javaid AY, Alam M (2015) A deep learning approach Tax DMJ (2001) One-class classification: concept learning in the
for network intrusion detection system. In: Proceedings of the absence of counter-examples. http://homepage.tudelft.nl/n9d04
9th EAI international conference on bio-inspired information and /thesis.pdf
communications technologies. https://doi.org/10.4108/eai.3-12- Tieleman T, Hinton G (2012) Lecture 6.5-Rmsprop: divide the gradient
2015.2262516 by a running average of its recent magnitude. COURSERA Neural
Noto K, Brodley C, Slonim D (2012) FRaC: a feature-modeling Netw Mach Learn 4(2):26–31
approach for semi-supervised and unsupervised anomaly Tsai CF, Hsu YF, Lin CY, Lin WY (2009) Intrusion Detection by
detection. Data Min Knowl Discov 25(1):109–133. https://doi. machine learning: a review. Expert Syst Appl 36(10):11994. https
org/10.1007/s10618-011-0234-x ://doi.org/10.1016/j.eswa.2009.05.029
Parsons L, Haque E, Liu H (2004) Subspace clustering for high dimen-
sional data: a review. ACM SIGKDD Explor Newsl 6(1):90–105. Publisher’s Note Springer Nature remains neutral with regard to
https://doi.org/10.1145/1007730.1007731 jurisdictional claims in published maps and institutional affiliations.
13