0% found this document useful (0 votes)
12 views

Performance Evaluation of Unsupervised Techniques in Cyber-Attack Anomaly Detection

Uploaded by

electro-ub ub
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Performance Evaluation of Unsupervised Techniques in Cyber-Attack Anomaly Detection

Uploaded by

electro-ub ub
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Journal of Ambient Intelligence and Humanized Computing (2020) 11:4477–4489

https://doi.org/10.1007/s12652-019-01417-9

ORIGINAL RESEARCH

Performance evaluation of unsupervised techniques in cyber‑attack


anomaly detection
Jorge Meira2 · Rui Andrade1 · Isabel Praça1 · João Carneiro1 · Verónica Bolón‑Canedo2 · Amparo Alonso‑Betanzos2 ·
Goreti Marreiros1

Received: 28 November 2018 / Accepted: 1 August 2019 / Published online: 7 August 2019
© Springer-Verlag GmbH Germany, part of Springer Nature 2019

Abstract
Cyber security is a critical area in computer systems especially when dealing with sensitive data. At present, it is becom-
ing increasingly important to assure that computer systems are secured from attacks due to modern society dependence
from those systems. To prevent these attacks, nowadays most organizations make use of anomaly-based intrusion detection
systems (IDS). Usually, IDS contain machine learning algorithms which aid in predicting or detecting anomalous patterns
in computer systems. Most of these algorithms are supervised techniques, which contain gaps in the detection of unknown
patterns or zero-day exploits, since these are not present in the algorithm learning phase. To address this problem, we pre-
sent in this paper an empirical study of several unsupervised learning algorithms used in the detection of unknown attacks.
In this study we evaluated and compared the performance of different types of anomaly detection techniques in two public
available datasets: the NSL-KDD and the ISCX. The aim of this evaluation allows us to understand the behavior of these
techniques and understand how they could be fitted in an IDS to fill the mentioned flaw. Also, the present evaluation could
be used in the future, as a comparison of results with other unsupervised algorithms applied in the cybersecurity field. The
results obtained show that the techniques used are capable of carrying out anomaly detection with an acceptable performance
and thus making them suitable candidates for future integration in intrusion detection tools.

Keywords Anomaly detection · One-class classification · Intrusion detection · Unsupervised learning

* Jorge Meira 1 Introduction


[email protected]
Rui Andrade Computer systems play a major role in modern everyday
[email protected] life. Almost everything from personal calendars to financial
Isabel Praça records and e-commerce operations are done with resource
[email protected] to a computing device with a network connection. Important
João Carneiro information is stored and sent in all sorts of devices, from
[email protected] small low power smartwatches to huge datacenters. This
Verónica Bolón‑Canedo creates an extensive attack vector that intended individuals
[email protected] and/or organizations may try to outbreak. Attackers use a
Amparo Alonso‑Betanzos variety of different techniques to try to exploit safety flaws
[email protected] in systems. This may result in sensible data breaches, stolen
Goreti Marreiros user accounts or taking control over the system.
[email protected] To combat these attacks, system administrators and secu-
1
GECAD‑Research Group on Intelligent Engineering rity experts often need to use safety measures to eliminate
and Computing for Advanced Innovation and Development, these attacks or at least mitigate their effects. One of these
Institute of Engineering, Polytechnic of Porto (ISEP/IPP), safety measures are intrusion detection systems (IDS). These
Porto, Portugal systems perform cyber-attack detection, using a variety of
2
LIDIA‑Laboratory for Research and Development techniques to discover failures and malicious activity in
in Artificial Intelligence, Department of Computer Science, computer systems. IDS tend to follow one of two different
University of A Coruña, A Coruña, Spain

13
Vol.:(0123456789)
4478 J. Meira et al.

approaches: (a) signature based, or (b) anomaly based. Sig- workflow used including a description of the datasets and
nature-based detection requires prior knowledge of an attack pre-processing techniques applied in our approach. It also
before being able to identify it; on the other hand, techniques describes all of the unsupervised algorithms tested, how
based on anomaly detection work by acquiring knowledge they work and which parameters where used in our study,
of the patterns that represent “normal” or “attack” data and Sect. 4 presents a comparative evaluation of the results,
then classify new data accordingly to their resemblance to and finally Sect. 5 draws the conclusion and ideas for future
those patterns. This later approach gives the IDS the pos- work.
sibility of detecting attacks, even if the attack is not cur-
rently known (a zero-day attack, that is, an attack that is
unknown or unaddressed yet, and thus can be exploited to 2 Related work
adversely affect the computer or network), because these
new attacks may present more similarities to other previous As IDS’s classification problems are a frequent topic of
attacks rather than to “normal” data. study in the literature, many authors have proposed and
Within anomaly-based approach IDSs, different algo- studied interesting techniques to deal with the problem of
rithms may be used. Supervised learning algorithms are unknown attacks. The task of identifying if a new instance
suitable for problems in which a set of already existing belongs to the class of the data that has been used for train-
and previously classified samples can be used as a training ing the classifier, or whether it is an outlier, is known as one-
dataset. On the other hand, when novel vulnerabilities and class classification. This means that the classifier only learns
attacks are involved, there are no classified examples for the data patterns of one class (target class) in the training
a supervised algorithm to learn from it. One possibility in phase. There are other names called to this field like novelty
order to deal with this problem is the use of unsupervised or outlier detection, and concept learning (Khan and Madden
learning algorithms. Unsupervised learning techniques can 2014). One-class algorithms were proven to be an important
learn what is normal for a given set of data and then are tool for several domains as in disease detection (Gardner
capable of finding deviations in new unclassified data, which et al. 2006), intrusion detection (Giacinto et al. 2008), text/
in this scenario would indicate a possible attack that until document classification (Manevitz and Yousef 2001), or pre-
now was unknown. dictive maintenance (Shin et al. 2005).
The motivation for this study comes from the SASSI— Fernández-Francos et al. (2017) presented a novel One-
“Sistema de Apoio à decisão de Segurança em Sistemas Class classification algorithm purposed for targeting dis-
Informáticos” (Decision Support System for Security in tributed environments called One-Class Convex Hull-Based
Computer System) project, which objective is the develop- Algorithm. Their results showed that this method was accu-
ment of an Intelligent Decision Support System that central- rate in one-class classification problems and efficient in big
izes, structures and allows the visualization of information data scenarios due to the distributed nature of the approach.
regarding the activity of computer networks and the indi- Castillo et al. (Castillo et al. 2015) proposed a Distributed
vidual machines in given networks, allowing the automatic One-Class Support Vector Machine (DOC-SVM) method
detection, prediction and prevention of anomalies, cyber- for classification problems. They experimented with differ-
attacks and possible security risks. This platform aims to ent datasets and their results demonstrated that the proposed
support computer network administrators who are increas- DOC-SVM was able to achieve accurate results and with a
ingly faced with critical decision-making tasks regarding reduction in the necessary training time when compared to
security problems that cannot be detected by typical anti- other classifiers known in the literature. Chen et al. (2017)
malware protection systems. This paper focusses on cyber- introduced the autoencoder ensembles for unsupervised out-
attack and anomaly detection using unsupervised learning lier detection. They presented the random edge sampling
algorithms, and explores six of these algorithms: Autoen- technique in which it randomly drops connections in a neural
coder, One-Class Nearest Neighbor, Isolation Forest, One- network retaining a certain level of control on the connection
Class K-Means, One-Class Scaled Convex Hull and One- density between several layers, so in this way they can create
Class Support Vector Machines, over two different public various models with different types of density. The men-
datasets the NSL-KDD (Tavallaee et al. 2009) and the ISCX tioned method was used in conjunction with adaptive data
datasets (Shiravi et al. 2012). sampling approach where the authors applied the RMSprop
Our results show that the techniques used are capable of (Tieleman et al. 2012) optimization method to speed up the
archiving high-performance results in the classification tasks learning process. Their method, named as RandNet, which
tested in our case study and consequently are candidates for stands for Randomized Neural Network for Outlier Detec-
future implementation in an IDS. tion, showed robustness avoiding the overfitting problem,
This paper has the following structure: Sect. 2 presents and it was competitive with respect to other neural network
some related work on this topic, Sect. 3 describes the techniques.

13
Performance evaluation of unsupervised techniques in cyber‑attack anomaly detection 4479

In the intrusion detection field, Goldstein and Uchida as, Local Outlier Factor (LOF), one class support vec-
(2016) presented a comparative evaluation of unsupervised tor machines (One class SVM) and cross-feature analysis
algorithms used in the context of Anomaly Detection. The (CFA).
algorithms were applied to a group of different datasets, one Our work intends to show and compare the behavior of
of each was the KDD 99, described in Sect. 3.2, however the several one-class classification algorithms (some of them
analyses only used part of the dataset regarding HTTP traf- already mentioned in this section) and apply them in two
fic. It is important to note that an improved version of this recent intrusion datasets with the purpose of identifying if
dataset called NSL-KDD is presented and used in this paper. these techniques could be integrated in an IDS inside the
Aleroud and Karabatis (2013) explored the detection of SASSI project.
zero-day attacks, with an approach that combines already
existing methods with linear data transformation techniques
such as discriminant functions that separated the data in nor- 3 Anomaly detection methodology
mal patterns from attack patterns, and anomaly detection
techniques using the One Class Nearest Neighbor algorithm In our work, we will study the behavior of several unsuper-
(1-NN) to identify the zero-day attacks. Their approach vised algorithms based in one class classification, in order
consisted in a system of several static components and pro- to verify if these techniques are a viable solution to discover
cesses. The first component was the network data reposi- and detect unknown attacks. In this section, we describe the
tory where they used the NSL-KDD dataset. The second network anomaly detection methodology, as shown in Fig. 1.
component represented the pre-processing methods applied We describe the datasets used and the pre-processing tech-
in the NSL-KDD dataset, where they converted numeric niques applied to them before feeding the algorithms, as
features into bins. The third module, Misuse detection, con- well as the unsupervised techniques employed. In the next
sisted in identifying attacks that are relevant to a particular section (Sect. 4) we will explain how the anomaly detection
context and also identifying normal activities in the network algorithms work.
to reduce the false positives alerts. This module used condi- In our exploration, we analyzed the NSL-KDD (Taval-
tional entropy to create known attacks context profiles using laee et al. 2009) and the ISCX datasets (Shiravi et al. 2012).
patterns from historical data. Finally, the last component These datasets contain samples from normal activity and
represented the anomaly detection module where it used the from simulated attacks in computer systems and are com-
1-NN algorithm to detect deviation from normal activity and monly used in the literature. Before using the learning algo-
also used the Singular Value Decomposition (SVD) tech- rithms, we have employed some pre-processing methods to
nique to reduce the data dimensionality. They showed good prepare the data.
performance in their approach, detecting zero-day attacks
with a low false positive rate. 3.1 NSL‑KDD
Casas and Mazel (2012) presented the concept of
an Unsupervised Network Intrusion Detection System In 1999, in the third international competition in the confer-
(UNIDS), using Sub-Space Clustering and Multiple Evi- ence Knowledge Discovery and Data Mining (KDD), the
dence Accumulation techniques for outlier detection. Their KDD 99 dataset (UCI Machine Learning Repository 2015)1
unsupervised security system consisted in analyzing packets was presented to the scientific community. This dataset is
captured in continuous times slots of fixed length running frequently used in the literature of IDS evaluation, and con-
in three consecutive steps. In the first step it was performed tains simulated network activity samples, corresponding to
the clustering analysis to detect anomalous time slots. The normal and abnormal activity divided in five categories:
second step used a multi-clustering algorithm based on a
combination of several techniques (Parsons et al. 2004; Ester • Denial of Service (DoS) An intruder tries to make a ser-
et al. 1996; Fred and Jain 2005) to rank the degree of abnor- vice unavailable (contains 9 types of DoS attacks);
mality of all the identified outlying flows. The third step used • Remote to Local (R2L) An intruder tries to obtain remote
a simple threshold detection technique to flag the top-ranked access to the victim’s machine (contains 15 types of R2L
outlying flows as anomalies. Their evaluation of this system attacks);
included its application to the KDD 99 dataset. • User to Root (U2R) An intruder with physical access to
Noto et al. (2012) studied anomaly detection using an the victim’s machine tries to gain super-user privileges
approach called FRaC, feature regression and classifica- (contains 8 types of U2R attacks);
tion. The FRaC technique built a model of normal data
and the distances of its features and used the learnt model
to detect when an anomaly occurred. They also compared
their approach with other commonly used techniques, such 1
https​://archi​ve.ics.uci.edu/ml/datas​ets/KDD+Cup+1999+Data.

13
4480 J. Meira et al.

Fig. 1  Anomaly detection methodology—the datasets were splitted, normalized and discretized through some pre-processing techniques before
being applied in the algorithms learning and testing phase

• Probe An intruder tries to get information about a vic- of the features were divided in k bins in the way that each
tim’s machine (contains 6 types of Probe attacks); bin contains approximately the same number of samples.
• Normal It constitutes the normal operations or activities Thus, each bin has nk adjacent values. The value of k is a user
in the network. defined parameter, and to obtain this value we used the heu-
ristic n where n is the number of samples. This discretiza-
Tavallaee et al. (2009) made some improvements on KDD tion technique can provide better accuracy and fast learning
99 dataset and the result was the NSL-KDD dataset. This in certain anomaly detection algorithms, since the range of
dataset is already organized in two subsets: one to train the values is smaller (Liu et al. 2002).
algorithms, and another one to test them. Each data sample The second pre-processing technique was the data nor-
contains 43 features where four of them are nominal type, malization, to have all the features within the same scale.
six are binary and the rest of them are numerical type. As we This operation prevents some classification algorithms to
are testing one class classification algorithms, it was selected give more importance to features with large numeric values.
a portion of normal data from the training set and a portion Once the features are all on the same scale, the classifiers
of both normal and attack data from the test set, where the assign the same weight to each attribute. The Z-score and
attack data contains all four attack categories and represents MinMax were the normalization techniques applied to the
10% of the test set. data. The Z-score technique transforms the input, so the
Some pre-treatment techniques were applied to the data- mean is zero and the standard deviation is one. On the other
set before performing the discretization and normalization hand, the MinMax transform the original input data to a new
operations, as shown in Fig. 1. Some features were removed specific set where the values range are between 0 to 1. We
namely: ‘Wrong_fragment’, ‘Num_outbound_cmds’, ‘Is_ tested the algorithms with each pre-processing technique and
hot_login’, ‘Land’ and ‘level_difficulty’ because they have with both combined to evaluate which techniques improve
redundant values in at least one of the subsets. In the case the performance of the algorithms. Then we made five expe-
of the ‘level_difficulty’ feature, it represents the level of riences with each algorithm with the best pre-processing
difficulty of attacks’ detection by learning algorithms. This techniques and calculate all the performance metrics mean
feature was removed because its information is not relevant to compare their results.
in a real-world anomaly detection problem. Another pre-
treatment operation to the data was the conversion of nomi- 3.2 ISCX
nal features to numerical features, since the algorithms to
be employed afterwards cannot handle non-numerical data. ISCX is a dataset developed by Shiravi et al. (2012) at the
After performing the cleaning of the subsets, two different Canadian cybersecurity institute. This dataset is based on
pre-processing techniques were applied to the data. First, the concept of profiles that contain detailed descriptions of
the data with continuous features was discretized with the abstract intrusions and distributions for applications, pro-
equal frequency technique. With this technique, the values tocols, services, and low-level network entities. To create

13
Performance evaluation of unsupervised techniques in cyber‑attack anomaly detection 4481

Table 1  ISCX captured activity algorithms are named as one-class classification methods
Capturing date Network activity
and appear to be good candidates for the problems of discov-
ering unknown attacks, since every attack can be considered
11/06/2010 Normal an outlier. In this work, we applied a set of 6 different one-
12/06/2010 Normal class algorithms, namely Autoencoder, Nearest Neighbor,
13/06/2010 Normal + internal infiltration into the network K-Means, Isolation Forest, Support Vector Machines and
14/06/2010 Normal + HTTP denial of service Scaled Convex Hull, which performance is to be evaluated
15/06/2010 Normal + distributed denial of service using a Botnet over the NSL-KDD and ISCX datasets.
IRC
16/06/2010 Normal
3.3.1 Autoencoder
17/06/2010 Normal + brute force SSH

The attacks were captured along with normal network activity. To An Autoencoder is a neural network that is part of a sub-area
distinguish between a normal observation and an abnormal one it is of machine learning called Deep learning [sets of algorithms
presented in the ISCX dataset an attribute called “label” where value with several layers of processing which are used to model
1 represents an attack and value 0 represents normal activity
high-level abstractions of data (Deng et al. 2014)]. Neural
Networks are interconnected processing units that are organ-
this dataset, real network communications were analyzed to ized by one or more layers which can be used in the imple-
create profiles for agents that generate real traffic for HTTP, mentation of a complex functional mapping between input
SMTP, SSH, IMAP, POP3 and FTP protocols. In this regard, and output variables (Bishop 1995). They can perform linear
a set of guidelines have been established to delineate a valid or non-linear transformations through the processing of the
dataset that establishes the basis for profiling. These guide- units in the different layers (Mazhelis 2016).
lines are vital to the effectiveness of the dataset in terms The autoencoder, also known as autoassociator, is a kind
of realism, total capture, integrity, and malicious activity of neural network that is trained to make the input features
(Shiravi et al. 2012). Each data sample in the ISCX dataset the same or very similar to the output features (Japkowicz
contains 21 attributes. There is a total of 7 days of network 1999). In the classification task, the autoencoder can repro-
traffic captured with four different attack types shown in duce accurately only the vectors whose structure is similar
Table 1. to the structure learned by the neural network.
For this dataset, we did the following changes before As a neural network, the autoencoders are sensitive to
applying the pre-processing techniques shown in Fig. 1 and outliers since they contribute to the minimization of the
described in the NSL-KDD dataset: error function. The disadvantage of this method is the need
of employing a number of parameters that have to be speci-
• All nominal features where converted to numerical—the fied by the user (Tax 2001). These parameters consist of
algorithms used cannot handle non-numeric features; selecting a number of hidden layers, a number of hidden
• All “Payload” features were removed—these are string units in each layer, the type of transformation function,
features, so it is not possible to train and test the algo- the learning rate and the stopping rule. In addition to these
rithms with these features; parameters, it is necessary to estimate a number of weights
• The source and destination IP address features were (usually equal to the number of hidden units and input units)
removed—there is no interest in training the algorithms for the training set. A large amount of data is essential for an
with these features, since the IP addresses are constantly accurate estimation of weights. The computational resources
changing; for this algorithm are considerably high, since the learning
• A new feature was created to represent the time interval process is iterative, being repeated several times throughout
of an operation on the network, defined as the differ- the training set, until the stop rule is satisfied.
ence between the features “stop date time” and “start date For the application of this algorithm, the H2O2 package
time”. was used. This package is an open source mathematical
engine for big data processing machine learning algorithms,
3.3 Unsupervised learning algorithms such as generalized linear models, gradient boosting, Ran-
dom Forests and neural networks (deep learning) in several
Unsupervised learning algorithms are suitable for scenarios cluster environments (Stadler 2011).
where the objective is to perform outlier detection to a data- After loading the datasets (NSL-KDD and ISCX)
set. Some of these algorithms follow the basic idea of learn- already pre-processed to this engine, the h2o.deeplearning
ing from a training dataset that only contains normal sam-
ples, and in the classification the output is either “normal”
if it resembles the learnt set or “outlier” if it does not. These 2
https​://cran.r-proje​ct.org/web/packa​ges/h2o/index​.html.

13
4482 J. Meira et al.

equal to 0.002, where all records above this threshold are


treated as anomalies.

3.3.2 One‑class k nearest neighbor

The One-Class Nearest Neighbor (OCNN) is an adaptation


of the original K-Nearest Neighbor supervised algorithm
using distances between neighbors to classify the data. In
the training phase the OCNN algorithm starts by memoriz-
ing the observations. The observations are objects where
each one of them represents a point in space defined by the
features. In the training process all observations are only
from the class that represents normal activity in the network.
After memorizing all the objects, in the test phase the algo-
rithm will use a metric to calculate distances between points.
Fig. 2  Reconstruction of the mean square error As distance metric, the Euclidean distances given by the
Eq. (1) were used, where xi and xj are two objects repre-
′ ′

sented by vectors in space ℝd .


function was used to train the autoencoder algorithm.

Regarding the parameters to be used, we tested several, but √ d
√∑
we find that Glander’s3 approach, applying a bottleneck d(Xi , Xj ) = √ (xi − xji )2 .

(1)
architecture used to fraud detection in credit card transac- i=1
tions, presented better results. When tuning the autoen-
coder, we used a separated validation set, and we found In this phase, the distances between the test object and its
that using a bottleneck architecture, where the number of first nearest neighbor (if k is equal to 1) from the training set
neurons in the middle layer are lower than the first and last are calculated. Then the distance values are used to identify
layers, it presented better classification results. The area a maximum distance and define a threshold. If that distance
under the curve metric was employed to compare the per- is higher than the maximum distance threshold, then the
formance of the algorithm with different hyperparameters sample in question is classified as an outlier.
values. The hyperparameters values used were: For anomaly detection in the NSL-KDD and ICSCX
datasets, and for the OCNN algorithm the value of k was
• Hidden c (50, 5, 50)—defines the number of hidden chosen to be equal to 1 because after testing with different
layers and units of the neural network, in this case the values, this value obtained the highest performance in the
vector c (50, 5, 50) contains three values and each value data classification.
corresponds to number of neurons per layer;
• Activation: than—we define the activation function 3.3.3 One‑class K‑means
hyperbolic tangent;
• Epochs = 20—specify the number of times to iterate the The K-Means is a clustering algorithm, which is the process
dataset. of partitioning a set of data (or objects) into smaller sub-
classes with common characteristics. The number of clusters
We used the h2o.anomaly function after the model is defined initially and remains fixed throughout the process.
finalized its training process. This function is intended to The goal of this algorithm is to find different groups in data
detect anomalies in a dataset. The function reconstructs defined by the variable k. Based on the data features, this
the original dataset using the training model and calculates algorithm works iteratively to assign each observation or
the MSE for each point in the test set. object to one of the k groups. The algorithms start by esti-
Then we created a graphic that represents the recon- mating the k centroids that are the centers of each cluster.
struction of the mean square error as shown in Fig. 2. This In this step, each object is assigned to its nearest centroid,
graphic represents an example of a test made on a test based on the squared Euclidean distance. As shown in the
sample of the ISCX dataset. It turns out that at a certain Eq. (2), being ci the collection of centroids in set C, each
point the MSE increases. This means that the model could object x is assigned to a cluster based on where dist(·) is the
not correctly identify these records, which could be con- standard Euclidean distance.
sidered an anomaly. So, we drawn a threshold, in this case
arg min dist(ci , x)2 (2)
3
https​://shiri​ng.githu​b.io/machi​ne_learn​ing/2017/05/02/fraud​_2. ci ∈C

13
Performance evaluation of unsupervised techniques in cyber‑attack anomaly detection 4483

Then subsequently the centroids are recomputed. This The observations from the dataset are represented by n
process is done by taking the mean of all objects assigned and H(i) is a harmonic number and can be estimated by the
to that centroid’s cluster. In the Eq. (3) the set of data point Euler’s constant. The parameter c(n) was used to normalize
assignments for each ith cluster centroid is Si. h(x) since it represents the average of given n. The Eq. (5)
represents the anomaly score s of an observation x:
1 ∑
ci = xi (3)
|S | − E(h(x))
(5)
| i | xi ∈si s(x, n) = 2 c(n)

To apply this algorithm as one-class classification, the E(h(x)) is the average of h(x) from a collection of isola-
building process of clusters should only use normal data tion trees. Using the anomaly score Liu et al. (2012) veri-
examples. In the classification process, the algorithm cal- fied that observations with a s value much smaller than 0.5
culates all the test data points distance to the closest cluster. are quite safe to be regarded as normal observations, while
Then a threshold is defined and if the calculated distance to observations with a s value very close to 1 are definitely
each object is higher than the threshold value, the sample is anomalies, and observations that return s ≈ 0.5 don’t really
classified as an anomaly. It was used the silhouette analysis mean any distinct anomaly.
that measures how close each point in one cluster is to points In our tests, we used the default algorithm parameter of
in the neighboring clusters. This measure gives us informa- 100 trees in both datasets, since experimentally the variation
tion about the best parameter (number of clusters) to apply. of this parameter did not show any substantial impact in the
In both the NSL-KDD and ISCX datasets the ideal number performance.
of clusters was set to 4.
3.3.5 One‑class scaled convex hull
3.3.4 Isolation Forest
The Scaled Convex Hull (SCH) is an algorithm based on
Isolation Forest (Liu et al. 2012) is a method for outlier a previously proposed method by Casale et al. (2011) that
detection that uses data structures called trees, such as binary uses the geometrical structure of the Convex Hull (CH) to
trees. Each tree is created by partitioning the instances recur- define the class in one-class classification problems. This
sively, by randomly selecting an attribute and a split value algorithm uses random projections and an ensemble of CH
between the maximum and minimum values of the selected models in two dimensions, and thus this method can be suit-
attribute (Liu et al. 2012). Being T an external node of a tree able for larger dimensions in an acceptable execution time
with no child or an internal-node designated by test with (Fernández-Francos et al. 2017). As we can see in Eq. (6).
exactly two daughter nodes (Tl, Tr). A test is an attribute q { |S| }
with a split value p, where q < p meaning the data points will ∑ | ∑|S|
|
be divided into Tl, Tr. CH(S) = 𝜃i xi ||(∀i ∶ 𝜃 ≥ 0) ∧ 𝜃i = 1, xi ∈ S
|
To build an insolation tree, the data X = {x1 , … , xn } will
i=1 | i=1

be recursively divided by randomly selecting an attribute (6)


q and a split value p, until it reaches three conditions (Liu the CH of a finite set of points S ∈ Rd provides a tight
et al. 2012): approximation among several convex forms being this
approximation prone to over-fitting. Fernández-Francos
• The tree reaches a height limit; et al. (2017) used reduced/enlarged versions of the original
• |X| = 1; CH to avoid the over-fitting problem, where in the training
• All data in X have the same values; phase an outlier can lead to shapes that do not represent the
target class accurately. To resolve this problem they applied
To detect anomalies, the observations are sorted accord- the formula presented by Liu et al. (2009) to calculate the
ing to their path lengths or anomaly scores. The path length expanded polytope, where the� vertices � ∑ are defined with
h(x) represents the number of edges x that go through an respect to the center point c = �S� 1
xi , ∀xi ∈ S and the
i
isolation tree from the root node until the traversal is termi- expansion parameter 𝜆 ∈ [0, +∞) as in Eq. (7):
nated at an external node. To calculate the anomaly score,
Lui et al. (2012) used the analysis from Binary Search Tree 𝜈 𝜆 ∶ {𝜆𝜈 + (1 − 𝜆)c|𝜈 ∈ CH(S)} (7)
(BST) to estimate the average path length of an isolation tree The parameter 𝜆 represents a constant extension (𝜆 > 1)
represented in Eq. (4): or constant contraction (0 < 𝜆 < 1) of the CH regarding c.
( ) An approximation of the decision made by the expanded
2(n − 1)
c(n) = 2H(n − 1) − (4) CH in the original d-dimensional space by means of an
n
ensemble of 𝜏 randomly projected decisions on 2-D spaces

13
4484 J. Meira et al.

two-dimensional plane is a line that separates and classifies


data) that better divides a dataset into two classes.
SVMs are effective in classifying linearly separable data
or having an approximately linear distribution. However,
there are many cases where it is not possible to properly
divide the training data using a hyperplane. To solve this
problem, SVMs can create a non-linear decision bound-
ary by projecting a non-linear function 𝜙 to a space with
a higher dimension. This means that SVMs can project the
data from the training set of its original space I to a new
space of greater dimension, denominated as feature space F
(Hearst et al. 1998).
To calculate the scalar products between objects
mapped in the new space, functions called Kernels
Fig. 3  Ensemble of projected decisions on 2-D based on Fernández-
Francos et al. (2017) K(xi , xi ) = 𝜙(x)T 𝜙(xi ) are used. The usefulness of kernels
lies therefore in the simplicity of their calculation and in
their capacity to represent abstract spaces. The most used
was proposed. In this way, the authors defined a decision kernels are the polynomials, the radial base function and
rule that states that a point does not belong to the modeled the sigmoidal. Each of them has parameters that must be
class if and only if there exists at least one projection in determined by the user (Gama et al. 2015).
which the point lies outside the projected CH. The hyperplane equation is represented by wT x + b = 0,
To have a better understanding of this method, in Fig. 3 with w ∈ F and b ∈ R in two-dimensional space. The con-
is graphically represented an example where a 3-D convex structed hyperplane as mentioned, determines the margin
figure is approximated by three random projections in 2-D. between the classes. The use of slack variables 𝜉i will allow
We can observe in Fig. 3 that the point can be inside one some data points to lie within the margin where the constant
or more projections but in fact that point lies outside the C > 0 determines the trade-off between maximizing the mar-
original geometric form. gin and the number of training data points within the margin
In the SCH algorithm Fernández-Francos et al. (2017) (Schölkopf et al. 2000). This way they will prevent SVM
proposed three different definitions of center: from over-fitting with noisy data. The objective function of
the SVM classifier is represented in the Eq. (8):
1. The average of all points in the projected space;
n
2. The average of the CH vertices in the projected space; w2 ∑
3. The average position of all the points in the projected
min +C 𝜉i (8)
w,b,𝜉i 2 i=1
polytope.

Each type of center leads to different decision regions (if Subject to:
a point belongs or not to the target class), giving more flex- yi (wT 𝜙(xi ) + b) ≥ 1 − 𝜉i for all i = 1, … , n
ibility to this method. 𝜉i ≥ 0 for all i = 1, … , n
In our experiment we found that the best parameters for
this algorithm were:
Therefore, for anomaly detection problems the One-Class
Support Vector Machines (OCSVM or 𝜈-SVM) will only
• A value of 𝜆 = 1, 22 in the NSL-KDD and a 𝜆 = 1, 11
train with data from one class, in this case, the class that
in the ISCX dataset;
represents normal activity in the network. Basically, it sepa-
• Around 2000 projections;
rates all the data points from the origin and maximizes the
• A center type that uses the average of the CH vertices in
distance from the hyperplane to the origin. This results in
the projected space.
a binary function that captures regions in the input space
where the probability density of the data lives.
3.3.6 One‑class support vector machines
The minimization function is given by the Eq. (9)
(Schölkopf et al. 2000):
Support Vector Machines (SVM) have the capability to
solve classification and regression problems. This algo- n
1 1 ∑
rithm focuses on the search for a hyperplane (generaliza- min w2 + 𝜉 −𝜌 (9)
w,𝜉i ,𝜌 2 𝜈n i=1 i
tion of a plane in different dimensions, for example in a

13
Performance evaluation of unsupervised techniques in cyber‑attack anomaly detection 4485

Table 2  Comparative results Best pre-processing techniques One-class algorithms NSL-KDD ISCX (AUC)
using mean AUC (× 100) for (AUC)
each algorithm using the best
combination of pre-processing No pre-processing Isolation forest 81.71 90.70
techniques, in NSL-KDD and
Z-score K-means 84.76 77.06
ISCX datasets
Z-score 1-Nearest neighbor 84.85 95.20
Equal frequency + minmax Autoencoder 83.65 80.44
Equal frequency + minmax Scaled convex hull 85.30 85.95
Equal frequency + minmax Support vector machines 83.14 91.63

Figure 4 represents an example of a ROC curve, were we


Subject to: can see the trade-off between TPR and FPR. Those clas-
(w ⋅ 𝜙(xi )) ≥ 𝜌 − 𝜉i for all i = 1, … , n sification algorithms that have a curve close to the top-left
𝜉i ≥ 0 for all i = 1, … , n corner indicates a better performance (as seen in the blue
line). As close the curve is to the diagonal line (marked in
As we can see in Eq. (9) the parameter 𝜈 characterizes the orange) where TPR = FPR, the less accurate the classifier is.
solution, instead of the C parameter that decided the smooth- The AUC is a metric that can be useful to summarize
ness in Eq. (8). The parameter 𝜈 sets an upper bound on the the performance of each classifier providing an aggregate
fraction outliers and a lower bound on the number of training measure of performance across all possible classification
examples used as support vectors. Due to the importance of thresholds. AUC can be seen as the probability of the model
this parameter, this approach is also known as 𝜈-SVM. in distinguishing between positive class (anomaly) and nega-
The tests performed allowed to obtain the best follow- tive class (normal activity).
ing parameters in anomaly detection for the NSL-KDD and
ISCX datasets: 4.2 Recall, precision, F1 score

• The radial base kernel function was used; In classification problems, metrics such as recall, precision
• The 𝛾 = 0.3 in the NSL-KDD and 𝛾 = 4.2 in the ISCX and F1 score are mostly used to understand the amount of
(parameter used for the radial basis kernel); miss predicted observations in a specific class. Each of these
• The 𝜈 = 0.01 in the NSL-KDD and 𝜈 = 0.005 in the metrics give us information about the different types of
ISCX. errors generated by the classifier. With Recall, also known
as TPR, and sensitivity, presented in the AUC section, we
can learn the percentage of instances from the anomaly class
4 Performance evaluation that are actually predicted correctly. This means that the
higher the Recall, the smaller the false negatives (type 2
All combinations of the pre-processing techniques with the error). On the other hand, Precision is very similar to Recall,
unsupervised learning algorithms were tested and we present but instead of computing the false negatives it computes the
the results of the best techniques applied to each algorithm
for NSL-KDD and ISCX datasets in Table 2. To evaluate
the performance of the classifiers we used several metrics
as described below:

4.1 Area under the curve

A well-known way of comparing the classifiers performance


is using the area under the curve (AUC) metric. The AUC
calculates the area under a receiver operating characteristic
curve (ROC) which is a graph showing the classification
performance at various thresholds settings drawn by two
parameters, the True Positive Rate (TPR) TP+False
True Positives (TP)
Negatives (FN)
and the False Positive Rate (FPR) FP+TrueNegatives
False Positives (FP)
(TN)
. Each
point on the ROC represents a TPR/FPR pair corresponding
to a particular decision threshold. Fig. 4  ROC curve example

13
4486 J. Meira et al.

Fig. 5  Critical difference dia-


gram, Nemenyi post hoc test

( )
classifiers due to the lack of good datasets in the cyberse-
false positives TP
. This represents the portion of anom-
TP+FP curity field. Even though the classifiers are not significant
aly class elements that were correctly predicted. So, as in different to each other, we can see that on average Nearest
Recall, we can say that, the higher the Precision, the smaller Neighbor, SCH and 𝜈 -SVM have a high score compared
the false positives (type 1 error). When using these two met- to the other three algorithms.
rics, there is often a tradeoff between them, so it is important Since the test set has unbalanced classes, we plotted the
to evaluate them together using another metric called F1 performance of the algorithms using other metrics that can
score, showed in Eq. (10): measure the errors more in detail. This metrics are: Recall,
Precision ⋅ Recall Precision and F1 score.
F1 = 2 ⋅ (10) Starting by the NSL-KDD dataset, observing the Fig. 6,
Precision + Recall
looking at the F1 score metric as it represents the har-
As showed in Eq. (10), F1 score represents the harmonic monic mean combining the two other metrics, we can see
mean which combines Recall and Precision metrics with an that all algorithms showed similar results. The isolation
equal weight. Forest and K-Means with 53% and 55% respectively and
As we can see on Table 2, the algorithms One-class the others ranging between 60 to 66%, being SCH the
K-means and 1-Nearest Neighbor had the best performance algorithm with the highest F1 score. We can look also at
applying the Z-Score techniques. The Isolation Forest algo- precision and recall metrics as to have a better perception
rithm had the best results without any kind of data transfor- of the false positives and false negatives costs. Few false
mation as it uses binary trees in the process of data recur- negatives represent a higher value of recall and vice versa,
sive partitioning. In the case of the Autoencoder, SCH and and we can also say the same regarding precision with
𝜈-SVM had the best performances in detecting anomalies respect to the false positives. Observing the graphic in
applying MinMax and Equal Frequency (EF) techniques in Fig. 6, all algorithms except SCH and 𝜈-SVM had a recall
the pre-processing phase. value much higher than precision, so the false positives
Looking at the NSL-KDD results, the SCH classifier had were much higher than the false negatives in these cases.
the best performance with an AUC value around 85. The In cybersecurity it is important to have a low false negative
other algorithms obtained very close results ranging between rate, since it represents the worst-case scenario, where data
81 and 84 AUC, where the 1-Nearest Neighbor was the sec- is predicted as normal activity, while in fact it represents
ond-best classifier with an AUC close to 85. Regarding the malicious or abnormal activity. Regarding the SCH and 𝜈
ISCX dataset, analyzing the table, we can observe that the -SVM they both had the highest F1 score compared to the
1-Nearest Neighbor algorithm obtained the highest AUC other anomaly detection techniques but at the same time
result, followed by 𝜈-SVM. In this dataset the AUC results they had more misclassified observations that represent
were higher compared to the NSL-KDD. One of the reasons false negatives than misclassified observations represent-
for this is the fact that the NSL-KDD has 38 different types ing false positives.
of attacks compared to the ISCX with only 4 different types. Analyzing Fig. 7, concerning the ISCX dataset, we
To verify if there is significant difference between the observe that Nearest Neighbor, SCH and 𝜈 -SVM have
classifiers performance in both datasets we applied the much better performance results than those obtained for
Nemenyi post hoc statistical test and presented a critical the NSL-KDD. On the other hand, the Isolation Forest
difference diagram (Demšar 2006) as shown in Fig. 5. As and K-means algorithms remained with approximately the
we can see all the algorithms are connected to each other same results as in NSL-KDD. Another fact that can be
(thickest horizontal line underneath the critical difference observed is that the algorithm SCH generates fewer false
scale), meaning that they are not significantly different (at negatives and increases the false positives when trying
level 𝛼 = 0.10). to detect the four different types of attacks contained in
The non-significant difference between algorithms can ISCX dataset.
be explained because we only used two datasets to test the

13
Performance evaluation of unsupervised techniques in cyber‑attack anomaly detection 4487

Fig. 6  Anomaly detection


results in NSL-KDD

Fig. 7  Anomaly detection


results in ISCX

5 Conclusion and future work system calls classification in normal activity or malicious
activity. The efficiency of intrusion detection depends on
Threats in information systems have become increasingly the techniques used in these systems. As mentioned, the
intelligent and they can deceive the basic security solu- work carried out in this paper was motivated by the SASSI
tions such as firewalls and antivirus. Anomaly-based IDSs project. The goal was to verify if any of the unsupervised
allow monitored network traffic classification or computer techniques presented in this paper could be implemented
in an IDS to support Systems administrators in decision

13
4488 J. Meira et al.

making process of anomaly and novelty detection task. References


We can conclude that all algorithms could detect most of
the anomalies and also showed that they have managed to Aleroud A, Karabatis G (2013) Toward zero-day attack identification
separate adequately the data between classes even though using linear data transformation techniques. In: Proceedings of
7th international conference on software security and reliability,
they were unbalanced (to represent a more realistic envi- SERE 2013, pp 159–68. https​://doi.org/10.1109/SERE.2013.16
ronment). To choose the best method, we need to focus Bishop CM (1995) Neural networks for pattern recognition. Oxford
not only on the overall performance but also on the type University Press, Oxford. http://cites​eerx.ist.psu.edu/viewd​oc/
of errors generated. Analyzing the performance metrics, downl​oad?doi=10.1.1.679.1104&rep=rep1&type=pdf
Casale P, Pujol O, Radeva P (2011) Approximate convex hulls family
we conclude that the 1-Nearest Neighbor, SCH and 𝜈-SVM for one-class classification. In: International workshop on multiple
presented the highest results in both datasets but the SCH classifier systems, pp 106–115. https​://doi.org/10.1007/978-3-642-
and 𝜈-SVM generated more false negatives than false posi- 21557​-5_13
tives errors in the NSL-KDD dataset. Being this type of Casas P, Mazel J, Owezarski P (2012) Unsupervised network intru-
sion detection systems: detecting the unknown without knowl-
error an undesirable scenario in cybersecurity, we suggest edge. Comput Commun 35(7):772–783. https:​ //doi.org/10.1016/j.
the implementation of the 1-Nearest Neighbor since it is comco​m.2012.01.016
capable of detecting most of the anomalies and moreover Castillo E, Peteiro-Barral D, Berdiñas BG, Fontenla-Romero O (2015)
it was also one of the fastest unsupervised techniques in Distributed one-class support vector machine. Int J Neural Syst
25(07):1550029. https​://doi.org/10.1142/S0129​06571​55002​9X
the computing process of anomaly detection. Although Chen J, Sathe S, Aggarwal C, Turaga D (2017) Outlier detection
unsupervised learning methods are great to generalize, with autoencoder ensembles. In: Proceedings of the 2017 SIAM
detect unknown patterns and also handle the unlabeled international conference on data mining, pp 90–98. https​://doi.
data problem, they have also some constraints. These org/10.1137/1.97816​11974​973.11
Demšar J (2006) Statistical comparisons of classifiers over multiple
methods can’t be too specific about the definition of the data sets. J Mach Learn Res 7:1–30
data, leading to less accuracy (generating a high number Deng L, Yu D et al (2014) Deep learning: methods and applica-
of false positives for this specific problem) compared to tions. Found Trends Signal Process 7(3–4):197–387. https​://doi.
supervised techniques presented in the literature (Niyaz org/10.1561/20000​00039​
Dhanabal L, Shantharajah SP (2015) A study on NSL-KDD dataset
et al. 2015; Dhanabal and Shantharajah 2015; Tsai et al. for intrusion detection system based on classification algorithms.
2009). Therefore, as future work, other architectures will Int J Adv Res Comput Commun Eng. https​://doi.org/10.17148​/
be studied with the aim of optimizing the performance IJARC​CE.2015.4696
of the predictive model, like the development of a hybrid Ester M, Kriegel H-P, Sander J, Xiaowei X et al (1996) A density-based
algorithm for discovering clusters in large spatial databases with
model containing unsupervised and supervised techniques noise. Kdd 96:226–231
to reduce the false positive rate and classify the attacks Fernández-Francos D, Fontenla-Romero Ó, Alonso-Betanzos A (2017)
by type. Then the predictive model will be tested in a real One-class convex hull-based algorithm for classification in distrib-
dataset developed by the SASSI sensors.4 After creating a uted environments. In: IEEE transactions on systems, man, and
cybernetics: systems. https​://doi.org/10.1109/TSMC.2017.27713​41
consistent predictive model, the aim will be the study of Fred ALN, Jain AK (2005) Combining multiple clusterings using
action rules to aid the system administrator in the preven- evidence accumulation. IEEE Trans Pattern Anal Mach Intell
tion of the detected attacks. 27(6):835–850
Gama J, de Leon Carvalho AP, Faceli K, Lorena AC, Oliveira M (2015)
Acknowledgements This work was supported by SASSI Project Extração de Conhecimento de Dados. http://www.silab​o.pt/Conte​
(ANI|P2020 17775) and has received funding from FEDER Funds udos/8117_PDF.pdf
through P2020 program and from National Funds through FCT- Gardner AB, Krieger AM, Vachtsevanos G, Litt B (2006) One-class
Fundação para a Ciência e a Tecnologia (Portuguese Foundation for novelty detection for seizure analysis from intracranial EEG. J
Science and Technology) under the project UID/EEA/00760/2019. Mach Learn Res 7:1025–1044
This work has also received financial support from MINECO (Grant Giacinto G, Perdisci R, Del Rio M, Roli F (2008) Intrusion detec-
TIN2015-65069), the Xunta de Galicia (Grants ED431C 2018/34, and tion in computer networks by a modular ensemble of one-class
Centro Singular de Investigación de Galicia, accreditation 2016–2019, classifiers. Inf Fusion 9(1):69–82. https​://doi.org/10.1016/j.inffu​
Ref. ED431G/01) and the European Union (European Regional Devel- s.2006.10.002
opment Fund—ERDF). Goldstein M, Uchida S (2016) A comparative evaluation of unsuper-
vised anomaly detection algorithms for multivariate data. PLoS
One. https​://doi.org/10.7910/DVN/OPQMV​F
Hearst MA, Dumais ST, Osuna E, Platt J, Scholkopf B (1998) Support
vector machines. IEEE Intell Syst Appl 13(4):18–28. https​://doi.
org/10.1109/5254.70842​8
Japkowicz N (1999) Concept-learning in the absence of counter-
examples: an autoassociation-based approach to classification.
Rutgers University, New Brunswick. https​://pdfs.seman​ticsc​holar​
.org/03ed/0a73d​1f7a7​b1650​5d6cb​9c8bf​beeef​7b19b​b.pdf
4
Khan SS, Madden MG (2014) One-class classification: taxonomy of
Monitoring system that collects data from network communications study and review of techniques. Knowl Eng Rev 29(3):345–374
in real time through network sensors.

13
Performance evaluation of unsupervised techniques in cyber‑attack anomaly detection 4489

Liu H, Hussain F, Tan CL, Dash M (2002) Discretization: an ena- Schölkopf B, Williamson R, Smola A, Shawe-Taylor J, Platt J (2000)
bling technique, pp 393–423. https ​ : //pdfs.seman ​ t icsc​ h olar​ Support vector method for novelty detection. Adv Neural Inf Pro-
.org/2d18/73800​b294a​104a8​36168​ac5bb​a11ed​eadc7​f.pdf cess Syst 12:582–588
Liu Z, Liu JG, Pan C, Wang G (2009) A novel geometric approach Shin HJ, Eom D-H, Kim S-S (2005) One-class support vector
to binary classification base. IEEE Trans Neural Net- machines—an application in machine fault detection and classifi-
works 20(7):1215–1220. https:​ //doi.org/10.1109/TNN.2009.20223​ cation. Comput Ind Eng 48(2):395–408. https:​ //doi.org/10.1016/j.
99 cie.2005.01.009
Liu FT, Ting KM, Zhou Z-H (2012) Isolation-based anomaly detec- Shiravi A, Shiravi H, Tavallaee M, Ghorbani AA (2012) Toward devel-
tion. ACM Trans Knowl Discov Data 6(1):3:1–3:39. https​://doi. oping a systematic approach to generate benchmark datasets for
org/10.1145/21333​60.21333​63 intrusion detection. Comput Secur 31(3):357–374. https​://doi.
Manevitz LM, Yousef M (2001) One-class SVMs for document clas- org/10.1016/j.cose.2011.12.012
sification. J Mach Learn Res 2:139–154 Stadler T (2011) R Topics Documented. Package ‚ÄòTreePar‚ Äô, 2.
Mazhelis O (2016) One-class classifiers: a review and analysis of suit- https​://doi.org/10.2307/25330​43
ability in the context of mobile-masquerader detection Oleksiy Tavallaee M, Bagheri E, Lu W, Ghorbani AA (2009) A detailed analy-
Mazhelis to cite this version: HAL Id: Hal-01262354 One-Class sis of the KDD CUP 99 data set. In: IEEE symposium on compu-
Classifiers: a review and analysis of suitability in the context of tational intelligence for security and defense applications, CISDA
mobile. https​://hal.inria​.fr/hal-01262​354/docum​ent 2009, pp 1–6. https​://doi.org/10.1109/CISDA​.2009.53565​28
Niyaz Q, Sun W, Javaid AY, Alam M (2015) A deep learning approach Tax DMJ (2001) One-class classification: concept learning in the
for network intrusion detection system. In: Proceedings of the absence of counter-examples. http://homep​age.tudel​ft.nl/n9d04​
9th EAI international conference on bio-inspired information and /thesi​s.pdf
communications technologies. https​://doi.org/10.4108/eai.3-12- Tieleman T, Hinton G (2012) Lecture 6.5-Rmsprop: divide the gradient
2015.22625​16 by a running average of its recent magnitude. COURSERA Neural
Noto K, Brodley C, Slonim D (2012) FRaC: a feature-modeling Netw Mach Learn 4(2):26–31
approach for semi-supervised and unsupervised anomaly Tsai CF, Hsu YF, Lin CY, Lin WY (2009) Intrusion Detection by
detection. Data Min Knowl Discov 25(1):109–133. https​://doi. machine learning: a review. Expert Syst Appl 36(10):11994. https​
org/10.1007/s1061​8-011-0234-x ://doi.org/10.1016/j.eswa.2009.05.029
Parsons L, Haque E, Liu H (2004) Subspace clustering for high dimen-
sional data: a review. ACM SIGKDD Explor Newsl 6(1):90–105. Publisher’s Note Springer Nature remains neutral with regard to
https​://doi.org/10.1145/10077​30.10077​31 jurisdictional claims in published maps and institutional affiliations.

13

You might also like