Applying Machine Learning To Cyber Security
Applying Machine Learning To Cyber Security
Bologna
SCHOOL OF SCIENCE
Second degree in computer science
Correlators:
Prof.
Valentina Presutti
Mehwish Alam
Session II
Academic Year 2017-2018
”Be able to fly, learning to fall”
Salmo
Contents
Introduction xi
i
CONTENTS CONTENTS
Clustering Algorithms . . . . . . . . . . . . . . . . . . 18
Association Rule Learning Algorithms . . . . . . . . . 18
Artificial Neural Networks Algorithms . . . . . . . . . 18
Dimensionality Reduction Algorithms . . . . . . . . . . 19
Ensemble Algorithms . . . . . . . . . . . . . . . . . . . 20
Bibliography 93
Sitography 101
List of Figures
v
vi LIST OF FIGURES
vii
viii LIST OF TABLES
Parallelamente alla crescita esponenziale del Web e dei servizi ad esso as-
sociati, crescono anche gli attacchi informatici. Per questo motivo l’utilizzo
di misure di sicurezza non è mai stato cosı̀ importante. Al giorno d’oggi le
tecniche di Intrusion Detection (IDS) e di Vulnerability Detection sono due
delle misure di sicurezza più utilizzate. Gli IDS hanno lo scopo di analizzare
un sistema o una rete, alla ricerca di minacce alla sicurezza, come attività
sospette e accessi non autorizzati. Mentre la Vulnerability Detection con-
siste nell’analisi di un sistema, alla ricerca di vulnerabilità, punti deboli, che
possono essere sfruttati da un utente malintenzionato. Sono state proposte
molte tecniche per raggiungere questi obiettivi e ora tra queste stanno emer-
gendo anche Machine Learning (ML) e Data Mining (DM). Maggiori dettagli
sugli IDS (1.1) e sulla Vulnerability Detection (1.2) sono riportati nel capitolo
seguente.
Questo lavoro si pone due goal principali:
ix
x Italian Introduction - Introduzione
In parallel to the exponential growth of the web and web based services
also cyber attacks are growing. For this reason the usage of security mea-
sures has never been so important. Intrusion Detection Systems (IDS) and
Vulnerability detection techniques are two of the most used security mea-
sures nowadays. IDSs have the purpose of analyzing a system or a network
looking for security threats, like suspicious activity and unauthorized access.
While Vulnerability Detection consists in the analysis of a system, looking
for vulnerabilities, weaknesses that can be exploited by an attacker. Many
techniques to achieve these goals have been proposed and now also Machine
Learning (ML) and Data Mining (DM) ones are emerging. For more details
about IDS (1.1) and Vulnerability Detection (1.2) refer to the next chapter.
This work has two main goals:
– Make a Survey.
As you can see, the first step does not simply consist in the making of a
Survey, this because the Assessment of the state of the art is very important.
xi
xii Introduction
But to really gain advantage and comprehension of what has already been
done we need to reproduce and compare the proposed methods. This leads
to another important aspect in Machine Leaning and more in general in
Computer Science: the adhesion to the FAIR principles [9]. This principles
have been proposed in order to give a set of guidelines to make data findable,
accessible, interoperable and reusable. This does not only apply to the data,
but also to code, which should be open source and so reproducible, at least
in the research area, thus allowing a continuous growth in the development
of new solutions. (For a discussion about this aspects refer to section 3.8)
The papers taken under exam have been chosen for their impact, considering
the citation count, and the rank of conference/journal in which they have
been published. Moreover also the publication date has been considered as
an important aspect.
For time reasons and coherence to the real world scenario, even if the Survey
has been done for both IDSs and Vulnerability Detection, the experimental
comparison has been done only for IDSs, selecting the three of them which
have been better documented by the authors. The comparison in chapter 5
has been done with the same machine and the same data, in order to compare
the performances of the proposed methods in the same scenario. Finally the
knowledge acquired from this first study has been used to find a solution
to a real world scenario (chapter 6). The task was to make an IDS using
only web server logs in a complete unsupervised way. The proposed solution
consists in two parts: (i) the data preprocessing and (ii) the proposal of an
unsupervised model using outlier detection.
Chapter 1
Background: Vulnerability,
Intrusion detection and
machine learning
This chapter explains the basic concepts needed to understand this whole
work. As first Intrusion Detection Systems and software Anomaly Detection
will be introduced. Then an overview about Machine Learning will be given.
Active and Passive IDS: Active IDS are also known as Intrusion and Pre-
vention Detection System (IPDS). They automatically block the suspected
intrusions without an intervention of an operator. On the other hand, pas-
sive IDS only monitors and analyses the traffic and alerts an operator in case
1
Background: Vulnerability, Intrusion detection and machine learning
of an attack.
Network-Based IDS: These kinds of systems usually monitor all the pass-
ing traffic at strategic points of the networks.
Moreover IDSs can be grouped in three other categories basing on the method
used to detect the attacks.
usage pattern will be customized for every system, increasing the difficulty
for the attackers to find activities that can be carried out undetected. Hybrid
techniques are a combination of misuse-based and anomaly-based ones. They
have been proposed to reduce the number of false alarms, while maintaining
the capability to detect new attacks.
Brute Force Attack: The most basic kind of attack. It simply consist in
a complete search over the credentials space, trying to discover the password
Background: Vulnerability, Intrusion detection and machine learning
or other informations.
Denial of Service (DoS): This attack has the purpose to exhaust the
system’s capabilities, causing the interruption of the supplied services. With
DDoS a Distributed variant is referred. In this kind of attacks a huge amount
of hosts (generally controlled by some malware) is used to generate thousands
of request to a single target (typically a web server).
Rootkit: a malign software with the aim to gain root access to a system.
Sometimes it can also lead to the remote access and control to the attacked
system.
Phishing: This term indicates the try to gain the credential of a user to
stole his identity. The most typical phishing attack is made using an e-mail
with looks legitimate with brings the user to a malign web page.
Supervised Algorithms
In this family of approaches the algorithm learns from past knowledge.
This means that the training data already contains knowledge, and the al-
gorithms needs to learn from it to predict future events. In this case we are
talking about labeled data, this means that the data comes also with his
explanation, for example imagine to a dataset containing pictures, labeled
with 1 if they contain a cat or 0 if not. If the aim of the program is to say
whenever a picture contains a cat or not this is an example of labeled learn-
ing. In this kind of algorithms the aim becomes to learn a model to identify
new occurrences, this model will basically be a function that explains the
learning data.
Unsupervised Algorithms
Here the data does not contains additional information about his meaning
so the purpose becomes to extract some pattern from unlabeled data. Let’s
Background: Vulnerability, Intrusion detection and machine learning
think to the previous example, if the dataset only contains picture with or
without cats, but they are not labeled we can still try to learn some pattern
from the data which will probably identify all the pictures containing cats.
Semi-supervised Algorithms
This family of algorithms falls in-between the previous ones. Semi-supervised
methods use both labeled and unlabeled data. Typically a small quantity of
labeled data is used to improve the accuracy of the model. This solutions are
well used when acquiring or learning on labeled data is resource-expensive,
while obtaining unlabeled data is not.
1.4.1 ML applications
Classification
In the following paragraphs the different classification types and the related
evaluation approaches will be introduced.
evaluated negative. False Positive (FP) and False Negative (FN) are respec-
tively indicating that the classification was wrongly positive and wrongly
negative.
The deriving metrics are the following:
TP + TN
Accuracy =
TP + TN + FP + FN
TP
P recision =
TP + FP
• Recall (or detection rate in the following chapters) is the ratio of items
correctly classified as belonging to the target class over all the items
that actually belong to that class.
TP
Recall =
TP + FN
• False Alarm Rate (FAR) or false positive rate is the ratio of items
incorrectly classified as member of the target class over the total num-
ber of items not belonging to it.
FP
F AR =
TN + FP
In figure 1.1 you can find an image intuitively showing precision and recall
metrics2 .
2
The figure was taken from: https://en.wikipedia.org/wiki/Precision_and_
recall
1.4 Machine Learning 11
• Average Accuracy
Pl T Pi +T Ni
i=1 T Pi +T Ni +F Pi +F Ni
Accuracyaverage =
l
• Macro Precision
Pl T Pi
i=1 T Pi +F Pi
P recisionmacro =
l
• Macro Recall Pl T Pi
i=1 T Pi +F Ni
Recallmacro =
l
• Micro Precision
Pl
i=1 T Pi
P recisionmicro = Pl
i=1 T Pi + F P i
• Micro Recall Pl
i=1 T Pi
Recallmicro = Pl
i=1 T Pi + F Ni
• Class False Positive Rate (Class FAR) indicates the ratio of Ex-
emplars from a given class incorrectly classified over all exemplars not
from that given class.
1.4 Machine Learning 13
The terms macro and micro are here indicating two different averaging strate-
gies. The first one averages the respective measures calculated for each class,
while the second one consists of calculating the sum of the numerators and
the sum of the denominators of the respective measures calculated for each
class, and then dividing the first sum by the second sum.
|D| T
1 X |Yi Zi |
P recisionmulti−label =
|D| i=1 |Zi |
|D| T
1 X |Yi Zi |
Recallmulti−label =
|D| i=1 |Yi |
Clustering
• Density models group the data points as dense and connected re-
gions ( Density-Based Spatial Clustering of Applications with Noise
DBSCAN).
Regression
1.4.2 ML approaches
To solve Machine Learning tasks a lot of possible approaches are available.
In this work an approach is intended as a family of algorithms grouped for
1.4 Machine Learning 15
Regression Algorithms
3
https://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/ ac-
cessed May 2018
Background: Vulnerability, Intrusion detection and machine learning
Instance-based Algorithms
This family of algorithms exploits a tree structure for his predicting pro-
cess. The tree is used to go from some observation about an object (the
branches of the tree) to conclusions about the item target value (represented
in his leaves). Decision trees can be both used in classification problems
(when the target variable can assume a discrete set of values) and regression
problems (when the target variable can assume continuous values). More
details about decision trees used in classification problems can be found at
sec. 3.4. This family of algorithms is known to be fast and accurate, reason
why it is a well used approach to many learning problems.
Some algorithms belonging to this family are:
Bayesian Algorithms
The methods in this family exploit the Bayes theorem to solve classifica-
tion and regression problems. Briefly Bayesian inference derives the posterior
probability as a consequence of two antecedents: a prior probability and a
likelihood function derived from a statistical model for the observed data.
The posterior probability is computed according to Bayes’ theorem:
P (E|H) · P (H)
P (H|E) =
P (E)
Clustering Algorithms
Association rule learning methods extract rules that best explain observed
relationships between variables in large databases. These rules can discover
important associations in large multidimensional datasets using some mea-
sures of interest. A more detailed explanation of these algorithms can be
found in sec. 3.2.
Two popular algorithms in this family are:
are structured in layers and each neural network is composed by at least tree
layers, one input layer, one (or more) hidden layer and one output layer.
Basing on the number of hidden layers the ANN can be classified as normal
or deep. In this second case the network is made by different neural layers.
For more details about ANN look at sec. 3.1.
Some well known algorithms in this family are:
• Sammon Mapping
• Projection Pursuit
Ensemble Algorithms
Similarly to what has been done in [6] in this chapter some available
data-sources designed for evaluating IDS will be described, including both
synthetic as well as real data. Moreover some simulator will be introduced.
These last one can be used to generate attacks for testing the algorithms and
for analysis purposes.
21
Data sources for intrusion detection
• Flow exporter aggregates packets into flows and exports flow records
towards one or more flow collectors.
Figure 2.1: The relation between DARPA, KDD-99 and NSL-KDD datasets.
and the Data attacks. Another one of the most used dataset is KDD-
99[31].This data set was used for The Third International Knowledge Discov-
ery and Data Mining Tools Competition, consisting in the feature extracted
version of DARPA dataset. The main goal of this competition was to build a
system for detecting network intrusions. KDD-99 consists in 4,900,000 single
connection vectors each of which contains 41 features and is labeled as either
normal or an attack, with exactly one specific attack type between DOS,
U2R, R2L, Probing Attack. An evolution of this dataset is the NSL-KDD
[31]. It solves some of the problems of KDD-99, like the record redundancy
in the training ad in the test set. In fig.2.1 you can see the relation be-
tween DARPA, KDD-99 and NSL-KDD. Moreover it is also possible to find
other public datasets such as SecRepo 5 , a repository in which you can find
heterogeneous data sources, like network logs, snort logs and pcap files. Fi-
nally the Common Vulnerabilities and Exposure6 (CVE) is a dictionary of
known vulnerabilities, like the recently discovered heartbleed7 . This dataset
is maintained by the MITRE corporation and used in numerous cybersecu-
rity products and services from around the world, including the U.S. National
Vulnerability Database.
5
http://www.secrepo.com accessed May 2018
6
https://cve.mitre.org/ accessed May 2018
7
http://heartbleed.com/ accessed May 2018
Data sources for intrusion detection
8
http://www.porcupine.or/ accessed May 2018
9
http://www.flame.ee.ethz.ch/ accessed May 2018
Chapter 3
This chapter discusses the state of the art techniques for Intrusion De-
tection mainly focusing on anomaly based approaches and hybrid solutions
due to their capability to recognize unknown attacks. The division of the
IDS is done based on the Data Mining and Machine Learning methodologies
adopted which are firstly briefly described. For a complete discussion of IDS,
including also misuse-based techniques you can look at [6].
25
Survey on ML approaches for Intrusion Detection
itation has been targeted and ANN are gaining more popularity.
Lippman et al. [39] process network sniffed data from DARPA dataset, con-
taining the bytes transferred to and from the victim through telnet sessions
to count the number of occurrences of a keyword eg., “password”, “permis-
sion denied” etc. Then these statistics are given as an input to two ANNs.
The first ANN computes the probability of an attack while the second ANN,
a Multi-Layer Perceptron (MLP), classifies the network activity into already
known attacks thus providing the name of an attack as an output. The
software used for this purpose is LNKnet [44], a pattern classification soft-
ware.The authors obtained results with 80% detection rate and low false
positive rate (1/day) while passing from simple keyword selection to key-
word selection with dynamically added keywords and discriminative training.
In [8] Kim et al. apply Long Short Term Memory (LSTM) to Recurrent
Neural Networks (RNN). They train the model on KDD-99 dataset, de-
veloping two experiments, in the first phase they try to find the optimal
hyper-parameter while in the second they evaluate the model with the previ-
ously obtained hyper-parameter. This method result in an average detection
percentage of 98.8%, while the false positive rate in around 10%.
3.2 Association Rule Mining 27
IF X is A THEN Y is B
Where, roughly, A and B are indicating some possible value that X and Y
can assume.
An example in the context of cyber security could be:
the misuse part and association rules to develop anomaly detection. The nor-
mal behavior is constructed using Frequent Episode Rules, assuming that the
frequent events come mostly from normal interaction with a system. From
the anomaly detection part they obtain new attack patterns which are than
converted into snort rules, for a more efficient and real-time detection. The
resulting detection rate was between 92.2% and 95.7%
3.3 Clustering
As we had seen in 1.4.1 Clustering is an unsupervised method for finding
patterns in high-dimensional unlabeled data. The basic idea is to group to-
gether data based on some similarity measure.
The idea in the context of intrusion detection systems is that ideally a clus-
tering method should divide the data in two clusters, one with all the normal
connections, and the other one with all the attacks.
Almalawy et al. [4] proposed and IDS for SCADA systems. Their approach is
based on the outlier concept. They cluster the SCADA system in dense clus-
ters using DBSCAN and the resulting n-dimensional space presents some
noise data (the outliers) which represent the critical states. The “outlier-
ness” of a state is evaluated through a cut-off function. Moreover they adopt
a technique of automatic proximity-based detection rules extraction which
enables the monitoring of the criticality degree of the system. Finally Al-
malawy et al. proposed a method to reduce the high false positive rate,
consisting in the re-labeling of the identified critical states by computing the
Euclidean distance from each critical state to each normal micro-cluster cen-
troid. If the critical state is located within any normal micro-cluster, it is
re-labeled as normal and assigned to that normal micro-cluster. This com-
plex method result in an average accuracy of 98%, but the false positive rate
remains quite high, with an average of 16%.
3.4 Decision Trees 29
Decision trees are often used in classification problems. They are char-
acterized by a tree structure where leaves represent class, internal nodes
represent some test, the related outcomes are represented with the outgoing
branches, and paths from the root to one leaf represent a classification rule.
An exemplar is classified by testing its features values against the nodes of
the decision tree. A simple example of decision tree is represented in figure
3.1.
Elhag et al. [11] proposed the usage of a Genetic Fuzzy System (GFS), a
combination of Fuzzy Association Rules and Evolutionary Algorithms clas-
sifiers. This classification scheme is based on a divide-and-conquer strategy,
in which the original multi-class problem is divided into binary subproblems,
Survey on ML approaches for Intrusion Detection
Kim et al. [18] proposed an hybrid IDS using a decision tree for misuse
detection and one-class SVM for the anomaly one. The misuse model is used
2
The figure was taken from
https://en.wikipedia.org/wiki/Support_vector_machine
3.8 Discussion on State of the Art Algorithms 33
to divide the normal data in subsets and each one of them is used to train
a one-class SVM classifier for anomaly detection. The idea is that each area
for the decomposed normal data set does not have known attacks and in-
cludes less variety of connection patterns than the entire normal data set.
An anomaly detection model for each normal training data subset can pro-
file more innocent and concentrated data so that this decomposition method
can improve the profiling performances of the normal traffic behaviors. The
approach has been tested on NSL-KDD distinguishing between known and
unknown attacks. The ROC curves show that in the best case the detection
rate reaches 90% for previously unknown attacks when the false positive rate
is slightly smaller than 10%. For already known attacks the detection rate
can reach 100% when the false positive rate remains below 5%.
the cases also the documentation leaks of some information, for this reason
in chapter 5 only three of the analyzed methods have been reproduced. The
most of the approaches are using KDD-99 dataset, but also if it is accessible,
they are using a random subset of it, this means that we cannot reproduce
the experiments on exactly the same setting. So the first result of this work
is to stress the fact that there is the need ti improve the accessibility of the
works in this field.
Moreover, as shown in table 3.1 the most of the methods proposed in the
state of the art are supervised. This can be a big problem in real world
scenarios, as in the most of the cases people don’t have labelled data to
train their models. The last fact that emerges from this survey is that KDD
became a sort of well known ”battle ground” used to compare IDS. This is
nice, as it gives a common environment to test and compare the methods.
But it is also a problems considering the fact that people use only random
subsets of the data, and the fact that it is quite aged.
3.8 Discussion on State of the Art Algorithms 35
• In the first phase, called API symbols extraction the source code is
tokenized and parsed into individual functions.
37
Survey on ML approaches for Vulnerability Detection
4.2 Clustering
In [29] Makanju et al. proposed IPLoM (Iterative Partitioning Log Min-
ing) a log data clustering algorithm. The proposed approach works by itera-
tively partitioning a set of logs. At each step the partition factor is different,
in particular the steps are the following: partition by token count, parti-
tion by token position and partition by search of bijection. At each step of
the partitioning process the resultant partitions come closer to containing
only log messages produced by the same line format. When the partitioning
phase ends IPLoM produces a line format description for each cluster. This
approach is not directly applied to security, but it’s clustering capabilities
shown some nice result, with a recall of 81% a precision of 73% and a F-
measure of 76%.
in test cases. Starting from analyzing 1039 test cases taken from the Debian
Bug Tracker the dataset has been preprocessed in two different ways, with
word2vec and with bag-of-words, while three ML approaches have been tested:
Logistic Regression, Multilayer Perceptron (MLP) and Random Forest. In
this experiment the method which performed better in terms of accuracy was
random forest, trained using dynamical features.
Also Younis et al. [10] take under exam different machine learning ap-
proaches, but their purpose is quite different. They try to identify the at-
tributes of the code containing a vulnerability that makes the code more
likely to be exploited. So their purpose is not only to identify vulnerabilities,
but to check if they are exploitable. The authors examined 183 vulnerabilities
from the National Vulnerability Database for Linux Kernel and Apache Http
Server, finding 82 to have an exploit. After that, the authors characterized
the vulnerable functions with and without an exploit using the selected eight
software metrics: Source Line of Code, Cyclomatic complexity, CountPath,
Nesting Degree, Information Flow, Calling functions, Called by functions,
and Number of Invocations. Apache HTTP server. Only a combination of
this metrics has been used in the experiment, in fact the authors used three
different feature selection methods: correlation-based, wrapper and principal
component analysis. The classification methods used are Logistic Regression,
Naı̈ve Bayes, Random Forest, and Support Vector machine. The experiments
results (table 4.2) show that the best approach is Random Forest.
Table 4.2: Results reported in [10] for classification without feature extrac-
tion. for a more detailed results refer to the paper.
k-Nearest Neighbor, Naı̈ve Bayes, Random Forest and Support Vector Ma-
chine (SVM). Between these Random forest and Naı̈ve Bayes resulted as the
more effective. Both of them have been tested using Weka, with default
setting for the Naı̈ve Bayes, while the number of random trees for Random
Forest hat been increased to 100. The experiments shown that between these
two approaches Random Forest is the one doing better with 82% recall and
59% precision, against 73% precision and 55% recall for Naı̈ve Bayes.
In [13] Perl et al. apply SVM to find possible dangerous code. The au-
thors conducted a large-scale evaluation of 66 GitHub projects with 170,860
commits, gathering both metadata about the commits as well as mapping
CVEs to commits to create a database of vulnerability-contributing commits.
The resulting approach over-performed FlawFinder 2 resulting in a precision
of 56% in the best case.
2
https://www.dwheeler.com/flawfinder/ accessed June 2018
Survey on ML approaches for Vulnerability Detection
Comparative evaluation of
intrusion detection systems
43
Comparative evaluation of intrusion detection systems
ods, therefore the next step is to address this issue by re-implementing exist-
ing methods and compare them in the same experimental setting. It has to
be noticed that in some cases the implementation of the methods may not
be exactly the same as it was in the original proposal, for many reasons, e.g.
the lack of implementation details in the paper, lack of access to the original
code.
• Calculator:
• Software versions:
– CUDA: 9.0.176
– CUDNN: 7.1.4
– Python: 2.7.12
– Tensorflow: 1.8.0
– Keras: 2.2.0
– Scikit Learn: 0.19.1
– Weka: 3.8.2
Comparative evaluation of intrusion detection systems
5.1.1 Dataset
The choosen dataset is NSL-KDD. As previously explained in 2 it derives
from KDD, a very known and used dataset for IDS. The dataset is composed
by training and testing dataset, in which each item is composed by 41 features
(table 5.2) extracted from DARPA. For more detail about the feature refer
at [31]. Of course this dataset is quite dated, but it still can be applied
as an effective benchmark dataset to compare different intrusion detection
methods.
NSL-KDD solves many of the problem of the KDD-99 dataset:
• It does not include redundant items in the training dataset and neither
in the testing one.
• The number of records in the train and test sets are reasonable, which
makes it affordable to run the experiments on the complete set, without
the need to randomly select a small portion of it.
The reproductions are not always strictly faithful, because the code of the
described methods is not available, and as it often happens, their description
is not enough detailed to understand all their aspects. Moreover in some case
the authors performs a different kind of classification, for example multi-class
instead of binary and so on. As explained before the reproduction has to be
done on a common environment to give sense to a comparison, this includes
5.2 Methods Reproduction 51
also the task, which in this work is binary classification between normal and
malign connections.
For this reasons the results shown in the following paragraphs will be different
from the one reported in the original papers. Moreover all the experiments
made for this work have been done using the NSL-KDD train dataset for
training and the test one fore testing. This can seem obvious, but in some
case the authors used a partition of the training dataset for the evaluation
of their methods. This lead to higher performances because the testing data
set contains zero days attacks, which are harder to be recognized.
Misuse detection: This step is the simplest one and consists in the ap-
plication of a random forest classifier. The classifier has been configured to
use mtry 15 and 100 trees. Moreover the features 6, 20 and 21 (tab 5.2) have
been excluded. The misuse classifier shown the following performances:
• Accuracy: 77.64%
• Precision: 94.51%
• FAR: 4.94%
As we can expect from a misuse method tested with zero-days attacks the
detection rate is not very high, while the false alarm rate is quite low.
Comparative evaluation of intrusion detection systems
prox2 (n, k)
P
P (n) = class(k)=j
Denoting with N the number of cases in the dataset, the raw outlier-ness of
item n has been defined as
N
P (n)
For each class, the median and the absolute deviation of all raw outlier-ness
are calculated. The median is subtracted from each raw outlier-ness. The
result of the subtraction is divided by the absolute deviation to get the final
outlier-ness.
The only difference from the method described in the paper is that, instead
of constructing the normal patterns for the network service, here they have
been done fore the protocol types. The reason is very simple, and it is that in
5.2 Methods Reproduction 53
this scenario the results were better using the protocol types. Moreover, as
the threshold over which an outlier-ness score was considered to be determi-
nant for an anomaly, the experiment have been repeated with many different
thresholds. The best one are as follows:
• Accuracy: 84.08%
• Precision: 81.86%
• FAR: 27.11%
As expected the detection rate increased from the misuse detection, but the
false alarm rate is worst.
Hybrid detection: This last classifier combines the two previous ones. It
applies as first the misuse classifier, and than the anomaly one on all the
items that are classified as normal by the misuse detection. For this step
the authors set the misuse random forest to use 15 trees, mtry 34 and to
ignore the features 2,7,8,9,15,20 and 21. While for the anomaly detection the
number of trees is 35 and the value of mtry is 34. The authors proposed to
use as threshold 1%, this means that only the 1% of the items with the higher
outlier-ness will be marked as anomaly. In this experimental evaluation the
used threshold has been reduced up to 0.12%, which gave the best results
in terms of FAR. Moreover it has been tested both with the original misuse
detection parameters and with the newly proposed. The best results have
been obtained with the original settings for the misuse detector. Increasing
the threshold some better result in terms of detection rate can be obtained,
even if the FAR increase.
In table 5.4 you can see how increasing the threshold increases the detection
rate (recall) but it increases also the FAR, this happens because increasing
the threshold means considering as attacks also connections which are more
similar to the normal ones. In this case probably the best result is the one
Comparative evaluation of intrusion detection systems
Table 5.4: Performances of the Hybrid Random Forest IDS when the thresh-
old changes.
with threshold set to 20% which has a nice detection rate and the false alarm
rate remains not too high.
• the distance to the nearest neighbor of the item in the same cluster it
belongs to.
These two distances are then summed and used as new unique feature
for a k-NN classifier. The reason to use this new one-dimensional dataset is
very easy to understand, working on a one-dimensional dataset instead on
one with 41 dimensions is less expensive.
Considering the item Di in figure 5.3 his new feature is computed as
P5
Dis(Di ) = j=1 Dis(Di , Cj ) + Dis(Di , N1 )
5.2 Methods Reproduction 55
where Cj is the cluster center of the j-th cluster and Nk is the k-th nearest
neighbor of the item Di . As we are considering only the nearest neighbor k
is 1, but it should be increased using a sum also in the second part of the
equation as follows:
P5 PK
Dis(Di ) = j=1 Dis(Di , Cj ) + k=1 Dis(Di , Nk )
In the paper Lin et al. use 5 cluster, as they are classifying over the attack
families, while in this work the task is to discriminate between normal and
malign connections, so the number of cluster is only two. Moreover they
authors used KDD as dataset, considering only a subset of his features, in
particular the use two different datasets, one composed by 6 features and one
composed by 19 features. Referring to table 5.2 the used features are:
• 19-dimensional dataset: 2, 4, 6, 12, 23, 25, 26, 27, 28, 29, 31, 32, 33,
34, 35, 36, 37, 38 and 39.
Testing this approach the best results have been obtained using the 19-
dimensional dataset and k=1 and 2 for the k-NN classifier over the new
one-dimensional data while working with 10-fold cross validation and on the
test set respectively. As we can expect using 10-fold cross validation on the
training data the results are strongly better than testing it on the test data,
because this is not an anomaly detection method, so it is not able to find zero-
days attacks. This is proved by the results reported in table 5.5. This results
may look very bad, but as already said we should consider that the new fea-
ture has not been used for anomaly detection, so they are normal. The new
one-dimensional dataset should be used to train an anomaly-detection model,
instead of a simple k-NN, to understand if it could improve his performance.
5.2 Methods Reproduction 57
Table 5.5: Results of CANN using 10-fold cross validation and the NSL-
KDD test set.
1. Prepare a training data set consisting of normal data and known attack
data.
3. Decompose the normal training data into subsets according to the de-
cision tree structure, data in the same leaf belong to the same subset.
4. For each normal leaf of the decision tree, build an anomaly detection
model using the 1-class SVM algorithm based on a normal data subset
for the leaf.
This means that if the decision tree marks an item as an attack, it is actually
considered to be an attack, while if it is marked as normal it will also be
subject to anomaly detection (as we can see in figure 5.4). The one-class
SVM are used as anomaly detector as they model a class pattern and then
they can be used to test the belonging of new items to this class (in our case
normal connections).
Comparative evaluation of intrusion detection systems
Figure 5.4: Diagram of the hybrid IDS based on decision trees and one-
class SVM.
source: [18]
5.2 Methods Reproduction 59
Reproducing the method three different values of γ have been tested: 0.01,
0.1 and 1. While for ν 0.01 and 0.5 have been chosen.
Regarding the decision three the minimum number of instances per leaf has
been set to 0.1% of the dataset size.
The results obtained applying only misuse detection (the simple decision
tree) are the following:
• Accuracy 82.12%
• Precision 86.53%
• Recall 81.24%
• FAR 16.71%
This results are not bad, but the false alarm rate is quite high for a misuse
detector. this means that a possible way to improve the method is trying to
understand how to decrease this high FAR.
Moreover we can expect a FAR grater to this one for the hybrid detection,
because it usually suffers for high false positives and in this method the
items marked as attacks by the decision tree are not even evaluated by the
anomaly detection, so for sure the FAR won’t decrease. In table 5.6 the
results obtained by the hybrid detection while the parameters γ and ν change
are reported. It is possible to observe that when ν is 0.5 the detection rate
(recall) increase, but this is too much expensive in terms of FAR.
Comparative evaluation of intrusion detection systems
Table 5.6: Results of the Hybrid IDS based on decision tree and one-class
SVM while γ and ν vary.
Setting ν to 0.01 the detection rate increase from the misuse model, while
the FAR increases only by a small percentage.
Until this point only positive scenarios have been treated. In real world
the most of the time data is raw, and not labelled. In this work we will take
as Real World Scenario: the HAPS (Holistic Attack Prevention System)
project. The aim of this project is to take many different logs (web server,
application and so on) and apply Machine Learning techniques to detect
cyber security threaths. Working on raw log files is a different and harder
work that starting from a dataset designed to be used for Machine Learning
as the KDD one. In fact these logs are completely unlabelled and they need
to be preprocessed in order to apply ML methods. Moreover for different
Log types a different preprocessing needs to be applied, and the same stands
for the machine learning technique. For this reason this works focuses only
on web server logs.
63
A real world Scenario and a novel unsupervised approach to
intrusion detection
6.1.1 Structure
The web logs analyzed in this work are produced by an Apache Web
Server and they are expressed in Combined Log Format (fig. 6.1). In the
following lines the different fields of this format are explained.
• Src ip: It indicates the IP address of the client which generated the
request.
• User Name: The user id of the person which requested the web re-
source, as established by the HTTP authentication.
• Time stamp: Date and hour of the moment in which the request has
been received by the server.
It is expressed as [day/Month/year:hour:minutes:seconds] where:
• Time zone: It represents the time zone related to the previous filed.
It can begin with + or -, depending on the zone.
• Status code: The status code that the server returns to the client.
• Return size: The size (in bytes) of the object returned by the server,
excluding the headers.
• Referrer: The web page in which the client was when it made the re-
quest. For example, following a link from the web site of your company,
the resulting request will have the url of your company web site as re-
ferrer field. This filed can be empty ( - ) if the request is generated by
directly writing an url in the url bar, or when the field is intentionally
leaved empty, as it often happens when the request is generated by a
bot.
a terminal based log-analyzer. This analysis shows that the most of the
connections come from bots, as it can be derived from many different aspects.
As first the 8.83% of the static request are directed to robot.txt, a particular
file used by bots to discover which part of the web site are forbidden to
them. Second, almost the 83% of the request come from North America,
which is a quite strange behavior considering that the website in object is
Italian. Moreover more than the 83% of the request comes from declared web
crawler and another 8% from unknown browsers (which are generally related
to bots). Finally the traffic distribution along the day is almost constant
while it became more varied excluding the web crawlers (as shown in figure
6.2).
It has also to be considered the fact that GoAccess cannot detect all the
bots, because in many case they try to hide their identity, this means that
the percentages shown before can even be worst.
Another important aspect of the analyzed logs is that the remote log name
and the user name fields have been obfuscated by the web site owner for
privacy reasons.
Figure 6.2: Traffic distribution with and without web crawlers. The data
field indicates the number of hour considered.
A real world Scenario and a novel unsupervised approach to
intrusion detection
• Apply rules to the data to find labels and then use some classification
technique.
• Resource, e.g. request containing one ore more ../ are often used by
attacker to access file that are outside of the website itself.
• Referrer and user agent, e.g. SSI tags <!– –> are often inserted by an
attacker which tries to execute some code.
6.2 Data Preprocessing and Feature Extraction 69
• Status code, e.g. even if it is not a rule a 5xx status code can be the
consequence of an attack to the web service.
As an attacker can exploit a very large number of different techniques also
the number of fingerprints to search for is very huge, for this reason they are
not listed here, but you can refer to [45], [46]. As explained in these works
not all the fingerprints described are enough to consider the connection dan-
gerous, but in some case they are related to other ones, so also this aspect is
considered when the data is pre-filtered. Obviously this set of heuristics does
not cover all the possible attacks, but it is only a simple way to pre-filter the
data. Moreover these heuristics can be updated in any time to obtain a more
effective filter.
Beside this first filtering phase the data still needs to be preprocessed in
such a way that it can be managed by a ML algorithm.
Analyzing the possible attacks it has appeared that probably a single model
should not be enough to catch all the possible attacks types. For example
there are attacks with are more related to the single connection structure,
like requests to root.exe; while other are related to a set of connection, usu-
ally sessions, like DOS attacks.
For this reason there should be the need to evaluate more models, and so
more datasets, in such a way to analyze both the single connections and the
sessions.
as feature in the dataset, and the value associated to the feature will be the
parameter value.
Since not all the requests use all the parameters, the not used ones will be
left empty in the dataset. Another useful information can be the number of
parameters passed to the request.
Let’s imagine to have to represent a log file containing only two requests.
Both of them pointing to the same resource /r1.html. This resource is re-
lated to three different parameters: att1, att2 and att3.
way each dataset will have less features, representing only the parameters
associated to the resource considered (reducing also the computation time).
And it will be possible to create a different model for each resource, having
like this a more ”fitting” model to the actual usage pattern of that resource.
6.2.2 Sessions
To reason at this level, the first thing that has to be done is to reconstruct
the sessions. Even in this case this purpose is followed using an heuristic.
Similarly to what has been done in [37], [15] the chosen heuristic consists
in grouping connection inside a session if hey have common IP and User
Agent. Moreover all the connection inside a session should have a time-
stamp belonging to a 30 minute interval.
Once that the sessions have been identified they have to be represented.
Extending what has been done in [37] and [15] the considered features are
described in table 6.2.
One possible additional feature can be the session IP membership to a
DNS Black List (DNSBL). This information can be very meaningful as it can
tell us if the IP belongs to one of the following categories:
• Tor;
• Proxy;
• Spammer;
• Zombie;
• Dial-up.
However finding this information for each session can be time expensive, even
using an ip cache because to do so the ip has to be searched in very long lists
of ips.
A real world Scenario and a novel unsupervised approach to
intrusion detection
Name Description
Total hits Total number of HTTP request
% img Percentage of images requested
% HTML Percentafe of html file requested
% Binary doc Percentage of binary doc files
% Binary exe Percentage of binary executable files
% ASCII Percentage of ASCII files
% Zip Percentage of compressed files
% Multimedia Percentage of multimedia files
% Other files Percentage of other file requested
Bandwidth Total bytes requested
Robots.txt True if robot.txt has been visited
Session time time passed between the first and the last connection
Avg interval Avg time passed between two request
True if the time between all couples of two
Is interval constant
following connections in the session is constant
Night requests Number of requests between 12 p.m and 7 a.m.
Repeated requests Number of repeated requests
% Errors Percentage of requests resulting with code >= 400
% GET Percentage of GETs
% POST Percentage of POSTs
% HEAD Percentage of HEADs
% Other Method Percentage of other methods
% Unassigned Referrer Percentage of requests with unassigned referrer
The number of requests signaled by the heuristics
nMisbehavior
used to pre-filter the connections
The number of requests with a fingerprint which
nAloneMisbhavior
signals an attack even alone
The number of requests with a fingerprint which signals
nOtherMisbehavior
an attack only if related to other ones
geoIp The origin of the request
dnsbl True if the src ip is present in some DNS black list
Table 6.2: Session features. The ones with yellow background are optionals
and can be added or not depending on the kind of model is going to be made.
6.3 Proposed Approach 73
requires the number of cluster as parameter, so the best value for this pa-
rameter needs to be found. In our example it appeared to be 4 as Kmeans
project 4 clear clusters in the space (figure 6.3). The image shows that the
connection marked as attacks by the heuristics appear quite far from the
cluster centers, moreover it shows that near the big clusters we can find
some outliers (isolated point). Another important aspect is that three of the
four cluster contain only bot connection, while the fourth one contains all
the human connections, plus some other bot connection. This means that
using clustering we can solve another of the big problems of the data under
analysis, the fact the the most of the connections are coming from bots. This
is a problem because it can lead to a model describing only bot connections,
making result as outliers all the connections coming by humans. Clustering
in some sense is solving the problem as it split the connections between group
of bot ones, and human ones. Each group is then representing similar con-
nections, so similar pattern, that can be used separately for outlier detection.
Moreover using clustering to split the connections related to a single resource
can be useful as it creates groups of similar items, improving like this the
precision of the normal patterns generated. Starting from this point there
are two alternatives to detect outliers:
The first method can be easier, because it does not need any additional
computation or model training. The items distances from the cluster centers
are already computed when the Kmeans model is fitted. This distances will
be compared against a threshold and whenever an item distance is greater
then the threshold it will be considered to be an outlier. The only difficulty
is to understand how to fix the threshold over which consider an element to
be an outlier. If we want to fix the number of false positives under a pre-
defined percentage, n%, we can order the distances and take as threshold the
distance value that is bigger than the n% of the distances. As an alternative
6.3 Proposed Approach 75
we can try other values, for example coming from the average distance and
the mean deviation. In any case to improve the detection performance the
threshold needs to by adjusted, but without having labelled data it is not
possible.
The second method is more expensive in terms of computation time, as it
requires to train other models, in particular one One-Class SVM for each
cluster. By the other side, as seen in the state of the art, this kind of classi-
fier can be very precise detecting outliers and it finds them not simply fixing
a distance threshold. Unluckily also in this case the model requires some
parameter to be calibrated. The first is ν, which in some sense regulates the
number of elements that can be considered to be outliers in the training data,
so in our case it actually fix an upper bound to the percentage of outliers we
will find in the data. While the second one if γ which is the coefficient of
the svm kernel function. Unluckily also in this case these parameters need to
be calibrated to improve the performances, but again, without labelled data
this cannot be done.
A real world Scenario and a novel unsupervised approach to
intrusion detection
Without having the possibility to evaluate the output of the models trying
to find better parameters, what has been done is to fix a quite conserva-
tive thresholds; meaning thresholds that try to keep the number of detected
outliers low, and so, also the number of false alarms. This choice has been
done both for the distance based and for the One-Class SVM based outlier
detection models, setting the distance threshold to 1.0 (around the double
of the average distance to the cluster centers) and ν to 0.01 respectively.
As first ”evaluation” the percentage of connection marked as outliers that
were also marked as attack by the heuristics has been counted. Moreover
the number of outliers found in the connections marked as normal by the
heuristics has been counted. These preliminary results are reported in ta-
ble 6.3 which shows that the most of the connections marked as attacks by
the heuristics are also marked as outliers by the models. This means that
both the models are at able to detect the attacks found by the heuristics.
Moreover we should consider that also the heuristics may have some false
positive, and that we can change the models threshold to increase the per-
centages reported in table. The problem again is that without the labelled
data we cannot know if there are false positives in the heuristics which are
not detected by the models (so it is working fine) or if we need to improve
its detection performances.
The same results are graphically expressed in figure 6.4, where you can
see the other outliers found by the models. These will look the majority of
the connections, but this is only a graphical issue due to the overlapping of
the point. To understand the cardinality of the outliers refer to the previous
table.
data. The Evaluation subset includes the 1% of the original dataset, obtained
selecting the 1% of the connections related each one of the resources thus
reproducing the original connection distribution. Moreover, related to the
three most used resources the half of the connections included in the subset
have been marked as outliers by the proposed models. In addition to this
1% of the Dataset (7262 items) a small dataset containing only attacks has
been added (136 additional connections). The resulting evaluation dataset is
unbalanced, as it contains only 325 attacks; 189 coming from the 1% of the
original dataset plus other 136 coming from the additional connections (fig
6.5).
Moreover their threshold have been changed, to see how their performances
change.
This evaluation process have been repeated two times, one considering
only the most requested resource (which covers more than the 60% of all
the connections) and one considering all the resources, without splitting by
resource.
6.4 Preliminary Evaluation 79
The first evaluation has been done following the idea of splitting the
connections by resource, focusing on the most used one. Doing this a problem
emerged: in the connection marked by the expert there are no attacks to this
resource. While the additional log file containing only attacks contains 89
attacks requesting this resource. Having these problems in mind, in figure 6.6
and table 6.4 you can see the results of both the models. The figure represent
how the models behave while their parameters change, while the table shows
which one goes better according to each one of the evaluation metrics. As
opposed to what has been observed in the previous chapter we can observe
that the distance based model can detect more attacks. Moreover his false
alarms rate is lower as compared to the One-Class SVM based model. In
both cases the precision is very low, but we have to consider that it is given
TP
by T P +F P
. Considering that we are applying an anomaly detection method
the false positive will always be in some sense high, while we have a really low
number of attacks, so also the true positives won’t ever be enough to make
up for the false positives. Although the results obtained from the proposed
models are not optimal, it is possible to observe that they are still better
A real world Scenario and a novel unsupervised approach to
intrusion detection
• Accuracy 97,93%
• Precision -
• Recall 0
• FAR 0
The reason why the accuracy is so high is that the heuristics are marking all
the new connections as non-attacks including the 89 new ones, but in respect
to the amount of connections this does not count a lot. The new attacks are
not detected by the heuristics because they do not contain anyone of the
most know attack fingerprints, and so they can pass undetected. Of course,
though not clearly, these connections are different from the ”normal” ones,
and for this reason they can be detected by the outlier detection models if
they are enough sensitive. But increasing their sensitivity also increase the
number of false positives. In table 6.5 the performances of the heuristics and
the best outlier detection model are compared with random guessing, which
is often used as simple baseline.
As previously said the dataset used for the evaluation is strongly unbal-
anced, causing many problems. One example is the Precision, which always
result very low. To solve this problem the evaluation has been repeated on
a more balanced dataset, downsampling the number of normal (non-attack)
connections to the double of the attacks. Table 6.6 shows the result of this
second evaluation, from which is possible to observe how the Distance-Based
outlier detection outperforms the One-Class SVM based. Moreover it is
possible to observe how using a balanced dataset the general performances
increase and in particular the precision, which was low using the unbalanced
one. The same results are also reported in a graphical way in figure 6.7.
6.4 Preliminary Evaluation 81
Table 6.4: Outlier detection results over the most used resource. For every
evaluation metric the best and the worst models have been highlighted.
6.4 Preliminary Evaluation 83
Figure 6.7: Proposed model performances on the most used resource when
using a balanced Dataset.
A real world Scenario and a novel unsupervised approach to
intrusion detection
Table 6.5: Evaluation over the most used resource and using the unbalanced
dataset. Comparison of the best performing outlier detection model (in terms
of recall), with the heuristics and random guessing labels.
This time the proposed outlier detection models have been evaluated
without considering the resources separately, but all together. The reason
for this additional evaluation is that, as previously said, it may happen that
some resource has not enough connection to build a model. As the previous
section shown how using an unbalanced dataset for the evaluation creates
some problem this time only the balanced one has been used. As previously
done, the balanced evaluation dataset has been obtained downsampling the
non-attack connection to the double of the attack ones.
The results of this evaluation are described in table 6.7 and figure 6.8. In the
table also the performances of the heuristics and random guessing labels are
expressed as a baseline. As expected the heuristics show low false positives,
but also a low recall (detection ratio). The reason for this behavior is that
heuristics belong to the misuse intrusion detection family, which is not able
to detect attacks that have not been previously described.
Moreover we can observe how the performances are in general worst than the
one obtained splitting by resource. Also this behavior was expected because
splitting by resource we obtain clusters of more homogeneous connections
compared to the one obtained here. This cause poorer normal pattern rep-
resentations and so worst detection performances. This phenomenon highly
6.4 Preliminary Evaluation 85
Table 6.6: Outlier detection results over the most used resource using a
balanced dataset. For every evaluation metric the best and the worst models
A real world Scenario and a novel unsupervised approach to
intrusion detection
Table 6.7: Outlier detection results without splitting by resource and using
a balanced dataset. Also the heuristics and random guessing performances
are expressed as baseline. For every evaluation metric the best and the worst
models have been highlighted.
Chapter 7
This work firstly analyzed the state of the art about two important fields
of cyber security: vulnerability and intrusion detection, focusing on the sec-
ond one. Two important problems in this field emerged:
• The most of the proposed approaches are supervised, while in real world
scenarios we rarely come across the labeled data.
• The state of the art methods are not reproducible as well as the datasets
used for testing these approaches along with their source code are not
available freely.
89
Conclusion and Future Works
• The distance based model outperforms the One-Class SVM based and
also the heuristics, showing how the outlier detection can be used to
detect new attacks, even if the number of false alarms tends to grow.
• When the resources are considered one at a time, the performances are
1
https://en.wikipedia.org/wiki/Elbow_method_(clustering) accessed: Novem-
ber 2018
91
significantly better than when they are all processed together. This
demonstrates that defining techniques to group the data in similar clus-
ters improves the outlier detection performances.
[1] Yisroel Mirsky, Tomer Doitshman, Yuval Elovici, and Asaf Shabtai.
“Kitsune: An Ensemble of Autoencoders for Online Network Intrusion
Detection.” In: (2018).
[2] Seyed Mohammad Ghaffarian and Hamid Reza Shahriari. “Software
Vulnerability Analysis and Discovery Using Machine-Learning and Data-
Mining Techniques: A Survey.” In: ACM Comput. Surv. 50.4 (Aug.
2017), 56:1–56:36. doi: 10.1145/3092566.
[3] Yisroel Mirsky, Tal Halpern, Rishabh Upadhyay, Sivan Toledo, and
Yuval Elovici. “Enhanced situation space mining for data streams.” en.
In: Proceedings of the Symposium on Applied Computing, SAC 2017,
Marrakech, Morocco, April 3-7, 2017. Ed. by Ahmed Seffah, Birgit
Penzenstadler, Carina Alves, and Xin Peng. ACM Press, 2017, pp. 842–
849. doi: 10.1145/3019612.3019671.
[4] Abdulmohsen Almalawi, Adil Fahad, Zahir Tari, Abdullah Alamri,
Rayed AlGhamdi, and Albert Y. Zomaya. “An Efficient Data-Driven
Clustering Technique to Detect Attacks in SCADA Systems.” In: IEEE
Transactions on Information Forensics and Security 11.5 (May 2016),
pp. 893–906. doi: 10.1109/TIFS.2015.2512522.
[5] Elisa Bertino, Ravi Sandhu, and Alexander Pretschner, eds. Proceedings
of the Sixth ACM on Conference on Data and Application Security and
Privacy, CODASPY 2016, New Orleans, LA, USA, March 9-11, 2016.
ACM, 2016.
93
94 Bibliography
[6] Anna L. Buczak and Erhan Guven. “A Survey of Data Mining and
Machine Learning Methods for Cyber Security Intrusion Detection.”
In: IEEE Communications Surveys Tutorials 18.2 (2016), pp. 1153–
1176. doi: 10.1109/COMST.2015.2494502.
[7] Gustavo Grieco, Guillermo Luis Grinblat, Lucas C. Uzal, Sanjay Rawat,
Josselin Feist, and Laurent Mounier. “Toward Large-Scale Vulnerabil-
ity Discovery using Machine Learning.” In: Proceedings of the Sixth
ACM on Conference on Data and Application Security and Privacy,
CODASPY 2016, New Orleans, LA, USA, March 9-11, 2016. Ed. by
Elisa Bertino, Ravi Sandhu, and Alexander Pretschner. ACM, 2016,
pp. 85–96. doi: 10.1145/2857705.2857720.
[8] Jihyun Kim, Jaehyun Kim, Huong Le Thi Thu, and Howon Kim. “Long
Short Term Memory Recurrent Neural Network Classifier for Intrusion
Detection.” In: 2016 International Conference on Platform Technology
and Service (PlatCon). Feb. 2016, pp. 1–5. doi: 10.1109/PlatCon.
2016.7456805.
[9] Mark D Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle
Appleton, Myles Axton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten,
Luiz Bonino da Silva Santos, Philip E Bourne, et al. “The FAIR Guid-
ing Principles for scientific data management and stewardship.” In:
Scientific data 3 (2016).
[10] Awad A. Younis, Yashwant K. Malaiya, Charles Anderson, and Indrajit
Ray. “To Fear or Not to Fear That is the Question: Code Characteris-
tics of a Vulnerable Functionwith an Existing Exploit.” In: Proceedings
of the Sixth ACM on Conference on Data and Application Security and
Privacy, CODASPY 2016, New Orleans, LA, USA, March 9-11, 2016.
Ed. by Elisa Bertino, Ravi Sandhu, and Alexander Pretschner. ACM,
2016, pp. 97–104. doi: 10.1145/2857705.2857750.
[11] Salma Elhag, Alberto Fernández, Abdullah Bawakid, Saleh Alshom-
rani, and Francisco Herrera. “On the combination of genetic fuzzy
Bibliography 95
[18] Gisung Kim, Seungmin Lee, and Sehun Kim. “A novel hybrid intru-
sion detection method integrating anomaly detection with misuse de-
tection.” In: Expert Systems with Applications 41.4 (2014), pp. 1690–
1700. doi: 10.1016/j.eswa.2013.08.066.
[19] Riccardo Scandariato, James Walden, Aram Hovsepyan, and Wouter
Joosen. “Predicting vulnerable software components via text mining.”
In: IEEE Transactions on Software Engineering 40.10 (2014), pp. 993–
1006. doi: 10.1109/TSE.2014.2340398.
[20] Aram Hovsepyan, Riccardo Scandariato, Wouter Joosen, and James
Walden. “Software vulnerability prediction using text analysis tech-
niques.” In: Proceedings of the 4th international workshop on Security
measurements and metrics. ACM. 2012, pp. 7–10.
[21] Jens Müller, Jörg Schwenk, and Ing Mario Heiderich. “Web Application
Forensics.” In: (2012).
[22] Hossain Shahriar and Mohammad Zulkernine. “Mitigating program
security vulnerabilities: Approaches and challenges.” In: ACM Com-
puting Surveys (CSUR) 44.3 (2012), p. 11. doi: 10.1145/2187671.
2187673.
[23] Leyla Bilge, Engin Kirda, Christopher Kruegel, and Marco Balduzzi.
“EXPOSURE: Finding Malicious Domains Using Passive DNS Anal-
ysis.” In: Proceedings of the Network and Distributed System Security
Symposium, NDSS 2011, San Diego, California, USA, 6th February -
9th February 2011. The Internet Society, 2011.
[24] Istehad Chowdhury and Mohammad Zulkernine. “Using complexity,
coupling, and cohesion metrics as early indicators of vulnerabilities.”
In: Journal of Systems Architecture 57.3 (2011), pp. 294–313. doi: 10.
1016/j.sysarc.2010.06.003.
[25] Cynthia Wagner, Jérôme François, Radu State, and Thomas Engel.
“Machine Learning Approach for IP-Flow Record Anomaly Detection.”
en. In: NETWORKING 2011. Ed. by David Hutchison et al. Vol. 6640.
Bibliography 97
101
Acknowledgments
I would like to thank all those who helped me in the realization of this
Thesis:
the supervisor Paolo Ciancarini;
the co-supervisor Valentina Presutti;
and the co-supervisor Mehwish Alam.