Ghafir Future Generation Computer Systems
Ghafir Future Generation Computer Systems
Authors Ghafir, Ibrahim; Hammoudeh, M.; Prenosil, V.; Han, L.; Hegarty,
R.; Rabie, K.; Aparicio-Navarro, F.J.
This work is made available online in accordance with publisher policies. Please refer to the
repository record for this item and our Policy Document available from the repository home
page for further information.
To see the final version of this work please visit the publisher’s website. Access to the
published online version may require a subscription.
Citation: Ghafir I, Hammoudeh M, Prenosil V (et al) Detection of advanced persistent threat using
machine-learning correlation analysis. Future Generation Computer Systems. 89: 349-359.
Copyright statement: © 2018 Elsevier B.V. All rights reserved. Reproduced in accordance with
the publisher's self-archiving policy. This manuscript version is made available under the CC-BY-
NC-ND 4.0 license.
Detection of Advanced Persistent Threat Using
Machine-Learning Correlation Analysis
Abstract
1. Introduction
The volume, complexity and variety of Cyber attacks are continually increas-
ing. This trend is currently being driven by cyber warfare and the emergence of
the Internet of Things [1–3]. The annual cost of cyber attacks was $3 trillion
5 in 2015 and it is expected to increase to more than $6 trillion per annum by
2021 [4]. This high cost has brought much interest in research and investment
towards developing new cyber attacks defence methods and techniques [5–8].
Although virus scanners, firewalls and intrusion detection and prevention sys-
tems (IDPSs) have been able to detect and prevent many of cyber attacks,
10 cyber-criminals in turn have developed more advanced methods and techniques
to intrude into the target’s network and exploit their resources, targeting both
wired and wireless communications [9, 10]. In addition, many of the defence
approaches against cyber attacks consider those attacks are targeting random
networks, so they assume that if the company’s network is well protected, the
15 attacker can surrender and move onto an easier target. Nonetheless, according
to a technical report by Trend Micro [11], this assumption is no longer valid with
the rise of targeted attacks, Advanced Persistent Threats (APTs), in which both
cyber-criminals and hackers are targeting selected organizations and persisting
until they achieve their goals.
20 The APT attack is a persistent, targeted attack on a specific organisation
and is performed through several steps [12]. The main aim of APT is espi-
onage and then data exfiltration. Therefore, APT is considered as a new and
more complex version of multi-step attack. These APTs present a challenge for
current detection methods as they use advanced techniques and make use of
25 unknown vulnerabilities. Moreover, the economic damage due to a successful
APT attack significant. The potential cost of attacks is the major motivation
2
for the investments in intrusion detection and prevention systems [13]. APTs
are currently one of the most serious threats to companies and governments [14].
Most of the research in the area of APT detection, has focused on analysing
30 already identified APTs [15–21], or detecting a particular APT that uses a
specific piece of malware [22]. Some works have attempted to detect novel
APT attacks. However, they face serious shortcomings in achieving real time
detection [23], detecting all APT attack steps [23], balance between false positive
and false negative rates [22], and correlating of events spanning over a long
35 period of time [24, 25]. The existing work is encouraging. However, the accurate
and timely detection of APT remains a challenge.
In this work, we have developed a novel machine learning-based system,
called MLAPT, which can accurately, and quickly detect and predict APT at-
tacks in a holistic way, making a significant contribution to the field of intrusion
40 detection systems (IDS). MLAPT runs through three main phases: threat de-
tection, alert correlation and attack prediction, the major contributions of this
work include:
∙ Threat detection: the aim of this first phase is to detect threats during
the multi-step APT attack. We have developed eight methods/modules to
45 detect various attacks used in one of the APT attack steps. These include
disguised exe file detection (DeFD), malicious file hash detection (MFHD),
malicious domain name detection (MDND), malicious IP address detection
(MIPD), malicious SSL certificate detection (MSSLD), domain flux detec-
tion (DFD), scan detection (SD), and Tor connection detection (TorCD).
50 The output of this phase is alerts, also known as events, triggered by
the individual modules. All the methods have been evaluated using real
network traffic.
3
this phase undergoes three main steps: alerts filter (AF), to identify re-
dundant or repeated alerts; clustering of alerts (AC), which most likely
belong to the same APT attack scenario; and correlation indexing (CI),
60 to evaluate the degree of correlation between alerts of each cluster.
∙ The proposed MLAPT system is able to process and analyse the network
traffic in real time without needing to store data, and make possible the
early prediction of APT attacks so that an appropriate and timely response
can take place before the attack completes its life cycle.
2. Related Work
The APT detection has been a challenge for the current Intrusion Detection
Systems (IDSs), and much research has been conducted to address this type
of multi-stage attack. Table 1 describes current APT detection systems and
80 mesmerises their limitations.
TerminAPTor, an APT detector, is described in [26]. This detector uses
information flow tracking to find the links between the elementary attacks, which
are triggered within the APT life cycle. TerminAPTor depends on an agent,
which can be a standard intrusion detection system, to detect those elementary
4
Table 1: Current APT detection systems: description and limitations
APT Detector Description Limitations
TerminAPTor [26] Uses information flow tracking to find High false positives
the links between the elementary alerts
C&C-based [27] Considers the access to the C&C do- Can be easily evaded when the infected
mains independent while the access to hosts connect to the c&C domains while
the legal domain is correlated users are surfing the Internet
Spear-phishing- Uses "Tokens" and utilises mathemati- The spear phishing email may not con-
based [28] cal and computational analysis to filter tain any of the Tokens - Detects only
spam emails one step of APT life cycle
Statistical APT de- The generated events in each APT step Requires significant expert knowledge
tector [29] are correlated in a statistical manner to set up and maintain
Active-learning- Detects malicious PDFs based on white Detects only one step of APT life cycle
based [30] lists and their compatibility as viable
PDF files
Data Leakage Pre- Utilises DLP algorithm to detect the Detects only one step of APT life cycle
vention [31] step of data exfiltration - Cannot achieve real time detection
SPuNge [23] Gathers the data on the hosts’ side Detects only one step of APT life cycle
- Cannot achieve real time detection
Context-based [32] Models APT as a pyramid in which the Requires significant expert knowledge
top of the pyramids represent the at- to set up and maintain
tack goal, and the lateral planes indi-
cates the environments involved in the
APT life cycle
5
detecting only one step of the APT life cycle.
An approach for APT detection based on spear phishing detection is explored
in [28]. This approach depends on mathematical and computational analysis
to filter spam emails. Tokens, which are considered as a group of words and
100 characters such as (click here, free, Viagra, replica), should be defined for the
detection algorithm to separate legitimate and spam emails. However, the spear
phishing email might not include any of the tokens which are necessary for the
algorithm process. Additionally, depending on one step for APT detection leads
the system to fail when missing that step.
105 A statistical APT detector, similar to TerminAPTor detector, is developed
in [29]. This system considers that APT undergoes five states which are delivery,
exploit, installation, C&C and actions; and several activities are taken in each
state. The generated events in each state are correlated in a statistical manner.
This system requires significant expert knowledge to set up and maintain.
110 An active-learning-based framework for malicious PDFs detection is sug-
gested in [30]. These malicious PDFs can be used in the early steps of APT
to get the point of entry. The system collects all PDFs transferred over the
network, then all known benign and malicious files are filtered by the "known
files module" which depends on white lists, reputation systems and antivirus
115 signature repository. Following this, the remaining files "unknown files" are
checked for their compatibility as viable PDF files. This approach detects only
one step of the APT life cycle.
An approach based on Data Leakage Prevention (DLP) is proposed in [31].
This approach focuses on detecting the last step of APT which is the data
120 exfiltration. A DLP algorithm is used to process the data traffic to detect
data leaks and generate "fingerprints" according to the features of the leak.
The proposed system utilises external cyber counterintelligence (CCI) sensors
in order to track the location or path of the leaked data. This approach is limited
to detect only one step of APT which is the data exfiltration. In addition, it
125 cannot achieve the real time detection as the CCI analysis unit should wait
for the information from the sensors. Moreover, it is not guaranteed that the
6
CCI sensors can provide the required information regarding the leaked data
fingerprints. This approach also introduces privacy issues, whereby actors in
the CCI have access to the data stored and transfered by all users of the sytems.
130 A working prototype, SPuNge, is presented in [23]. The proposed approach
depends on the gathered data on the hosts’ side and aims to detect possible APT
attacks. SPuNge undergoes two main phases, in the first one, the detected
malicious URLs are analysed. Those URLs can be connected by the hosts’
computers over HTTP(S) with an Internet browser or by malware installed on
135 the infected machines. The computers which show a similar activity are then
determined. This system depends on detecting one activity of the APT attack,
which is malicious URL connection, and does not consider the other activities
of APT. Meaning, if the detection system misses the malicious URL connection,
the whole APT scenario will not be detected. Additionally, the system cannot
140 achieve real time detection.
A context-based framework for APT detection is explained in [32]. This
framework is based on modelling APT as an attack pyramid in which the top
of the pyramid represents the attack goal, and the lateral planes indicates the
environments involved in the APT life cycle. This detection framework requires
145 significant expert knowledge to set up and maintain.
Finally, the existing APT detection systems face serious shortcomings in
achieving real time detection, balance between false positive and false negative
rates and correlating of events spanning over a long period of time. To address
those weaknesses, this paper presents a new approach for APT detection and
150 prediction.
7
Detection modules should be developed to detect the most common techniques
used in the APT attack steps.
However, detecting a single stage of an APT technique itself does not mean
detecting an APT attack. Even though an individual module alert indicates a
160 technique which can possibly be used in an APT attack, this technique can be
used for other types of attacks or it can be even a benign one. For example,
domain flux, port scanning and malicious C&C communications, used in the
APT attack, can be also used for botnet attacks [33]. Moreover, Tor network
connection, used for data exfiltration in the APT attack, can also be used legally
165 to protect the confidentiality of a user traffic [34]. Thus, individually these
detection modules are ineffective and their information should be fused to build
a complete picture regarding an APT attack. For this reason, a correlation
framework should be developed to link the outputs of the detection modules
and reduce the false positive rate of the detection system.
170 Predicting the APT attack in its early steps would minimise the damage and
prevent the attacker from achieving the goal of data exfiltration. With a histori-
cal record of the correlation framework output, machine learning algorithms can
be used to train a prediction model. As the purpose of the prediction model is
to classify the early alerts of the correlation framework, classification algorithms
175 should be selected to train the model.
8
Network security team Prediction module
Model
Alerts SVM
filter
KNN
Alerts Correlation Ensemble
clustering indexing
Events
Machine learning
Detection modules dataset
Labels
Input
Normalization
Network traffic
of this phase are alerts, also known as events, triggered by individual modules.
The alerts raised by individual detection modules are then fed to the corre-
lation framework. The aim of the correlation framework is to find alerts could
be related and belong to one APT attack scenario. The process in this phase
190 undergoes three main steps: alerts filter to identify redundant or repeated alerts;
clustering of alerts which most likely belong to the same APT attack scenario;
and correlation indexing to evaluate the degree of correlation between alerts of
each cluster.
In the final phase, a machine-learning-based prediction module is used by
195 the network security team to determine the probability of the early alerts to
develop a complete APT attack. This allows the network security team to
predict the APT attack in its early steps and apply the required procedure to
stop it before completion and minimize the damage. The detection of APT is
9
different from the prediction. The detection can be when two or more steps of
200 APT are correlated. However, the prediction can be achieved after the first two
steps of APT are linked.
The detection modules have been presented in [35–42], this paper focuses on
the correlation framework and prediction module.
205 Taking into consideration the APT steps, Table 2 shows the MLAPT detec-
tion modules for each APT step. These modules are:
10
Disguised exe File Detection (DeFD): This module detects disguised exe files
over the connections. In other words, it detects if the content of the file is exe
while the extension is not exe. The network traffic is processed, all connections
210 are analysed and all exe files identified when transferring over the connections
are filtered. This filtering is based on the file content. Following this, the file
name extension should be checked to decide about raising an alert on disguised
exe file detection [35].
Malicious File Hash Detection (MFHD): This module detects any malicious
215 file downloaded by one of the network hosts. It is based on a blacklist of mali-
cious file hashes. The network traffic is processed, all connections are analysed
and MD5, SHA1 and SHA256 hashes are calculated for each new file identified
when transferring over a connection. The calculated hashes are then matched
with the blacklist [36].
220 Malicious Domain Name Detection (MDND): This module is used to detect
any connection to a malicious domain name. It is based on a blacklist of mali-
cious domain names. DNS traffic is filtered, all DNS requests are analysed and
the queries are matched with the blacklist [37].
Malicious IP Address Detection (MIPD): This module detects any connec-
225 tion between an infected host and a C&C server. The detection is based on a
blacklist of malicious IPs of C&C servers. MIPD processes the network traf-
fic to search for a match in the source and destination IP addresses for each
connection with the IP blacklist [38].
Malicious SSL Certificate Detection (MSSLD): This module aims at detect-
230 ing C&C communications based on a blacklist of malicious SSL certificates. This
blacklist consists of two forms of SSL certificates, the SHA1 fingerprints and
the serial & subject, which are associated with malware and malicious activities.
The network traffic is processed and all secure connections are filtered. The SSL
certificate of each secure connection is then matched with the SSL certificate
235 blacklist [39].
Domain Flux Detection (DFD): This module detects algorithmically gener-
ated domain flux, where the infected host queries for the existence of a large
11
number of domains, whilst the owner has to register only one. This leads to
the failure of many of DNS queries. DFD utilizes DNS query failures to detect
240 domain flux attacks. The network traffic is processed, particularly DNS traffic.
All DNS query failures are analysed and a threshold for DNS query failures
from the same IP address is imposed to detect domain flux attacks and identify
infected hosts [40].
Scan Detection (SD): The SD module detects port scanning attacks which
245 aims to identify the noteworthy servers and services for future data exploitation.
SD is based on tracking all failed connection attempts, and a threshold for those
failed attempts is imposed over a specific time interval to detect scanning attacks
and identify infected hosts [41].
Tor Connection Detection (TorCD): This module detects any connection to
250 a Tor network. It is based on a list of Tor servers which is publicly published.
The network traffic is processed and the source and destination IP addresses for
each connection are matched with Tor servers list [42].
This phase of MLAPT takes the output of each of the detection modules (the
255 generated alerts) as an input, and aims to find alerts could be correlated and
belong to a single APT attack scenario. FCI (Filter, Cluster, and Index) runs
through three main steps: (1) Alerts filter, which filters redundant or repeated
alerts; (2) Alerts clustering, which clusters alerts which potentially belong to
the same APT attack scenario; and (3) Correlation indexing, which evaluates
260 the correlations between alerts of each cluster.
In Section 3.3, eight attack detection modules are presented, each module
detects one possible technique used in one of the APT steps. The output of each
module is an alert which is generated when an attack is detected. Each alert has
seven attributes (alert_type, timestamp, src_ip, src_port, dest_ip, dest_port,
265 infected_host). Table 3 summarizes the steps of the APT attack that can be
detected by MLAPT and the alerts which can be generated for each step.
All alerts generated by the detection modules are fed to the correlation
12
Table 3: The APT attack detectable steps and alerts.
(a1) disguised_exe_alert
(A) Step 2
(a2) hash_alert
Point of entry
(a3) domain_alert
(b1) ip_alert
(B) Step 3
(b2) ssl_alert
C&C communication
(b3) domain_flux_alert
(C) Step 5
(c1) scan_alert
Asset/Data discovery
(D) Step 6
(d1) tor_alert
Data exfiltration
framework. However, those alerts are not the only ones detected by the the
modules. When an APT technique is detected, and before an alert is gener-
270 ated, the module checks whether the same alert has been generated during the
previous day, if so, the alert is ignored. This alerts suppression reduces the com-
putational cost of the FCI correlation framework. The FCI process steps will
be explained in this section. As an output of the FCI correlation framework,
two main alerts can be generated:
13
possible full APT scenarios can be expressed as:
𝐴𝑃 𝑇𝑓 𝑢𝑙𝑙 = 𝐴 ∧ 𝐵 ∧ 𝐶 ∧ 𝐷 (1)
14
stored by AC for a correlation time. For each new alert, the AC module checks
all stored alerts for the clustering possibility. The clustering algorithm in this
module is scenario-based, which utilizes three main rules:
∙ Alert step: Alerts for the same APT attack step cannot be in one cluster.
310 ∙ Alert type: Alerts of the same type cannot be in one cluster.
∙ Alert time: Cluster’s alerts should be all triggered within the correlation
time, and alerts order should be corresponded with the APT life cycle.
Meaning, if t(d), t(c), t(b) and t(a) are the times when the alerts from the
APT steps six, five, three and two, respectively, have been triggered, the
315 clustering algorithm can classify those alerts into one cluster only if they
meet the following two conditions:
The AC module has four processing engines, explained later in Section 4.2.2
320 on page 20, each engine processes all alerts which belong to one of the APT
detectable steps. Based on the incoming alert step, a corresponded engine runs.
As a result of AC process, the new incoming alert can be classified into an
existing APT cluster, a new APT cluster can be created, or the new alert is
ignored as it does not meet the rules and cannot be clustered at all. The output
325 of AC is APT clusters. Each cluster contains a maximum of four alerts, which
potentially belong to one APT full or sub-scenario. The produced cluster alerts
are evaluated using the correlation index algorithm, presented in the following
Section 3.4.3 on page 15, to decide if they are correlated or not.
15
process is to filters clusters having uncorrelated alerts. The second function
335 calculates the correlation index of each cluster by the end of the correlation
window. The latter function is essential to build a historical record of the mon-
itored network to be used in the next module of the FCI correlation framework,
namely the prediction module.
The correlation indexing (CI) algorithm makes use of the attributes of each
340 alert in the cluster to calculate the cluster’s correlation index 𝐶𝑜𝑟𝑟𝑖𝑑 . To find
the 𝐶𝑜𝑟𝑟𝑖𝑑 for each cluster, the CI algorithm calculates the correlation between
each two alerts (steps) in the cluster. Therefore, three values are calculated
within each cluster: 𝐶𝑜𝑟𝑟𝑎𝑏 , the correlation between the second step (𝑎𝑙𝑒𝑟𝑡1 )
and the third step (𝑎𝑙𝑒𝑟𝑡2 ) of APT; 𝐶𝑜𝑟𝑟𝑏𝑐 , the correlation between the third
345 step (𝑎𝑙𝑒𝑟𝑡2 ) and the fifth step (𝑎𝑙𝑒𝑟𝑡3 ) of APT; and 𝐶𝑜𝑟𝑟𝑐𝑑 , the correlation
between the fifth step (𝑎𝑙𝑒𝑟𝑡3 ) and the sixth step (𝑎𝑙𝑒𝑟𝑡4 ) of APT.
The clustering algorithm is based on alert_type and timestamp attributes
of each alert. However, the correlation indexing algorithm is based on in-
fected_host and scanned_host attributes. To calculate 𝐶𝑜𝑟𝑟𝑎𝑏 , 𝐶𝑜𝑟𝑟𝑏𝑐 and
350 𝐶𝑜𝑟𝑟𝑐𝑑 , taking into consideration the APT attack life cycle and the attributes
of each alert in the cluster, the CI algorithm utilizes the following rules:
⎧
⎨1, if [𝑎𝑙𝑒𝑟𝑡2 , 𝑖𝑛𝑓 𝑒𝑐𝑡𝑒𝑑_ℎ𝑜𝑠𝑡2 ] = [𝑎𝑙𝑒𝑟𝑡1 , 𝑖𝑛𝑓 𝑒𝑐𝑡𝑒𝑑_ℎ𝑜𝑠𝑡1 ]
⎪
𝐶𝑜𝑟𝑟𝑎𝑏 =
⎩0,
⎪
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
⎧
⎪
⎪
⎪ 1, if [𝑎𝑙𝑒𝑟𝑡3 , 𝑖𝑛𝑓 𝑒𝑐𝑡𝑒𝑑_ℎ𝑜𝑠𝑡3 ] = [𝑎𝑙𝑒𝑟𝑡2 , 𝑖𝑛𝑓 𝑒𝑐𝑡𝑒𝑑_ℎ𝑜𝑠𝑡2 ]
⎪
⎨
𝐶𝑜𝑟𝑟𝑏𝑐 = or [𝑎𝑙𝑒𝑟𝑡3 , 𝑖𝑛𝑓 𝑒𝑐𝑡𝑒𝑑_ℎ𝑜𝑠𝑡3 ] = [𝑎𝑙𝑒𝑟𝑡1 , 𝑖𝑛𝑓 𝑒𝑐𝑡𝑒𝑑_ℎ𝑜𝑠𝑡1 ]
⎪
⎪
⎪
⎪
⎩0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
16
⎧
1, if [𝑎𝑙𝑒𝑟𝑡4 , 𝑖𝑛𝑓 𝑒𝑐𝑡𝑒𝑑_ℎ𝑜𝑠𝑡4 ] = [𝑎𝑙𝑒𝑟𝑡3 , 𝑠𝑐𝑎𝑛𝑛𝑒𝑑_ℎ𝑜𝑠𝑡]
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪ or [𝑎𝑙𝑒𝑟𝑡4 , 𝑖𝑛𝑓 𝑒𝑐𝑡𝑒𝑑_ℎ𝑜𝑠𝑡4 ] = [𝑎𝑙𝑒𝑟𝑡3 , 𝑖𝑛𝑓 𝑒𝑐𝑡𝑒𝑑_ℎ𝑜𝑠𝑡3 ]
⎪
⎨
𝐶𝑜𝑟𝑟𝑐𝑑 = or [𝑎𝑙𝑒𝑟𝑡4 , 𝑖𝑛𝑓 𝑒𝑐𝑡𝑒𝑑_ℎ𝑜𝑠𝑡4 ] = [𝑎𝑙𝑒𝑟𝑡2 , 𝑖𝑛𝑓 𝑒𝑐𝑡𝑒𝑑_ℎ𝑜𝑠𝑡2 ]
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪ or [𝑎𝑙𝑒𝑟𝑡4 , 𝑖𝑛𝑓 𝑒𝑐𝑡𝑒𝑑_ℎ𝑜𝑠𝑡4 ] = [𝑎𝑙𝑒𝑟𝑡1 , 𝑖𝑛𝑓 𝑒𝑐𝑡𝑒𝑑_ℎ𝑜𝑠𝑡1 ]
⎪
⎪
⎪
⎩0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
⎪
Since 𝐶𝑜𝑟𝑟𝑎𝑏 , 𝐶𝑜𝑟𝑟𝑏𝑐 and 𝐶𝑜𝑟𝑟𝑐𝑑 values can be only 1 or 0, the cluster’s
correlation index 𝐶𝑜𝑟𝑟𝑖𝑑 is always positive and can take one of the following
360 values:
17
All the clusters and their correlation index values are recorded into a specific
dataset, the correlation_dataset, to be used in the Prediction module.
375 This module is used by the network security team to estimate the probability
of an apt_sub_scenario_two_steps_alert, generated by the FCI correlation
framework, to develop a complete APT attack. In practical terms, it predicts
if FCI will generate an apt_full_scenario_alert in the future based on the
attributes of the current apt_sub_scenario_alert. This prediction gives the
380 network security team a sign to perform more forensics on the corresponding
two suspicious connections and deny the attacker to complete the APT life cycle.
The prediction module uses a historical record of the monitored network and
applies machine learning techniques to achieve its functionality.
PM takes the correlation dataset, built by FCI over six months or more, as
385 an input. The required period of time to build the correlation dataset depends
on the number of correlated clusters generated by FCI. This number affects the
number of samples used to train the prediction model. The correlation dataset
contains the correlated clusters, both full and sub APT scenarios, and the cor-
relation index 𝐶𝑜𝑟𝑟𝑖𝑑 for each cluster. The process in this module undergoes
390 three main steps: (1) Preparing the dataset, to be available to be consumed by
machine learning algorithms; (2) Training the prediction model, different ma-
chine learning algorithms are applied and the best model which has the highest
accuracy is chosen; and (3) Using the model for prediction, the security team
apply the model on FCI real time alerts. The output of this module is a predic-
395 tion model used by the network security team for live traffic monitor and APT
prediction.
4. MLAPT Implementation
In this section, the implementation of MLAPT are introduced and the used
frameworks, tools and programming languages are mentioned. As MLAPT con-
18
400 sists of three main phases: threat detection, alert correlation and attack predic-
tion; the implementation algorithms of each phase are presented separately.
All detection modules are implemented on top of Bro [43]. The implemen-
tation and evaluation of the detection modules have been published in [35–42].
405 Therefore, this paper presents only the implementation and evaluation of the
correlation framework and prediction module.
As an output of each detection module, in case of an APT technique is de-
tected, a corresponding event (alert) is generated. This event is to be used in
the FCI correlation framework as explained later in Section 4.2 on page 19. Ad-
410 ditionally, an alert email is sent to RT (Request tracker) [44] where the network
security team can perform additional forensics and respond to the triggered
alert. Along with generating a new alert, information regarding the alert and
the malicious connection (alert_type, timestamp, src_ip, src_port, dest_ip,
dest_port, infected_host, malicious_item) is written into a specific log (indi-
415 vidual log for each APT technique detection) to keep a historical record of the
monitored network.
In case of cryptographically embedded payloads for APTs paradigms, even
the connections are encrypted, the detection modules (except DeFD and MFHD)
are still effective as they depend on investigating the packets’ headers and not
420 the payload.
The FCI framework is implemented in two versions. The first one is im-
plemented on top of Bro to be used on live traffic for real time detection; it
can be also used offline on PCAP (Packet Capture) files. The second version
425 is implemented in Python to be used offline on saved alerts’ logs. Using FCI
offline-version is useful when having a PCAP file for a network which is not
monitored by Bro.
19
4.2.1. Implementation of the Alerts Filter (AF) Module
When generating a new alert by one of the detection modules,
430 the AF algorithm checks t_detection_modules_alerts table to determine
if the same alert has been generated within the last correlation_time.
t_detection_modules_alerts table contains all alerts which have been gener-
ated by the detection modules and sent to AC within the last correlation_time.
Thus, AF either (1) ignores the new alert, if it is a repeated one; or (1) sends
435 the new alert to AC, to be processed and clustered, and (2) writes the new alert
into t_detection_modules_alerts table where it is saved for the next correla-
tion_time. The AF algorithm pseudo-code is provided in the supplementary
material of this paper.
3. alert_3 ∈ {𝑠𝑐𝑎𝑛_𝑎𝑙𝑒𝑟𝑡}.
4. alert_4 ∈ {𝑡𝑜𝑟_𝑎𝑙𝑒𝑟𝑡}.
Alert clustering can affect the performance of the correlation indexing and
450 the prediction module as well. For this reason, the first function of CI, evaluat-
ing the correlations between the cluster’s alerts, mentioned in Section 3.4.3 on
page 15, is also implemented within the AC algorithm. Implementing the first
function of CI within AC reduces the computational cost of the FCI correlation
framework, since AC does not classify any new alert into a cluster unless it is
455 correlated with the cluster alerts, as explained later in this section. The AC
algorithm pseudo-code is provided in the supplementary material of this paper.
20
First, the AC module determines to which one of the APT steps the new
alert, coming from the AF module, belongs. MLAPT can detect four steps of
the APT life cycle, mentioned in Section 3.4, Table 3 on page 13. Based on the
460 new alert step, AC has four processing engines, each engine processes all alerts
which belong to one APT step.
For alert_1 processing engine, the second step of APT is the first detectable
step, therefore, as soon as an alert of the second APT step is triggered, AC
starts a new cluster and writes the new alert into alert_1.
465 For alert_2 processing engine, when a new alert for the third step of APT is
triggered, the AC module checks all the clusters in the clustered_dataset. The
cluster of interest is the one that has alert_1 and the other alerts (alert_2,
alert_3, alert_4 ) are still missed. For that cluster of interest, the algorithm
checks time attributes: time, the time when the current processed alert is trig-
470 gered; and time_1, the time when the alert_1 is triggered. For the new alert to
be considered, those time attributes should meet two conditions: time > time_1
and time - time_1 <= TW ; whereas TW stands for the time window "corre-
lation time". Following this, the first function of the CI module checks the in-
fected_host attributes: infected, the infected host of the current processed alert;
475 and infected_1, the infected host of alert_1. If both infected host attributes
are matched, the current processed alert is added into the current cluster of in-
terest as alert_2. In addition, an event apt_sub_scenario_two_steps_alert is
generated and an alert email is sent to RT informing the network security team
regarding this APT sub scenario detection. When one of the previous checks
480 fails, AC checks if the current cluster is the last one in the clustered_dataset:
if true, a new cluster is started and the current processed alerts is added as
alert_2 ; if false, the process is to be repeated again for the next cluster.
For alert_3 processing engine, when a new alert for the fifth step of APT is
triggered, AC checks all the clusters in the clustered_dataset. There are three
485 cases for the cluster of interest: (1) when the cluster has alert_1 and alert_2
and the other alerts "alert_3 and alert_4 " are missed; (2) when the cluster has
alert_1 and the other alerts "alert_2, alert_3, alert_4 " are missed; (3) and
21
when the cluster has alert_2 and the other alerts "alert_1, alert_3, alert_4 "
are missed.
490 For the first case of cluster of interest, AC checks all time attributes which
should meet two conditions: time > time_2 and time - time_1 <= TW. Fol-
lowing this, CI checks all infected host attributes that should meet the condition
infected == infected_2, as alert_1 and alert_2 are already in the cluster so it is
guaranteed that infected_1 == infected_2 and there is no need for the first func-
495 tion of CI to check it. The current processed alert is then added into the current
cluster of interest as alert_3, an event apt_sub_scenario_three_steps_alert is
generated, and an alert email is sent to RT informing the network security team
regarding this APT sub-scenario detection. If one of the previous checks is
failed, it is checked if the current cluster is the last one in clustered_dataset:
500 if true, a new cluster is started and the current processed alerts is added as
alert_3 ; if false, the process is to be repeated again for the next cluster.
For the second and third case of cluster of interest, the process is similar to
the first case, taking into consideration the corresponded time and infected host
attributes.
505 For alert_4 processing engine, the first step is to find the cluster of interest in
the clustered_dataset. When a new alert for the sixth step of APT is triggered,
AC checks all the clusters in the clustered_dataset. There are seven cases for
the cluster of interest: (1) when the cluster has alert_1, alert_2, and alert_3,
and the last alert "alert_4 " is missed; (2) when the cluster has alert_1 and
510 alert_2 and the other alerts "alert_3 and alert_4 " are missed; (3) when the
cluster has alert_1 and alert_3 and the other alerts "alert_2 and alert_4 "
are missed; (4) when the cluster has alert_2 and alert_3 and the other alerts
"alert_1 and alert_4 " are missed; (5) when the cluster has alert_1 and the
other alerts "alert_2, alert_3, and alert_4 " are missed; (6) when the cluster
515 has alert_2 and the other alerts "alert_1, alert_3, and alert_4 " are missed;
(7) and when the cluster has alert_3 and the other alerts "alert_1, alert_2,
and alert_4 " are missed.
The process of all cases of cluster of interest in alert_4 processing engine
22
is similar to the process in alert_3 processing engine explained above. The
520 AC algorithm checks all time attributes of the cluster; after that, the CI algo-
rithm checks all infected host and scanned host attributes; to decide whether
the current processed alert is to be added into the current cluster of inter-
est as alert_4. Based on the cluster of interest, three events can be gen-
erated as an output of alert_4 processing engine: apt_full_scenario_alert
525 for case 1; apt_sub_scenario_three_steps_alert for cases 2, 3, and 4; and
apt_sub_scenario_two_steps_alert for cases 5, 6, and 7. In addition, an alert
email is sent to RT informing the network security team regarding this APT
full or sub-scenario detection. If one of the algorithms’ conditions fails, the pro-
cess moves to the next cluster in clustered_dataset or it is ended if the current
530 cluster is the last one.
23
4.3.1. Preparing the Machine Learning Dataset
Building the machine_learning_dataset is based on the correlation_dataset,
which is the output of the FCI correlation framework over a period of six months
or more. The correlation_dataset contains the correlated clusters, both full and
550 sub APT scenarios, and the correlation index 𝐶𝑜𝑟𝑟𝑖𝑑 for each cluster. To prepare
the machine_learning_dataset, PM makes the following modifications on the
correlation_dataset:
∙ The class prediction is based on the first two detectable steps of APT,
therefore, all the columns related to the third and fourth detectable
570 alerts, i.e. alert_3 and alert_4 attributes, are filtered out of the cor-
relation_dataset.
∙ Since the chosen machine learning classifiers work with numeric values,
columns which are not numeric in the correlation_dataset are represented
in a numerical format for the machine_learning_dataset. The alert_type
24
575 values are mapped to numbers from 1 to 6, and the columns which
contain IPs values (src_ip_1, dest_ip_1, infected_host_1, src_ip_2,
dest_ip_2, infected_host_2 ) are mapped to numeric values using
socket [45] and struct.unpack [46] functions built in Python.
In this section, the evaluation of MLAPT is introduced and the achieved re-
sults are presented. As MLAPT consists of three main phases: threat detection,
600 alert correlation and attack prediction; the evaluation of MLAPT undergoes the
evaluation of the three phases respectively. Additionally, a comparison between
the developed system MLAPT and other existing systems is provided.
25
5.1. Evaluation of the Detection Modules
Two main methods were used to evaluate the detection modules. In the first
605 one, the detection modules were applied on pcap files which contain malicious
traffic. Each pcap file was provided by a different third party, pcap file size and
data source are mentioned in the evaluation section of each detection module.
In the second evaluation method, Bro was installed on an experimental server
(2x 4-core Intel Xeon CPU E5530 @ 2.40 GHz, 12 GB RAM) with passive access
610 to part of the university campus live traffic (200 Mbps, 200 users, 550 nodes)
via an optical TAP (Test Access Port). The detection modules were run on the
experimental server and the network was monitored for one month.
In the absence of any publicly available data which contains APT attack
615 traffic, which can be used in the evaluation of the FCI framework. We had
to build a new dataset which contains APT attack traffic. Using the campus
network to gather attack data does not guarantee capturing any APT attack
traffic against the monitored network.
The aim of the correlation framework is to identify different alerts raised
620 by the various detection modules, which could be correlated and belong to one
APT attack scenario. To effectively evaluate the FCI correlation framework, a
dataset containing many of the detection modules alerts, in which some of those
alerts belong to APT attack scenarios, has been built. The data is generated
to appear as APT attack scenarios were simulated on the campus network, the
625 techniques used in the APT life cycle were identified by the detection modules,
and all generated alerts were written into the simulation dataset. That dataset
also contains many of the generated alerts which do not belong to APT attack
scenarios. All the detection modules have been evaluated on pcap files and on
the real live traffic as well. The aim of this experiment is to test if the FCI
630 correlation framework is able to detect those APT scenarios in the simulation
dataset.
26
5.2.1. Data Generation
A script is written, using Python. This script generates two types of alerts:
(1) Random alerts which do not relate or belong to one APT attack scenario; and
635 (2) Related alerts which belong to a full or sub-APT attack. Each alert has seven
attributes: alert_type, timestamp, src_ip, src_port, dest_ip, dest_port and the
infected_host; only the scan_alert has the extra scanned_host attribute.
To generate a random alert, the alert_type is selected randomly from the set
of all 8 detectable alerts, i.e. disguised_exe_alert, hash_alert, domain_alert,
640 ip_alert, ssl_alert, domain_flux_alert, scan_alert and tor_alert. The times-
tamp is assigned a random value between Fri, 01 Jan 2016 00:00:01 GMT and
Thu, 30 Jun 2016 23:59:59 GMT. The src_ip is randomly assigned an IP ad-
dress on the campus network. The src_port is selected randomly from the
49152, 65535 range of ports, which are usually assigned dynamically to client
645 applications when initiating a connection. The dest_ip value is assigned based
on the selected alert_type: If the alert_type is disguised_exe_alert, hash_alert
or ssl_alert, then the dest_ip can be any valid IP address which is not on
the campus network; if the alert_type is domain_alert or domain_flux_alert,
then the dest_ip can use an IP address which is on the campus network; if the
650 alert_type is assigned ip_alert, then the dest_ip can select a random IP ad-
dress from the ip_blacklist; if alert_type is scan_alert, the dest_ip is selected
randomly from campus network IP addresses; and if the alert_type is tor_alert,
the dest_ip is selected randomly from tor_server_list. The dest_port is se-
lected based on the selected alert_type: if the alert_type is disguise_exe_alert
655 or hash_alert, the dest_port should be 80; if the alert_type is domain_alert or
domain_flux_alert, the dest_port should be 53; if the alert_type is ip_alert,
ssl_alert or tor_alert, the dest_port should be 443; and if the alert_type is
scan_alert, the dest_port is selected randomly from the 1, 1024 range of ports.
The infected_host should be the same src_ip of the connection. Finally, the
660 scanned_host (only if alert_type is scan_alert) should be the same dest_ip of
the connection.
27
To generate an APT full-scenario (consisting of 4 correlated alerts) or sub-
scenario (consisting of 2 or 3 correlated alerts), the APT life cycle should be
taken into consideration. Meaning, the generated alerts’ attributes of each sce-
665 nario are selected to appear as an APT attack which is simulated through the
campus network.
28
5.2.3. Results and Discussion
690 The FCI correlation framework was applied on the simulation_dataset. Ta-
ble 4 shows the FCI correlation framework detection results. This table indicates
the True Positive Rate (TPR) and the False Positive Rate (FPR) [47] for each
studied APT attack, both full and partial attacks. Among all studied APT
attacks, the best TPR results were for the APT sub-attack two steps scenario,
695 followed by the APT sub-attack three steps scenario and APT full attack, re-
spectively. The results show that the higher the number of related alerts, the
lower the TPR and the higher FPR. This is due to the higher possibility of the
random alerts to be incorrectly clustered when more alerts are to be correlated
for APT. By manual analysis for the results, the incorrect alerts clustering was
700 the main reason of the false alarms. Some APT attacks were not detected due
to some of the random alerts which were incorrectly clustered and correlated.
This can happen if those random alerts, by chance, meet the clustered and cor-
relation rules, so one random alert can interfere with a running APT scenario (if
the random alert is triggered for the missed scenario step, for the same infected
705 host, and within the correlation time) and cause the false positive detection of
APT and false negative detection of the random alert. Besides, a very rare case
can cause the wrong detection is when two, three or four random alerts can
meet the correlation rules, by chance, and are correlated incorrectly.
APT full scenario 90*4 78*4 12*4 88 4452 400 4500 78% 1%
(4 steps)
APT sub-scenario 65*3 42*3 23*3 24 4681 150 4750 84% 1.4%
(3 steps)
APT sub-scenario 85*2 47*2 38*2 6 4724 100 4800 94% 1.6%
(2 steps)
APT full and 725 532 193 118 4132 650 4250 81.8% 4.5%
sub-scenario
29
5.3. Evaluation of the APT Prediction Module (PM)
710 To evaluate the PM module, three main steps were followed: (1) Preparing
the machine_learning_dataset; (2) Training the prediction model; and (3) Sav-
ing the model for prediction.
Using the correlation_dataset, which is the output of the FCI correlation
framework over a period of six months, the machine_learning_dataset is pre-
715 pared as explained in Section 4.3.1 on page 24.
As there is no machine learning algorithm which can be regarded as the
best or the optimal one, various experiments should be performed on the ma-
chine_learning_dataset using several machine learning algorithms, and then a
comparison between the trained models is made.
720 The Matlab’s Classification Learner application is used to train models to
classify the machine_learning_dataset. Automated training is performed to
search for the best classification model type, including decision trees, support
vector machines, nearest neighbours, and ensemble classification; the character-
istics of each classifier type can be found in [48]. Cross-validation is used as a
725 validation scheme to examine the prediction accuracy of each trained model.
Cross-validation is a model assessment technique used to evaluate a machine
learning algorithm’s performance in making predictions on new datasets which
has not been trained on . This is done by partitioning a dataset and using a
subset to train the algorithm and the remaining data for testing. Each round of
730 cross-validation involves randomly partitioning the original dataset into a train-
ing set and a testing set. The training set is then used to train a supervised
learning algorithm and the testing set is used to evaluate its performance. This
process is repeated several times and the average accuracy is used as a per-
formance indicator. Table 5 shows the prediction accuracy for all investigated
735 classification algorithms used to train the classification models.
Experimental results show that the best classification algorithm is the Linear
SVM, with a prediction accuracy of 84.8%. This trained model can be saved
by the network security team to be applied on real time traffic when a new real
30
Table 5: Classification algorithms and the prediction accuracy of the trained models.
Prediction
Classification algorithms
accuracy
31
Table 6: A comparison between MLAPT and other existing systems.
APT detection Auton- APT speed TPR FPR Prediction
system omy steps accuracy
The most effective system in terms of true positive rate is TerminAPTor [26]
with a TPR of 100%, previously mentioned in Section 2 on page 4. However,
the developers mentioned that TerminAPTor has a high rate of false positives
(although they did not mention the figure of FPR) and needs to be improved
750 by filtering the false positives. Moreover, this detector requires the alerts to be
provided by other systems (agent-based) and cannot work autonomously. De-
spite having the lowest false positive rate of 0%, the C&C-based system [27],
presented previously in Section 2 on page 5, does not achieve the real time de-
tection. Furthermore, the authors stated that the detection can be easily evaded
755 when the infected hosts connect to the C&C domains while users are surfing the
Internet. Additionally, missing the detection of C&C domains leads to failure
in APT detection since this system depends on detecting only one step of the
APT life cycle. Whilst the spear phishing based system [28], explored earlier
in Section 2 on page 6, has a TPR of 97.2%, the FPR of 14.2% is considerably
760 high. In addition, depending on one step for APT detection leads the system to
fail when missing the spear phishing email detection. This missing can happen
when the spear phishing email does not include any of the tokens which are nec-
essary for the algorithm process. The context-based system [32], already stated
in Section 2 on page 7, has a significantly high FPR of 27.88% while the TPR
32
765 was not provided by the authors. Besides, this framework requires significant
expert knowledge to set up and maintain; and similar to TerminAPTor, it is an
agent-based system and cannot work autonomously.
Having a high rate of true positives is significant. Nevertheless, increasing the
amount of true positives means that the false positive rate also increases. Thus,
770 the balance between TPR and FPR is an essential requirement for any detection
system. The developed system MLAPT has a suitable balance between the two
values of TPR and FPR with 81.8% and 4.5% respectively. MLAPT can also
work autonomously and generate the required events based on its own detection
modules. The generated events covers four detectable steps of the APT life cycle
775 which reduces the false positives and gives more possibility of APT detection
in case one of the steps is missed. Furthermore, this system can achieve the
real time detection, so it can be much easier to trace back to the attacker,
minimise the damage and prevent further break-ins. Moreover, to the author’s
knowledge, MLAPT is the only system which can predict APT in its early steps
780 with a prediction accuracy of 84.8%, which prevents the attacker from achieving
the goal of data exfiltration.
33
∙ In the final phase, a machine-learning-based prediction module (PM) is
designed and implemented based on a historical record of the monitored
795 network.
References
34
[2] A. MacDermott, T. Baker, Q. Shi, Iot forensics: Challenges for the ioa
era, in: New Technologies, Mobility and Security (NTMS), 2018 9th IFIP
International Conference on, IEEE, 2018, pp. 1–5.
35
[11] T. M. technical report, Targeted attacks and how to de-
fend against them, http://www.trendmicro.co.uk/media/misc/
850 targeted-attacks-and-how-to-defend-against-them-en.pdf, ac-
cessed: 05-12-2017.
855 [13] T. R. Rakes, J. K. Deane, L. Paul Rees, It security planning under uncer-
tainty for high-impact events, Omega 40 (1) (2012) 79–88.
860 [15] M. I. Center, Apt1: Exposing one of china’s cyber espionage units, Tech.
rep., Mandiant, Tech. Rep (2013).
865 [17] C. Tankard, Advanced persistent threats and how to monitor and deter
them, Network security 2011 (8) (2011) 16–19.
[19] S.-T. Liu, Y.-M. Chen, S.-J. Lin, A novel search engine to uncover poten-
870 tial victims for apt investigations, in: Network and Parallel Computing,
Springer, 2013, pp. 405–416.
36
[21] M. Lee, D. Lewis, Clustering disparate attacks: Mapping the activities of
the advanced persistent threat., in: Proceedings of the 21st Virus Bulletin
International Conference.(October 2011) pp, pp. 122–127.
895 [27] X. Wang, K. Zheng, X. Niu, B. Wu, C. Wu, Detection of command and
control in advanced persistent threat based on independent access, in: Com-
munications (ICC), 2016 IEEE International Conference on, IEEE, 2016,
pp. 1–6.
37
[29] J. Sexton, C. Storlie, J. Neil, Attack chain detection, Statistical Analysis
and Data Mining: The ASA Data Science Journal 8 (5-6) (2015) 353–363.
905 [30] N. Nissim, A. Cohen, C. Glezer, Y. Elovici, Detection of malicious pdf files
and directions for enhancements: a state-of-the art survey, Computers &
Security 48 (2015) 246–266.
[36] I. Ghafir, V. Prenosil, Malicious file hash detection and drive-by down-
load attacks, in: Proceedings of the Second International Conference on
930 Computer and Communication Technologies, Springer, 2016, pp. 661–669.
38
[37] I. Ghafir, V. Prenosil, Dns traffic analysis for malicious domains detec-
tion, in: 2nd International Conference on Signal Processing and Integrated
Networks (SPIN), IEEE Xplore Digital Library, 2015, pp. 613–918.
39
[46] Python-Software-Foundation, struct, interpret strings as packed binary
data, https://docs.python.org/2/library/struct.html, accessed: 15-
960 03-2018.
40