Detecting APT Attacks Based On Network Traffic Using Machine Learning
Detecting APT Attacks Based On Network Traffic Using Machine Learning
Cho Do Xuan
Abstract
Advanced Persistent Threat (APT) attacks are a form of malicious, intention-
ally and clearly targeted attack. By using many sophisticated and complicated
methods and technologies to attack targets in order to obtain confidential
and sensitive information. In fact, in order to detect APT attacks, detection
systems often need to apply many parallel and series techniques in order to
make the most of the advantages as well as minimize the disadvantages of
each technique. Therefore, in this paper, we propose a method of detecting
APT attacks based on abnormal behaviors of Network traffic using machine
learning. Accordingly, in our research, the abnormal behavior of APT attacks
in Network Traffic will be defined on both components: Domain and IP.
Then, these behaviors are evaluated and classified based on the Random
Forest classification algorithm to conclude about the behavior of APT attacks.
Details of the definition of abnormal behaviors of the Domain and IP will
be presented in Section 3.2 of the paper. The synchronous APT attack
detection method proposed in this paper is a novel approach, which will help
information security systems detect quickly and accurately signs of the APT
attack campaign in the organization. The experimental results presented in
Section 4 will demonstrate the effectiveness of our proposed method.
1 Introduction
1.1 Introduction to APT Attack
APT attack technique is an advanced and targeted attack technique [1]. This
is shown in its persistence and ability to conceal and hide [1, 2]. The studies
[1–4] presented the definitions and concepts of terms: Advanced, Persistent,
and Threat in this attack technique. Moreover, the studies [1, 2] pointed 4
characteristics that highlight the difference between APT attack and other
network attack techniques including Targeted, Persistent, Evasive, and Com-
plex. This difference plays a vital role in illustrating the APT attack much
more dangerous than other attack techniques. The study [2] identified the
phases of the APT attack campaign including Reconnaissance, Preparation,
Targeting, Further access, Data gathering, and Maintenance. Currently, the
APT attack is considered the most dangerous cyber-attack technique and
causes many difficulties and damage to organizations and state agencies of
countries around the world. Therefore, the problem of early detecting and
warning this attack technique is very necessary today.
In this paper, we propose the APT attack detection method using a
combination of two methods of analyzing signs and abnormal behavior of
domain and IP in Network traffic. Accordingly, the characteristics of the APT
attack detection method that we propose are as follows:
• Step 1: Extracting the behavior and features of APT attacks in Network
traffic. At this step, we propose features and behavior of the domain and
IP of APT attacks in Network traffic;
• Step 2: Detecting APT attack based on Network traffic using abnor-
mal behavior analysis technique and the Random Forest (RF) machine
learning algorithm. After having features that represent the difference
between APT domains and APT IPs with clean domains and IPs, we use
the Random Forest algorithm to evaluate and classify in order to detect
suspect domains and IPs of APT attacks. Besides, in order to improve
the efficiency of the APT IP detection process, we consider the output
of the malicious domain detection process as a feature of the behavior of
IP-based APT attacks. The combination of features extracted from the IP
and the label of the domain makes the detection process more efficient
due to the correlation between the domain and the IP.
Detecting APT Attacks Based on Network Traffic Using Machine Learning 173
2 Related Works
2.1 Detecting APT Domain
In the paper [6], the authors detected APT attacks based on two factors: DNS
log and Network traffic. For the APT attack detection technique based on
the DNS log, the authors used 5 feature groups: Domain-based features,
Time-based features, Whois-based features, DNS answer-based features;
and Active Probing features. These five feature groups have a total of 14
features for detecting malicious DNS. The classification algorithm used in
the paper is the J48 Decision Tree algorithm. For the APT attack detection
technique based on network traffic, the authors presented 6 main features.
After detecting an APT attack on both DNS log and Network traffic, the
authors used a correlation analysis technique to detect which computer
addresses in the system were infected with APT malware. However, in
the paper, the authors didn’t present details of this correlation calculation
method.
In the paper [7], the authors combined the J48 Decision Tree algo-
rithm with 4 main feature groups: DNS request and answer-based features;
Domain-based features; Time-based features; and Whois-based features to
detect APT malware command and control domains (C&C Domain). The
Global Abnormal Forest and KNN machine learning algorithms are used in
this study. The statistical correlation analysis technique is used by the authors
to find out some of the new abnormal features of APT attacks. However, the
authors didn’t present data sources from which these abnormal features would
be extracted.
Detecting APT Attacks Based on Network Traffic Using Machine Learning 175
In [8], the authors used 3 main groups of features to detect the domain
APT, which are Domain name lexical features; Ranking features; DNS query
features and Random Forest algorithm.
In the article [9], the author used the correlation analysis technique
between DNS log and Network traffic, and some machine learning algorithms
such as KNN, SVM to detect APT attacks.
Yan et al. [10] proposed the method of using the CNN deep learning
algorithm to detect APT attacks based on DNS Activities. Accordingly,
the authors extracted three main groups of features: Domain Name-based
Features; Feature of the Relationship between DNS Request Behavior and
Response Behavior; Feature of the Relationship between DNS Request
Behavior and Response Behavior on a dataset of 4,907,147,146 pieces of
initial data of 47 days DNS request records of Jilin University Education
Network combined with CNN algorithm to detect APT attack behavior. There
are also some other approaches for detecting malicious domains that support
APT attack detection, including Vinayakumara et al. [11] used deep learning
algorithms; and Nguyen [12] proposed using neutrosophic sets.
APT
Analyzing and Detecting attack
extracting behavior malicious IP
of Network traffic
Another
attack
Detecting APT IP
Figure 1 Proposed model of detecting APT attack based on Network Traffic using machine
learning.
Detecting APT Attacks Based on Network Traffic Using Machine Learning 177
Table 1 Continued
No. Group Feature Data Type Description
16 Rank in Domcop Integer Rank of the domain name in
the list of ten million of
common domain names
from Domcop
17 DNS query Resolved IP count Integer Number of IP addresses
features (D) returned in the DNS query.
18 Distinct country count Integer Number of country from IP
addresses
19 Silent IP ratio Real Ratio of inappropriate
domains
20 HTTP response status Integer Status of the returned HTTP
response
21 Name server count Integer Number of the name servers
returned in the DNS query.
22 Name server IP count Integer Number of IP addresses of
the name server in the DNS
query
23 Name server Country count Integer Number of countries in
which the DNS servers are
located
24 The number of domains Integer This feature is described in
sharing the same IP* detail below
25 The IP address in the same Binary This feature is described in
class B range of known detail below
C&C servers*
26 Mail exchange server count Integer Number of mail exchange
servers returned in the DNS
query
27 Time to live (TTL) Integer Time to live of the cached
records for the domain
name in the DNS server
28 Time The daily similarity* Binary This feature is described in
value-based detail below
features (T)
29 Same query numbers in the Binary This feature is described in
same time window* detail below
30 Very low frequency query* Binary This feature is described in
detail below
180 C. D. Xuan
• Very low frequency query: Some advanced APT malware executes the
query to the domain to determine the C&C control server with a very
low frequency in order to avoid detection. The time between queries can
up to several dates, even several weeks or months. This could be the
behavior of sophisticated malware designed for the purpose of evading
detection. According to the experiment, these domains usually have
other common signs attached to them such as a web server that has
complete content, stable IP and TTL addresses.
Then, the server opens a random port (not a service port and usually
a large port). Thus, with the problem of detecting APT attacks on the
server, when the server opens a certain port that does not run properly
the service, the server is likely infected with malware and malware
is communicating with C&C Server. For example, detecting that on
network traffic has HTTP traffic that does not go through port 80 or port
8080. To find abnormal ports that are not running the correct service
according to specified internet standards, the first step is to find out the
strange IPs that the server queried to in the DNS packet. From there,
we will look for the queries that the server queries to that IP. From
those records, extracting protocols and service ports where the server
sends data out. Checking if they are suitable or not. Otherwise, they are
abnormal abnormal protocols and ports. The reason for such the defining
principle is that when a server provides an outgoing service with a
specified port, it always listens to the request and returns the response
through that port. For example, a web service with HTTP protocol has
a default port 80. Thus, the webserver always listens to the request and
returns the response over port 80.
• A large discrepancy between the transferred and received data. Usually,
the data that the server transfer to the client is usually greater than or
at least equal to the data sent by the client. But when the server is
infected with malware, the data transferred to the client will be much
higher than the data transferred from the client to the server. Because,
when the server is infected with malware, the data transferred to the
client contains files, and when the client sends data to the server, it is
a search or download command. In order to find data transferred from
the C&C server, we need to extract strange IPs from DNS records. From
there, any connection that transfers data from that IP and sent packet
size that is taken in the TCP len field is larger than the size of the packet
transferred from the server to that IP address, it violates this case.
• Abnormal TCP connection. When hackers send attack commands to
the server that is infected with malware, the TCP connection time
between the infected server and the C&C server is very long because the
commands that a hacker sent are the commands to search or download
the file. Therefore, the server will take a lot of time to find and return the
response to the C&C server. Here, we consider connections that have
the time greater than or equal to one minute as abnormal connections.
To determine the connection time of host A to host B, we specify that
182 C. D. Xuan
if the FIN flag of host A = 1 and the flag ACK = 1 and the SYN
flag of host B = 0 and the flag ACK = 1, the connection time of a
connection is equal to the subtraction result of time epoch of 2 records.
If the connection time is greater than or equal to 60 seconds, it is an
abnormal TCP connection.
• Heartbeat Traffic. After the malware successfully infected, the malware
sends packets to the C&C server to inform the C&C server that the
malware has successfully infected and is still on the victim’s computer.
These packets are sent at a fixed time and the packet size is always
the same. When a malware enters the victim’s computer, the malware
begins to connect to the C&C Server and sends packets of the same
size in the same period of time to inform C&C Server that the malware
is still online. This is called Heartbeat Traffic. To determine Heartbeat
Traffic, we will take the values of the TCP len and time epoch fields
to determine the size and time of the packet. From there, check the
size of the packets in periods of time such as dates, months, and years
(depending on the request) to determine heartbeat.
• Abnormal data fluctuation. The data of C&C servers are often small and
steadily but when hackers start sending data from the infected server to
C & C server, the data will increase dramatically. As with the previous
signs, the first step identifying strange IPs in DNS records. Then, get the
value of the TCP len field of all records whose destination IP is a strange
IP. From there, find the maximum packet size and calculate the average
size of the packets. If the size of the largest packet is greater than 60%
of the average size of packets, that data is abnormal data.
that this algorithm has a low rate of false prediction from benign to malicious
domains. Besides, when comparing the results of detecting malicious domain,
we notice that our approach is better than some other approaches. This shows
that the selected and extracted features of malicious domains presented the
clear difference between the malicious domains and the clean domains. Next,
we will use this malicious domain detection model to test the dataset of actual
APT attack domains.
The results presented in Table 4 shows that the accuracy of the method
of detecting APT attacks by DNS behavior is relatively good (92.54%).
The false detection rate FPR is low (0.61%). Can see that the process of
testing the malicious domain detection model gave really good results even
though the experimental dataset has a difference in the proportion of normal
domains and APT domains. This result shows that malicious domain features
that were selected during the training process have relatively accurately and
fully defined the behavior of the APT domain. With this result, the paper
has provided detection systems for malicious domains in general and APT
domains in particular a novel method to detect and classify domains.
trees. Table 5 below shows the results of APT IP detection using the Random
Forest algorithm with the features defined in Table 2.
Through the experimental results in Table 5, we notice that when the
domain feature is not used, the Random Forest algorithm with 50 trees gives
the highest accuracy with the accuracy of 93.12%. If only evaluating based
on this accuracy, the system classified very well. However, if looking at
the precision (the ratio of the number of correctly predicted APT IP among
those classified as APT IP) of all is low and the FPR rate is quite high. The
cause of this problem is that the test dataset has a big discrepancy between
the proportion of clean data and malicious data. Besides, the selected and
extracted features contain many characteristics of APT attacks, but these
characteristics do not appear much in the test dataset.
References
[1] Quintero Bonilla, Santiago & Rey, Ángel. A New Proposal on the
Advanced Persistent Threat: A Survey. Applied Sciences. 2020, 10(11),
pp. 38–74.
[2] Adel, A., Ankur, C., Sowmya, M., Dijiang Huang, H.: A Survey on
Advanced Persistent Threats: Techniques, Solutions, Challenges, and
Research Opportunities. IEEE Commu. Sur. & Tu. PP99(1–1), 1–29
(2019).
[3] Zimba, Aaron, Chen, Hongsong, Wang, Zhaoshun, Chishimba, Mumbi.
Modeling and detection of the multi-stages of Advanced Persistent
Threats attacks based on semi-supervised learning and complex net-
works characteristics. Future Generation Computer Systems. Volume
106, 2020, pp. 501–517.
[4] Sadegh, M.M., Rigel, G.J., Birhanu, E., Ramachandran, S., HOLMES:
Real-time APT Detection through Correlation of Suspicious Information
Flows. In: 2019 IEEE Symposium on Security and Privacy, pp. 1137–
1152, San Francisco, CA, USA, 19–23 May 2019.
[5] Lajevardi, Amir, Amini, Morteza. A semantic-based correlation
approach for detecting hybrid and low-level APTs. Future Generation
Computer Systems. Vol. 96, 2019, pp. 64–88.
[6] Weina, N., Xiaosong, Z., GuoWu, Y., Jianan, Z., Zhongwei, R., Identi-
fying APT Malware Domain Based on Mobile DNS Logging. Mat. Pro.
in. Eng. 2, 1–9 (2017).
[7] Zhao, G., Xu, K., Xu, L., Wu, B., Detecting APT malware infections
based on malicious DNS and traffic analysis. IEEE Access. 3, 1132–
1142 (2015).
[8] Do Xuan Cho, Ha Hai Nam. A Method of Monitoring and Detecting
APT Attacks Based on Unknown Domains. Pro. Com. Sci. 150, 316–
323 (2019).
[9] Jiazhong Lu, Kai Chen, Zhongliu Zhuo, XiaoSong Zhang. A temporal
correlation and traffic analysis approach for APT attacks detection.
Cluster Computing (2017). pp. 1–12.
[10] Guanghua Yan, Qiang Li, Dong Guo, Xiangyu Meng. Discovering Sus-
picious APT Behaviors by Analyzing DNS Activities. Sensors 2020, 20,
731; doi:10.3390/s20030731.
[11] R. Vinayakumara, K.P. Somana, P. Poornachandranb. Detecting mali-
cious domain names using deep learning approaches at scale. Journal of
Intelligent and Fuzzy Systems. 2018, 34, 1355–1367.
Detecting APT Attacks Based on Network Traffic Using Machine Learning 189
[12] Van Can, Nguyen et al. A New Method to Classify Malicious Domain
Name Using Neutrosophic Sets in DGA Botnet Detection. Journal of
Intelligent and Fuzzy Systems. 2020, 36, 4223–4236.
[13] Cho Do Xuan, Hoang Mai Dao, Hoa Dinh Nguyen. APT attack detection
based on flow network analysis techniques using deep learning. Journal
of Intelligent & Fuzzy Systems, vol. 39, no. 3, pp. 4785–4801, 2020.
[14] Cho Do Xuan, Lai Van Duong and Tisenko Victor Nikolaevich, “Detect-
ing C&C Server in the APT Attack based on Network Traffic using
Machine Learning”, International Journal of Advanced Computer Sci-
ence and Applications(IJACSA), 11(5), 2020. http://dx.doi.org/10.1456
9/IJACSA.2020.0110504.
[15] Shai, S.S., Shai B.D., Understanding Machine Learning: From Theory
to Algorithms. Cambridge University Press (2014).
[16] Leo, B., Random Forests. Ma. Lear. 45(1), 5–32 (2001).
[17] Xuan, Cho. Malicious domain detection based on DNS query using
Machine Learning. International Journal of Emerging Trends in Engi-
neering Research. No. 8, 2020, pp. 1809–1814.7.
[18] OpenDNS public domain lists of domain names for training/testing
classifier. https://github.com/opendns/public-domain-lists [access
date 1/4/3018].
[19] Malware Domain List. http://www.malwaredomainlist.com/ [access
date 1/4/2020].
[20] Join the fight against phishing. https://www.phishtank.com/ [access date
1/4/2020].
[21] Alexa – Top Sites for Countries. https://www.alexa.com/topsites/count
ries [access date 1/4/2020].
[22] Public-domain-lists. https://github.com/opendns/public-domain-lists
[access date 3/4/2020].
[23] APTNotes – Github Repo. https://github.com/kbandla/APTnotes
[access date 3/4/2020].
[24] APTNotes – Website. https://aptnotes.malwareconfig.com/Targeted
[access date 3/4/2020].
[25] Cyber Attacks Logbook (Kaspersky). https://apt.securelist.com/ [access
date 3/4/2020].
[26] DeepEnd Research: List of malware pcaps, samples, and indicators for
the Library of Malware Traffic Patterns. https://contagiodump.blogspo
t.com/2013/08/deepend-research-list-of-malware-pcaps.html [access
date 3/4/2020].
190 C. D. Xuan
Biography