Deep Reinforcement Learning for Cyber Security

Uploaded by

amrutavalliakula

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views

Deep Reinforcement Learning for Cyber Security

Uploaded by

amrutavalliakula

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO.

8, AUGUST 2023 3779

Deep Reinforcement Learning for Cyber Security

Thanh Thi Nguyen and Vijay Janapa Reddi , Member, IEEE

Abstract— The scale of Internet-connected systems has has been, thus, a critical need for the development of cyber
increased considerably, and these systems are being exposed to security technologies to mitigate and eliminate impacts of
cyberattacks more than ever. The complexity and dynamics of these attacks [8].
cyberattacks require protecting mechanisms to be responsive,
adaptive, and scalable. Machine learning, or more specifically Artificial intelligence (AI), especially machine learning
deep reinforcement learning (DRL), methods have been pro- (ML), has been applied to both attacking and defending in
posed widely to address these issues. By incorporating deep cyberspace. On the attacker side, ML is utilized to compromise
learning into traditional RL, DRL is highly capable of solving defense strategies. On the cyber security side, ML is employed
complex, dynamic, and especially high-dimensional cyber defense to put up robust resistance against security threats in order
problems. This article presents a survey of DRL approaches
developed for cyber security. We touch on different vital aspects, to adaptively prevent and minimize the impacts or damages
including DRL-based security methods for cyber–physical sys- that occurred. Among these ML applications, unsupervised
tems, autonomous intrusion detection techniques, and multiagent and supervised learning methods have been used widely
DRL-based game theory simulations for defense strategies against for intrusion detection [9]–[11], malware detection [12]–[14],
cyberattacks. Extensive discussions and future research directions cyber–physical attacks [15]–[17], and data privacy protec-
on DRL-based cyber security are also given. We expect that this
comprehensive review provides the foundations for and facilitates tion [18]. In principle, unsupervised methods explore the
future studies on exploring the potential of emerging DRL to cope structure and patterns of data without using their labels,
with increasingly complex cyber security problems. while supervised methods learn by examples based on data’s
Index Terms— Cyber defense, cyber security, cyberattacks, labels. These methods, however, cannot provide dynamic and
deep learning, deep reinforcement learning (DRL), Internet of sequential responses against cyberattacks, especially new or
Things (IoT), IoT, review, survey. constantly evolving threats. In addition, the detection and
defending responses often take place after the attacks when
I. I NTRODUCTION traces of attacks become available for collecting and analyzing,
and thus, proactive defense solutions are hindered. A statistical
I NTERNET of Things (IoT) technologies have been
employed broadly in many sectors, such as telecommuni-
cations, transportation, manufacturing, water and power man-
study shows that 62% of the attacks were recognized after they
have caused significant damages to the cyber systems [19].
agement, healthcare, education, finance, government, and even Reinforcement learning (RL), a branch of ML, is the closest
entertainment. The convergence of various information and form of human learning because it can learn by its own
communication technology (ICT) tools in the IoT has boosted experience through exploring and exploiting the unknown
its functionalities and services to users to new levels. ICT environment. RL can model an autonomous agent to take
has witnessed a remarkable development in terms of system sequential actions optimally without or with limited prior
design, network architecture, and intelligent devices in the knowledge of the environment, and thus, it is particularly
last decade. For example, ICT has been advanced with the adaptable and useful in real-time and adversarial environments.
innovations of cognitive radio network (CRN) and 5G cellular With the power of function approximation and representation
network [1], [2], software-defined network (SDN) [3], cloud learning, deep learning has been incorporated into RL methods
computing [4], (mobile) edge caching [5], [6], and fog comput- and enabled them to solve many complex problems [20]–[24].
ing [7]. Accompanying these developments is the increasing The combination of deep learning and RL, therefore, indicates
vulnerability to cyberattacks, which are defined as any type of excellent suitability for cyber security applications where
offensive maneuver exercised by one or multiple computers to cyberattacks are increasingly sophisticated, rapid, and ubiq-
target computer information systems, network infrastructures, uitous [25]–[28].
or personal computer devices. Cyberattacks may be instigated The emergence of DRL has actually witnessed great success
by economic competitors or state-sponsored attackers. There in different fields, from video game domain, e.g., Atari [29],
[30], game of Go [31], [32], real-time strategy game Star-
Manuscript received 21 July 2020; revised 16 January 2021 and 16 April Craft II [33]–[36], 3-D multiplayer game Quake III Arena
2021; accepted 18 October 2021. Date of publication 1 November 2021; date
of current version 4 August 2023. (Corresponding author: Thanh Thi Nguyen.) Capture the Flag [37], and teamwork game Dota 2 [38] to
Thanh Thi Nguyen is with the School of Information Technology, Deakin real-world applications, such as robotics [39], autonomous
University, Melbourne Burwood Campus, Burwood, VIC 3125, Australia vehicles (AVs) [40], autonomous surgery [41], [42], natural
(e-mail: [email protected]).
Vijay Janapa Reddi is with the John A. Paulson School of Engineering language processing [43], biological data mining [44], and
and Applied Sciences, Harvard University, Cambridge, MA 02138 USA drug design [45]. DRL methods have also recently been
(e-mail: [email protected]). applied to solve various problems in the IoT area. For example,
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TNNLS.2021.3121870. a DRL-based resource allocation framework that integrates
Digital Object Identifier 10.1109/TNNLS.2021.3121870 networking, caching, and computing capabilities for smart city
2162-237X © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Amrutavalli Akula. Downloaded on October 29,2023 at 05:59:25 UTC from IEEE Xplore. Restrictions apply.
3780 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 8, AUGUST 2023

applications is proposed in [46]. The DRL algorithm, i.e., dou-

ble dueling deep Q-network (DQN) [47], [48], is used to solve
this problem because it involves a large-state space, which con-
sists of the dynamic changing status of base stations, mobile
edge caching (MEC) servers and caches. The framework is
developed based on the programmable control principle of
SDN and the caching capability of information-centric net-
working. Alternatively, Zhu et al. [49] explored MEC policies
by using the context awareness concept that represents the
Fig. 1. Interactions between the agent and its environment in RL, character-
user’s context information and traffic pattern statistics. The use ized by state, action, and reward. Based on the current state s and reward r,
of AI technologies at the mobile network edges is advocated the agent will take an optimal action, leading to changes of state and reward.
to intelligently exploit the operating environment and make The agent then receives the next state s and reward r from the environment
to determine the next action, making an iterative process of agent-environment
the right decisions regarding what, where, and how to cache interactions.
appropriate contents. To increase the caching performance,
a DRL approach, i.e., the asynchronous advantage actor-critic Q-learning needs to use a lookup table or Q-table to
(A3C) algorithm [50], is used to find an optimal policy aiming store expected rewards (Q-values) of actions given a set of
to maximize the offloading traffic. states. This requires a large memory when the state and
Findings from our current survey show that applications of action spaces increase. Real-world problems often involve
DRL in cyber environments are generally categorized under continuous state or action space, and therefore, Q-learning is
two perspectives: optimizing and enhancing the communi- inefficient to solve these problems. Fortunately, deep learning
cations and networking capabilities of the IoT applications, has emerged as a powerful tool that is a great complement
e.g., [51]–[59], and defending against cyberattacks. This article to traditional RL techniques. Deep learning methods have
focuses on the latter where DRL methods are used to solve two typical capabilities, i.e. function approximation and
cyber security problems with the presence of cyberattacks representation learning, which help them to learn a compact
or threats. Section II provides a background of DRL meth- low-dimensional representation of raw high-dimensional
ods, followed by a detailed survey of DRL applications in data effectively [61]. The combination of deep learning and
cyber security in Section III. We group these applications into RL was the research direction that Google DeepMind has
three major categories, including DRL-based security solu- initiated and pioneered. They proposed DQN with the use of
tions for cyber–physical systems (CPSs), autonomous intru- a deep neural network (DNN) to enable Q-learning to deal
sion detection techniques, and DRL-based game theory for with high-dimensional sensory inputs [29], [62].
cyber security. Section IV concludes this article with extensive Using DNNs to approximate the Q-function, however,
discussions and future research directions on DRL for cyber is unstable due to the correlations among the sequences
security. of observations and the correlations between the Q-values
Q(s, a) and the target values Q(s , a ). Mnih et al. [29]
II. D EEP R EINFORCEMENT L EARNING P RELIMINARY proposed the use of two novel techniques, i.e., experience
replay memory and target network, to address this problem
Different from the other popular branch of ML, i.e., super- (Fig. 2). On the one hand, experience memory stores an
vised methods learning by examples, RL characterizes an extensive list of learning experience tuples (s, a, r, s ),
agent by creating its own learning experiences through inter- which are obtained from the agent’s interactions with
acting directly with the environment. RL is described by the environment. The agent’s learning process retrieves
concepts of state, action, and reward (Fig. 1). It is a trial and these experiences randomly to avoid the correlations of
error approach in which the agent takes action at each time step consecutive experiences. On the other hand, the target
that causes two changes: the current state of the environment network is technically a copy of the estimation network, but
is changed to a new state, and the agent receives a reward or its parameters are frozen and only updated after a period. For
penalty from the environment. Given a state, the reward is a instance, the target network is updated after 10 000 updates
function that can tell the agent how good or bad an action is. of the estimation network, as demonstrated in [29]. DQN
Based on received rewards, the agent learns to take more good has made a breakthrough as it is the first time an RL agent
actions and gradually filter out bad actions. can provide a human-level performance in playing 49 Atari
A popular RL method is Q-learning whose goal is to games by using just raw image pixels of the game board.
maximize the discounted cumulative reward based on the As a value-based method, DQN takes a long training time
Bellman equation [60] and has limitations in solving problems with continuous action
spaces. Value-based methods, in general, evaluate the goodness
Q(st , at ) = E rt+1 + γ rt+2 + γ 2rt+3 + · · · |st , at . (1)
of an action given a state using the Q-value function. When
The discount factor γ ∈ [0, 1] manages the importance the number of states or actions is large or infinite, they
levels of future rewards. It is applied as a mathematical trick to show inefficiency or even impracticality. Another type of
analyzing the learning convergence. In practice, the discount RL, i.e., policy gradient methods, has solved this problem
is necessary because of partial observability or uncertainty of effectively. These methods aim to derive actions directly by
the stochastic environment. learning a policy π(s, a) that is a probability distribution

Authorized licensed use limited to: Amrutavalli Akula. Downloaded on October 29,2023 at 05:59:25 UTC from IEEE Xplore. Restrictions apply.
NGUYEN AND REDDI: DRL FOR CYBER SECURITY 3781

Fig. 2. DQN architecture with the loss function described by L(β) = E[(r +
γ maxa Q(s , a |β ) − Q(s, a|β))2 ], where β and β are parameters of the
estimation and target DNNs, respectively. Each action taken by the agent will
generate an experience, which consists of the current state s, action a, reward
r, and next state s . These learning experiences (samples) are stored in the
experience replay memory, which are then retrieved randomly for a stable
learning process.

over all possible actions. REINFORCE [64], vanilla policy Fig. 3. Learning architecture of A3C, consisting of a global network and
gradient [65], trust region policy optimization (TRPO) [66] a number of worker agents. Each worker initially resets its parameters to
those of the global network and interacts with its copy of the environment for
and proximal policy optimization (PPO) [67] are notable learning. Gradients obtained from these individual learning processes will be
policy gradient methods. The gradient estimation, however, used to update the global network asynchronously. This increases the learning
often suffers from a large fluctuation [68]. The combination of speed and diversifies the experience learned by the global network as the
experiences obtained by individual worker agents are independent.
value-based and policy-gradient methods has been developed
to aggregate the advantages and eradicate the disadvantages function A(s) shows how advantageous the agent is when
of these two methods. This kind of combination has con- it is in a particular state. The learning process of A3C is
stituted another type of RL, i.e., actor-critic methods. This asynchronous because each learner interacts with its separate
structure comprises two components: an actor and a critic environment and updates the master network independently.
that can be both characterized by DNNs. The actor attempts This process is iterated, and the master network is the one to
to learn a policy by receiving feedback from the critic. use when the learning is finished.
This iterative process helps the actor improve its strategy Table I summarizes comparable features of value-based,
and converge to an optimal policy. Deep deterministic pol- policy-gradient, and actor-critic methods, and their typical
icy gradient (DDPG) [69], distributed distributional DDPG example algorithms. Valued-based methods are more sample
(D4PG) [70], A3C [50], and unsupervised reinforcement and efficient than policy-gradient methods because they are able
auxiliary learning (UNREAL) [71] are methods that utilize to exploit data from other sources such as experts [72].
the actor-critic framework. An illustrative architecture of the In DRL, a value function or a policy function is normally
popular algorithm A3C is presented in Fig. 3. A3C’s structure approximated by a universal function approximator, such as
consists of a hierarchy of a master learning agent (global) a (deep) neural network, which can take either discrete or
and individual learners (workers). Both the master agent and continuous states as inputs. Therefore, modeling state spaces is
individual learners are modeled by DNNs with each having more straightforward than dealing with action spaces in DRL.
two outputs: one for the critic and another for the actor. The Value-based methods are suitable for problems with discrete
first output is a scalar value representing the expected reward action spaces as they evaluate every action explicitly and
of a given state V (s), while the second output is a vector of choose an action at each time step based on these evaluations.
values representing a probability distribution over all possible On the other hand, policy-gradient and actor-critic methods
actions π(s, a). are more suitable for continuous action spaces because they
The value loss function of the critic is specified by describe the policy (a mapping between states and actions)
as a probability distribution over actions. The continuity
L1 = (R − V (s))2 (2)
characteristic is the main difference between discrete and
where R = r + γ V (s ) is the discounted future reward. continuous action spaces. In a discrete action space, actions
In addition, the actor is pursuing minimization of the following are characterized as a mutually exclusive set of options, and
policy loss function: while in continuous action space, action has a value from a
certain range or boundary [73].
L 2 = − log(π(a|s)) ∗ A(s) − ϑ H (π) (3)
where A(s) = R − V (s) is the estimated advantage function, III. DRL IN C YBER S ECURITY: S URVEY
and H (π) is the entropy term, which handles the exploration A large number of applications of RL to various aspects
capability of the agent with the hyperparameter ϑ controlling of cyber security have been proposed in the literature, rang-
the strength of the entropy regularization. The advantage ing from data privacy to critical infrastructure protection.

Authorized licensed use limited to: Amrutavalli Akula. Downloaded on October 29,2023 at 05:59:25 UTC from IEEE Xplore. Restrictions apply.
3782 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 8, AUGUST 2023

TABLE I
S UMMARY OF F EATURES OF DRL T YPES AND T HEIR N OTABLE M ETHODS

Fig. 5. Dynamics of attack and defense in a CPS. The physical layer is often
uncertain with disturbances w, while cyberattack a directly affects the cyber
layer where a defense strategy d needs to be implemented. The dynamics
of attack-defense characterized by θ (t, a, d) is injected into the conventional
Fig. 4. Different (sub)sections of the survey on DRL in cyber security. physical system to develop a cyber–physical co-modeling as presented in (4).
However, drawbacks of traditional RL have restricted its A. DRL-Based Security Methods for Cyber–Physical Systems
capability in solving complex and large-scale cyber security
problems. The increasing number of connected IoT devices Investigations of defense methods for CPS against cyberat-
in recent years has led to a significant increase in the tacks have received considerable attention and interests from
number of cyberattack instances and their complexity. The the cyber security research community. CPS is a mechanism
emergence of deep learning and its integration with RL has controlled by computer-based algorithms facilitated by Internet
created a class of DRL methods that are able to detect integration. This mechanism provides efficient management of
and fight against sophisticated types of cyberattacks, such distributed physical systems via the shared network. With the
as falsified data injection to CPSs [86], deception attack rapid development of the Internet and control technologies,
to autonomous systems [93], distributed denial-of-service CPSs have been used extensively in many areas, includ-
(DDoS) attacks [114], intrusions to host computers or net- ing manufacturing [76], health monitoring [77], [78], smart
works [125], jamming [142], spoofing [157], malware [161], grid [79]–[81], and transportation [82], [83]. Being exposed
attacks in adversarial networking environments [168], and so widely to the Internet, these systems are increasingly vul-
on. This section provides a comprehensive survey of state- nerable to cyberattacks [84]. In 2015, hackers attacked the
of-the-art DRL-powered solutions for cyber security, ranging control system of a steel mill in Germany by obtaining
from defense methods for CPSs to autonomous intrusion login credentials via the use of phishing emails. This attack
detection approaches, and game theory-based solutions. caused a partial plant shutdown and resulted in damage of
The structure of the survey is presented in Fig. 4. We limit millions of dollars. Likewise, there was a costly cyberattack
the survey to existing applications of DRL to cyber security. on a power grid in Ukraine in late December 2015 that
There are other topics in cyber security where DRL has not disrupted electricity supply to a few hundred thousand end
been applied to and they are therefore discussed in Section IV consumers [85].
(Discussions and Future Research Directions). Those potential In an effort to study cyberattacks on CPS, Feng et al. [85]
topics include a multiagent DRL approach to cyber security, characterized the cyber state dynamics by a mathematical
combining host-based and network-based intrusion detection model
systems (IDSs), model-based DRL and combining model-free
ẋ(t) = f (t, x, u, w; θ (t, a, d)); x(t0 ) = x 0 (4)
and model-based DRL methods for cyber security applica-
tions, investigating methods that can deal with continuous where x, u, and w represent the physical state, control inputs,
action spaces in cyber environments, offensive AI, deep- and disturbances correspondingly (see Fig. 5). In addition,
fakes, ML poisoning, adversarial ML, human–machine team- θ (t, a, d) describes the cyber state at time t with a and d
ing within a human-on-the-loop architecture, bit-and-piece referring to cyberattack and defense, respectively.
DDoS attacks, and potential attacks by quantum physics-based The CPS defense problem is then modeled as a two-player
powerful computers to crack encryption algorithms. zero-sum game by which utilities of players are summed up

Authorized licensed use limited to: Amrutavalli Akula. Downloaded on October 29,2023 at 05:59:25 UTC from IEEE Xplore. Restrictions apply.
NGUYEN AND REDDI: DRL FOR CYBER SECURITY 3783

network (GAN) model to cope with data infusion attacks in

AVs equipped with 5G communication links. The adversary
attempts to inject faulty data to impact safe distance spacing
between AVs, while the AVs manage to minimize this devia-
tion. LSTM is used as a generator, while a convolutional neural
network (CNN) is used as a discriminator to resemble a GAN
structure, which is able to capture previous temporal actions
Fig. 6. Architecture of the adversarial DRL algorithm for robust AV control.
of the AV and attacker as well as previous distance deviations.
A DNN consisting of an LSTM, a fully connected layer (FCL), and regression A DRL algorithm is proposed based on these observations to
is used to learn long-term dependencies within large datasets, which contain select optimal actions (suitable velocity) for the AV to avoid
the outcomes of the players’ past interactions. The DNN can approximate
the Q functions to find optimal actions for players, i.e., AV (defender) and
collisions and accidents.
especially attacker, who seeks to inject faulty data to AV sensor readings. Autonomous systems can be vulnerable to inefficiency
from various sources, such as noises in communication chan-
to zero at each time step. The defender is represented by nels, sensor failures, errors in sensor measurement readings,
an actor-critic DRL algorithm. Simulation results demonstrate packet errors, and especially cyberattacks. Deception attack
that the proposed method in [85] can learn an optimal strategy to autonomous systems is widespread as it is initiated by an
to timely and accurately defend the CPS from unknown adversary whose effort is to inject noises to the communication
cyberattacks. channels between sensors and the command center. This kind
Applications of CPS in critical safety domains such as of attack leads to corrupted information being sent to the
autonomous automotive, chemical process, automatic pilot command center and eventually degrades the system’s perfor-
avionics, and smart grid require a certain correctness level. mance. Gupta and Yang [93] studied the ways to increase the
Akazaki et al. [86] proposed the use of DRL, i.e., dou- robustness of autonomous systems by allowing the system to
ble DQN (DDQN) and A3C algorithms, to find falsified learn using adversarial examples. The problem is formulated
inputs (counterexamples) for CPS models. This allows for as a zero-sum game with the players to be the command
effective yet automatic detection of CPS defects. Due to the center (observer) and the adversary. The inverted pendulum
infinite state space of CPS models, conventional methods, such problem from Roboschool [67] is used as a simulation environ-
as simulated annealing [87] and cross entropy [88], were found ment. The TRPO algorithm is employed to design an observer
inefficient. Experimental results show the superiority of the that can reliably detect adversarial attacks in terms of measure-
use of DRL algorithms against those methods in terms of ment corruption and automatically mitigate their effects.
the smaller number of simulation runs. This leads to a more
practical detection process for CPS models’ defects despite
the great complexity of CPS’s software and physical systems. B. DRL-Based Intrusion Detection Systems
AVs operating in the future smart cities require a robust To detect intrusions, security experts conventionally need
processing unit of intravehicle sensors, including camera, to observe and examine audit data, e.g., application traces,
radar, roadside smart sensors, and intervehicle beaconing. Such network traffic flow, and user command data, to differentiate
reliance is vulnerable to cyber–physical attacks aiming to get between normal and abnormal behaviors. However, the volume
control of AVs by manipulating the sensory data and affecting of audit data surges rapidly when the network size is enlarged.
the reliability of the system, e.g., increasing accident risks or This makes manual detection difficult or even impossible.
reducing the vehicular flow. Ferdowsi et al. [89] examined the An IDS is a software or hardware platform installed on host
scenarios, where the attackers manage to interject faulty data computers or network equipment to detect and report to the
to the AV’s sensor readings while the AV (the defender) needs administrator abnormal or malicious activities by analyzing
to deal with that problem to control AV robustly. Specifically, the audit data. An intrusion detection and prevention system
the car following model [90] is considered in which the may be able to take appropriate actions immediately to reduce
focus is on autonomous control of a car that follows closely the impacts of the malevolent activities.
another car. The defender aims to learn the leading vehicle’s Depending on different types of audit data, IDSs are
speed based on sensor readings. The attacker’s objective is grouped into two categories: host-based or network-based
to mislead the following vehicle to a deviation from the IDS. Host-based IDS typically observes and analyses the
optimal safe spacing. The interactions between the attacker and host computer’s log files or settings to discover anomalous
defender are characterized by a game-theoretic problem. The behaviors. Network-based IDS relies on a sniffer to col-
interactive game structure and its DRL solution are diagramed lect transmitting packets in the network and examines the
in Fig. 6. Instead of directly deriving a solution based on traffic data for intrusion detection. Host-based systems nor-
the mixed-strategy Nash equilibrium analytics, Hochreiter and mally lack cross-platform support and implementing them
Schmidhuber et al. proposed the use of DRL to solve this requires knowledge of the host operating systems and con-
dynamic game. Long short-term memory (LSTM) [91] is figurations. Network-based systems aim to monitor traffic
used to approximate the Q-function for both defending and over specific network segments, and they are independent
attacking agents as it can capture the temporal dynamics of of the operating systems and more portable than host-based
the environment. systems. Implementing network-based systems is, thus, eas-
Likewise, Rasheed et al. [92] introduced an adversarial ier and they offer more monitoring power than host-based
DRL method that integrates LSTM and generative adversarial systems. However, a network-based IDS may have difficulty
Authorized licensed use limited to: Amrutavalli Akula. Downloaded on October 29,2023 at 05:59:25 UTC from IEEE Xplore. Restrictions apply.
3784 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 8, AUGUST 2023

in handling heavy traffic and high-speed networks because 2) Network-Based IDS: Deokar and Hazarnis [112] pointed
it must examine every packet passing through its network out the drawbacks of both anomaly-based and signature-based
segment. detection methods. On the one hand, anomaly detection has
Regardless of the IDS type, two common detection meth- a high false alarm rate because it may categorize activities,
ods are used: signature-based and anomaly-based detection. which users rarely perform as an anomaly. On the other hand,
Signature detection involves the storage of patterns of known signature detection cannot discover new types of attacks as
attacks and comparing characteristics of possible attacks to it uses a database of patterns of well-known attacks. The
those in the database. Anomaly detection observes the normal authors, therefore, proposed an IDS that can identify known
behaviors of the system and alerts the administrator if any and unknown attacks effectively by combining features of both
activities are found deviated from normality, for instance, anomaly and signature detection through the use of log files.
the unexpected increase of traffic rate, i.e., number of IP The proposed IDS is based on a collaboration of RL meth-
packets per second. ML techniques, including unsupervised ods, association rule learning, and log correlation techniques.
clustering and supervised classification methods, have been RL gives a reward (or penalty) to the system when it selects
used widely to build adaptive IDSs [94]–[98]. These meth- log files that contain (or do not contain) anomalies or any signs
ods, e.g., neural networks [99], k-nearest neighbors [99], of attack. This procedure enables the system to choose more
[100], support vector machine (SVM) [99], [101], random appropriate log files in searching for traces of the attack.
forest [102], and recently deep learning [103]–[105], however, One of the most difficult challenges in the current Internet
normally rely on fixed features of existing cyberattacks so is dealing with the DDoS threat, which is a DoS attack
that they are deficient in detecting new or deformed attacks. but has a distributed nature, occurring with a large traffic
The lack of prompt responses to dynamic intrusions also volume and compromising a large number of hosts. Malialis
leads to ineffective solutions produced by unsupervised or and Kudenko [113] and Malialis [114] initially introduced
supervised techniques. In this regard, RL methods have been the multiagent router throttling method based on the SARSA
demonstrated effectively in various IDS applications [106]. algorithm [115] to address the DDoS attacks by learning
Sections III-B.1 and III-B.2 review the use of DRL methods multiple agents to rate-limit or throttle traffic toward a victim
in both host-based and network-based IDSs. server. That method, however, has a limited capability in terms
1) Host-Based IDS: As the volume of audit data and of scalability. They, therefore, further proposed the coordinated
complexity of intrusion behaviors increase, adaptive intrusion team learning design to the original multiagent router throttling
detection models demonstrate limited effectiveness because based on the divide-and-conquer paradigm to eliminate the
they can only handle temporally isolated labeled or unlabeled mentioned drawback. The proposed approach integrates three
data. In practice, many complex intrusions comprise temporal mechanisms, namely, task decomposition, hierarchical team-
sequences of dynamic behaviors. Xu and Xie [107] proposed based communication, and team rewards, involving multiple
an RL-based IDS that can handle this problem. System call defensive nodes across different locations to coordinately
trace data are fed into a Markov reward process whose state stop or reduce the flood of DDoS attacks. A network emu-
value can be used to detect abnormal temporal behaviors lator is developed based on the work of Yau et al. [116]
of host processes. The intrusion detection problem is, thus, to evaluate throttling approaches. Simulation results show
converted to a state value prediction task of the Markov chains. that the resilience and adaptability of the proposed method
The linear temporal difference (TD) RL algorithm [108] is are superior to its competing methods, i.e., baseline router
used as the state value prediction model where its outcomes throttling and additive-increase/multiplicative-decrease throt-
are compared with a predetermined threshold to distinguish tling algorithms [116], in various scenarios with different
normal traces and attack traces. Instead of using the errors attack dynamics. The scalability of the proposed method is
between real values and estimated ones, the TD learning successfully experimented with up to 100 RL agents, which
algorithm uses the differences between successive approxima- has a great potential to be deployed in a large Internet service
tions to update the state value function. Experimental results provider network.
obtained from using system call trace data show the dominance Alternatively, Bhosale et al. [117] proposed a multiagent
of the proposed RL-based IDS in terms of higher accuracy intelligent system [118] using RL and influence diagram [119]
and lower computational costs compared with SVM, hidden to enable quick responses against the complex attacks. Each
Markov model, and other ML or data mining methods. The agent learns its policy based on the local database and infor-
proposed method based on the linear basis functions, however, mation received from other agents, i.e., decisions and events.
has a shortcoming when sequential intrusion behaviors are Shamshirband et al. [120], on the other hand, introduced an
highly nonlinear. Therefore, a kernel-based RL approach using intrusion detection and prevention system for wireless sensor
least-squares TD [109] was suggested for intrusion detection networks (WSNs) based on a game theory approach and
in [110] and [111]. Relying on the kernel methods, the gen- employed a fuzzy Q-learning algorithm [121], [122] to obtain
eralization capability of TD RL is enhanced, especially in optimal policies for the players. Sink nodes, a base station, and
high-dimensional and nonlinear feature spaces. The kernel an attacker constitute a three-player game where sink nodes
least-squares TD algorithm is, therefore, able to predict anom- and base station are coordinated to derive a defense strategy
aly probabilities accurately, which contributes to improving the against the DDoS attacks, particularly in the application layer.
detection performance of IDS, especially when dealing with The IDS detects future attacks based on the fuzzy Q-learning
multistage cyberattacks. algorithm that takes part in two phases: detection and defense

Authorized licensed use limited to: Amrutavalli Akula. Downloaded on October 29,2023 at 05:59:25 UTC from IEEE Xplore. Restrictions apply.
NGUYEN AND REDDI: DRL FOR CYBER SECURITY 3785

unilateral, and lagging behind dynamic attacks. Cyberspace

involves various cyber components, and thus, reliable cyber
security requires the consideration of interactions among these
components. Specifically, the security policy applied to a
component has a certain impact on the decisions taken by other
components. Therefore, the decision space increases consider-
ably, with many what-if scenarios when the system is large.
Game theory has been demonstrated effectively in solving such
large-scale problems because it can examine many scenarios to
derive the best policy for each player [127]–[131]. The utility
or payoff of a game player depends not only on its actions but
also on other players’ activities. In other words, the efficacy of
cyber defending strategies must take into account the attacker’s
strategies and other network users’ behaviors. Game theory
Fig. 7. Two-phase intrusion detection and prevention system based on a can model the conflict and cooperation between intelligent
game theory approach and fuzzy Q-learning. In Phase 1, the sink node uses decision-makers, resembling activities of cyber security prob-
fuzzy Q-learning to detect anomalies caused by the attacker to victim nodes.
The malicious information is preprocessed and checked against a threshold by lems, which involve attackers and defenders. This resemblance
the sink node before passing to Phase 2, where the base station also employs has enabled game theory to mathematically describe and
fuzzy Q-learning to select optimal defense actions. analyze the complex behaviors of multiple competitive mech-
(Fig. 7). The game commences when the attacker sends an anisms. In the following, we present game-theoretic models
overwhelming volume of flooding packets beyond a specific involving multiple DRL agents that characterize cyber security
threshold as a DDoS attack to a victim node in the WSN. problems in different attacking scenarios, including jamming,
Using the low energy adaptive clustering hierarchy (LEACH), spoofing, malware, and attacks in adversarial environments.
which is a prominent WSN protocol [123], the performance of 1) Jamming Attacks: Jamming attacks can be considered as
the proposed method is evaluated and compared with that of a special case of DoS attacks, which are defined as any event
existing soft computing methods. The results show the efficacy that diminishes or eradicates a network’s capacity to execute
and viability of the proposed method in terms of detection its expected function [132]–[134]. Jamming is a serious attack
accuracy, energy consumption, and network lifetime. in networking and has attracted a great interest of researchers
In another approach, Caminero et al. [124] proposed who used ML or especially RL to address this problem,
a model, namely, adversarial environment using RL, e.g., [135]–[141]. The recent development of deep learning
to incorporate the RL theory in implementing a classifier has facilitated the use of DRL for jamming handling or
for network intrusion detection. A simulated environment mitigation. Xiao et al. [142] studied security challenges of the
is created, where random samples drawn from a labeled MEC systems and proposed an RL-based solution to provide
network intrusion dataset are treated as RL states. The secure offloading to the edge nodes against jamming attacks.
adversarial strategy is employed to deal with unbalanced MEC is a technique that allows cloud computing functions to
datasets as it helps to avoid training bias via an oversampling take place at the edge nodes of a cellular network or generally
mechanism, and thus, decreases classification errors for of any network. This technology helps to decrease network
underrepresented classes. Likewise, a study in [125] applied traffic and reduce overhead and latency when users request to
DRL methods, such as DQN, DDQN, policy gradient, and access contents that have been cached in the edges closer to
actor-critic models, for network intrusion detection. With the cellular customer. MEC systems, however, are vulnerable
several adjustments and adaptations, DRL algorithms can to cyberattacks because they are physically located closer
be used as a supervised approach to classifying labeled to users and attackers with less secure protocols compared
intrusion data. DRL policy network is simple and fast, which with cloud servers or database centers. In [142], the RL
is suitable for online learning and rapid responses in modern methodology is used to select the defense levels and important
data networks with evolving environments. Results obtained parameters, such as offloading rate and time, transmit channel,
on two intrusion detection datasets show DDQN is the and power. As the network state space is large, the authors
best algorithm among the four employed DRL algorithms. proposed the use of DQN to handle high-dimensional data,
DDQN’s performance is equivalent and even better than as shown in Fig. 8. DQN uses a CNN to approximate the
many traditional ML methods in some cases. Recently, Q-function that requires high computational complexity and
Saeed et al. [126] examined the existing multiagent IDS memory. To mitigate this disadvantage a transfer learning
architectures, including several approaches that utilized RL method named hotbooting technique is used. The hotbooting
algorithms. The adaptation capability of RL methods can help method helps to initialize weights of CNN more efficiently by
IDS to respond effectively to changes in the environments. using experiences that have been learned in similar circum-
However, the optimal solution is not guaranteed because the stances. This reduces the learning time by avoiding random
convergence of a multiagent system is hard to obtain. explorations at the start of each episode. Simulation results
demonstrate that the proposed method is effective in terms of
C. DRL-Based Game Theory for Cyber Security enhancing the security and user privacy of MEC systems and
Traditional cyber security methods, such as firewall, it can protect the systems in confronting with different types
antivirus software, or intrusion detection are normally passive, of smart attacks with low overhead.
Authorized licensed use limited to: Amrutavalli Akula. Downloaded on October 29,2023 at 05:59:25 UTC from IEEE Xplore. Restrictions apply.
3786 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 8, AUGUST 2023

sensing process and cause the main sources of security threats

in infrastructure-less CRNs. The external adversary node is not
part of the CRN, but such attackers can affect the operation of
an ad hoc CRN via jamming attacks. In an infrastructure-based
CRN, an exogenous attacker can mount incumbent emulation
or perform sensor-jamming attacks. The attacker can increase
the local false-alarm probability to affect the decision of the
IEEE 802.22 base station about the availability of a given
band. A jamming attack can have both short-term and long-
term effects. Wang et al. [147] developed a game-theoretic
framework to battle against jamming in CRNs where each
Fig. 8. Secure offloading method in MEC based on DQN with hotbooting
technique. The DQN agent’s actions are to find optimal parameters, such as radio observes the status and quality of available channels
offloading rate, power, and channel for the mobile device to offload the traces and the strategy of jammers to make decisions accordingly.
to the edge node accordingly. The attackers may deploy jamming, spoofing, The CRN can learn optimal channel utilization strategy using
DoS, or smart attacks to disrupt this process. By interacting with the edge
caching systems, the agent can evaluate the reward of the previous action and minimax-Q learning policy [148], solving the problems of
obtain a new state, enabling it to select the next optimal action. how many channels to use for data and to control packets
along with the channel switching strategy. The performance
On the other hand, Aref et al. [143] introduced a multiagent of minimax-Q learning represented via spectrum-efficient
RL method to deal with antijamming communications in throughput is superior to the myopic learning method, which
wideband autonomous cognitive radios (WACRs). WACRs gives high priority to the immediate payoff and ignores the
are advanced radios that can sense the states of the radio environment dynamics and the attackers’ cognitive capability.
frequency spectrum and network, and autonomously optimize In CRNs, secondary users (SUs) are obliged to avoid disrup-
their operating mode corresponding to the perceived states. tions to communications of primary users (PUs) and can only
Cognitive communication protocols, however, may struggle gain access to the licensed spectrum when it is not occupied
when there are unintentional interferences or malicious users by PUs. Jamming attacks are emergent in CRNs due to the
who attempt to interrupt reliable communications by deliberate opportunistic access of SUs as well as the appearance of smart
jamming. Each radio’s effort is to occupy the available com- jammers, which can detect the transmission strategy of SUs.
mon wideband spectrum as much as possible and avoid the Xiao et al. [149] studied the scenarios where a smart jammer
sweeping signal of a jammer that affects the entire spectrum aims to disrupt the SUs, rather than PUs. The SUs and jammer,
band. The multiagent RL approach proposed in [143] learns therefore, must sense the channel to check the presence of
an optimal policy for each radio to select appropriate subband, PUs before making their decisions. The constructed scenarios
aiming to avoid jamming signals and interruptions from other consist of a secondary source node supported by relay nodes to
radios. Comparative studies show the significant dominance transmit data packets to secondary receiving nodes. The smart
of the proposed method against a random policy. A drawback jammer can learn quickly the frequency and transmission
of the proposed method is the assumption that the jammer power of SUs, while SUs do not have full knowledge of the
uses a fixed strategy in responding to the WACRs strategies, underlying dynamic environment. The interactions between
although the jammer may be able to perform adaptive jam- SUs and jammer are modeled as a cooperative transmission
ming with the cognitive radio technology. In [144], when the power control game, and the optimal strategy for SUs is
current spectrum subband is interfered with by a jammer, derived based on the Stackelberg equilibrium [150]. The aim of
Q-learning is used to optimally select a new subband that SU players is to select appropriate transmission powers to effi-
allows uninterrupted transmission as long as possible. The ciently send data messages in the presence of jamming attacks.
reward structure of the Q-learning agent is defined as the The jammer’s utility gain is the SUs’ loss and vice versa.
amount of time that the jammer or interferer takes to interfere RL methods, i.e., Q-learning [60] and WoLF-PHC [151],
with the WACR transmission. Experimental results using the are used to model SUs as intelligent agents for coping with
hardware-in-the-loop prototype simulation show that the agent the smart jammer. WoLF-PHC stands for the combination of
can detect the jamming patterns and successfully learns an the win or learns fast algorithm and the policy hill-climbing
optimal subband selection policy for jamming avoidance. The method. It uses a varying learning rate to foster convergence
obvious drawback of this method is the use of a Q-table with to the game equilibrium by adjusting the learning speed [151].
a limited number of environment states. Simulation results show the improvement in the antijamming
The access right to spectrum (or more generally resources) performance of the proposed method in terms of the signal-
is the main difference between CRNs and traditional wire- to-interference-plus-noise ratio (SINR). The optimal strategy
less technologies. RL in general or Q-learning has been achieved from the Stackelberg game can minimize the damage
investigated to produce optimal policy for cognitive radio created by the jammer in the worst case scenario.
nodes to interact with their radio frequency environment [145]. Recently, Han et al. [152] introduced an antijamming
Attar et al. [146] examined RL solutions against attacks system for CRNs using the DQN algorithm based on a
on both CRN architectures, i.e., infrastructure-based, e.g., frequency-spatial antijamming communication game. The
the IEEE 802.22 standard, and infrastructure-less, e.g., ad hoc game simulates an environment of numerous jammers that
CRN. The adversaries may attempt to manipulate the spectrum inject jamming signals to disturb the ongoing transmissions

Authorized licensed use limited to: Amrutavalli Akula. Downloaded on October 29,2023 at 05:59:25 UTC from IEEE Xplore. Restrictions apply.
NGUYEN AND REDDI: DRL FOR CYBER SECURITY 3787

of SUs. The SU should not interfere with the communications

of PUs and must defeat smart jammers. This communication
system is 2-D that utilizes both frequency hopping and user
mobility. The RL state is the radio environment consisting of
PUs, SUs, jammers, and serving base stations/access points.
The DQN is used to derive an optimal frequency hopping
policy that determines whether the SU should leave an area
of heavy jamming or choose a channel to send signals.
Experimental results show the superiority of the DQN-based
method against the Q-learning-based strategy in terms of faster
convergence rate, increased SINR, lower cost of defense, and
improved utility of the SU. DQN with the core component Fig. 9. Cloud-based malware detection using DQN, where the stochastic
CNN helps to speed the learning process of the system, which gradient descent (SGD) method, is used to update weights of the CNN.
Malicious detection is performed in the cloud server with more powerful
has a large number of frequency channels, compared with the computational resources than mobile devices. The DQN agent helps to select
benchmark Q-learning method. optimal task offloading rates for mobile devices to avoid network congestion
To improve the work of Han et al. [152], Liu et al. [153] and detection delay. By observing the network status and evaluating utility
based on malware detection reports from the server, the agent can formulate
also proposed an antijamming communication system using states and rewards, which are used to generate a sequence of optimal actions,
a DRL method but having different and more extensive con- i.e., dynamic offloading rates.
tributions. Specifically, Liu et al. [153] used the raw spectrum
information with temporal features, known as spectrum water- 3) Malware Attacks: One of the most challenging malware
fall [154] to characterize the environment state, rather than of mobile devices is the zero-day attacks, which exploit
using the SINR and PU occupancy as in [152]. Because of this, publicly unknown security vulnerabilities, and until they are
Liu et al.’s [153] model does not necessitate prior knowledge contained or mitigated, hackers might have already caused
about the jamming patterns and parameters of the jammer but adverse effects on computer programs, data, or networks [159],
rather uses the local observation data. This prevents the model [160]. To avoid such attacks, the traces or log data produced
from the loss of information and facilitates its adaptability by the applications need to be processed in real time. With
to a dynamic environment. Furthermore, Liu et al.’s [153] limited computational power, battery life, and radio bandwidth,
work does not assume that the jammer needs to take the same mobile devices often offload specific malware detection tasks
channel-slot transmission structure with the users as in [152]. to security servers at the cloud for processing. The security
The recursive CNN is utilized to deal with a complex infinite server with powerful computational resources and a more
environment state represented by spectrum waterfall, which updated malware database can process the tasks quicker, more
has a recursion characteristic. The model is tested using several accurately, and then send a detection report back to mobile
jamming scenarios, which include sweeping jamming, comb devices with less delay. The offloading process is, therefore,
jamming, dynamic jamming, and intelligent comb jamming. a key factor affecting the cloud-based malware detection per-
A disadvantage of both Han et al.’s and Liu et al.’s methods formance. For example, if too many tasks are offloaded to the
is that they can only derive an optimal policy for one user, cloud server, there would be radio network congestion that can
which inspires a future research direction focusing on multiple lead to long detection delays. Wan et al. [161] enhanced the
users’ scenarios. mobile offloading performance by improving the previously
2) Spoofing Attacks: Spoofing attacks are popular in proposed game model in [162]. The Q-learning approach used
wireless networks where the attacker claims to be another node in [162] to select optimal offloading rate suffers the curse
using the faked identity, such as media access control, to gain of high dimensionality when the network size is increased,
access to the network illegitimately. This illegal penetration or a large number of feasible offloading rates is available
may lead to man-in-the-middle, or DoS attacks [155]. for selection. Wan et al. [161], thus, advocated the use of
Xiao et al. [156], [157] modeled the interactions between the hotbooting Q-learning and DQN and showed the performance
legitimate receiver and spoofers as a zero-sum authentication improvement in terms of malware detection accuracy and
game and utilized Q-learning and Dyna-Q [158] algorithms speed compared with the standard Q-learning. The cloud-based
to address the spoofing detection problem. The utility of malware detection approach using DQN for selecting the
the receiver or spoofer is computed based on the Bayesian optimal offloading rate is shown in Fig. 9.
risk, which is the expected payoff in the spoofing detection. 4) Attacks in Adversarial Environments: Traditional
The receiver aims to select the optimal test threshold in networks facilitate the direct communications between
PHY-layer spoofing detection while the spoofer needs to client application and server where each network has its
select an optimal attacking frequency. To prevent collisions, switch control that makes the network reconfiguration
spoofers are cooperative to attack the receiver. Simulation and task time-consuming and inefficient. This method is also
experimental results show the improved performance of the disadvantageous because the requested data may need to be
proposed methods against the benchmark method with a fixed retrieved from more than one database involving multiple
test threshold. A disadvantage of the proposed approaches is servers. SDN is a next-generation networking technology as it
that both action and state spaces are quantized into discrete can reconfigure the network adaptively. With the control being
levels, bounded within a specified interval, which may lead programmable with a global view of the network architecture,
to locally optimal solutions. SDN can manage and optimize network resources effectively.
Authorized licensed use limited to: Amrutavalli Akula. Downloaded on October 29,2023 at 05:59:25 UTC from IEEE Xplore. Restrictions apply.
3788 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 8, AUGUST 2023

RL has been demonstrated broadly in the literature as a robust

method for SDN controlling, e.g., [163]–[167].
Although RL’s success in SDN controlling is abundant,
the attacker may be able to falsify the defender’s training
process if it is aware of the network control algorithm
in an adversarial environment. To deal with this problem,
Han et al. [168] proposed the use of adversarial RL to build
an autonomous defense system for SDN. The attacker selects
important nodes in the network to compromise, for example, Fig. 10. Defender and attacker interact via the IDS in an adversarial
nodes in the backbone network or the target subnet. By propa- environment, involving defense and attack cycles. Using these two cycles,
gating through the network, the attacker attempts to eventually a defender and an attacker can repeatedly change their defense and attack
strategies. This model can be used to study defense strategies for different
compromise the critical server, while the defender prevents classes of attacks, such as buffer over-read attacks [171] and code reuse
the server from being compromised and preserve as many attacks [172].
unaffected nodes as possible. To achieve those goals, the RL
Alternatively, Elderman et al. [173] simulated cyber secu-
defender takes four possible actions, consisting of “isolating,”
rity problems in networking as a stochastic Markov game with
“patching,” “reconnecting,” and “migrating.” Two types of
two agents, one attacker, and one defender, with incomplete
DRL agents are trained to model defenders, i.e., DDQN and
information and partial observability. The attacker does not
A3C, to select appropriate actions given different network
know the network topology but attempts to reach and get
states. The reward is characterized based on the status of the
access to the location that contains a valuable asset. The
critical server, the number of preserved nodes, migration costs,
defender knows the internal network but does not see the attack
and the validity of the actions taken. That study considered the
types or positions of intruders. This is a challenging cyber
scenarios where attackers can penetrate the learning process
security game because a player needs to adapt its strategy to
of RL defenders by flipping reward signs or manipulating
defeat the unobservable opponents [174]. Different algorithms,
states. These causative attacks poison the defender’s training
e.g., Monte Carlo learning, Q-learning, and neural networks
process and cause it to perform suboptimal actions. The
are used to learn both defender and attacker. Simulation
adversarial training approach is applied to reduce the impact of
results show that Monte Carlo learning with the softmax
poisoning attacks with its eminent performance demonstrated
exploration is the best method for learning both attacking and
via several experiments using the popular network emulator
defending strategies. Neural network algorithms have a limited
Mininet [169].
adversarial learning capability, and thus they are outperformed
In an adversarial environment, the defender may not know
by Q-learning and Monte Carlo learning techniques. This
the private details of the attacker, such as the type of
simulation has a disadvantage that simplifies the real-world
attack, attacking target, frequency, and location. Therefore,
cyber security problem into a game of only two players with
the defender, for example, may allocate substantial resources
only one asset. In real practice, there can be multiple hackers
to protect an asset that is not a target of the attacker. The
simultaneously penetrating a server that holds valuable data.
defender needs to dynamically reconfigure defense strate-
In addition, a network may contain useful data in different
gies to increase the complexity and cost for the intruder.
locations instead of in a single location as simulated.
Zhu et al. [170] introduced a model where the defender and
attacker can repeatedly change the defense and attack strate-
gies. The defender has no prior knowledge about the attacker, IV. D ISCUSSIONS AND F UTURE R ESEARCH D IRECTIONS
such as launched attacks and attacking policies. However, DRL has emerged over recent years as one of the most
it is aware of the attacker classes and can access the system successful methods of designing and creating human or even
utilities, which are jointly contributed by the defense and superhuman AI agents. Many of these successes have relied on
attack activities. Two interactive RL methods are proposed for the incorporation of DNNs into a framework of traditional RL
cyber defenses in [170], namely, adaptive RL and robust RL. to address complex and high-dimensional sequential decision-
The adaptive RL handles attacks that have a diminishing explo- making problems. Applications of DRL algorithms, therefore,
ration rate (nonpersistent attacker) while the robust RL deals have been found in various fields, including IoT and cyber
with intruders who have a constant exploration rate (persistent security. Computers and the Internet today play crucial roles
attacker). The interactions between defender and attacker are in many areas of our lives, e.g., entertainment, communication,
illustrated via the attack and defense cycles, as in Fig. 10. transportation, medicine, and even shopping. Lots of our
The attackers and defenders do not take actions simultaneously personal information and important data are stored online.
but asynchronously. On the attack cycle, the attacker evaluates Even financial institutions, e.g., banks, mortgage companies,
previous attacks before launching a new attack if necessary. and brokerage firms, run their business online. Therefore,
On the defense cycle, after receiving an alert, the defender it is essential to have a security plan in place to prevent
carries out a meta-analysis on the latest attacks and calculates hackers from accessing our computer systems. This article
the corresponding utility before deploying a new defense if has presented a comprehensive survey of DRL methods and
needed. An advantage of this system model is that it does their applications to cyber security problems, with notable
not assume any underlying model for the attacker but instead examples summarized in Table II. The adversarial environment
treats attack strategies as black boxes. of cyber systems has instigated various proposals of game

Authorized licensed use limited to: Amrutavalli Akula. Downloaded on October 29,2023 at 05:59:25 UTC from IEEE Xplore. Restrictions apply.
NGUYEN AND REDDI: DRL FOR CYBER SECURITY 3789

TABLE II
S UMMARY OF T YPICAL DRL A PPLICATIONS IN C YBER S ECURITY

theory models involving multiple DRL agents. We found that algorithm [93], LSTM-Q-learning [89], DDQN, and A3C [86].
this kind of application occupies a major proportion of papers One of the great challenges in implementing DRL algo-
in the literature relating to DRL for cyber security problems. rithms for CPS security solutions is the lack of realistic
CPS simulations. For example, the work in [86] had to use
the MATLAB/Simulink CPS modeling and embed it into the
A. Challenges and Future Work on Applying DRL for CPS OpenAI Gym environment. This implementation is expensive
Security Solutions in terms of computational time due to the overhead caused
An emerging area is the use of DRL for security solu- by integrating the MATLAB/Simulink CPS simulation in the
tions for CPSs [175], [176]. The large-scale and complex OpenAI Gym library. More proper simulations of CPS models
nature of CPSs, e.g., in environmental monitoring networks, embedded directly in DRL-enabled environments are, thus,
electrical smart grid systems, transportation management net- encouraged in future work. Another common challenge in
works, and cyber manufacturing management systems, require applying DRL algorithms is to transfer the trained policies
security solutions to be responsive and accurate. This has from simulations to real-world environments. While simula-
been addressed by various DRL approaches, e.g., TRPO tions are cheap and safe for training DRL agents, the reality

Authorized licensed use limited to: Amrutavalli Akula. Downloaded on October 29,2023 at 05:59:25 UTC from IEEE Xplore. Restrictions apply.
3790 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 8, AUGUST 2023

gap due to modeling impreciseness and errors make the environment [177]–[179], which can be then employed to
transfer challenging. This is more critical for CPS modeling deploy planning algorithms, e.g., Monte-Carlo tree search
because of the complexity, dynamics, and large scale of techniques [180], to derive optimal actions. Alternatively,
CPS. Research in this direction, i.e., sim-to-real transfer for model-based and model-free combination approaches, such
DRL-based security solutions for CPS, is worth investigating as model-free policy with planning capabilities [181], [182]
as it can help to reduce time, costs, and increase safety during or model-based lookahead search [31], can be used as they
the training process of DRL agents, and eventually reduce aggregate advantages of both methods. On the other hand,
costly mistakes when executing in real-world environments. the current literature on applications of DRL to cyber security
often limits discretizing the action space, which restricts the
B. Challenges and Future Work on Applying DRL for IDS full capability of the DRL solutions to real-world problems.
An example is an application of DRL for selecting optimal
Although there have been a large number of applications
mobile offloading rates in [161] and [162], where the action
of traditional RL methods for IDSs, there has been a small
space has been discretized although a small change of the
amount of work on DRL algorithms for this kind of applica-
rate would primarily affect the performance of the cloud-based
tion. This is probably because the integration of deep learning
malware detection system. Investigation of methods that can
and RL methods has just been sustained very recently. The
deal with continuous action spaces in cyber environments,
complexity and dynamics of intrusion detection problems are
e.g., policy gradient and actor-critic algorithms, is another
expected to be solved effectively by DRL methods, which
encouraging research direction.
combine the powerful representation learning and function
approximation capabilities of deep learning and the opti-
mal sequential decision-making capability of traditional RL. D. Training DRL in Adversarial Cyber Environments
Applying DRL for IDS requires simulated or real-intrusion AI can not only help defend against cyberattacks but can
environments for training agents interactively. This is a great also facilitate dangerous attacks, i.e., offensive AI. Hackers
challenge because using real environments for training is can take advantages of AI to make attacks smarter and more
costly, while simulated environments may be far from reality. sophisticated to bypass detection methods to penetrate com-
Most of the existing studies on DRL for intrusion detection puter systems or networks. For example, hackers may employ
relied on game-based settings (e.g., Fig. 7) or labeled intrusion algorithms to observe normal behaviors of users and employ
datasets. For example, the work in [125] used two datasets of the users’ patterns to develop untraceable attacking strategies.
labeled intrusion samples and adjusted the DRL machinery ML-based systems can mimic humans to craft convincing
for it to work on these datasets in a supervised learning fake messages that are utilized to conduct large-scale phishing
manner. This kind of application lacks a live environment and attacks. Likewise, by creating a highly realistic fake video or
proper interactions between DRL agents and the environment. audio messages based on AI advances (i.e., deepfakes [183]),
There is, thus, a gap for future research on creating more hackers can spread false news in elections or manipulate
realistic environments, which are able to respond in real financial markets [184]. Alternatively, attackers can poison the
time to actions of the DRL agents and facilitate the full data pool used for training deep learning methods (i.e., ML
exploitation of the DRL’s capabilities to solve complex and poisoning) or attackers can manipulate the states or policies,
sophisticated cyber intrusion detection problems. Furthermore, falsify part of the reward signals in RL to trick the agent
as host-based and network-based IDSs have both advantages into taking suboptimal actions, resulting in the agent being
and disadvantages, combining these systems could be a logical compromised [185]. These kinds of attacks are difficult to
approach. DRL-based solutions for this kind of integrated prevent, detect, and fight against as they are part of a battle
system would be another interesting future study. between AI systems. adversarial ML, especially supervised
methods, have been used extensively in cyber security [186]
C. Exploring Capabilities of Model-Based DRL Methods but very few studies have been found on using adversarial
RL [187]. The adversarial DRL or DRL algorithms trained in
Most DRL algorithms used for cyber defense so far are
various adversarial cyber environments are worth comprehen-
model-free methods, which are sample inefficient as they
sive investigations as they can be a solution to battle against
require a large quantity of training data. These data are difficult
the increasingly complex offensive AI systems [188]–[190].
to obtain in real cyber security practice. Researchers generally
utilize simulators to validate their proposed approaches, but
these simulators often do not characterize the complexity E. Human–Machine Teaming With Human-on-the-Loop
and dynamics of the real-cyber space of the IoT systems Models
fully. Model-based DRL methods are more appropriate than With the support of AI systems, cyber security experts
model-free methods when training data are limitedly available no longer examine a huge volume of attack data manually
because, with model-based DRL, it can be easy to collect data to detect and defend against cyberattacks. This has many
in a scalable way. Exploration of model-based DRL methods advantages because the security teams alone cannot sustain the
or the integration of model-based and model-free methods for volume. AI-enabled defense strategies can be automated and
cyber defense is, thus, an interesting future study. For example, deployed rapidly and efficiently but these systems alone cannot
function approximators can be used to learn a proxy model of issue creative responses when new threats are introduced.
the actual high-dimensional and possibly partial observable Moreover, human adversaries are always behind cybercrime

Authorized licensed use limited to: Amrutavalli Akula. Downloaded on October 29,2023 at 05:59:25 UTC from IEEE Xplore. Restrictions apply.
NGUYEN AND REDDI: DRL FOR CYBER SECURITY 3791

or cyberwarfare. Therefore, there is a critical need for human [5] O. Krestinskaya, A. P. James, and L. O. Chua, “Neuromemristive
intellect teamed with machines for cyber defenses. The tra- circuits for edge computing: A review,” IEEE Trans. neural Netw.
Learn. Syst., vol. 31, no. 1, pp. 4–23, Jan. 2020.
ditional human-in-the-loop model for human–machine inte- [6] N. Abbas, Y. Zhang, A. Taherkordi, and T. Skeie, “Mobile edge com-
gration struggles to adapt quickly with cyber defense system puting: A survey,” IEEE Internet Things J., vol. 5, no. 1, pp. 450–465,
because autonomous agent carries out part of the task and Feb. 2018.
[7] A. V. Dastjerdi and R. Buyya, “Fog computing: Helping the Internet
need to halt to wait for human’s responses before completing of Things realize its potential,” Computer, vol. 49, no. 8, pp. 112–116,
the task. The modern human-on-the-loop model would be a Aug. 2016.
solution for a future human–machine teaming cyber security [8] B. Geluvaraj, P. M. Satwik, and T. A. Kumar, “The future of cyberse-
curity: Major role of artificial intelligence, machine learning, and deep
system. This model allows agents to autonomously perform the learning in cyberspace,” in Proc. Int. Conf. Comput. Netw. Commun.
task whilst humans can monitor and intervene operations of Technol., 2019, pp. 739–747.
agents only when necessary. How to integrate human knowl- [9] A. L. Buczak and E. Guven, “A survey of data mining and machine
learning methods for cyber security intrusion detection,” IEEE Com-
edge into DRL algorithms [191] under the human-on-the-loop mun. Surveys Tuts., vol. 18, no. 2, pp. 1153–1176, 2nd Quart., 2016.
model for cyber defense is an interesting research question. [10] G. Apruzzese, M. Colajanni, L. Ferretti, A. Guido, and M. Marchetti,
“On the effectiveness of machine and deep learning for cyber secu-
rity,” in Proc. 10th Int. Conf. Cyber Conflict (CyCon), May 2018,
F. Exploring Capabilities of Multiagent DRL Methods pp. 371–390.
[11] Y. Xin et al., “Machine learning and deep learning methods for
As hackers utilized more and more sophisticated and cybersecurity,” IEEE Access, vol. 6, pp. 35365–35381, 2018.
large-scale approaches to attack computer systems and net- [12] N. Milosevic, A. Dehghantanha, and K.-K. R. Choo, “Machine learning
works, the defense strategies need to be more intelligent and aided Android malware classification,” Comput. Elect. Eng., vol. 61,
pp. 266–274, Jul. 2017.
large-scale as well. Multiagent DRL is a research direction
[13] R. M. H. Babu, R. Vinayakumar, and K. P. Soman, “A short
that can be explored to tackle this problem. Game theory review on applications of deep learning for cyber security,” 2018,
models for cyber security reviewed in this article have involved arXiv:1812.06292. [Online]. Available: http://arxiv.org/abs/1812.06292
multiple agents but they are restricted to a couple of attackers [14] D. Berman, A. Buczak, J. Chavis, and C. Corbett, “A survey of deep
learning methods for cyber security,” Information, vol. 10, no. 4, p. 122,
and defenders with limited communication, cooperation, and Apr. 2019.
coordination among the agents. These aspects of multiagent [15] S. Paul, Z. Ni, and C. Mu, “A learning-based solution for an adversarial
DRL need to be investigated thoroughly in cyber security repeated game in cyber–physical power systems,” IEEE Trans. Neural
Netw. Learn. Syst., vol. 31, no. 11, pp. 4512–4523, Nov. 2020, doi:
problems to enable an effective large-scale defense plan. 10.1109/TNNLS.2019.2955857.
Challenges of multiagent DRL itself then need to be addressed, [16] D. Ding, Q.-L. Han, Y. Xiang, C. Ge, and X.-M. Zhang, “A survey
such as nonstationarity, partial observability, and efficient on security control and attack detection for industrial cyber-physical
systems,” Neurocomputing, vol. 275, pp. 1674–1683, Jan. 2018.
multiagent training schemes [192]. On the other hand, the RL [17] M. Wu, Z. Song, and Y. B. Moon, “Detecting cyber-physical attacks in
methodology has been applied to deal with various cyberat- CyberManufacturing systems with machine learning methods,” J. Intell.
tacks, e.g., jamming, spoofing, false data injection, malware, Manuf., vol. 30, no. 3, pp. 1111–1123, Mar. 2019.
[18] L. Xiao, X. Wan, X. Lu, Y. Zhang, and D. Wu, “IoT security tech-
DoS, DDoS, brute force, Heartbleed, botnet, Web attack, and niques based on machine learning,” 2018, arXiv:1801.06275. [Online].
infiltration attack [193]–[198]. However, recently emerged or Available: http://arxiv.org/abs/1801.06275
new types of attacks have been largely unaddressed. One of [19] A. Sharma, Z. Kalbarczyk, J. Barlow, and R. Iyer, “Analysis of security
data from a large computing organization,” in Proc. IEEE/IFIP 41st Int.
these new types is the bit-and-piece DDoS attack. This attack Conf. Dependable Syst. Netw. (DSN), Jun. 2011, pp. 506–517.
injects small junk into legitimate traffic of over a large number [20] N. D. Nguyen, T. Nguyen, and S. Nahavandi, “System design per-
of IP addresses so that it can bypass many detection methods spective for human-level agents using deep reinforcement learning: A
survey,” IEEE Access, vol. 5, pp. 27091–27102, 2017.
as there is so little of it per address. Another emerging attack, [21] Z. Sui, Z. Pu, J. Yi, and S. Wu, “Formation control with collision
for instance, is attacking from the computing cloud to breach avoidance through deep reinforcement learning using model-guided
systems of companies who manage IT systems for other firms demonstration,” IEEE Trans. Neural Netw. Learn. Syst., vol. 32, no. 6,
pp. 2358–2372, Jun. 2021, doi: 10.1109/TNNLS.2020.3004893.
or host other firms’ data on their servers. Alternatively, hackers
[22] A. Tsantekidis, N. Passalis, A.-S. Toufa, K. Saitas-Zarkias,
can use quantum physics-based powerful computers to crack S. Chairistanidis, and A. Tefas, “Price trailing for financial trading
encryption algorithms that are currently used to protect various using deep reinforcement learning,” IEEE Trans. Neural Netw. Learn.
types of invaluable data [184]. Consequently, a future study Syst., vol. 32, no. 7, pp. 2837–2846, Jul. 2021, doi: 10.1109/TNNLS.
2020.2997523.
on addressing these new types of attacks is encouraged. [23] T. T. Nguyen, N. D. Nguyen, P. Vamplew, S. Nahavandi,
R. Dazeley, and C. P. Lim, “A multi-objective deep reinforcement
learning framework,” 2018, arXiv:1803.02965. [Online]. Available:
R EFERENCES http://arxiv.org/abs/1803.02965
[1] I. Kakalou, K. E. Psannis, P. Krawiec, and R. Badea, “Cognitive radio [24] X. Wang, Y. Gu, Y. Cheng, A. Liu, and C. L. P. Chen, “Approximate
network and network service chaining toward 5G: Challenges and policy-based accelerated deep reinforcement learning,” IEEE Trans.
requirements,” IEEE Commun. Mag., vol. 55, no. 11, pp. 145–151, Neural Netw. Learn. Syst., vol. 31, no. 6, pp. 1820–1830, Jun. 2020.
Nov. 2017. [25] M. H. Ling, K.-L.-A. Yau, J. Qadir, G. S. Poh, and Q. Ni, “Application
[2] Y. Huang, S. Li, C. Li, Y. T. Hou, and W. Lou, “A deep-reinforcement- of reinforcement learning for security enhancement in cognitive radio
learning-based approach to dynamic eMBB/URLLC multiplexing in 5G networks,” Appl. Soft Comput., vol. 37, pp. 809–829, Dec. 2015.
NR,” IEEE Internet Things J., vol. 7, no. 7, pp. 6439–6456, Jul. 2020. [26] Y. Wang, Z. Ye, P. Wan, and J. Zhao, “A survey of dynamic spectrum
[3] P. Wang, L. T. Yang, X. Nie, Z. Ren, J. Li, and L. Kuang, “Data- allocation based on reinforcement learning algorithms in cognitive
driven software defined network attack detection: State-of-the-art and radio networks,” Artif. Intell. Rev., vol. 51, no. 3, pp. 493–506,
perspectives,” Inf. Sci., vol. 513, pp. 65–83, Mar. 2020. Mar. 2019.
[4] A. Botta, W. Donato, V. Persico, and A. Pescapé, “Integration of cloud [27] X. Lu, L. Xiao, T. Xu, Y. Zhao, Y. Tang, and W. Zhuang, “Reinforce-
computing and Internet of Things: A survey,” Future Generat. Comput. ment learning based PHY authentication for VANETs,” IEEE Trans.
Syst., vol. 56, pp. 684–700, Mar. 2016. Veh. Technol., vol. 69, no. 3, pp. 3068–3079, Mar. 2020.

Authorized licensed use limited to: Amrutavalli Akula. Downloaded on October 29,2023 at 05:59:25 UTC from IEEE Xplore. Restrictions apply.
3792 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 8, AUGUST 2023

[28] M. Alauthman, N. Aslam, M. Al-Kasassbeh, S. Khan, A. Al-Qerem, [53] R. Shafin et al., “Self-tuning sectorization: Deep reinforcement learning
and K.-K. R. Choo, “An efficient reinforcement learning-based botnet meets broadcast beam optimization,” IEEE Trans. Wireless Commun.,
detection approach,” J. Netw. Comput. Appl., vol. 150, Jan. 2020, vol. 19, no. 6, pp. 4038–4053, Jun. 2020.
Art. no. 102479. [54] D. Zhang, X. Han, and C. Deng, “Review on the research and practice
[29] V. Mnih et al., “Human-level control through deep reinforcement of deep learning and reinforcement learning in smart grids,” CSEE
learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015. J. Power Energy Syst., vol. 4, no. 3, pp. 362–370, Sep. 2018.
[30] N. D. Nguyen, S. Nahavandi, and T. Nguyen, “A human mixed strategy [55] X. He, K. Wang, H. Huang, T. Miyazaki, Y. Wang, and S. Guo,
approach to deep reinforcement learning,” in Proc. IEEE Int. Conf. “Green resource allocation based on deep reinforcement learning in
Syst., Man, Cybern. (SMC), Oct. 2018, pp. 4023–4028. content-centric IoT,” IEEE Trans. Emerg. Topics Comput., vol. 8, no. 3,
[31] D. Silver et al., “Mastering the game of Go with deep neural networks pp. 781–796, Jul. 2020, doi: 10.1109/TETC.2018.2805718.
and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016. [56] Y. He, C. Liang, F. R. Yu, and Z. Han, “Trust-based social networks
[32] D. Silver et al., “Mastering the game of go without human knowledge,” with computing, caching and communications: A deep reinforcement
Nature, vol. 550, no. 7676, pp. 354–359, 2017. learning approach,” IEEE Trans. Netw. Sci. Eng., vol. 7, no. 1,
[33] O. Vinyals et al., “StarCraft II: A new challenge for rein- pp. 66–79, Jan. 2020, doi: 10.1109/TNSE.2018.2865183.
forcement learning,” 2017, arXiv:1708.04782. [Online]. Available: [57] N. C. Luong et al., “Applications of deep reinforcement learning in
http://arxiv.org/abs/1708.04782 communications and networking: A survey,” IEEE Commun. Surveys
[34] P. Sun et al., “TStarBots: Defeating the cheating level builtin AI Tuts., vol. 21, no. 4, pp. 3133–3174, 4th Quart., 2019, doi: 10.1109/
in StarCraft II in the full game,” 2018, arXiv:1809.07193. [Online]. COMST.2019.2916583.
Available: http://arxiv.org/abs/1809.07193 [58] Y. Dai, D. Xu, S. Maharjan, Z. Chen, Q. He, and Y. Zhang, “Blockchain
[35] Z.-J. Pang, R.-Z. Liu, Z.-Y. Meng, Y. Zhang, Y. Yu, and T. Lu, and deep reinforcement learning empowered intelligent 5G beyond,”
“On reinforcement learning for full-length game of StarCraft,” 2018, IEEE Netw., vol. 33, no. 3, pp. 10–17, May/Jun. 2019.
arXiv:1809.09095. [Online]. Available: http://arxiv.org/abs/1809.09095 [59] A. S. Leong, A. Ramaswamy, D. E. Quevedo, H. Karl, and L. Shi,
[36] V. Zambaldi et al., “Relational deep reinforcement learning,” 2018, “Deep reinforcement learning for wireless sensor scheduling in cyber–
arXiv:1806.01830. [Online]. Available: http://arxiv.org/abs/1806.01830 physical systems,” Automatica, vol. 113, Mar. 2020, Art. no. 108759.
[37] M. Jaderberg et al., “Human-level performance in first-person mul- [60] C. J. C. H. Watkins and P. Dayan, “Q-learning,” Mach. Learn., vol. 8,
tiplayer games with population-based deep reinforcement learn- nos. 3–4, pp. 279–292, 1992.
ing,” 2018, arXiv:1807.01281. [Online]. Available: http://arxiv. [61] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath,
org/abs/1807.01281 “Deep reinforcement learning: A brief survey,” IEEE Signal Process.
[38] OpenAI. (Mar.1, 2019)OpenAI Five. [Online]. Available: https:// Mag., vol. 34, no. 6, pp. 26–38, Nov. 2017.
openai.com/five/ [62] V. Mnih et al., “Playing Atari with deep reinforcement learning,” 2013,
[39] S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep reinforce- arXiv:1312.5602. [Online]. Available: http://arxiv.org/abs/1312.5602
ment learning for robotic manipulation with asynchronous off-policy [63] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience
updates,” in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), May 2017, replay,” 2015, arXiv:1511.05952. [Online]. Available: http://arxiv.org/
pp. 3389–3396. abs/1511.05952
[40] D. Isele, R. Rahimi, A. Cosgun, K. Subramanian, and K. Fujimura, [64] R. J. Williams, “Simple statistical gradient-following algorithms for
“Navigating occluded intersections with autonomous vehicles using connectionist reinforcement learning,” Mach. Learn., vol. 8, nos. 3–4,
deep reinforcement learning,” in Proc. IEEE Int. Conf. Robot. Autom. pp. 229–256, 1992.
(ICRA), May 2018, pp. 2034–2039. [65] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policy
[41] T. Nguyen, N. D. Nguyen, F. Bello, and S. Nahavandi, “A new ten- gradient methods for reinforcement learning with function approxima-
sioning method using deep reinforcement learning for surgical pattern tion,” in Proc. Adv. Neural Inf. Process. Syst., 2000, pp. 1057–1063.
cutting,” in Proc. IEEE Int. Conf. Ind. Technol. (ICIT), Feb. 2019, [66] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust
pp. 1339–1344, doi: 10.1109/ICIT.2019.8755235. region policy optimization,” in Proc. Int. Conf. Mach. Learn., 2015,
[42] N. D. Nguyen, T. Nguyen, S. Nahavandi, A. Bhatti, and G. Guest, pp. 1889–1897.
“Manipulating soft tissues by deep reinforcement learning for [67] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,
autonomous robotic surgery,” in Proc. IEEE Int. Syst. Conf. (SysCon), “Proximal policy optimization algorithms,” 2017, arXiv:1707.06347.
Apr. 2019, pp. 1–7, doi: 10.1109/SYSCON.2019.8836924. [Online]. Available: http://arxiv.org/abs/1707.06347
[43] Y. Keneshloo, T. Shi, N. Ramakrishnan, and C. K. Reddy, “Deep [68] C. Wu et al., “Variance reduction for policy gradient with action-
reinforcement learning for sequence-to-sequence models,” IEEE Trans. dependent factorized baselines,” 2018, arXiv:1803.07246. [Online].
Neural Netw. Learn. Syst., vol. 31, no. 7, pp. 2469–2489, Jul. 2019. Available: http://arxiv.org/abs/1803.07246
[44] M. Mahmud, M. S. Kaiser, A. Hussain, and S. Vassanelli, “Applications [69] T. P. Lillicrap et al., “Continuous control with deep rein-
of deep learning and reinforcement learning to biological data,” IEEE forcement learning,” 2015, arXiv:1509.02971. [Online]. Available:
Trans. Neural Netw. Learn. Syst., vol. 29, no. 6, pp. 2063–2079, http://arxiv.org/abs/1509.02971
Jun. 2018. [70] G. Barth-Maron et al., “Distributed distributional deterministic policy
[45] M. Popova, O. Isayev, and A. Tropsha, “Deep reinforcement learn- gradients,” 2018, arXiv:1804.08617. [Online]. Available: http://arxiv.
ing for de novo drug design,” Sci. Adv., vol. 4, no. 7, Jul. 2018, org/abs/1804.08617
Art. no. eaap7885. [71] M. Jaderberg et al., “Reinforcement learning with unsuper-
[46] Y. He, F. R. Yu, N. Zhao, V. C. M. Leung, and H. Yin, “Software- vised auxiliary tasks,” 2016, arXiv:1611.05397. [Online]. Available:
defined networks with mobile edge computing and caching for smart http://arxiv.org/abs/1611.05397
cities: A big data deep reinforcement learning approach,” IEEE Com- [72] O. Nachum, M. Norouzi, K. Xu, and D. Schuurmans, “Bridging the
mun. Mag., vol. 55, no. 12, pp. 31–37, Dec. 2017. gap between value and policy based reinforcement learning,” in Proc.
[47] H. V. Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning Adv. Neural Inf. Process. Syst., 2017, pp. 2775–2785.
with double Q-learning,” in Proc. 13th AAAI Conf. Artif. Intell., 2016, [73] M. Lapan, Deep Reinforcement Learning Hands-on: Apply Modern RL
pp. 2094–2100. Methods, With Deep Q-Networks, Value Iteration, Policy Gradients,
[48] Z. Wang et al., “Dueling network architectures for deep reinforcement TRPO, AlphaGo Zero and More. Birmingham, U.K.: Packt Publishing,
learning,” in Proc. Int. Conf. Mach. Learn., 2016, pp. 1995–2003. 2018.
[49] H. Zhu, Y. Cao, W. Wang, T. Jiang, and S. Jin, “Deep reinforcement [74] (Dec. 14, 220). OpenAI Gym Toolkit Documentation. Classic Control:
learning for mobile edge caching: Review, new features, and open Control Theory Problems From the Classic RL Literature. [Online].
issues,” IEEE Netw., vol. 32, no. 6, pp. 50–57, Nov. 2018. Available: https://gym.openai.com/envs/#classic_control
[50] V. Mnih et al., “Asynchronous methods for deep reinforcement learn- [75] (Dec. 14, 220). OpenAI Gym Toolkit Documentation. Box2D: Con-
ing,” in Proc. Int. Conf. Mach. Learn., 2016, pp. 1928–1937. tinuous Control Tasks in the Box2D Simulator. [Online]. Available:
[51] Y. Zhang, J. Yao, and H. Guan, “Intelligent cloud resource management https://gym.openai.com/envs/#box2d
with deep reinforcement learning,” IEEE Cloud Comput., vol. 4, no. 6, [76] L. Wang, M. Törngren, and M. Onori, “Current status and advancement
pp. 60–69, Nov./Dec. 2017. of cyber-physical systems in manufacturing,” J. Manuf. Syst., vol. 37,
[52] J. Zhu, Y. Song, D. Jiang, and H. Song, “A new deep-Q-learning- pp. 517–527, Oct. 2015.
based transmission scheduling mechanism for the cognitive Internet [77] Y. Zhang, M. Qiu, C.-W. Tsai, M. M. Hassan, and A. Alamri, “Health-
of Things,” IEEE Internet Things J., vol. 5, no. 4, pp. 2375–2385, CPS: Healthcare cyber-physical system assisted by cloud and big data,”
Aug. 2017. IEEE Syst. J., vol. 11, no. 1, pp. 88–95, Mar. 2017.
Authorized licensed use limited to: Amrutavalli Akula. Downloaded on October 29,2023 at 05:59:25 UTC from IEEE Xplore. Restrictions apply.
NGUYEN AND REDDI: DRL FOR CYBER SECURITY 3793

[78] P. M. Shakeel, S. Baskar, V. R. S. Dhulipala, S. Mishra, and [100] P. Deshpande, S. C. Sharma, S. K. Peddoju, and S. Junaid, “HIDS:
M. M. Jaber, “Maintaining security and privacy in health care system A host based intrusion detection system for cloud computing environ-
using learning based deep-Q-networks,” J. Med. Syst., vol. 42, no. 10, ment,” Int. J. Syst. Assurance Eng. Manage., vol. 9, no. 3, pp. 567–576,
p. 186, 2018. Jun. 2018.
[79] M. H. Cintuglu, O. A. Mohammed, K. Akkaya, and A. S. Uluagac, [101] M. Nobakht, V. Sivaraman, and R. Boreli, “A host-based intrusion
“A survey on smart grid cyber-physical system testbeds,” IEEE Com- detection and mitigation framework for smart home IoT using Open-
mun. Surveys Tuts., vol. 19, no. 1, pp. 446–464, 1st Quart., 2017. Flow,” in Proc. 11th Int. Conf. Availability, Rel. Secur. (ARES),
[80] Y. Chen, S. Huang, F. Liu, Z. Wang, and X. Sun, “Evaluation of rein- Aug. 2016, pp. 147–156.
forcement learning-based false data injection attack to automatic volt- [102] P. A. A. Resende and A. C. Drummond, “A survey of random forest
age control,” IEEE Trans. Smart Grid, vol. 10, no. 2, pp. 2158–2169, based methods for intrusion detection systems,” ACM Comput. Surv.,
Mar. 2018. vol. 51, no. 3, p. 48, 2018.
[81] Z. Ni and S. Paul, “A multistage game in smart grid security: A [103] G. Kim, H. Yi, J. Lee, Y. Paek, and S. Yoon, “LSTM-based system-call
reinforcement learning solution,” IEEE Trans. Neural Netw. Learn. language modeling and robust ensemble method for designing host-
Syst., vol. 30, no. 9, pp. 2684–2695, Sep. 2019, doi: 10.1109/TNNLS. based intrusion detection systems,” 2016, arXiv:1611.01726. [Online].
2018.2885530. Available: http://arxiv.org/abs/1611.01726
[82] A. Ferdowsi, A. Eldosouky, and W. Saad, “Interdependence-aware [104] A. Chawla, B. Lee, S. Fallon, and P. Jacob, “Host based intrusion
game-theoretic framework for secure intelligent transportation sys- detection system with combined CNN/RNN model,” in Proc. Joint Eur.
tems,” IEEE Internet Things J., early access, Sep. 1, 2020, doi: Conf. Mach. Learn. Knowl. Discovery Databases, 2018, pp. 149–158.
10.1109/JIOT.2020.3020899. [105] M. M. Hassan, A. Gumaei, A. Alsanad, M. Alrubaian, and G. Fortino,
[83] Y. Li et al., “Nonlane-discipline-based car-following model for electric “A hybrid deep learning model for efficient intrusion detection in big
vehicles in transportation-cyber-physical systems,” IEEE Trans. Intell. data environment,” Inf. Sci., vol. 513, pp. 386–396, Mar. 2020.
Transp. Syst., vol. 19, no. 1, pp. 38–47, Jan. 2018. [106] A. Janagam and S. Hossen, “Analysis of network intrusion detection
[84] C. Li and M. Qiu, Reinforcement Learning for Cyber-Physical Systems: system with machine learning algorithms (deep reinforcement learning
With Cybersecurity Case Studies. Boca Raton, FL, USA: CRC Press, algorithm),” M.S. thesis, Dept. Comput., Blekinge Inst. Technol.,
2019. Stockholm, Sweden, 2018.
[85] M. Feng and H. Xu, “Deep reinforecement learning based optimal [107] X. Xu and T. Xie, “A reinforcement learning approach for host-based
defense for cyber-physical system in presence of unknown cyber- intrusion detection using sequences of system calls,” in Proc. Int. Conf.
attack,” in Proc. IEEE Symp. Ser. Comput. Intell. (SSCI), Nov. 2017, Intell. Comput., 2005, pp. 995–1003.
pp. 1–8. [108] R. S. Sutton, “Learning to predict by the methods of temporal differ-
[86] T. Akazaki, S. Liu, Y. Yamagata, Y. Duan, and J. Hao, “Falsification ences,” Mach. Learn., vol. 3, no. 1, pp. 9–44, 1988.
of cyber-physical systems using deep reinforcement learning,” in Proc. [109] X. Xu, “A sparse kernel-based least-squares temporal difference algo-
Int. Symp. Formal Methods, 2018, pp. 456–465. rithm for reinforcement learning,” in Proc. Int. Conf. Natural Comput.,
2006, pp. 47–56.
[87] H. Abbas and G. Fainekos, “Convergence proofs for simulated anneal-
ing falsification of safety properties,” in Proc. 50th Annu. Allerton Conf. [110] X. Xu and Y. Luo, “A kernel-based reinforcement learning approach to
Commun., Control, Comput. (Allerton), Oct. 2012, pp. 1594–1601. dynamic behavior modeling of intrusion detection,” in Proc. Int. Symp.
Neural Netw., 2007, pp. 455–464.
[88] S. Sankaranarayanan and G. Fainekos, “Falsification of temporal prop-
[111] X. Xu, “Sequential anomaly detection based on temporal-difference
erties of hybrid systems using the cross-entropy method,” in Proc.
learning: Principles, models and case studies,” Appl. Soft Comput.,
15th ACM Int. Conf. Hybrid Syst., Comput. Control (HSCC), 2012,
vol. 10, no. 3, pp. 859–867, Jun. 2010.
pp. 125–134.
[112] B. Deokar and A. Hazarnis, “Intrusion detection system using log files
[89] A. Ferdowsi, U. Challita, W. Saad, and N. B. Mandayam, “Robust
and reinforcement learning,” Int. J. Comput. Appl., vol. 45, no. 19,
deep reinforcement learning for security and safety in autonomous
pp. 28–35, 2012.
vehicle systems,” in Proc. 21st Int. Conf. Intell. Transp. Syst. (ITSC),
[113] K. Malialis, “Distributed reinforcement learning for network intrusion
Nov. 2018, pp. 307–312.
response,” Doctoral Dissertation, Dept. Comput. Sci., Univ. York, York,
[90] X. Wang, R. Jiang, L. Li, Y.-L. Lin, and F.-Y. Wang, “Long memory is U.K., 2014.
important: A test study on deep-learning based car-following model,”
[114] K. Malialis and D. Kudenko, “Distributed response to network intru-
Phys. A, Stat. Mech. Appl., vol. 514, pp. 786–795, Jan. 2019.
sions using multiagent reinforcement learning,” Eng. Appl. Artif. Intell.,
[91] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural vol. 41, pp. 270–284, May 2015.
Comput., vol. 9, no. 8, pp. 1735–1780, 1997. [115] R. S. Sutton and A. G. Barto, Introduction to Reinforcement Learning.
[92] I. Rasheed, F. Hu, and L. Zhang, “Deep reinforcement learning Cambridge, MA, USA: MIT Press, 1998.
approach for autonomous vehicle systems for maintaining security [116] D. K. Y. Yau, J. C. S. Lui, F. Liang, and Y. Yam, “Defending against
and safety using LSTM-GAN,” Veh. Commun., vol. 26, Dec. 2020, distributed denial-of-service attacks with max-min fair server-centric
Art. no. 100266. router throttles,” IEEE/ACM Trans. Netw., vol. 13, no. 1, pp. 29–42,
[93] A. Gupta and Z. Yang, “Adversarial reinforcement learning for Feb. 2005.
observer design in autonomous systems under cyber attacks,” 2018, [117] R. Bhosale, S. Mahajan, and P. Kulkarni, “Cooperative machine learn-
arXiv:1809.06784. [Online]. Available: http://arxiv.org/abs/1809.06784 ing for intrusion detection system,” Int. J. Sci. Eng. Res., vol. 5, no. 1,
[94] A. Abubakar and B. Pranggono, “Machine learning based intrusion pp. 1780–1785, 2014.
detection system for software defined networks,” in Proc. 7th Int. Conf. [118] A. Herrero and E. Corchado, “Multiagent systems for network intrusion
Emerg. Secur. Technol. (EST), Sep. 2017, pp. 138–143. detection: A review,” in Computational Intelligence in Security for
[95] S. Jose, D. Malathi, B. Reddy, and D. Jayaseeli, “A survey on anomaly Information Systems. Berlin, Germany: Springer, 2009, pp. 143–154.
based host intrusion detection system,” J. Phys., Conf. Ser., vol. 1000, [119] A. Detwarasiti and R. D. Shachter, “Influence diagrams for team
Apr. 2018, Art. no. 012049. decision analysis,” Decis. Anal., vol. 2, no. 4, pp. 207–228, Dec. 2005.
[96] S. Roshan, Y. Miche, A. Akusok, and A. Lendasse, “Adaptive and [120] S. Shamshirband, A. Patel, N. B. Anuar, M. L. M. Kiah, and A. Abra-
online network intrusion detection system using clustering and extreme ham, “Cooperative game theoretic approach using fuzzy Q-learning for
learning machines,” J. Franklin Inst., vol. 355, no. 4, pp. 1752–1779, detecting and preventing intrusions in wireless sensor networks,” Eng.
Mar. 2018. Appl. Artif. Intell., vol. 32, pp. 228–241, Jun. 2014.
[97] S. Dey, Q. Ye, and S. Sampalli, “A machine learning based intrusion [121] P. Muñoz, R. Barco, and I. de la Bandera, “Optimization of load bal-
detection scheme for data fusion in mobile clouds involving heteroge- ancing using fuzzy Q-Learning for next generation wireless networks,”
neous client networks,” Inf. Fusion, vol. 49, pp. 205–215, Sep. 2019. Expert Syst. Appl., vol. 40, no. 4, pp. 984–994, Mar. 2013.
[98] D. Papamartzivanos, F. G. Mármol, and G. Kambourakis, “Introducing [122] S. Shamshirband, N. B. Anuar, M. L. M. Kiah, and A. Patel,
deep learning self-adaptive misuse network intrusion detection sys- “An appraisal and design of a multi-agent system based cooperative
tems,” IEEE Access, vol. 7, pp. 13546–13560, 2019. wireless intrusion detection computational intelligence technique,” Eng.
[99] W. Haider, G. Creech, Y. Xie, and J. Hu, “Windows based data sets Appl. Artif. Intell., vol. 26, no. 9, pp. 2105–2127, Oct. 2013.
for evaluation of robustness of host based intrusion detection systems [123] S. Varshney and R. Kuma, “Variants of LEACH routing protocol in
(IDS) to zero-day and stealth attacks,” Future Internet, vol. 8, no. 4, WSN: A comparative analysis,” in Proc. 8th Int. Conf. Cloud Comput.,
p. 29, Jul. 2016. Data Sci. Eng. (Confluence), Jan. 2018, pp. 199–204.

Authorized licensed use limited to: Amrutavalli Akula. Downloaded on October 29,2023 at 05:59:25 UTC from IEEE Xplore. Restrictions apply.
3794 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 8, AUGUST 2023

[124] G. Caminero, M. Lopez-Martin, and B. Carro, “Adversarial envi- [147] B. Wang, Y. Wu, K. J. R. Liu, and T. C. Clancy, “An anti-jamming
ronment reinforcement learning algorithm for intrusion detection,” stochastic game for cognitive radio networks,” IEEE J. Sel. Areas
Comput. Netw., vol. 159, pp. 96–109, Aug. 2019. Commun., vol. 29, no. 4, pp. 877–889, Apr. 2011.
[125] M. Lopez-Martin, B. Carro, and A. Sanchez-Esguevillas, “Application [148] M. L. Littman, “Markov games as a framework for multiagent rein-
of deep reinforcement learning to intrusion detection for supervised forcement learning,” in Proc. 11th Int. Conf. Mach. Learn., 1994,
problems,” Expert Syst. Appl., vol. 141, Mar. 2020, Art. no. 112963. pp. 157–163.
[126] I. A. Saeed, A. Selamat, M. F. Rohani, O. Krejcar, and J. A. Chaudhry, [149] L. Xiao, Y. Li, J. Liu, and Y. Zhao, “Power control with reinforcement
“A systematic state-of-the-art analysis of multi-agent intrusion detec- learning in cooperative cognitive radio networks against jamming,”
tion,” IEEE Access, vol. 8, pp. 180184–180209, 2020. J. Supercomput., vol. 71, no. 9, pp. 3237–3257, Sep. 2015.
[127] S. Roy, C. Ellis, S. Shiva, D. Dasgupta, V. Shandilya, and Q. Wu, [150] D. Yang, G. Xue, J. Zhang, A. Richa, and X. Fang, “Coping with
“A survey of game theory as applied to network security,” in Proc. a smart jammer in wireless networks: A Stackelberg game approach,”
43rd Hawaii Int. Conf. Syst. Sci., 2010, pp. 1–10. IEEE Trans. Wireless Commun., vol. 12, no. 8, pp. 4038–4047,
[128] S. Shiva, S. Roy, and D. Dasgupta, “Game theory for cyber security,” Aug. 2013.
in Proc. 6th Annu. Workshop Cyber Secur. Inf. Intell. Res. (CSIIRW), [151] M. Bowling and M. Veloso, “Multiagent learning using a variable
2010, p. 34. learning rate,” Artif. Intell., vol. 136, no. 2, pp. 215–250, Apr. 2002.
[129] K. Ramachandran and Z. Stefanova, “Dynamic game theories in [152] G. Han, L. Xiao, and H. V. Poor, “Two-dimensional anti-jamming
cyber security,” in Proc. Int. Conf. Dyn. Syst. Appl., vol. 7, 2016, communication based on deep reinforcement learning,” in Proc. IEEE
pp. 303–310. Int. Conf. Acoust., Speech Signal Process. (ICASSP), Mar. 2017,
[130] Y. Wang, Y. Wang, J. Liu, Z. Huang, and P. Xie, “A survey of game pp. 2087–2091.
theoretic methods for cyber security,” in Proc. IEEE 1st Int. Conf. Data [153] X. Liu, Y. Xu, L. Jia, Q. Wu, and A. Anpalagan, “Anti-jamming
Sci. Cyberspace (DSC), Jun. 2016, pp. 631–636. communications using spectrum waterfall: A deep reinforcement learn-
[131] Q. Zhu and S. Rass, “Game theory meets network security: A tutorial,” ing approach,” IEEE Commun. Lett., vol. 22, no. 5, pp. 998–1001,
in Proc. ACM SIGSAC Conf. Comput. Commun. Secur., Oct. 2018, Mar. 2018.
pp. 2163–2165. [154] W. Chen and X. Wen, “Perceptual spectrum waterfall of pattern shape
[132] A. Mpitziopoulos, D. Gavalas, C. Konstantopoulos, and G. Pantziou, recognition algorithm,” in Proc. 18th Int. Conf. Adv. Commun. Technol.
“A survey on jamming attacks and countermeasures in WSNs,” IEEE (ICACT), Jan. 2016, pp. 382–389.
Commun. Surveys Tuts., vol. 11, no. 4, pp. 42–56, 4th Quart., 2009. [155] K. Zeng, K. Govindan, and P. Mohapatra, “Non-cryptographic authen-
tication and identification in wireless networks,” IEEE Wireless Com-
[133] S. Hu, D. Yue, X. Xie, X. Chen, and X. Yin, “Resilient event-
mun., vol. 17, no. 5, pp. 56–62, Oct. 2010.
triggered controller synthesis of networked control systems under
periodic DoS jamming attacks,” IEEE Trans. Cybern., vol. 49, no. 12, [156] L. Xiao, Y. Li, G. Liu, Q. Li, and W. Zhuang, “Spoofing detection with
pp. 4271–4281, Dec. 2019, doi: 10.1109/TCYB.2018.2861834. reinforcement learning in wireless networks,” in Proc. IEEE Global
Commun. Conf. (GLOBECOM), Dec. 2015, pp. 1–5.
[134] H. Boche and C. Deppe, “Secure identification under passive eaves-
[157] L. Xiao, Y. Li, G. Han, G. Liu, and W. Zhuang, “PHY-layer spoofing
droppers and active jamming attacks,” IEEE Trans. Inf. Forensics
detection with reinforcement learning in wireless networks,” IEEE
Security, vol. 14, no. 2, pp. 472–485, Feb. 2019.
Trans. Veh. Technol., vol. 65, no. 12, pp. 10037–10047, Dec. 2016.
[135] Y. Wu, B. Wang, K. J. R. Liu, and T. C. Clancy, “Anti-jamming
[158] R. S. Sutton, “Integrated architecture for learning, planning, and
games in multi-channel cognitive radio networks,” IEEE J. Sel. Areas
reacting based on approximating dynamic programming,” in Proc. 7th
Commun., vol. 30, no. 1, pp. 4–15, Jan. 2012.
Int. Conf. Mach. Learn., 1990, pp. 216–224.
[136] S. Singh and A. Trivedi, “Anti-jamming in cognitive radio networks [159] X. Sun, J. Dai, P. Liu, A. Singhal, and J. Yen, “Using Bayesian
using reinforcement learning algorithms,” in Proc. 9th Int. Conf. networks for probabilistic identification of zero-day attack paths,”
Wireless Opt. Commun. Netw. (WOCN), Sep. 2012, pp. 1–5. IEEE Trans. Inf. Forensics Security, vol. 13, no. 10, pp. 2506–2521,
[137] Y. Gwon, S. Dastangoo, C. Fossa, and H. T. Kung, “Competing mobile Oct. 2018.
network game: Embracing antijamming and jamming strategies with [160] Y. Afek, A. Bremler-Barr, and S. L. Feibish, “Zero-day signature
reinforcement learning,” in Proc. IEEE Conf. Commun. Netw. Secur. extraction for high-volume attacks,” IEEE/ACM Trans. Netw., vol. 27,
(CNS), Oct. 2013, pp. 28–36. no. 2, pp. 691–706, Apr. 2019, doi: 10.1109/TNET.2019.2899124.
[138] W. G. Conley and A. J. Miller, “Cognitive jamming game for dynam- [161] X. Wan, G. Sheng, Y. Li, L. Xiao, and X. Du, “Reinforcement learning
ically countering ad hoc cognitive radio networks,” in Proc. MILCOM based mobile offloading for cloud-based malware detection,” in Proc.
IEEE Mil. Commun. Conf., Nov. 2013, pp. 1176–1182. GLOBECOM IEEE Global Commun. Conf., Dec. 2017, pp. 1–6.
[139] K. Dabcevic, A. Betancourt, L. Marcenaro, and C. S. Regazzoni, “A fic- [162] Y. Li, J. Liu, Q. Li, and L. Xiao, “Mobile cloud offloading for malware
titious play-based game-theoretical approach to alleviating jamming detections with learning,” in Proc. IEEE Conf. Comput. Commun.
attacks for cognitive radios,” in Proc. IEEE Int. Conf. Acoust., Speech Workshops (INFOCOM WKSHPS), Apr. 2015, pp. 197–201.
Signal Process. (ICASSP), May 2014, pp. 8158–8162. [163] M. A. Salahuddin, A. Al-Fuqaha, and M. Guizani, “Software-defined
[140] F. Slimeni, B. Scheers, Z. Chtourou, and V. Le Nir, “Jamming networking for RSU clouds in support of the internet of vehicles,”
mitigation in cognitive radio networks using a modified Q-learning IEEE Internet Things J., vol. 2, no. 2, pp. 133–144, Apr. 2015.
algorithm,” in Proc. Int. Conf. Mil. Commun. Inf. Syst. (ICMCIS), [164] R. Huang, X. Chu, J. Zhang, and Y. H. Hu, “Energy-efficient mon-
May 2015, pp. 1–7. itoring in software defined wireless sensor networks using reinforce-
[141] L. Xiao, X. Lu, D. Xu, Y. Tang, L. Wang, and W. Zhuang, “UAV relay ment learning: A prototype,” Int. J. Distrib. Sensor Netw., vol. 2015,
in VANETs against smart jamming with reinforcement learning,” IEEE pp. 1–12, Oct. 2015.
Trans. Veh. Technol., vol. 67, no. 5, pp. 4087–4097, May 2018. [165] S. Kim, J. Son, A. Talukder, and C. S. Hong, “Congestion prevention
[142] L. Xiao, X. Wan, C. Dai, X. Du, X. Chen, and M. Guizani, “Security mechanism based on Q-leaning for efficient routing in SDN,” in Proc.
in mobile edge caching with reinforcement learning,” IEEE Wireless Int. Conf. Inf. Netw. (ICOIN), Jan. 2016, pp. 124–128.
Commun., vol. 25, no. 3, pp. 116–122, Jun. 2018. [166] S.-C. Lin, I. F. Akyildiz, P. Wang, and M. Luo, “QoS-aware adap-
[143] M. A. Aref, S. K. Jayaweera, and S. Machuzak, “Multi-agent reinforce- tive routing in multi-layer hierarchical software defined networks: A
ment learning based cognitive anti-jamming,” in Proc. IEEE Wireless reinforcement learning approach,” in Proc. IEEE Int. Conf. Services
Commun. Netw. Conf. (WCNC), Mar. 2017, pp. 1–6. Comput. (SCC), Jun. 2016, pp. 25–33.
[144] S. Machuzak and S. K. Jayaweera, “Reinforcement learning based [167] A. Mestres et al., “Knowledge-defined networking,” ACM SIGCOMM
anti-jamming with wideband autonomous cognitive radios,” in Proc. Comput. Commun. Rev., vol. 47, no. 3, pp. 2–10, Sep. 2017.
IEEE/CIC Int. Conf. Commun. China (ICCC), Jul. 2016, pp. 1–5. [168] Y. Han et al., “Reinforcement learning for autonomous defence in
[145] M. D. Felice, L. Bedogni, and L. Bononi, “Reinforcement learning- software-defined networking,” in Proc. Int. Conf. Decis. Game Theory
based spectrum management for cognitive radio networks: A literature Secur., 2018, pp. 145–165.
review and case study,” in Handbook of Cognitive Radio. Singapore: [169] B. Lantz and B. O’Connor, “A mininet-based virtual testbed for
Springer, 2019, pp. 1849–1886. distributed SDN development,” ACM SIGCOMM Comput. Commun.
[146] A. Attar, H. Tang, A. V. Vasilakos, F. R. Yu, and V. C. M. Leung, Rev., vol. 45, no. 4, pp. 365–366, 2015.
“A survey of security challenges in cognitive radio networks: Solu- [170] M. Zhu, Z. Hu, and P. Liu, “Reinforcement learning algorithms for
tions and future research directions,” Proc. IEEE, vol. 100, no. 12, adaptive cyber defense against heartbleed,” in Proc. 1st ACM Workshop
pp. 3172–3186, Dec. 2012. Moving Target Defense (MTD), 2014, pp. 51–58.

Authorized licensed use limited to: Amrutavalli Akula. Downloaded on October 29,2023 at 05:59:25 UTC from IEEE Xplore. Restrictions apply.
NGUYEN AND REDDI: DRL FOR CYBER SECURITY 3795

[171] J. Wang, M. Zhao, Q. Zeng, D. Wu, and P. Liu, “Risk assessment [193] L. Xiao, Y. Li, G. Han, H. Dai, and H. V. Poor, “A secure mobile
of buffer ‘heartbleed’ over-read vulnerabilities,” in Proc. 45th Annu. crowdsensing game with deep reinforcement learning,” IEEE Trans.
IEEE/IFIP Int. Conf. Dependable Syst. Netw., Jun. 2015, pp. 555–562. Inf. Forensics Security, vol. 13, no. 1, pp. 35–47, Jan. 2018.
[172] B. Luo, Y. Yang, C. Zhang, Y. Wang, and B. Zhang, “A survey of code [194] Y. Liu, M. Dong, K. Ota, J. Li, and J. Wu, “Deep reinforcement
reuse attack and defense,” in Proc. Int. Conf. Intell. Interact. Syst. Appl., learning based smart mitigation of DDoS flooding in software-defined
2018, pp. 782–788. networks,” in Proc. IEEE 23rd Int. Workshop Comput. Aided Modeling
[173] R. Elderman, L. J. J. Pater, A. S. Thie, M. M. Drugan, and Design Commun. Links Netw. (CAMAD), Sep. 2018, pp. 1–6.
M. M. Wiering, “Adversarial reinforcement learning in a cyber secu- [195] Y. Xu, G. Ren, J. Chen, X. Zhang, L. Jia, and L. Kong, “Interference-
rity simulation,” in Proc. 9th Int. Conf. Agents Artif. Intell., 2017, aware cooperative anti-jamming distributed channel selection in
pp. 559–566. UAV communication networks,” Appl. Sci., vol. 8, no. 10, p. 1911,
[174] K. Chung, C. A. Kamhoua, K. A. Kwiat, Z. T. Kalbarczyk, and Oct. 2018.
R. K. Iyer, “Game theory with learning for cyber security monitoring,” [196] F. Yao and L. Jia, “A collaborative multi-agent reinforcement learn-
in Proc. IEEE 17th Int. Symp. High Assurance Syst. Eng. (HASE), ing anti-jamming algorithm in wireless networks,” IEEE Wireless
Jan. 2016, pp. 1–8. Commun. Lett., vol. 8, no. 4, pp. 1024–1027, Aug. 2019, doi:
[175] F. Wei, Z. Wan, and H. He, “Cyber-attack recovery strategy for smart 10.1109/LWC.2019.2904486.
grid based on deep reinforcement learning,” IEEE Trans. Smart Grid, [197] Y. Li et al., “On the performance of deep reinforcement learning-based
vol. 11, no. 3, pp. 2476–2486, May 2020. anti-jamming method confronting intelligent jammer,” Appl. Sci., vol. 9,
[176] X. Liu, J. Ospina, and C. Konstantinou, “Deep reinforcement learning no. 7, p. 1361, 2019.
for cybersecurity assessment of wind integrated power systems,” 2020, [198] M. Chatterjee and A.-S. Namin, “Detecting phishing websites through
arXiv:2007.03025. [Online]. Available: http://arxiv.org/abs/2007.03025 deep reinforcement learning,” in Proc. IEEE 43rd Annu. Comput. Softw.
[177] J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh, “Action-conditional Appl. Conf. (COMPSAC), Jul. 2019, pp. 227–232.
video prediction using deep networks in Atari games,” in Proc. Adv.
Neural Inf. Process. Syst., 2015, pp. 2863–2871.
[178] M. Mathieu, C. Couprie, and Y. LeCun, “Deep multi-scale video pre-
diction beyond mean square error,” 2015, arXiv:1511.05440. [Online].
Available: http://arxiv.org/abs/1511.05440
[179] A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine, “Neural network
dynamics for model-based deep reinforcement learning with model-free
fine-tuning,” in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), May 2018, Thanh Thi Nguyen received the Ph.D. degree in
pp. 7559–7566. mathematics and statistics from Monash University,
[180] C. B. Browne et al., “A survey of Monte Carlo tree search methods,” Clayton, VIC, Australia, in 2013.
IEEE Trans. Comput. Intell. AI Games, vol. 4, no. 1, pp. 1–43, He was a Visiting Scholar with the Depart-
Mar. 2012. ment of Computer Science, Stanford University,
[181] A. Tamar, Y. Wu, G. Thomas, S. Levine, and P. Abbeel, “Value iteration Stanford, CA, USA, in 2015, and the Edge Com-
networks,” in Proc. 26th Int. Joint Conf. Artif. Intell., Aug. 2017, puting Laboratory, John A. Paulson School of Engi-
pp. 2154–2162. neering and Applied Sciences, Harvard University,
[182] R. Pascanu et al., “Learning model-based planning from scratch,” 2017, Massachusetts, USA, in 2019. He is currently a
arXiv:1707.06170. [Online]. Available: http://arxiv.org/abs/1707.06170 Senior Lecturer with the School of Information
[183] T. T. Nguyen, Q. V. H. Nguyen, C. M. Nguyen, D. Nguyen, D. Thanh Technology, Deakin University, Burwood, VIC, Aus-
Nguyen, and S. Nahavandi, “Deep learning for deepfakes creation tralia. He received an Alfred Deakin Post-Doctoral Research Fellowship
and detection: A survey,” 2019, arXiv:1909.11573. [Online]. Available: in 2016, a European–Pacific Partnership for the ICT Expert Exchange Pro-
http://arxiv.org/abs/1909.11573 gram Award from the European Commission in 2018, and an Australia–India
[184] M. Giles. (Jan. 4, 2019). Five Emerging Cyber-Threats to Worry Strategic Research Fund Early- and Mid-Career Fellowship from the Aus-
About in 2019. MIT Technology Review. [Online]. Available: tralian Academy of Science in 2020. He has expertise in various areas,
https://www.technologyreview.com/2019/01/04/66232/five-emerging- including artificial intelligence, deep learning, deep reinforcement learning,
cyber-threats-2019/ cyber security, IoT, and data science.
[185] V. Behzadan and A. Munir, “Vulnerability of deep reinforcement
learning to policy induction attacks,” in Proc. Int. Conf. Mach. Learn.
Data Mining Pattern Recognit., 2017, pp. 262–275.
[186] V. Duddu, “A survey of adversarial machine learning in cyber warfare,”
Defence Sci. J., vol. 68, no. 4, pp. 356–366, 2018.
[187] T. Chen, J. Liu, Y. Xiang, W. Niu, E. Tong, and Z. Han, “Adversarial
attack and defense in reinforcement learning-from AI security view,”
Cybersecurity, vol. 2, no. 1, p. 11, Dec. 2019. Vijay Janapa Reddi (Member, IEEE) received the
[188] I. Ilahi et al., “Challenges and countermeasures for adversarial attacks Ph.D. degree in computer science from Harvard
on deep reinforcement learning,” 2020, arXiv:2001.09684. [Online]. University, Cambridge, MA, USA, in 2010.
Available: http://arxiv.org/abs/2001.09684 He is currently an Associate Professor with the
[189] L. Tong, A. Laszka, C. Yan, N. Zhang, and Y. Vorobeychik, “Finding John A. Paulson School of Engineering and Applied
needles in a moving haystack: Prioritizing alerts with adversarial Sciences, Harvard University, where he is the Direc-
reinforcement learning,” in Proc. AAAI Conf. Artif. Intell., vol. 34, tor of the Edge Computing Laboratory. His research
no. 1, 2020, pp. 946–953. interests include computer architecture and system-
[190] J. Sun et al., “Stealthy and efficient adversarial attacks against deep software design, with an emphasis on the context
reinforcement learning,” 2020, arXiv:2005.07099. [Online]. Available: of mobile and edge computing platforms based on
http://arxiv.org/abs/2005.07099 machine learning.
[191] T. Nguyen, N. D. Nguyen, and S. Nahavandi, “Multi-agent deep Dr. Reddi was a recipient of multiple awards, including the Best Paper
reinforcement learning with human strategies,” in Proc. IEEE Award at the 2005 International Symposium on Microarchitecture and the
Int. Conf. Ind. Technol. (ICIT), Feb. 2019, pp. 1357–1362, doi: 2009 International Symposium on High-Performance Computer Architecture,
10.1109/ICIT.2019.8755032. the IEEE’s Top Picks in Computer Architecture Awards in 2006, 2010, 2011,
[192] T. T. Nguyen, N. D. Nguyen, and S. Nahavandi, “Deep reinforcement 2016, and 2017, the Google Faculty Research Awards in 2012, 2013, 2015,
learning for multiagent systems: A review of challenges, solutions, and 2017, the Intel Early Career Award in 2013, the National Academy of
and applications,” IEEE Trans. Cybern., vol. 50, no. 9, pp. 3826–3839, Engineering Gilbreth Lecturer Honor in 2016, and the IEEE TCCA Young
Sep. 2020. Computer Architect Award in 2016.