Deep Reinforcement Learning for Cyber Security
Deep Reinforcement Learning for Cyber Security
Abstract— The scale of Internet-connected systems has has been, thus, a critical need for the development of cyber
increased considerably, and these systems are being exposed to security technologies to mitigate and eliminate impacts of
cyberattacks more than ever. The complexity and dynamics of these attacks [8].
cyberattacks require protecting mechanisms to be responsive,
adaptive, and scalable. Machine learning, or more specifically Artificial intelligence (AI), especially machine learning
deep reinforcement learning (DRL), methods have been pro- (ML), has been applied to both attacking and defending in
posed widely to address these issues. By incorporating deep cyberspace. On the attacker side, ML is utilized to compromise
learning into traditional RL, DRL is highly capable of solving defense strategies. On the cyber security side, ML is employed
complex, dynamic, and especially high-dimensional cyber defense to put up robust resistance against security threats in order
problems. This article presents a survey of DRL approaches
developed for cyber security. We touch on different vital aspects, to adaptively prevent and minimize the impacts or damages
including DRL-based security methods for cyber–physical sys- that occurred. Among these ML applications, unsupervised
tems, autonomous intrusion detection techniques, and multiagent and supervised learning methods have been used widely
DRL-based game theory simulations for defense strategies against for intrusion detection [9]–[11], malware detection [12]–[14],
cyberattacks. Extensive discussions and future research directions cyber–physical attacks [15]–[17], and data privacy protec-
on DRL-based cyber security are also given. We expect that this
comprehensive review provides the foundations for and facilitates tion [18]. In principle, unsupervised methods explore the
future studies on exploring the potential of emerging DRL to cope structure and patterns of data without using their labels,
with increasingly complex cyber security problems. while supervised methods learn by examples based on data’s
Index Terms— Cyber defense, cyber security, cyberattacks, labels. These methods, however, cannot provide dynamic and
deep learning, deep reinforcement learning (DRL), Internet of sequential responses against cyberattacks, especially new or
Things (IoT), IoT, review, survey. constantly evolving threats. In addition, the detection and
defending responses often take place after the attacks when
I. I NTRODUCTION traces of attacks become available for collecting and analyzing,
and thus, proactive defense solutions are hindered. A statistical
I NTERNET of Things (IoT) technologies have been
employed broadly in many sectors, such as telecommuni-
cations, transportation, manufacturing, water and power man-
study shows that 62% of the attacks were recognized after they
have caused significant damages to the cyber systems [19].
agement, healthcare, education, finance, government, and even Reinforcement learning (RL), a branch of ML, is the closest
entertainment. The convergence of various information and form of human learning because it can learn by its own
communication technology (ICT) tools in the IoT has boosted experience through exploring and exploiting the unknown
its functionalities and services to users to new levels. ICT environment. RL can model an autonomous agent to take
has witnessed a remarkable development in terms of system sequential actions optimally without or with limited prior
design, network architecture, and intelligent devices in the knowledge of the environment, and thus, it is particularly
last decade. For example, ICT has been advanced with the adaptable and useful in real-time and adversarial environments.
innovations of cognitive radio network (CRN) and 5G cellular With the power of function approximation and representation
network [1], [2], software-defined network (SDN) [3], cloud learning, deep learning has been incorporated into RL methods
computing [4], (mobile) edge caching [5], [6], and fog comput- and enabled them to solve many complex problems [20]–[24].
ing [7]. Accompanying these developments is the increasing The combination of deep learning and RL, therefore, indicates
vulnerability to cyberattacks, which are defined as any type of excellent suitability for cyber security applications where
offensive maneuver exercised by one or multiple computers to cyberattacks are increasingly sophisticated, rapid, and ubiq-
target computer information systems, network infrastructures, uitous [25]–[28].
or personal computer devices. Cyberattacks may be instigated The emergence of DRL has actually witnessed great success
by economic competitors or state-sponsored attackers. There in different fields, from video game domain, e.g., Atari [29],
[30], game of Go [31], [32], real-time strategy game Star-
Manuscript received 21 July 2020; revised 16 January 2021 and 16 April Craft II [33]–[36], 3-D multiplayer game Quake III Arena
2021; accepted 18 October 2021. Date of publication 1 November 2021; date
of current version 4 August 2023. (Corresponding author: Thanh Thi Nguyen.) Capture the Flag [37], and teamwork game Dota 2 [38] to
Thanh Thi Nguyen is with the School of Information Technology, Deakin real-world applications, such as robotics [39], autonomous
University, Melbourne Burwood Campus, Burwood, VIC 3125, Australia vehicles (AVs) [40], autonomous surgery [41], [42], natural
(e-mail: [email protected]).
Vijay Janapa Reddi is with the John A. Paulson School of Engineering language processing [43], biological data mining [44], and
and Applied Sciences, Harvard University, Cambridge, MA 02138 USA drug design [45]. DRL methods have also recently been
(e-mail: [email protected]). applied to solve various problems in the IoT area. For example,
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TNNLS.2021.3121870. a DRL-based resource allocation framework that integrates
Digital Object Identifier 10.1109/TNNLS.2021.3121870 networking, caching, and computing capabilities for smart city
2162-237X © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Amrutavalli Akula. Downloaded on October 29,2023 at 05:59:25 UTC from IEEE Xplore. Restrictions apply.
3780 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 8, AUGUST 2023
Authorized licensed use limited to: Amrutavalli Akula. Downloaded on October 29,2023 at 05:59:25 UTC from IEEE Xplore. Restrictions apply.
NGUYEN AND REDDI: DRL FOR CYBER SECURITY 3781
Fig. 2. DQN architecture with the loss function described by L(β) = E[(r +
γ maxa Q(s , a |β ) − Q(s, a|β))2 ], where β and β are parameters of the
estimation and target DNNs, respectively. Each action taken by the agent will
generate an experience, which consists of the current state s, action a, reward
r, and next state s . These learning experiences (samples) are stored in the
experience replay memory, which are then retrieved randomly for a stable
learning process.
over all possible actions. REINFORCE [64], vanilla policy Fig. 3. Learning architecture of A3C, consisting of a global network and
gradient [65], trust region policy optimization (TRPO) [66] a number of worker agents. Each worker initially resets its parameters to
those of the global network and interacts with its copy of the environment for
and proximal policy optimization (PPO) [67] are notable learning. Gradients obtained from these individual learning processes will be
policy gradient methods. The gradient estimation, however, used to update the global network asynchronously. This increases the learning
often suffers from a large fluctuation [68]. The combination of speed and diversifies the experience learned by the global network as the
experiences obtained by individual worker agents are independent.
value-based and policy-gradient methods has been developed
to aggregate the advantages and eradicate the disadvantages function A(s) shows how advantageous the agent is when
of these two methods. This kind of combination has con- it is in a particular state. The learning process of A3C is
stituted another type of RL, i.e., actor-critic methods. This asynchronous because each learner interacts with its separate
structure comprises two components: an actor and a critic environment and updates the master network independently.
that can be both characterized by DNNs. The actor attempts This process is iterated, and the master network is the one to
to learn a policy by receiving feedback from the critic. use when the learning is finished.
This iterative process helps the actor improve its strategy Table I summarizes comparable features of value-based,
and converge to an optimal policy. Deep deterministic pol- policy-gradient, and actor-critic methods, and their typical
icy gradient (DDPG) [69], distributed distributional DDPG example algorithms. Valued-based methods are more sample
(D4PG) [70], A3C [50], and unsupervised reinforcement and efficient than policy-gradient methods because they are able
auxiliary learning (UNREAL) [71] are methods that utilize to exploit data from other sources such as experts [72].
the actor-critic framework. An illustrative architecture of the In DRL, a value function or a policy function is normally
popular algorithm A3C is presented in Fig. 3. A3C’s structure approximated by a universal function approximator, such as
consists of a hierarchy of a master learning agent (global) a (deep) neural network, which can take either discrete or
and individual learners (workers). Both the master agent and continuous states as inputs. Therefore, modeling state spaces is
individual learners are modeled by DNNs with each having more straightforward than dealing with action spaces in DRL.
two outputs: one for the critic and another for the actor. The Value-based methods are suitable for problems with discrete
first output is a scalar value representing the expected reward action spaces as they evaluate every action explicitly and
of a given state V (s), while the second output is a vector of choose an action at each time step based on these evaluations.
values representing a probability distribution over all possible On the other hand, policy-gradient and actor-critic methods
actions π(s, a). are more suitable for continuous action spaces because they
The value loss function of the critic is specified by describe the policy (a mapping between states and actions)
as a probability distribution over actions. The continuity
L1 = (R − V (s))2 (2)
characteristic is the main difference between discrete and
where R = r + γ V (s ) is the discounted future reward. continuous action spaces. In a discrete action space, actions
In addition, the actor is pursuing minimization of the following are characterized as a mutually exclusive set of options, and
policy loss function: while in continuous action space, action has a value from a
certain range or boundary [73].
L 2 = − log(π(a|s)) ∗ A(s) − ϑ H (π) (3)
where A(s) = R − V (s) is the estimated advantage function, III. DRL IN C YBER S ECURITY: S URVEY
and H (π) is the entropy term, which handles the exploration A large number of applications of RL to various aspects
capability of the agent with the hyperparameter ϑ controlling of cyber security have been proposed in the literature, rang-
the strength of the entropy regularization. The advantage ing from data privacy to critical infrastructure protection.
Authorized licensed use limited to: Amrutavalli Akula. Downloaded on October 29,2023 at 05:59:25 UTC from IEEE Xplore. Restrictions apply.
3782 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 8, AUGUST 2023
TABLE I
S UMMARY OF F EATURES OF DRL T YPES AND T HEIR N OTABLE M ETHODS
Fig. 5. Dynamics of attack and defense in a CPS. The physical layer is often
uncertain with disturbances w, while cyberattack a directly affects the cyber
layer where a defense strategy d needs to be implemented. The dynamics
of attack-defense characterized by θ (t, a, d) is injected into the conventional
Fig. 4. Different (sub)sections of the survey on DRL in cyber security. physical system to develop a cyber–physical co-modeling as presented in (4).
However, drawbacks of traditional RL have restricted its A. DRL-Based Security Methods for Cyber–Physical Systems
capability in solving complex and large-scale cyber security
problems. The increasing number of connected IoT devices Investigations of defense methods for CPS against cyberat-
in recent years has led to a significant increase in the tacks have received considerable attention and interests from
number of cyberattack instances and their complexity. The the cyber security research community. CPS is a mechanism
emergence of deep learning and its integration with RL has controlled by computer-based algorithms facilitated by Internet
created a class of DRL methods that are able to detect integration. This mechanism provides efficient management of
and fight against sophisticated types of cyberattacks, such distributed physical systems via the shared network. With the
as falsified data injection to CPSs [86], deception attack rapid development of the Internet and control technologies,
to autonomous systems [93], distributed denial-of-service CPSs have been used extensively in many areas, includ-
(DDoS) attacks [114], intrusions to host computers or net- ing manufacturing [76], health monitoring [77], [78], smart
works [125], jamming [142], spoofing [157], malware [161], grid [79]–[81], and transportation [82], [83]. Being exposed
attacks in adversarial networking environments [168], and so widely to the Internet, these systems are increasingly vul-
on. This section provides a comprehensive survey of state- nerable to cyberattacks [84]. In 2015, hackers attacked the
of-the-art DRL-powered solutions for cyber security, ranging control system of a steel mill in Germany by obtaining
from defense methods for CPSs to autonomous intrusion login credentials via the use of phishing emails. This attack
detection approaches, and game theory-based solutions. caused a partial plant shutdown and resulted in damage of
The structure of the survey is presented in Fig. 4. We limit millions of dollars. Likewise, there was a costly cyberattack
the survey to existing applications of DRL to cyber security. on a power grid in Ukraine in late December 2015 that
There are other topics in cyber security where DRL has not disrupted electricity supply to a few hundred thousand end
been applied to and they are therefore discussed in Section IV consumers [85].
(Discussions and Future Research Directions). Those potential In an effort to study cyberattacks on CPS, Feng et al. [85]
topics include a multiagent DRL approach to cyber security, characterized the cyber state dynamics by a mathematical
combining host-based and network-based intrusion detection model
systems (IDSs), model-based DRL and combining model-free
ẋ(t) = f (t, x, u, w; θ (t, a, d)); x(t0 ) = x 0 (4)
and model-based DRL methods for cyber security applica-
tions, investigating methods that can deal with continuous where x, u, and w represent the physical state, control inputs,
action spaces in cyber environments, offensive AI, deep- and disturbances correspondingly (see Fig. 5). In addition,
fakes, ML poisoning, adversarial ML, human–machine team- θ (t, a, d) describes the cyber state at time t with a and d
ing within a human-on-the-loop architecture, bit-and-piece referring to cyberattack and defense, respectively.
DDoS attacks, and potential attacks by quantum physics-based The CPS defense problem is then modeled as a two-player
powerful computers to crack encryption algorithms. zero-sum game by which utilities of players are summed up
Authorized licensed use limited to: Amrutavalli Akula. Downloaded on October 29,2023 at 05:59:25 UTC from IEEE Xplore. Restrictions apply.
NGUYEN AND REDDI: DRL FOR CYBER SECURITY 3783
in handling heavy traffic and high-speed networks because 2) Network-Based IDS: Deokar and Hazarnis [112] pointed
it must examine every packet passing through its network out the drawbacks of both anomaly-based and signature-based
segment. detection methods. On the one hand, anomaly detection has
Regardless of the IDS type, two common detection meth- a high false alarm rate because it may categorize activities,
ods are used: signature-based and anomaly-based detection. which users rarely perform as an anomaly. On the other hand,
Signature detection involves the storage of patterns of known signature detection cannot discover new types of attacks as
attacks and comparing characteristics of possible attacks to it uses a database of patterns of well-known attacks. The
those in the database. Anomaly detection observes the normal authors, therefore, proposed an IDS that can identify known
behaviors of the system and alerts the administrator if any and unknown attacks effectively by combining features of both
activities are found deviated from normality, for instance, anomaly and signature detection through the use of log files.
the unexpected increase of traffic rate, i.e., number of IP The proposed IDS is based on a collaboration of RL meth-
packets per second. ML techniques, including unsupervised ods, association rule learning, and log correlation techniques.
clustering and supervised classification methods, have been RL gives a reward (or penalty) to the system when it selects
used widely to build adaptive IDSs [94]–[98]. These meth- log files that contain (or do not contain) anomalies or any signs
ods, e.g., neural networks [99], k-nearest neighbors [99], of attack. This procedure enables the system to choose more
[100], support vector machine (SVM) [99], [101], random appropriate log files in searching for traces of the attack.
forest [102], and recently deep learning [103]–[105], however, One of the most difficult challenges in the current Internet
normally rely on fixed features of existing cyberattacks so is dealing with the DDoS threat, which is a DoS attack
that they are deficient in detecting new or deformed attacks. but has a distributed nature, occurring with a large traffic
The lack of prompt responses to dynamic intrusions also volume and compromising a large number of hosts. Malialis
leads to ineffective solutions produced by unsupervised or and Kudenko [113] and Malialis [114] initially introduced
supervised techniques. In this regard, RL methods have been the multiagent router throttling method based on the SARSA
demonstrated effectively in various IDS applications [106]. algorithm [115] to address the DDoS attacks by learning
Sections III-B.1 and III-B.2 review the use of DRL methods multiple agents to rate-limit or throttle traffic toward a victim
in both host-based and network-based IDSs. server. That method, however, has a limited capability in terms
1) Host-Based IDS: As the volume of audit data and of scalability. They, therefore, further proposed the coordinated
complexity of intrusion behaviors increase, adaptive intrusion team learning design to the original multiagent router throttling
detection models demonstrate limited effectiveness because based on the divide-and-conquer paradigm to eliminate the
they can only handle temporally isolated labeled or unlabeled mentioned drawback. The proposed approach integrates three
data. In practice, many complex intrusions comprise temporal mechanisms, namely, task decomposition, hierarchical team-
sequences of dynamic behaviors. Xu and Xie [107] proposed based communication, and team rewards, involving multiple
an RL-based IDS that can handle this problem. System call defensive nodes across different locations to coordinately
trace data are fed into a Markov reward process whose state stop or reduce the flood of DDoS attacks. A network emu-
value can be used to detect abnormal temporal behaviors lator is developed based on the work of Yau et al. [116]
of host processes. The intrusion detection problem is, thus, to evaluate throttling approaches. Simulation results show
converted to a state value prediction task of the Markov chains. that the resilience and adaptability of the proposed method
The linear temporal difference (TD) RL algorithm [108] is are superior to its competing methods, i.e., baseline router
used as the state value prediction model where its outcomes throttling and additive-increase/multiplicative-decrease throt-
are compared with a predetermined threshold to distinguish tling algorithms [116], in various scenarios with different
normal traces and attack traces. Instead of using the errors attack dynamics. The scalability of the proposed method is
between real values and estimated ones, the TD learning successfully experimented with up to 100 RL agents, which
algorithm uses the differences between successive approxima- has a great potential to be deployed in a large Internet service
tions to update the state value function. Experimental results provider network.
obtained from using system call trace data show the dominance Alternatively, Bhosale et al. [117] proposed a multiagent
of the proposed RL-based IDS in terms of higher accuracy intelligent system [118] using RL and influence diagram [119]
and lower computational costs compared with SVM, hidden to enable quick responses against the complex attacks. Each
Markov model, and other ML or data mining methods. The agent learns its policy based on the local database and infor-
proposed method based on the linear basis functions, however, mation received from other agents, i.e., decisions and events.
has a shortcoming when sequential intrusion behaviors are Shamshirband et al. [120], on the other hand, introduced an
highly nonlinear. Therefore, a kernel-based RL approach using intrusion detection and prevention system for wireless sensor
least-squares TD [109] was suggested for intrusion detection networks (WSNs) based on a game theory approach and
in [110] and [111]. Relying on the kernel methods, the gen- employed a fuzzy Q-learning algorithm [121], [122] to obtain
eralization capability of TD RL is enhanced, especially in optimal policies for the players. Sink nodes, a base station, and
high-dimensional and nonlinear feature spaces. The kernel an attacker constitute a three-player game where sink nodes
least-squares TD algorithm is, therefore, able to predict anom- and base station are coordinated to derive a defense strategy
aly probabilities accurately, which contributes to improving the against the DDoS attacks, particularly in the application layer.
detection performance of IDS, especially when dealing with The IDS detects future attacks based on the fuzzy Q-learning
multistage cyberattacks. algorithm that takes part in two phases: detection and defense
Authorized licensed use limited to: Amrutavalli Akula. Downloaded on October 29,2023 at 05:59:25 UTC from IEEE Xplore. Restrictions apply.
NGUYEN AND REDDI: DRL FOR CYBER SECURITY 3785
Authorized licensed use limited to: Amrutavalli Akula. Downloaded on October 29,2023 at 05:59:25 UTC from IEEE Xplore. Restrictions apply.
NGUYEN AND REDDI: DRL FOR CYBER SECURITY 3787
Authorized licensed use limited to: Amrutavalli Akula. Downloaded on October 29,2023 at 05:59:25 UTC from IEEE Xplore. Restrictions apply.
NGUYEN AND REDDI: DRL FOR CYBER SECURITY 3789
TABLE II
S UMMARY OF T YPICAL DRL A PPLICATIONS IN C YBER S ECURITY
theory models involving multiple DRL agents. We found that algorithm [93], LSTM-Q-learning [89], DDQN, and A3C [86].
this kind of application occupies a major proportion of papers One of the great challenges in implementing DRL algo-
in the literature relating to DRL for cyber security problems. rithms for CPS security solutions is the lack of realistic
CPS simulations. For example, the work in [86] had to use
the MATLAB/Simulink CPS modeling and embed it into the
A. Challenges and Future Work on Applying DRL for CPS OpenAI Gym environment. This implementation is expensive
Security Solutions in terms of computational time due to the overhead caused
An emerging area is the use of DRL for security solu- by integrating the MATLAB/Simulink CPS simulation in the
tions for CPSs [175], [176]. The large-scale and complex OpenAI Gym library. More proper simulations of CPS models
nature of CPSs, e.g., in environmental monitoring networks, embedded directly in DRL-enabled environments are, thus,
electrical smart grid systems, transportation management net- encouraged in future work. Another common challenge in
works, and cyber manufacturing management systems, require applying DRL algorithms is to transfer the trained policies
security solutions to be responsive and accurate. This has from simulations to real-world environments. While simula-
been addressed by various DRL approaches, e.g., TRPO tions are cheap and safe for training DRL agents, the reality
Authorized licensed use limited to: Amrutavalli Akula. Downloaded on October 29,2023 at 05:59:25 UTC from IEEE Xplore. Restrictions apply.
3790 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 8, AUGUST 2023
gap due to modeling impreciseness and errors make the environment [177]–[179], which can be then employed to
transfer challenging. This is more critical for CPS modeling deploy planning algorithms, e.g., Monte-Carlo tree search
because of the complexity, dynamics, and large scale of techniques [180], to derive optimal actions. Alternatively,
CPS. Research in this direction, i.e., sim-to-real transfer for model-based and model-free combination approaches, such
DRL-based security solutions for CPS, is worth investigating as model-free policy with planning capabilities [181], [182]
as it can help to reduce time, costs, and increase safety during or model-based lookahead search [31], can be used as they
the training process of DRL agents, and eventually reduce aggregate advantages of both methods. On the other hand,
costly mistakes when executing in real-world environments. the current literature on applications of DRL to cyber security
often limits discretizing the action space, which restricts the
B. Challenges and Future Work on Applying DRL for IDS full capability of the DRL solutions to real-world problems.
An example is an application of DRL for selecting optimal
Although there have been a large number of applications
mobile offloading rates in [161] and [162], where the action
of traditional RL methods for IDSs, there has been a small
space has been discretized although a small change of the
amount of work on DRL algorithms for this kind of applica-
rate would primarily affect the performance of the cloud-based
tion. This is probably because the integration of deep learning
malware detection system. Investigation of methods that can
and RL methods has just been sustained very recently. The
deal with continuous action spaces in cyber environments,
complexity and dynamics of intrusion detection problems are
e.g., policy gradient and actor-critic algorithms, is another
expected to be solved effectively by DRL methods, which
encouraging research direction.
combine the powerful representation learning and function
approximation capabilities of deep learning and the opti-
mal sequential decision-making capability of traditional RL. D. Training DRL in Adversarial Cyber Environments
Applying DRL for IDS requires simulated or real-intrusion AI can not only help defend against cyberattacks but can
environments for training agents interactively. This is a great also facilitate dangerous attacks, i.e., offensive AI. Hackers
challenge because using real environments for training is can take advantages of AI to make attacks smarter and more
costly, while simulated environments may be far from reality. sophisticated to bypass detection methods to penetrate com-
Most of the existing studies on DRL for intrusion detection puter systems or networks. For example, hackers may employ
relied on game-based settings (e.g., Fig. 7) or labeled intrusion algorithms to observe normal behaviors of users and employ
datasets. For example, the work in [125] used two datasets of the users’ patterns to develop untraceable attacking strategies.
labeled intrusion samples and adjusted the DRL machinery ML-based systems can mimic humans to craft convincing
for it to work on these datasets in a supervised learning fake messages that are utilized to conduct large-scale phishing
manner. This kind of application lacks a live environment and attacks. Likewise, by creating a highly realistic fake video or
proper interactions between DRL agents and the environment. audio messages based on AI advances (i.e., deepfakes [183]),
There is, thus, a gap for future research on creating more hackers can spread false news in elections or manipulate
realistic environments, which are able to respond in real financial markets [184]. Alternatively, attackers can poison the
time to actions of the DRL agents and facilitate the full data pool used for training deep learning methods (i.e., ML
exploitation of the DRL’s capabilities to solve complex and poisoning) or attackers can manipulate the states or policies,
sophisticated cyber intrusion detection problems. Furthermore, falsify part of the reward signals in RL to trick the agent
as host-based and network-based IDSs have both advantages into taking suboptimal actions, resulting in the agent being
and disadvantages, combining these systems could be a logical compromised [185]. These kinds of attacks are difficult to
approach. DRL-based solutions for this kind of integrated prevent, detect, and fight against as they are part of a battle
system would be another interesting future study. between AI systems. adversarial ML, especially supervised
methods, have been used extensively in cyber security [186]
C. Exploring Capabilities of Model-Based DRL Methods but very few studies have been found on using adversarial
RL [187]. The adversarial DRL or DRL algorithms trained in
Most DRL algorithms used for cyber defense so far are
various adversarial cyber environments are worth comprehen-
model-free methods, which are sample inefficient as they
sive investigations as they can be a solution to battle against
require a large quantity of training data. These data are difficult
the increasingly complex offensive AI systems [188]–[190].
to obtain in real cyber security practice. Researchers generally
utilize simulators to validate their proposed approaches, but
these simulators often do not characterize the complexity E. Human–Machine Teaming With Human-on-the-Loop
and dynamics of the real-cyber space of the IoT systems Models
fully. Model-based DRL methods are more appropriate than With the support of AI systems, cyber security experts
model-free methods when training data are limitedly available no longer examine a huge volume of attack data manually
because, with model-based DRL, it can be easy to collect data to detect and defend against cyberattacks. This has many
in a scalable way. Exploration of model-based DRL methods advantages because the security teams alone cannot sustain the
or the integration of model-based and model-free methods for volume. AI-enabled defense strategies can be automated and
cyber defense is, thus, an interesting future study. For example, deployed rapidly and efficiently but these systems alone cannot
function approximators can be used to learn a proxy model of issue creative responses when new threats are introduced.
the actual high-dimensional and possibly partial observable Moreover, human adversaries are always behind cybercrime
Authorized licensed use limited to: Amrutavalli Akula. Downloaded on October 29,2023 at 05:59:25 UTC from IEEE Xplore. Restrictions apply.
NGUYEN AND REDDI: DRL FOR CYBER SECURITY 3791
or cyberwarfare. Therefore, there is a critical need for human [5] O. Krestinskaya, A. P. James, and L. O. Chua, “Neuromemristive
intellect teamed with machines for cyber defenses. The tra- circuits for edge computing: A review,” IEEE Trans. neural Netw.
Learn. Syst., vol. 31, no. 1, pp. 4–23, Jan. 2020.
ditional human-in-the-loop model for human–machine inte- [6] N. Abbas, Y. Zhang, A. Taherkordi, and T. Skeie, “Mobile edge com-
gration struggles to adapt quickly with cyber defense system puting: A survey,” IEEE Internet Things J., vol. 5, no. 1, pp. 450–465,
because autonomous agent carries out part of the task and Feb. 2018.
[7] A. V. Dastjerdi and R. Buyya, “Fog computing: Helping the Internet
need to halt to wait for human’s responses before completing of Things realize its potential,” Computer, vol. 49, no. 8, pp. 112–116,
the task. The modern human-on-the-loop model would be a Aug. 2016.
solution for a future human–machine teaming cyber security [8] B. Geluvaraj, P. M. Satwik, and T. A. Kumar, “The future of cyberse-
curity: Major role of artificial intelligence, machine learning, and deep
system. This model allows agents to autonomously perform the learning in cyberspace,” in Proc. Int. Conf. Comput. Netw. Commun.
task whilst humans can monitor and intervene operations of Technol., 2019, pp. 739–747.
agents only when necessary. How to integrate human knowl- [9] A. L. Buczak and E. Guven, “A survey of data mining and machine
learning methods for cyber security intrusion detection,” IEEE Com-
edge into DRL algorithms [191] under the human-on-the-loop mun. Surveys Tuts., vol. 18, no. 2, pp. 1153–1176, 2nd Quart., 2016.
model for cyber defense is an interesting research question. [10] G. Apruzzese, M. Colajanni, L. Ferretti, A. Guido, and M. Marchetti,
“On the effectiveness of machine and deep learning for cyber secu-
rity,” in Proc. 10th Int. Conf. Cyber Conflict (CyCon), May 2018,
F. Exploring Capabilities of Multiagent DRL Methods pp. 371–390.
[11] Y. Xin et al., “Machine learning and deep learning methods for
As hackers utilized more and more sophisticated and cybersecurity,” IEEE Access, vol. 6, pp. 35365–35381, 2018.
large-scale approaches to attack computer systems and net- [12] N. Milosevic, A. Dehghantanha, and K.-K. R. Choo, “Machine learning
works, the defense strategies need to be more intelligent and aided Android malware classification,” Comput. Elect. Eng., vol. 61,
pp. 266–274, Jul. 2017.
large-scale as well. Multiagent DRL is a research direction
[13] R. M. H. Babu, R. Vinayakumar, and K. P. Soman, “A short
that can be explored to tackle this problem. Game theory review on applications of deep learning for cyber security,” 2018,
models for cyber security reviewed in this article have involved arXiv:1812.06292. [Online]. Available: http://arxiv.org/abs/1812.06292
multiple agents but they are restricted to a couple of attackers [14] D. Berman, A. Buczak, J. Chavis, and C. Corbett, “A survey of deep
learning methods for cyber security,” Information, vol. 10, no. 4, p. 122,
and defenders with limited communication, cooperation, and Apr. 2019.
coordination among the agents. These aspects of multiagent [15] S. Paul, Z. Ni, and C. Mu, “A learning-based solution for an adversarial
DRL need to be investigated thoroughly in cyber security repeated game in cyber–physical power systems,” IEEE Trans. Neural
Netw. Learn. Syst., vol. 31, no. 11, pp. 4512–4523, Nov. 2020, doi:
problems to enable an effective large-scale defense plan. 10.1109/TNNLS.2019.2955857.
Challenges of multiagent DRL itself then need to be addressed, [16] D. Ding, Q.-L. Han, Y. Xiang, C. Ge, and X.-M. Zhang, “A survey
such as nonstationarity, partial observability, and efficient on security control and attack detection for industrial cyber-physical
systems,” Neurocomputing, vol. 275, pp. 1674–1683, Jan. 2018.
multiagent training schemes [192]. On the other hand, the RL [17] M. Wu, Z. Song, and Y. B. Moon, “Detecting cyber-physical attacks in
methodology has been applied to deal with various cyberat- CyberManufacturing systems with machine learning methods,” J. Intell.
tacks, e.g., jamming, spoofing, false data injection, malware, Manuf., vol. 30, no. 3, pp. 1111–1123, Mar. 2019.
[18] L. Xiao, X. Wan, X. Lu, Y. Zhang, and D. Wu, “IoT security tech-
DoS, DDoS, brute force, Heartbleed, botnet, Web attack, and niques based on machine learning,” 2018, arXiv:1801.06275. [Online].
infiltration attack [193]–[198]. However, recently emerged or Available: http://arxiv.org/abs/1801.06275
new types of attacks have been largely unaddressed. One of [19] A. Sharma, Z. Kalbarczyk, J. Barlow, and R. Iyer, “Analysis of security
data from a large computing organization,” in Proc. IEEE/IFIP 41st Int.
these new types is the bit-and-piece DDoS attack. This attack Conf. Dependable Syst. Netw. (DSN), Jun. 2011, pp. 506–517.
injects small junk into legitimate traffic of over a large number [20] N. D. Nguyen, T. Nguyen, and S. Nahavandi, “System design per-
of IP addresses so that it can bypass many detection methods spective for human-level agents using deep reinforcement learning: A
survey,” IEEE Access, vol. 5, pp. 27091–27102, 2017.
as there is so little of it per address. Another emerging attack, [21] Z. Sui, Z. Pu, J. Yi, and S. Wu, “Formation control with collision
for instance, is attacking from the computing cloud to breach avoidance through deep reinforcement learning using model-guided
systems of companies who manage IT systems for other firms demonstration,” IEEE Trans. Neural Netw. Learn. Syst., vol. 32, no. 6,
pp. 2358–2372, Jun. 2021, doi: 10.1109/TNNLS.2020.3004893.
or host other firms’ data on their servers. Alternatively, hackers
[22] A. Tsantekidis, N. Passalis, A.-S. Toufa, K. Saitas-Zarkias,
can use quantum physics-based powerful computers to crack S. Chairistanidis, and A. Tefas, “Price trailing for financial trading
encryption algorithms that are currently used to protect various using deep reinforcement learning,” IEEE Trans. Neural Netw. Learn.
types of invaluable data [184]. Consequently, a future study Syst., vol. 32, no. 7, pp. 2837–2846, Jul. 2021, doi: 10.1109/TNNLS.
2020.2997523.
on addressing these new types of attacks is encouraged. [23] T. T. Nguyen, N. D. Nguyen, P. Vamplew, S. Nahavandi,
R. Dazeley, and C. P. Lim, “A multi-objective deep reinforcement
learning framework,” 2018, arXiv:1803.02965. [Online]. Available:
R EFERENCES http://arxiv.org/abs/1803.02965
[1] I. Kakalou, K. E. Psannis, P. Krawiec, and R. Badea, “Cognitive radio [24] X. Wang, Y. Gu, Y. Cheng, A. Liu, and C. L. P. Chen, “Approximate
network and network service chaining toward 5G: Challenges and policy-based accelerated deep reinforcement learning,” IEEE Trans.
requirements,” IEEE Commun. Mag., vol. 55, no. 11, pp. 145–151, Neural Netw. Learn. Syst., vol. 31, no. 6, pp. 1820–1830, Jun. 2020.
Nov. 2017. [25] M. H. Ling, K.-L.-A. Yau, J. Qadir, G. S. Poh, and Q. Ni, “Application
[2] Y. Huang, S. Li, C. Li, Y. T. Hou, and W. Lou, “A deep-reinforcement- of reinforcement learning for security enhancement in cognitive radio
learning-based approach to dynamic eMBB/URLLC multiplexing in 5G networks,” Appl. Soft Comput., vol. 37, pp. 809–829, Dec. 2015.
NR,” IEEE Internet Things J., vol. 7, no. 7, pp. 6439–6456, Jul. 2020. [26] Y. Wang, Z. Ye, P. Wan, and J. Zhao, “A survey of dynamic spectrum
[3] P. Wang, L. T. Yang, X. Nie, Z. Ren, J. Li, and L. Kuang, “Data- allocation based on reinforcement learning algorithms in cognitive
driven software defined network attack detection: State-of-the-art and radio networks,” Artif. Intell. Rev., vol. 51, no. 3, pp. 493–506,
perspectives,” Inf. Sci., vol. 513, pp. 65–83, Mar. 2020. Mar. 2019.
[4] A. Botta, W. Donato, V. Persico, and A. Pescapé, “Integration of cloud [27] X. Lu, L. Xiao, T. Xu, Y. Zhao, Y. Tang, and W. Zhuang, “Reinforce-
computing and Internet of Things: A survey,” Future Generat. Comput. ment learning based PHY authentication for VANETs,” IEEE Trans.
Syst., vol. 56, pp. 684–700, Mar. 2016. Veh. Technol., vol. 69, no. 3, pp. 3068–3079, Mar. 2020.
Authorized licensed use limited to: Amrutavalli Akula. Downloaded on October 29,2023 at 05:59:25 UTC from IEEE Xplore. Restrictions apply.
3792 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 8, AUGUST 2023
[28] M. Alauthman, N. Aslam, M. Al-Kasassbeh, S. Khan, A. Al-Qerem, [53] R. Shafin et al., “Self-tuning sectorization: Deep reinforcement learning
and K.-K. R. Choo, “An efficient reinforcement learning-based botnet meets broadcast beam optimization,” IEEE Trans. Wireless Commun.,
detection approach,” J. Netw. Comput. Appl., vol. 150, Jan. 2020, vol. 19, no. 6, pp. 4038–4053, Jun. 2020.
Art. no. 102479. [54] D. Zhang, X. Han, and C. Deng, “Review on the research and practice
[29] V. Mnih et al., “Human-level control through deep reinforcement of deep learning and reinforcement learning in smart grids,” CSEE
learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015. J. Power Energy Syst., vol. 4, no. 3, pp. 362–370, Sep. 2018.
[30] N. D. Nguyen, S. Nahavandi, and T. Nguyen, “A human mixed strategy [55] X. He, K. Wang, H. Huang, T. Miyazaki, Y. Wang, and S. Guo,
approach to deep reinforcement learning,” in Proc. IEEE Int. Conf. “Green resource allocation based on deep reinforcement learning in
Syst., Man, Cybern. (SMC), Oct. 2018, pp. 4023–4028. content-centric IoT,” IEEE Trans. Emerg. Topics Comput., vol. 8, no. 3,
[31] D. Silver et al., “Mastering the game of Go with deep neural networks pp. 781–796, Jul. 2020, doi: 10.1109/TETC.2018.2805718.
and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016. [56] Y. He, C. Liang, F. R. Yu, and Z. Han, “Trust-based social networks
[32] D. Silver et al., “Mastering the game of go without human knowledge,” with computing, caching and communications: A deep reinforcement
Nature, vol. 550, no. 7676, pp. 354–359, 2017. learning approach,” IEEE Trans. Netw. Sci. Eng., vol. 7, no. 1,
[33] O. Vinyals et al., “StarCraft II: A new challenge for rein- pp. 66–79, Jan. 2020, doi: 10.1109/TNSE.2018.2865183.
forcement learning,” 2017, arXiv:1708.04782. [Online]. Available: [57] N. C. Luong et al., “Applications of deep reinforcement learning in
http://arxiv.org/abs/1708.04782 communications and networking: A survey,” IEEE Commun. Surveys
[34] P. Sun et al., “TStarBots: Defeating the cheating level builtin AI Tuts., vol. 21, no. 4, pp. 3133–3174, 4th Quart., 2019, doi: 10.1109/
in StarCraft II in the full game,” 2018, arXiv:1809.07193. [Online]. COMST.2019.2916583.
Available: http://arxiv.org/abs/1809.07193 [58] Y. Dai, D. Xu, S. Maharjan, Z. Chen, Q. He, and Y. Zhang, “Blockchain
[35] Z.-J. Pang, R.-Z. Liu, Z.-Y. Meng, Y. Zhang, Y. Yu, and T. Lu, and deep reinforcement learning empowered intelligent 5G beyond,”
“On reinforcement learning for full-length game of StarCraft,” 2018, IEEE Netw., vol. 33, no. 3, pp. 10–17, May/Jun. 2019.
arXiv:1809.09095. [Online]. Available: http://arxiv.org/abs/1809.09095 [59] A. S. Leong, A. Ramaswamy, D. E. Quevedo, H. Karl, and L. Shi,
[36] V. Zambaldi et al., “Relational deep reinforcement learning,” 2018, “Deep reinforcement learning for wireless sensor scheduling in cyber–
arXiv:1806.01830. [Online]. Available: http://arxiv.org/abs/1806.01830 physical systems,” Automatica, vol. 113, Mar. 2020, Art. no. 108759.
[37] M. Jaderberg et al., “Human-level performance in first-person mul- [60] C. J. C. H. Watkins and P. Dayan, “Q-learning,” Mach. Learn., vol. 8,
tiplayer games with population-based deep reinforcement learn- nos. 3–4, pp. 279–292, 1992.
ing,” 2018, arXiv:1807.01281. [Online]. Available: http://arxiv. [61] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath,
org/abs/1807.01281 “Deep reinforcement learning: A brief survey,” IEEE Signal Process.
[38] OpenAI. (Mar.1, 2019)OpenAI Five. [Online]. Available: https:// Mag., vol. 34, no. 6, pp. 26–38, Nov. 2017.
openai.com/five/ [62] V. Mnih et al., “Playing Atari with deep reinforcement learning,” 2013,
[39] S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep reinforce- arXiv:1312.5602. [Online]. Available: http://arxiv.org/abs/1312.5602
ment learning for robotic manipulation with asynchronous off-policy [63] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience
updates,” in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), May 2017, replay,” 2015, arXiv:1511.05952. [Online]. Available: http://arxiv.org/
pp. 3389–3396. abs/1511.05952
[40] D. Isele, R. Rahimi, A. Cosgun, K. Subramanian, and K. Fujimura, [64] R. J. Williams, “Simple statistical gradient-following algorithms for
“Navigating occluded intersections with autonomous vehicles using connectionist reinforcement learning,” Mach. Learn., vol. 8, nos. 3–4,
deep reinforcement learning,” in Proc. IEEE Int. Conf. Robot. Autom. pp. 229–256, 1992.
(ICRA), May 2018, pp. 2034–2039. [65] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policy
[41] T. Nguyen, N. D. Nguyen, F. Bello, and S. Nahavandi, “A new ten- gradient methods for reinforcement learning with function approxima-
sioning method using deep reinforcement learning for surgical pattern tion,” in Proc. Adv. Neural Inf. Process. Syst., 2000, pp. 1057–1063.
cutting,” in Proc. IEEE Int. Conf. Ind. Technol. (ICIT), Feb. 2019, [66] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust
pp. 1339–1344, doi: 10.1109/ICIT.2019.8755235. region policy optimization,” in Proc. Int. Conf. Mach. Learn., 2015,
[42] N. D. Nguyen, T. Nguyen, S. Nahavandi, A. Bhatti, and G. Guest, pp. 1889–1897.
“Manipulating soft tissues by deep reinforcement learning for [67] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,
autonomous robotic surgery,” in Proc. IEEE Int. Syst. Conf. (SysCon), “Proximal policy optimization algorithms,” 2017, arXiv:1707.06347.
Apr. 2019, pp. 1–7, doi: 10.1109/SYSCON.2019.8836924. [Online]. Available: http://arxiv.org/abs/1707.06347
[43] Y. Keneshloo, T. Shi, N. Ramakrishnan, and C. K. Reddy, “Deep [68] C. Wu et al., “Variance reduction for policy gradient with action-
reinforcement learning for sequence-to-sequence models,” IEEE Trans. dependent factorized baselines,” 2018, arXiv:1803.07246. [Online].
Neural Netw. Learn. Syst., vol. 31, no. 7, pp. 2469–2489, Jul. 2019. Available: http://arxiv.org/abs/1803.07246
[44] M. Mahmud, M. S. Kaiser, A. Hussain, and S. Vassanelli, “Applications [69] T. P. Lillicrap et al., “Continuous control with deep rein-
of deep learning and reinforcement learning to biological data,” IEEE forcement learning,” 2015, arXiv:1509.02971. [Online]. Available:
Trans. Neural Netw. Learn. Syst., vol. 29, no. 6, pp. 2063–2079, http://arxiv.org/abs/1509.02971
Jun. 2018. [70] G. Barth-Maron et al., “Distributed distributional deterministic policy
[45] M. Popova, O. Isayev, and A. Tropsha, “Deep reinforcement learn- gradients,” 2018, arXiv:1804.08617. [Online]. Available: http://arxiv.
ing for de novo drug design,” Sci. Adv., vol. 4, no. 7, Jul. 2018, org/abs/1804.08617
Art. no. eaap7885. [71] M. Jaderberg et al., “Reinforcement learning with unsuper-
[46] Y. He, F. R. Yu, N. Zhao, V. C. M. Leung, and H. Yin, “Software- vised auxiliary tasks,” 2016, arXiv:1611.05397. [Online]. Available:
defined networks with mobile edge computing and caching for smart http://arxiv.org/abs/1611.05397
cities: A big data deep reinforcement learning approach,” IEEE Com- [72] O. Nachum, M. Norouzi, K. Xu, and D. Schuurmans, “Bridging the
mun. Mag., vol. 55, no. 12, pp. 31–37, Dec. 2017. gap between value and policy based reinforcement learning,” in Proc.
[47] H. V. Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning Adv. Neural Inf. Process. Syst., 2017, pp. 2775–2785.
with double Q-learning,” in Proc. 13th AAAI Conf. Artif. Intell., 2016, [73] M. Lapan, Deep Reinforcement Learning Hands-on: Apply Modern RL
pp. 2094–2100. Methods, With Deep Q-Networks, Value Iteration, Policy Gradients,
[48] Z. Wang et al., “Dueling network architectures for deep reinforcement TRPO, AlphaGo Zero and More. Birmingham, U.K.: Packt Publishing,
learning,” in Proc. Int. Conf. Mach. Learn., 2016, pp. 1995–2003. 2018.
[49] H. Zhu, Y. Cao, W. Wang, T. Jiang, and S. Jin, “Deep reinforcement [74] (Dec. 14, 220). OpenAI Gym Toolkit Documentation. Classic Control:
learning for mobile edge caching: Review, new features, and open Control Theory Problems From the Classic RL Literature. [Online].
issues,” IEEE Netw., vol. 32, no. 6, pp. 50–57, Nov. 2018. Available: https://gym.openai.com/envs/#classic_control
[50] V. Mnih et al., “Asynchronous methods for deep reinforcement learn- [75] (Dec. 14, 220). OpenAI Gym Toolkit Documentation. Box2D: Con-
ing,” in Proc. Int. Conf. Mach. Learn., 2016, pp. 1928–1937. tinuous Control Tasks in the Box2D Simulator. [Online]. Available:
[51] Y. Zhang, J. Yao, and H. Guan, “Intelligent cloud resource management https://gym.openai.com/envs/#box2d
with deep reinforcement learning,” IEEE Cloud Comput., vol. 4, no. 6, [76] L. Wang, M. Törngren, and M. Onori, “Current status and advancement
pp. 60–69, Nov./Dec. 2017. of cyber-physical systems in manufacturing,” J. Manuf. Syst., vol. 37,
[52] J. Zhu, Y. Song, D. Jiang, and H. Song, “A new deep-Q-learning- pp. 517–527, Oct. 2015.
based transmission scheduling mechanism for the cognitive Internet [77] Y. Zhang, M. Qiu, C.-W. Tsai, M. M. Hassan, and A. Alamri, “Health-
of Things,” IEEE Internet Things J., vol. 5, no. 4, pp. 2375–2385, CPS: Healthcare cyber-physical system assisted by cloud and big data,”
Aug. 2017. IEEE Syst. J., vol. 11, no. 1, pp. 88–95, Mar. 2017.
Authorized licensed use limited to: Amrutavalli Akula. Downloaded on October 29,2023 at 05:59:25 UTC from IEEE Xplore. Restrictions apply.
NGUYEN AND REDDI: DRL FOR CYBER SECURITY 3793
[78] P. M. Shakeel, S. Baskar, V. R. S. Dhulipala, S. Mishra, and [100] P. Deshpande, S. C. Sharma, S. K. Peddoju, and S. Junaid, “HIDS:
M. M. Jaber, “Maintaining security and privacy in health care system A host based intrusion detection system for cloud computing environ-
using learning based deep-Q-networks,” J. Med. Syst., vol. 42, no. 10, ment,” Int. J. Syst. Assurance Eng. Manage., vol. 9, no. 3, pp. 567–576,
p. 186, 2018. Jun. 2018.
[79] M. H. Cintuglu, O. A. Mohammed, K. Akkaya, and A. S. Uluagac, [101] M. Nobakht, V. Sivaraman, and R. Boreli, “A host-based intrusion
“A survey on smart grid cyber-physical system testbeds,” IEEE Com- detection and mitigation framework for smart home IoT using Open-
mun. Surveys Tuts., vol. 19, no. 1, pp. 446–464, 1st Quart., 2017. Flow,” in Proc. 11th Int. Conf. Availability, Rel. Secur. (ARES),
[80] Y. Chen, S. Huang, F. Liu, Z. Wang, and X. Sun, “Evaluation of rein- Aug. 2016, pp. 147–156.
forcement learning-based false data injection attack to automatic volt- [102] P. A. A. Resende and A. C. Drummond, “A survey of random forest
age control,” IEEE Trans. Smart Grid, vol. 10, no. 2, pp. 2158–2169, based methods for intrusion detection systems,” ACM Comput. Surv.,
Mar. 2018. vol. 51, no. 3, p. 48, 2018.
[81] Z. Ni and S. Paul, “A multistage game in smart grid security: A [103] G. Kim, H. Yi, J. Lee, Y. Paek, and S. Yoon, “LSTM-based system-call
reinforcement learning solution,” IEEE Trans. Neural Netw. Learn. language modeling and robust ensemble method for designing host-
Syst., vol. 30, no. 9, pp. 2684–2695, Sep. 2019, doi: 10.1109/TNNLS. based intrusion detection systems,” 2016, arXiv:1611.01726. [Online].
2018.2885530. Available: http://arxiv.org/abs/1611.01726
[82] A. Ferdowsi, A. Eldosouky, and W. Saad, “Interdependence-aware [104] A. Chawla, B. Lee, S. Fallon, and P. Jacob, “Host based intrusion
game-theoretic framework for secure intelligent transportation sys- detection system with combined CNN/RNN model,” in Proc. Joint Eur.
tems,” IEEE Internet Things J., early access, Sep. 1, 2020, doi: Conf. Mach. Learn. Knowl. Discovery Databases, 2018, pp. 149–158.
10.1109/JIOT.2020.3020899. [105] M. M. Hassan, A. Gumaei, A. Alsanad, M. Alrubaian, and G. Fortino,
[83] Y. Li et al., “Nonlane-discipline-based car-following model for electric “A hybrid deep learning model for efficient intrusion detection in big
vehicles in transportation-cyber-physical systems,” IEEE Trans. Intell. data environment,” Inf. Sci., vol. 513, pp. 386–396, Mar. 2020.
Transp. Syst., vol. 19, no. 1, pp. 38–47, Jan. 2018. [106] A. Janagam and S. Hossen, “Analysis of network intrusion detection
[84] C. Li and M. Qiu, Reinforcement Learning for Cyber-Physical Systems: system with machine learning algorithms (deep reinforcement learning
With Cybersecurity Case Studies. Boca Raton, FL, USA: CRC Press, algorithm),” M.S. thesis, Dept. Comput., Blekinge Inst. Technol.,
2019. Stockholm, Sweden, 2018.
[85] M. Feng and H. Xu, “Deep reinforecement learning based optimal [107] X. Xu and T. Xie, “A reinforcement learning approach for host-based
defense for cyber-physical system in presence of unknown cyber- intrusion detection using sequences of system calls,” in Proc. Int. Conf.
attack,” in Proc. IEEE Symp. Ser. Comput. Intell. (SSCI), Nov. 2017, Intell. Comput., 2005, pp. 995–1003.
pp. 1–8. [108] R. S. Sutton, “Learning to predict by the methods of temporal differ-
[86] T. Akazaki, S. Liu, Y. Yamagata, Y. Duan, and J. Hao, “Falsification ences,” Mach. Learn., vol. 3, no. 1, pp. 9–44, 1988.
of cyber-physical systems using deep reinforcement learning,” in Proc. [109] X. Xu, “A sparse kernel-based least-squares temporal difference algo-
Int. Symp. Formal Methods, 2018, pp. 456–465. rithm for reinforcement learning,” in Proc. Int. Conf. Natural Comput.,
2006, pp. 47–56.
[87] H. Abbas and G. Fainekos, “Convergence proofs for simulated anneal-
ing falsification of safety properties,” in Proc. 50th Annu. Allerton Conf. [110] X. Xu and Y. Luo, “A kernel-based reinforcement learning approach to
Commun., Control, Comput. (Allerton), Oct. 2012, pp. 1594–1601. dynamic behavior modeling of intrusion detection,” in Proc. Int. Symp.
Neural Netw., 2007, pp. 455–464.
[88] S. Sankaranarayanan and G. Fainekos, “Falsification of temporal prop-
[111] X. Xu, “Sequential anomaly detection based on temporal-difference
erties of hybrid systems using the cross-entropy method,” in Proc.
learning: Principles, models and case studies,” Appl. Soft Comput.,
15th ACM Int. Conf. Hybrid Syst., Comput. Control (HSCC), 2012,
vol. 10, no. 3, pp. 859–867, Jun. 2010.
pp. 125–134.
[112] B. Deokar and A. Hazarnis, “Intrusion detection system using log files
[89] A. Ferdowsi, U. Challita, W. Saad, and N. B. Mandayam, “Robust
and reinforcement learning,” Int. J. Comput. Appl., vol. 45, no. 19,
deep reinforcement learning for security and safety in autonomous
pp. 28–35, 2012.
vehicle systems,” in Proc. 21st Int. Conf. Intell. Transp. Syst. (ITSC),
[113] K. Malialis, “Distributed reinforcement learning for network intrusion
Nov. 2018, pp. 307–312.
response,” Doctoral Dissertation, Dept. Comput. Sci., Univ. York, York,
[90] X. Wang, R. Jiang, L. Li, Y.-L. Lin, and F.-Y. Wang, “Long memory is U.K., 2014.
important: A test study on deep-learning based car-following model,”
[114] K. Malialis and D. Kudenko, “Distributed response to network intru-
Phys. A, Stat. Mech. Appl., vol. 514, pp. 786–795, Jan. 2019.
sions using multiagent reinforcement learning,” Eng. Appl. Artif. Intell.,
[91] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural vol. 41, pp. 270–284, May 2015.
Comput., vol. 9, no. 8, pp. 1735–1780, 1997. [115] R. S. Sutton and A. G. Barto, Introduction to Reinforcement Learning.
[92] I. Rasheed, F. Hu, and L. Zhang, “Deep reinforcement learning Cambridge, MA, USA: MIT Press, 1998.
approach for autonomous vehicle systems for maintaining security [116] D. K. Y. Yau, J. C. S. Lui, F. Liang, and Y. Yam, “Defending against
and safety using LSTM-GAN,” Veh. Commun., vol. 26, Dec. 2020, distributed denial-of-service attacks with max-min fair server-centric
Art. no. 100266. router throttles,” IEEE/ACM Trans. Netw., vol. 13, no. 1, pp. 29–42,
[93] A. Gupta and Z. Yang, “Adversarial reinforcement learning for Feb. 2005.
observer design in autonomous systems under cyber attacks,” 2018, [117] R. Bhosale, S. Mahajan, and P. Kulkarni, “Cooperative machine learn-
arXiv:1809.06784. [Online]. Available: http://arxiv.org/abs/1809.06784 ing for intrusion detection system,” Int. J. Sci. Eng. Res., vol. 5, no. 1,
[94] A. Abubakar and B. Pranggono, “Machine learning based intrusion pp. 1780–1785, 2014.
detection system for software defined networks,” in Proc. 7th Int. Conf. [118] A. Herrero and E. Corchado, “Multiagent systems for network intrusion
Emerg. Secur. Technol. (EST), Sep. 2017, pp. 138–143. detection: A review,” in Computational Intelligence in Security for
[95] S. Jose, D. Malathi, B. Reddy, and D. Jayaseeli, “A survey on anomaly Information Systems. Berlin, Germany: Springer, 2009, pp. 143–154.
based host intrusion detection system,” J. Phys., Conf. Ser., vol. 1000, [119] A. Detwarasiti and R. D. Shachter, “Influence diagrams for team
Apr. 2018, Art. no. 012049. decision analysis,” Decis. Anal., vol. 2, no. 4, pp. 207–228, Dec. 2005.
[96] S. Roshan, Y. Miche, A. Akusok, and A. Lendasse, “Adaptive and [120] S. Shamshirband, A. Patel, N. B. Anuar, M. L. M. Kiah, and A. Abra-
online network intrusion detection system using clustering and extreme ham, “Cooperative game theoretic approach using fuzzy Q-learning for
learning machines,” J. Franklin Inst., vol. 355, no. 4, pp. 1752–1779, detecting and preventing intrusions in wireless sensor networks,” Eng.
Mar. 2018. Appl. Artif. Intell., vol. 32, pp. 228–241, Jun. 2014.
[97] S. Dey, Q. Ye, and S. Sampalli, “A machine learning based intrusion [121] P. Muñoz, R. Barco, and I. de la Bandera, “Optimization of load bal-
detection scheme for data fusion in mobile clouds involving heteroge- ancing using fuzzy Q-Learning for next generation wireless networks,”
neous client networks,” Inf. Fusion, vol. 49, pp. 205–215, Sep. 2019. Expert Syst. Appl., vol. 40, no. 4, pp. 984–994, Mar. 2013.
[98] D. Papamartzivanos, F. G. Mármol, and G. Kambourakis, “Introducing [122] S. Shamshirband, N. B. Anuar, M. L. M. Kiah, and A. Patel,
deep learning self-adaptive misuse network intrusion detection sys- “An appraisal and design of a multi-agent system based cooperative
tems,” IEEE Access, vol. 7, pp. 13546–13560, 2019. wireless intrusion detection computational intelligence technique,” Eng.
[99] W. Haider, G. Creech, Y. Xie, and J. Hu, “Windows based data sets Appl. Artif. Intell., vol. 26, no. 9, pp. 2105–2127, Oct. 2013.
for evaluation of robustness of host based intrusion detection systems [123] S. Varshney and R. Kuma, “Variants of LEACH routing protocol in
(IDS) to zero-day and stealth attacks,” Future Internet, vol. 8, no. 4, WSN: A comparative analysis,” in Proc. 8th Int. Conf. Cloud Comput.,
p. 29, Jul. 2016. Data Sci. Eng. (Confluence), Jan. 2018, pp. 199–204.
Authorized licensed use limited to: Amrutavalli Akula. Downloaded on October 29,2023 at 05:59:25 UTC from IEEE Xplore. Restrictions apply.
3794 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 8, AUGUST 2023
[124] G. Caminero, M. Lopez-Martin, and B. Carro, “Adversarial envi- [147] B. Wang, Y. Wu, K. J. R. Liu, and T. C. Clancy, “An anti-jamming
ronment reinforcement learning algorithm for intrusion detection,” stochastic game for cognitive radio networks,” IEEE J. Sel. Areas
Comput. Netw., vol. 159, pp. 96–109, Aug. 2019. Commun., vol. 29, no. 4, pp. 877–889, Apr. 2011.
[125] M. Lopez-Martin, B. Carro, and A. Sanchez-Esguevillas, “Application [148] M. L. Littman, “Markov games as a framework for multiagent rein-
of deep reinforcement learning to intrusion detection for supervised forcement learning,” in Proc. 11th Int. Conf. Mach. Learn., 1994,
problems,” Expert Syst. Appl., vol. 141, Mar. 2020, Art. no. 112963. pp. 157–163.
[126] I. A. Saeed, A. Selamat, M. F. Rohani, O. Krejcar, and J. A. Chaudhry, [149] L. Xiao, Y. Li, J. Liu, and Y. Zhao, “Power control with reinforcement
“A systematic state-of-the-art analysis of multi-agent intrusion detec- learning in cooperative cognitive radio networks against jamming,”
tion,” IEEE Access, vol. 8, pp. 180184–180209, 2020. J. Supercomput., vol. 71, no. 9, pp. 3237–3257, Sep. 2015.
[127] S. Roy, C. Ellis, S. Shiva, D. Dasgupta, V. Shandilya, and Q. Wu, [150] D. Yang, G. Xue, J. Zhang, A. Richa, and X. Fang, “Coping with
“A survey of game theory as applied to network security,” in Proc. a smart jammer in wireless networks: A Stackelberg game approach,”
43rd Hawaii Int. Conf. Syst. Sci., 2010, pp. 1–10. IEEE Trans. Wireless Commun., vol. 12, no. 8, pp. 4038–4047,
[128] S. Shiva, S. Roy, and D. Dasgupta, “Game theory for cyber security,” Aug. 2013.
in Proc. 6th Annu. Workshop Cyber Secur. Inf. Intell. Res. (CSIIRW), [151] M. Bowling and M. Veloso, “Multiagent learning using a variable
2010, p. 34. learning rate,” Artif. Intell., vol. 136, no. 2, pp. 215–250, Apr. 2002.
[129] K. Ramachandran and Z. Stefanova, “Dynamic game theories in [152] G. Han, L. Xiao, and H. V. Poor, “Two-dimensional anti-jamming
cyber security,” in Proc. Int. Conf. Dyn. Syst. Appl., vol. 7, 2016, communication based on deep reinforcement learning,” in Proc. IEEE
pp. 303–310. Int. Conf. Acoust., Speech Signal Process. (ICASSP), Mar. 2017,
[130] Y. Wang, Y. Wang, J. Liu, Z. Huang, and P. Xie, “A survey of game pp. 2087–2091.
theoretic methods for cyber security,” in Proc. IEEE 1st Int. Conf. Data [153] X. Liu, Y. Xu, L. Jia, Q. Wu, and A. Anpalagan, “Anti-jamming
Sci. Cyberspace (DSC), Jun. 2016, pp. 631–636. communications using spectrum waterfall: A deep reinforcement learn-
[131] Q. Zhu and S. Rass, “Game theory meets network security: A tutorial,” ing approach,” IEEE Commun. Lett., vol. 22, no. 5, pp. 998–1001,
in Proc. ACM SIGSAC Conf. Comput. Commun. Secur., Oct. 2018, Mar. 2018.
pp. 2163–2165. [154] W. Chen and X. Wen, “Perceptual spectrum waterfall of pattern shape
[132] A. Mpitziopoulos, D. Gavalas, C. Konstantopoulos, and G. Pantziou, recognition algorithm,” in Proc. 18th Int. Conf. Adv. Commun. Technol.
“A survey on jamming attacks and countermeasures in WSNs,” IEEE (ICACT), Jan. 2016, pp. 382–389.
Commun. Surveys Tuts., vol. 11, no. 4, pp. 42–56, 4th Quart., 2009. [155] K. Zeng, K. Govindan, and P. Mohapatra, “Non-cryptographic authen-
tication and identification in wireless networks,” IEEE Wireless Com-
[133] S. Hu, D. Yue, X. Xie, X. Chen, and X. Yin, “Resilient event-
mun., vol. 17, no. 5, pp. 56–62, Oct. 2010.
triggered controller synthesis of networked control systems under
periodic DoS jamming attacks,” IEEE Trans. Cybern., vol. 49, no. 12, [156] L. Xiao, Y. Li, G. Liu, Q. Li, and W. Zhuang, “Spoofing detection with
pp. 4271–4281, Dec. 2019, doi: 10.1109/TCYB.2018.2861834. reinforcement learning in wireless networks,” in Proc. IEEE Global
Commun. Conf. (GLOBECOM), Dec. 2015, pp. 1–5.
[134] H. Boche and C. Deppe, “Secure identification under passive eaves-
[157] L. Xiao, Y. Li, G. Han, G. Liu, and W. Zhuang, “PHY-layer spoofing
droppers and active jamming attacks,” IEEE Trans. Inf. Forensics
detection with reinforcement learning in wireless networks,” IEEE
Security, vol. 14, no. 2, pp. 472–485, Feb. 2019.
Trans. Veh. Technol., vol. 65, no. 12, pp. 10037–10047, Dec. 2016.
[135] Y. Wu, B. Wang, K. J. R. Liu, and T. C. Clancy, “Anti-jamming
[158] R. S. Sutton, “Integrated architecture for learning, planning, and
games in multi-channel cognitive radio networks,” IEEE J. Sel. Areas
reacting based on approximating dynamic programming,” in Proc. 7th
Commun., vol. 30, no. 1, pp. 4–15, Jan. 2012.
Int. Conf. Mach. Learn., 1990, pp. 216–224.
[136] S. Singh and A. Trivedi, “Anti-jamming in cognitive radio networks [159] X. Sun, J. Dai, P. Liu, A. Singhal, and J. Yen, “Using Bayesian
using reinforcement learning algorithms,” in Proc. 9th Int. Conf. networks for probabilistic identification of zero-day attack paths,”
Wireless Opt. Commun. Netw. (WOCN), Sep. 2012, pp. 1–5. IEEE Trans. Inf. Forensics Security, vol. 13, no. 10, pp. 2506–2521,
[137] Y. Gwon, S. Dastangoo, C. Fossa, and H. T. Kung, “Competing mobile Oct. 2018.
network game: Embracing antijamming and jamming strategies with [160] Y. Afek, A. Bremler-Barr, and S. L. Feibish, “Zero-day signature
reinforcement learning,” in Proc. IEEE Conf. Commun. Netw. Secur. extraction for high-volume attacks,” IEEE/ACM Trans. Netw., vol. 27,
(CNS), Oct. 2013, pp. 28–36. no. 2, pp. 691–706, Apr. 2019, doi: 10.1109/TNET.2019.2899124.
[138] W. G. Conley and A. J. Miller, “Cognitive jamming game for dynam- [161] X. Wan, G. Sheng, Y. Li, L. Xiao, and X. Du, “Reinforcement learning
ically countering ad hoc cognitive radio networks,” in Proc. MILCOM based mobile offloading for cloud-based malware detection,” in Proc.
IEEE Mil. Commun. Conf., Nov. 2013, pp. 1176–1182. GLOBECOM IEEE Global Commun. Conf., Dec. 2017, pp. 1–6.
[139] K. Dabcevic, A. Betancourt, L. Marcenaro, and C. S. Regazzoni, “A fic- [162] Y. Li, J. Liu, Q. Li, and L. Xiao, “Mobile cloud offloading for malware
titious play-based game-theoretical approach to alleviating jamming detections with learning,” in Proc. IEEE Conf. Comput. Commun.
attacks for cognitive radios,” in Proc. IEEE Int. Conf. Acoust., Speech Workshops (INFOCOM WKSHPS), Apr. 2015, pp. 197–201.
Signal Process. (ICASSP), May 2014, pp. 8158–8162. [163] M. A. Salahuddin, A. Al-Fuqaha, and M. Guizani, “Software-defined
[140] F. Slimeni, B. Scheers, Z. Chtourou, and V. Le Nir, “Jamming networking for RSU clouds in support of the internet of vehicles,”
mitigation in cognitive radio networks using a modified Q-learning IEEE Internet Things J., vol. 2, no. 2, pp. 133–144, Apr. 2015.
algorithm,” in Proc. Int. Conf. Mil. Commun. Inf. Syst. (ICMCIS), [164] R. Huang, X. Chu, J. Zhang, and Y. H. Hu, “Energy-efficient mon-
May 2015, pp. 1–7. itoring in software defined wireless sensor networks using reinforce-
[141] L. Xiao, X. Lu, D. Xu, Y. Tang, L. Wang, and W. Zhuang, “UAV relay ment learning: A prototype,” Int. J. Distrib. Sensor Netw., vol. 2015,
in VANETs against smart jamming with reinforcement learning,” IEEE pp. 1–12, Oct. 2015.
Trans. Veh. Technol., vol. 67, no. 5, pp. 4087–4097, May 2018. [165] S. Kim, J. Son, A. Talukder, and C. S. Hong, “Congestion prevention
[142] L. Xiao, X. Wan, C. Dai, X. Du, X. Chen, and M. Guizani, “Security mechanism based on Q-leaning for efficient routing in SDN,” in Proc.
in mobile edge caching with reinforcement learning,” IEEE Wireless Int. Conf. Inf. Netw. (ICOIN), Jan. 2016, pp. 124–128.
Commun., vol. 25, no. 3, pp. 116–122, Jun. 2018. [166] S.-C. Lin, I. F. Akyildiz, P. Wang, and M. Luo, “QoS-aware adap-
[143] M. A. Aref, S. K. Jayaweera, and S. Machuzak, “Multi-agent reinforce- tive routing in multi-layer hierarchical software defined networks: A
ment learning based cognitive anti-jamming,” in Proc. IEEE Wireless reinforcement learning approach,” in Proc. IEEE Int. Conf. Services
Commun. Netw. Conf. (WCNC), Mar. 2017, pp. 1–6. Comput. (SCC), Jun. 2016, pp. 25–33.
[144] S. Machuzak and S. K. Jayaweera, “Reinforcement learning based [167] A. Mestres et al., “Knowledge-defined networking,” ACM SIGCOMM
anti-jamming with wideband autonomous cognitive radios,” in Proc. Comput. Commun. Rev., vol. 47, no. 3, pp. 2–10, Sep. 2017.
IEEE/CIC Int. Conf. Commun. China (ICCC), Jul. 2016, pp. 1–5. [168] Y. Han et al., “Reinforcement learning for autonomous defence in
[145] M. D. Felice, L. Bedogni, and L. Bononi, “Reinforcement learning- software-defined networking,” in Proc. Int. Conf. Decis. Game Theory
based spectrum management for cognitive radio networks: A literature Secur., 2018, pp. 145–165.
review and case study,” in Handbook of Cognitive Radio. Singapore: [169] B. Lantz and B. O’Connor, “A mininet-based virtual testbed for
Springer, 2019, pp. 1849–1886. distributed SDN development,” ACM SIGCOMM Comput. Commun.
[146] A. Attar, H. Tang, A. V. Vasilakos, F. R. Yu, and V. C. M. Leung, Rev., vol. 45, no. 4, pp. 365–366, 2015.
“A survey of security challenges in cognitive radio networks: Solu- [170] M. Zhu, Z. Hu, and P. Liu, “Reinforcement learning algorithms for
tions and future research directions,” Proc. IEEE, vol. 100, no. 12, adaptive cyber defense against heartbleed,” in Proc. 1st ACM Workshop
pp. 3172–3186, Dec. 2012. Moving Target Defense (MTD), 2014, pp. 51–58.
Authorized licensed use limited to: Amrutavalli Akula. Downloaded on October 29,2023 at 05:59:25 UTC from IEEE Xplore. Restrictions apply.
NGUYEN AND REDDI: DRL FOR CYBER SECURITY 3795
[171] J. Wang, M. Zhao, Q. Zeng, D. Wu, and P. Liu, “Risk assessment [193] L. Xiao, Y. Li, G. Han, H. Dai, and H. V. Poor, “A secure mobile
of buffer ‘heartbleed’ over-read vulnerabilities,” in Proc. 45th Annu. crowdsensing game with deep reinforcement learning,” IEEE Trans.
IEEE/IFIP Int. Conf. Dependable Syst. Netw., Jun. 2015, pp. 555–562. Inf. Forensics Security, vol. 13, no. 1, pp. 35–47, Jan. 2018.
[172] B. Luo, Y. Yang, C. Zhang, Y. Wang, and B. Zhang, “A survey of code [194] Y. Liu, M. Dong, K. Ota, J. Li, and J. Wu, “Deep reinforcement
reuse attack and defense,” in Proc. Int. Conf. Intell. Interact. Syst. Appl., learning based smart mitigation of DDoS flooding in software-defined
2018, pp. 782–788. networks,” in Proc. IEEE 23rd Int. Workshop Comput. Aided Modeling
[173] R. Elderman, L. J. J. Pater, A. S. Thie, M. M. Drugan, and Design Commun. Links Netw. (CAMAD), Sep. 2018, pp. 1–6.
M. M. Wiering, “Adversarial reinforcement learning in a cyber secu- [195] Y. Xu, G. Ren, J. Chen, X. Zhang, L. Jia, and L. Kong, “Interference-
rity simulation,” in Proc. 9th Int. Conf. Agents Artif. Intell., 2017, aware cooperative anti-jamming distributed channel selection in
pp. 559–566. UAV communication networks,” Appl. Sci., vol. 8, no. 10, p. 1911,
[174] K. Chung, C. A. Kamhoua, K. A. Kwiat, Z. T. Kalbarczyk, and Oct. 2018.
R. K. Iyer, “Game theory with learning for cyber security monitoring,” [196] F. Yao and L. Jia, “A collaborative multi-agent reinforcement learn-
in Proc. IEEE 17th Int. Symp. High Assurance Syst. Eng. (HASE), ing anti-jamming algorithm in wireless networks,” IEEE Wireless
Jan. 2016, pp. 1–8. Commun. Lett., vol. 8, no. 4, pp. 1024–1027, Aug. 2019, doi:
[175] F. Wei, Z. Wan, and H. He, “Cyber-attack recovery strategy for smart 10.1109/LWC.2019.2904486.
grid based on deep reinforcement learning,” IEEE Trans. Smart Grid, [197] Y. Li et al., “On the performance of deep reinforcement learning-based
vol. 11, no. 3, pp. 2476–2486, May 2020. anti-jamming method confronting intelligent jammer,” Appl. Sci., vol. 9,
[176] X. Liu, J. Ospina, and C. Konstantinou, “Deep reinforcement learning no. 7, p. 1361, 2019.
for cybersecurity assessment of wind integrated power systems,” 2020, [198] M. Chatterjee and A.-S. Namin, “Detecting phishing websites through
arXiv:2007.03025. [Online]. Available: http://arxiv.org/abs/2007.03025 deep reinforcement learning,” in Proc. IEEE 43rd Annu. Comput. Softw.
[177] J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh, “Action-conditional Appl. Conf. (COMPSAC), Jul. 2019, pp. 227–232.
video prediction using deep networks in Atari games,” in Proc. Adv.
Neural Inf. Process. Syst., 2015, pp. 2863–2871.
[178] M. Mathieu, C. Couprie, and Y. LeCun, “Deep multi-scale video pre-
diction beyond mean square error,” 2015, arXiv:1511.05440. [Online].
Available: http://arxiv.org/abs/1511.05440
[179] A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine, “Neural network
dynamics for model-based deep reinforcement learning with model-free
fine-tuning,” in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), May 2018, Thanh Thi Nguyen received the Ph.D. degree in
pp. 7559–7566. mathematics and statistics from Monash University,
[180] C. B. Browne et al., “A survey of Monte Carlo tree search methods,” Clayton, VIC, Australia, in 2013.
IEEE Trans. Comput. Intell. AI Games, vol. 4, no. 1, pp. 1–43, He was a Visiting Scholar with the Depart-
Mar. 2012. ment of Computer Science, Stanford University,
[181] A. Tamar, Y. Wu, G. Thomas, S. Levine, and P. Abbeel, “Value iteration Stanford, CA, USA, in 2015, and the Edge Com-
networks,” in Proc. 26th Int. Joint Conf. Artif. Intell., Aug. 2017, puting Laboratory, John A. Paulson School of Engi-
pp. 2154–2162. neering and Applied Sciences, Harvard University,
[182] R. Pascanu et al., “Learning model-based planning from scratch,” 2017, Massachusetts, USA, in 2019. He is currently a
arXiv:1707.06170. [Online]. Available: http://arxiv.org/abs/1707.06170 Senior Lecturer with the School of Information
[183] T. T. Nguyen, Q. V. H. Nguyen, C. M. Nguyen, D. Nguyen, D. Thanh Technology, Deakin University, Burwood, VIC, Aus-
Nguyen, and S. Nahavandi, “Deep learning for deepfakes creation tralia. He received an Alfred Deakin Post-Doctoral Research Fellowship
and detection: A survey,” 2019, arXiv:1909.11573. [Online]. Available: in 2016, a European–Pacific Partnership for the ICT Expert Exchange Pro-
http://arxiv.org/abs/1909.11573 gram Award from the European Commission in 2018, and an Australia–India
[184] M. Giles. (Jan. 4, 2019). Five Emerging Cyber-Threats to Worry Strategic Research Fund Early- and Mid-Career Fellowship from the Aus-
About in 2019. MIT Technology Review. [Online]. Available: tralian Academy of Science in 2020. He has expertise in various areas,
https://www.technologyreview.com/2019/01/04/66232/five-emerging- including artificial intelligence, deep learning, deep reinforcement learning,
cyber-threats-2019/ cyber security, IoT, and data science.
[185] V. Behzadan and A. Munir, “Vulnerability of deep reinforcement
learning to policy induction attacks,” in Proc. Int. Conf. Mach. Learn.
Data Mining Pattern Recognit., 2017, pp. 262–275.
[186] V. Duddu, “A survey of adversarial machine learning in cyber warfare,”
Defence Sci. J., vol. 68, no. 4, pp. 356–366, 2018.
[187] T. Chen, J. Liu, Y. Xiang, W. Niu, E. Tong, and Z. Han, “Adversarial
attack and defense in reinforcement learning-from AI security view,”
Cybersecurity, vol. 2, no. 1, p. 11, Dec. 2019. Vijay Janapa Reddi (Member, IEEE) received the
[188] I. Ilahi et al., “Challenges and countermeasures for adversarial attacks Ph.D. degree in computer science from Harvard
on deep reinforcement learning,” 2020, arXiv:2001.09684. [Online]. University, Cambridge, MA, USA, in 2010.
Available: http://arxiv.org/abs/2001.09684 He is currently an Associate Professor with the
[189] L. Tong, A. Laszka, C. Yan, N. Zhang, and Y. Vorobeychik, “Finding John A. Paulson School of Engineering and Applied
needles in a moving haystack: Prioritizing alerts with adversarial Sciences, Harvard University, where he is the Direc-
reinforcement learning,” in Proc. AAAI Conf. Artif. Intell., vol. 34, tor of the Edge Computing Laboratory. His research
no. 1, 2020, pp. 946–953. interests include computer architecture and system-
[190] J. Sun et al., “Stealthy and efficient adversarial attacks against deep software design, with an emphasis on the context
reinforcement learning,” 2020, arXiv:2005.07099. [Online]. Available: of mobile and edge computing platforms based on
http://arxiv.org/abs/2005.07099 machine learning.
[191] T. Nguyen, N. D. Nguyen, and S. Nahavandi, “Multi-agent deep Dr. Reddi was a recipient of multiple awards, including the Best Paper
reinforcement learning with human strategies,” in Proc. IEEE Award at the 2005 International Symposium on Microarchitecture and the
Int. Conf. Ind. Technol. (ICIT), Feb. 2019, pp. 1357–1362, doi: 2009 International Symposium on High-Performance Computer Architecture,
10.1109/ICIT.2019.8755032. the IEEE’s Top Picks in Computer Architecture Awards in 2006, 2010, 2011,
[192] T. T. Nguyen, N. D. Nguyen, and S. Nahavandi, “Deep reinforcement 2016, and 2017, the Google Faculty Research Awards in 2012, 2013, 2015,
learning for multiagent systems: A review of challenges, solutions, and 2017, the Intel Early Career Award in 2013, the National Academy of
and applications,” IEEE Trans. Cybern., vol. 50, no. 9, pp. 3826–3839, Engineering Gilbreth Lecturer Honor in 2016, and the IEEE TCCA Young
Sep. 2020. Computer Architect Award in 2016.
Authorized licensed use limited to: Amrutavalli Akula. Downloaded on October 29,2023 at 05:59:25 UTC from IEEE Xplore. Restrictions apply.