Deep_Reinforcement_Learning-Based_Intelligent_Reflecting_Surface_for_Secure_Wireless_Communications
Deep_Reinforcement_Learning-Based_Intelligent_Reflecting_Surface_for_Secure_Wireless_Communications
Abstract— In this paper, we study an intelligent reflecting the non-convex optimization problem, a novel deep reinforce-
surface (IRS)-aided wireless secure communication system, where ment learning (DRL)-based secure beamforming approach is
an IRS is deployed to adjust its reflecting elements to secure the firstly proposed to achieve the optimal beamforming policy
communication of multiple legitimate users in the presence of against eavesdroppers in dynamic environments. Furthermore,
multiple eavesdroppers. Aiming to improve the system secrecy post-decision state (PDS) and prioritized experience replay (PER)
rate, a design problem for jointly optimizing the base station schemes are utilized to enhance the learning efficiency and
(BS)’s beamforming and the IRS’s reflecting beamforming is secrecy performance. Specifically, a modified PDS scheme is
formulated considering different quality of service (QoS) require- presented to trace the channel dynamic and adjust the beamform-
ments and time-varying channel conditions. As the system is ing policy against channel uncertainty accordingly. Simulation
highly dynamic and complex, and it is challenging to address results demonstrate that the proposed deep PDS-PER learning
based secure beamforming approach can significantly improve
Manuscript received March 7, 2020; revised June 21, 2020 and August 27, the system secrecy rate and QoS satisfaction probability in
2020; accepted September 13, 2020. Date of publication September 25, 2020; IRS-aided secure communication systems.
date of current version January 8, 2021. This work was supported in part by the
National Research Foundation (NRF), Singapore, through Singapore Energy Index Terms— Secure communication, intelligent reflecting
Market Authority (EMA), Energy Resilience, under Grant NRF2017EWT- surface, beamforming, secrecy rate, deep reinforcement learning.
EP003-041; in part by the Singapore NRF under Grant NRF2015-NRF-
ISF001-2277; in part by the Singapore NRF National Satellite of Excellence,
Design Science, and Technology for Secure Critical Infrastructure under
Grant NSoE DeST-SCI2019-0007; in part by the A*STAR-NTU-SUTD Joint I. I NTRODUCTION
Research Grant on Artificial Intelligence for the Future of Manufacturing
under Grant RGANS1906; in part by the Wallenberg AI, Autonomous
Systems, and Software Program and Nanyang Technological University
(WASP/NTU) under Grant M4082187 (4080); in part by the Singapore
P HYSICAL layer security (PLS) has attracted increasing
attention as an alternative of cryptography-based tech-
niques for wireless communications [1], where PLS exploits
Ministry of Education (MOE) under Grant Tier 1 RG16/20; in part by the the wireless channel characteristics by using signal processing
Alibaba Group through Alibaba Innovative Research (AIR) Program; in part
by the Alibaba-NTU Singapore Joint Research Institute (JRI); in part by the designs and channel coding to support secure communication
Nanyang Technological University (NTU) Startup Grant, Singapore Ministry services without relying on a shared secret key [1], [2].
of Education Academic Research Fund, under Grant Tier 1 RG128/18, Grant So far, a variety of approaches have been reported to improve
Tier 1 RG115/19, Grant Tier 1 RT07/19, Grant Tier 1 RT01/19, and Grant Tier
2 MOE2019-T2-1-176; in part by the NTU-WASP Joint Project, Singapore PLS in wireless communication systems, e.g., cooperative
National Research Foundation through its Strategic Capability Research relaying strategies [3], [4], artificial noise-assisted beamform-
Centers Funding Initiative: Strategic Centre for Research in Privacy-Preserving ing [5], [6], and cooperative jamming [7], [8]. However,
Technologies and Systems; in part by the Energy Research Institute @NTU,
Singapore NRF National Satellite of Excellence, Design Science, and Tech- employing a large number of active antennas and relays in
nology for Secure Critical Infrastructure under Grant NSoE DeST-SCI2019- PLS systems incurs an excessive hardware cost and the system
0012; in part by the AI Singapore 100 Experiments (100E) programme; in complexity. Moreover, cooperative jamming and transmitting
part by the NTU Project for Large Vertical Take-Off and Landing Research
Platform; and in part by the Natural Science Foundation of China under Grant artificial noise require extra transmit power for security guar-
61971366. This article will be presented in part at the 2020 IEEE Global antees.
Communications Conference, Taipei, Taiwan, December 2020. The associate To tackle these shortcomings of the existing approaches
editor coordinating the review of this article and approving it for publication
was D. Gunduz. (Corresponding author: Zehui Xiong.) [3]–[8], a new paradigm, called intelligent reflecting surface
Helin Yang, Zehui Xiong, Jun Zhao, and Dusit Niyato are with the School (IRS) [9]–[13], has been proposed as a promising technique
of Computer Science and Engineering, Nanyang Technological University, to achieve high spectrum efficiency and energy efficiency, and
Singapore 639798 (e-mail: [email protected]; [email protected];
[email protected]; [email protected]). enhance secrecy rate in the fifth generation (5G) and beyond
Liang Xiao is with the Department of Information and Communi- wireless communication systems. In particular, IRS is a uni-
cation Engineering, Xiamen University, Xiamen 361005, China (e-mail: form planar array which is comprised of a number of low-cost
[email protected]).
Qingqing Wu is with the State Key Laboratory of Internet of Things passive reflecting elements, where each of elements adaptively
for Smart City, University of Macau, Macau 999078, China (e-mail: adjusts its reflection amplitude and/or phase to control the
[email protected]). strength and direction of the electromagnetic wave, hence
Color versions of one or more of the figures in this article are available
online at https://ieeexplore.ieee.org. IRS is capable of enhancing and/or weakening the reflected
Digital Object Identifier 10.1109/TWC.2020.3024860 signals at different users [9]. As a result, the reflected signal by
1536-1276 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on May 02,2023 at 05:26:21 UTC from IEEE Xplore. Restrictions apply.
376 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 20, NO. 1, JANUARY 2021
IRS can increase the received signal at legitimate users while Yu et al. [21] investigated an optimization problem with
suppressing the signal at the eavesdroppers [9]–[13]. Hence, considering the impacts of outdated CSI of the eavesdropping
from the PLS perspective, some innovative studies have been channels in an IRS-aided secure communication system, and
recently devoted to performance optimization for IRS-aided a robust algorithm was proposed to address the optimization
secure communications [14]–[25]. problem in the presence of multiple eavesdroppers.
The above mentioned studies [14]–[25] mainly applied the
traditional optimization techniques e.g., AO, SDP or MM
A. Related Works algorithms to jointly optimize the BSs beamforming and the
Initial studies on IRS-aided secure communication systems IRSs reflecting beamforming in IRS-aided secure communica-
have reported in [14]–[17], where a simple system model with tion systems, which are less efficient for large-scale systems.
only a single-antenna legitimate user and a single-antenna Inspired by the recent advances of artificial intelligence (AI),
eavesdropper was considered in these works. The authors several works attempted to utilize AI algorithms to optimize
in [14] and [15] applied the alternative optimization (AO) algo- IRSs reflecting beamforming [26]–[29]. Deep learning (DL)
rithm to jointly optimize the transmit beamforming vector at was exploited to search the optimal IRS reflection matrices
the base station (BS) and the phase elements at the IRS for the that maximize the achievable system rate in an IRS-aided com-
maximization of the secrecy rate, but they did not extend their munication system, and the simulation demonstrated that DL
models to multi-user IRS-assisted secure communication sys- significantly outperforms conventional algorithms. Moreover,
tems. To minimize the transmit power at the BS subject to the the authors in [28] and [29] proposed deep reinforcement
secrecy rate constraint, the authors in [18] utilized AO solution learning (DRL) based approach to address the non-convex
and semidefinite programming (SDP) relaxation to address the optimization problem, and the phase shifts at the IRS are
optimization problem with the objective to jointly optimize the optimized effectively. However, the works [26]–[29] merely
power allocation and the IRS reflecting beamforming. In addi- considered to maximize the system achievable rate of a single
tion, Feng et al. [19] also studied the secure transmission user without considering the scenario of multiple users, secure
framework with an IRS to minimize the system transmit power communication and imperfect CSI in their models. The authors
in cases of rank-one and full-rank BS-IRS links, and derived a in [30] and [31] applied reinforcement learning (RL) to achieve
closed-form expression of beamforming matrix. Different from smart beamforming at the BS against an eavesdropper in com-
these studies [14]–[19] which considered only a single eaves- plex environments, but the IRS-aided secure communication
dropper, secure communication systems comprising multiple system needs to optimize the IRS’s reflect beamforming in
eavesdroppers were investigated in [20]–[22]. Chen et al. [20] addition to the BS’s transmit beamforming. To the best of
presented a minimum-secrecy-rate maximization design to our knowledge, RL or DRL has not been explored yet in
provide secure communication services for multiple legitimate prior works to optimize both the BS’s transmit beamforming
users while keeping them secret from multiple eavesdrop- and the IRS’s reflect beamforming in dynamic IRS-aided
pers in an IRS-aided multi-user multiple-input single-output secure communication systems, under the condition of mul-
(MISO) system, but the simplification of the optimization tiple eavesdroppers and imperfect CSI, which thus motivates
problem may cause a performance loss. The authors in [23] this work.
and [24] studied an IRS-aided multiple-input multiple-output
(MIMO) channel, where a multi-antenna BS transmits data
B. Contributions
stream to a multi-antenna legitimate user in the presence
of an eavesdropper configured with multiple antennas, and In this paper, we investigate an IRS-aided secure commu-
a suboptimal secrecy rate maximization approach was pre- nication system with the objective to maximize the system
sented to optimize the beamforming policy. In addition to the secrecy rate of multiple legitimate users in the presence of
use of AO or SDP in the system performance optimization, multiple eavesdroppers under realistic time-varying channels,
the minorization-maximization (MM) algorithm was recently while guaranteeing quality of service (QoS) requirements of
utilized to optimize the joint transmit beamforming at the BS legitimate users. A novel DRL-based secure beamforming
and phase shift coefficient at the IRS [16], [23]. approach is firstly proposed to jointly optimize the beamform-
Moreover, the authors in [22] and [25] employed the arti- ing matrix at the BS and the reflecting beamforming matrix
ficial noise-aided beamforming for IRS-aided MISO secure (reflection phases) at the IRS in dynamic environments. The
communication systems to improve the system secrecy rate, major contributions of this paper are summarized as follows:
and an AO based solution was applied to jointly optimize the • The physical secure communication based on IRS with
BSs beamforming, artificial noise interference vector and IRSs multiple eavesdroppers is investigated under the condition
reflecting beamforming with the goal to maximize the secrecy of time-varying channel coefficients in this paper. In addi-
rate. All these existing studies [14]–[20], [22]–[25] assumed tion, we formulate a joint BS’s transmit beamforming
that perfect channel state information (CSI) of legitimate users and IRS’s reflect beamforming optimization problem with
or eavesdroppers is available at the BS, which is not a practical the goal of maximizing the system secrecy rate while
assumption. The reason is that acquiring perfect CSI at the BS considering the QoS requirements of legitimate users.
is challenging since the corresponding CSI may be outdated • An RL-based intelligent beamforming framework is pre-
when the channel is time-varying due to the transmission sented to achieve the optimal BSs beamforming and
delay, processing delay, and high mobility of users. Hence, the IRS’s reflecting beamforming, where the central
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on May 02,2023 at 05:26:21 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: DEEP REINFORCEMENT LEARNING-BASED INTELLIGENT REFLECTING SURFACE 377
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on May 02,2023 at 05:26:21 UTC from IEEE Xplore. Restrictions apply.
378 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 20, NO. 1, JANUARY 2021
where nk denotes the additive complex Gaussian noise can be bounded with respect to the Euclidean norm by using
(AWGN) with the with zero mean and variance δk2 at the k-th norm-bounded error model, i.e.,
MU. In (2), we observe that in addition to the received desired
||Δhbu ||2 ≤ (ςbu )2 , ||Δhru ||2 ≤ (ςru )2 ,
signal, each MU also suffers inter-user interference (IUI) in
the system. In addition, the received signal at eavesdropper m ||Δhbe ||2 ≤ (ςbe )2 , ||Δhre ||2 ≤ (ςre )2 , (7)
is expressed by where ςbu , ςru , ςbe , and ςre refer to the radii of the determin-
istically bounded error regions.
ym = hH re,m ΨHbr + hbe,m
H
vk sk + nm (3) Under the channel uncertainty model, the achievable rate of
k∈K
the k-th MU is given by
where nm is the AWGN of eavesdropper m with the vari- ⎛ 2 ⎞
H H
ance δm2
. ⎜ (hru,k ΨH br + hbu,k )vk ⎟
In practical systems, it is not easy for the BS and the Rku = log2 ⎝1 + ⎠.
| (hH ΨH + hH )v |2 + δ 2
ru,k br bu,k i k
IRS to obtain perfect CSI [9], [21]. This is due to the i∈K,i=k
fact that both the transmission delay and processing delay (8)
exist, as well as the mobility of the users. Therefore, CSI
is outdated at the time when the BS and the IRS transmit If the m-th eavesdropper attempts to eavesdrop the signal
the data stream to MUs [21]. Once this outdated CSI is of the k-th MU, its achievable rate can be expressed by
employed for beamforming, it will lead to a negative effect on ⎛ 2 ⎞
H H
the demodulation at the MUs, thereby leading to substantial ⎜ (hre,m ΨH br + hbe,m )vk ⎟
e
Rm,k = log2 ⎝1+ ⎠.
performance loss [21]. Therefore, it is necessary to consider | H H 2
(hre,m ΨHbr + hbe,m )vi | + δm2
outdated CSI in the IRS-aided secure communication system. i∈K,i=k
Let Tdelay denote the delay between the outdated CSI and (9)
the real-time CSI. In other words, when the BS receives the
pilot sequences sent from the MUs at the time slot t, it will Since each eavesdropper can eavesdrop any of the K
complete the channel estimation process and begin to transmit MUs’ signal, according to [14]–[25], the achievable individual
data stream to the MUs at the time slot t + Tdelay . Hence, secrecy rate from the BS to the k-th MU can be expressed by
+
the relation between the outdated channel vector h(t) and the
real-time channel vector h(t + Tdelay ) can be expressed by Rk = Rk − max Rm,k
sec u e (10)
∀m
h(t + Tdelay ) = ρh(t) + 1 − ρ2 ĥ(t + Tdelay ). (4) where [z]+ = max(0, z).
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on May 02,2023 at 05:26:21 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: DEEP REINFORCEMENT LEARNING-BASED INTELLIGENT REFLECTING SURFACE 379
the coupling of the optimization variables (V and Ψ) and the it is important to design an efficient reward function to improve
unit-norm constraints in (11d) are non-convex. In addition, the MUs’ QoS satisfaction levels.
we would consider the robust beamforming design to max- In this paper, the reward function represents the optimization
imize the worst-case achievable secrecy rate of the system objective, and our objective is to maximize the system secrecy
while guaranteeing the worst-case constraints. rate of all MUs while guaranteeing their QoS requirements.
Thus, the presented QoS-aware reward function is expressed
III. P ROBLEM T RANSFORMATION BASED ON RL as
The optimization problem given in (11) is difficult to r= Rksec − k −
μ1 psec μ2 puk
address as it is a non-convex problem. In addition, in realistic k∈K k∈K k∈K (14)
IRS-aided secure communication systems, the capabilities of part 1 part 2 part 3
MUs, the channel quality, and the service applications will
where
change dynamically. Moreover, the problem in (11) is just a
single time slot optimization problem, which may converge 1, if Rksec < Rksec ,min , ∀k ∈ K,
psec
k = (15)
to a suboptimal solution and obtain the greedy-search like 0, otherwise,
performance due to the ignorance of the historical system state
and the long term benefit. Hence, it is generally infeasible to 1, if Rk < Rkmin , ∀k ∈ K,
puk = (16)
apply the traditional optimization techniques (AO, SDP, and 0, otherwise.
MM) to achieve an effective secure beamforming policy in
In (14), the part 1 represents the immediate utility (system
uncertain dynamic environments.
secrecy rate), the part 2 and the part 3 are the cost functions
Model-free RL is a dynamic programming tool which can
which are defined as the unsatisfied secrecy rate requirement
be adopted to solve the decision-making problem by learning
and the unsatisfied minimum rate requirement, respectively.
the optimal solution in dynamic environments [32]. Hence,
The coefficients μ1 and μ2 are the positive constants of the
we model the secure beamforming optimization problem as
part 2 and the part 3 in (14), respectively, and they are used
an RL problem. In RL, the IRS-aided secure communication
to balance the utility and cost [33]–[35].
system is treated as an environment, the central controller at
The goals of (15) and (16) are to impose the QoS satisfac-
the BS is regarded as a learning agent. The key elements of
tion levels of both the secrecy rate and the minimum data rate
RL are defined as follows.
requirements, respectively. If the QoS requirement is satisfied
State space: Let S denote the system state space. The
in the current time slot, then psec u
k = 0 or pk = 0, indicating
current system state s ∈ S includes the channel information
that there is no punishment of the reward function due to the
of all users, the secrecy rate, the transmission data rate of the
successful QoS guarantees.
last time slot and the QoS satisfaction level, which is defined
The goal of the learning agent is to search for an optimal
as
policy π ∗ (π is a mapping from states in S to the probabilities
s = {hk }k∈K , {hm }m∈M , {Rksec }k∈K , {Rk }k∈K , of choosing an action in A: π(s) : S → A) that maximizes
the long-term expected discounted reward, and the cumulative
{QoSk }k∈K (12)
discounted reward function can be defined as
where hk and hm are the channel coefficients of the k-th ∞
MU and m-th eavesdropper, respectively. QoSk is the feed- Ut = γτ rt+τ +1 (17)
back QoS satisfaction level of the k-th MU, where the QoS τ =0
satisfaction level consists of both the minimum secrecy rate where γ ∈ (0, 1] denotes the discount factor. Under a cer-
satisfaction level in (11a) and the minimum data rate satis- tain policy π, the state-action function of the agent with a
faction level in (11b). Other parameters in (12) are already state-action pair (s, a) is given by
defined in Section II.
Action space: Let A denote the system action space. Qπ (st , at ) = Eπ [Ut |st = s, at = a] . (18)
According to the observed system state s, the central controller The conventional Q-Learning algorithm can be adopted to
chooses the beamforming vector {vk }k∈K at the BS and the learn the optimal policy. The key objective of Q-Learning is
IRS reflecting beamforming coefficient (phase shift) {θl }l∈L to update Q-table by using the Bellman’s equation as follows:
at the IRS. Hence, the action a ∈ A can be defined by ⎡
a = {vk }k∈K , {θl }l∈L . (13) Qπ (st , at ) = Eπ ⎣rt + γ T (st+1 |st , at )
st+1 ∈S
Transition probability: Let T (s |s, a) represent the transi- ⎤
tion probability, which is the probability of transitioning to a
new state s ∈ S, given the action a executed in the sate s. π(st+1 , at+1 )Qπ (st+1 , at+1 )⎦ (19)
Reward function: In RL, the reward acts as a signal to at+1 ∈A
evaluate how good the secure beamforming policy is when The optimal action-value function in (17) is equivalent to
the agent executes an action at a current state. The system the Bellman optimality equation, which is expressed by
performance will be enhanced when the reward function at
each learning step correlates with the desired objective. Thus, Q∗ (st , at ) = rt + γ max Q∗ (st+1 , at+1 ) (20)
at+1
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on May 02,2023 at 05:26:21 UTC from IEEE Xplore. Restrictions apply.
380 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 20, NO. 1, JANUARY 2021
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on May 02,2023 at 05:26:21 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: DEEP REINFORCEMENT LEARNING-BASED INTELLIGENT REFLECTING SURFACE 381
At the time slot t, the PDS action-value function Q̃(s̃t , at ) RL algorithm. However, classical DQN uniformly samples
of the current PDS state-action pair (s̃t , at ) is defined as each transition et = st , at , rt , s̃t , st+1 from the experience
replay, which may lead to an uncertain or negative effect on
Q̃(s̃t , at ) = ru (s̃t , at ) + γ T u (st+1 |s̃t , at )V (st+1 ). (26) learning a better policy. The reason is that different transitions
st+1
(experience information) in the replay buffer have different
By employing the extra information (the known transition importance for the learning policy, and sampling every tran-
probability T k (s̃t |st , at ) and known reward rk (st , at )), the Q- sition equally may unavoidably result in inefficient usage
function Q̂(st , at ) in PDS-learning can be further expanded of meaningful transitions. Therefore, a prioritized experience
under all state-action pairs (s, a), which is expressed by replay (PER) scheme has been presented to address this issue
and enhance the sampling efficiency [36], [37], where the
Q̂(st , at ) = rk (st , at ) + T k (s̃t |st , at )Q̃(s̃t , at ). (27)
s̃t priority of transition is determined by the values of TD error.
The state-value function in PDS-learning is defined by In PER, a transition with higher absolute TD error has higher
priority in the sense that is has more aggressive correction for
V̂t (st ) = T k (st+1 |st , at )Ṽ (st+1 ) (28) the action-value function.
st+1
In the deep PDS-PER learning algorithm, similar to classical
where Ṽt (st+1 ) = max Q̃t (s̃t+1 , at ). At each time slot, DQN, the agent collects and stores each experience et =
at ∈A
the PDS action-value function Q̃(s̃t , at ) is updated by st , at , rt , s̃t , st+1 into its experience replay buffer, and DNN
updates the parameter by sampling a mini-batch of tuples
Q̃t+1 (s̃t , at ) = (1 − αt )Q̃t (s̃t , at ) from the replay buffer. So far, PER was adopted only for
+αt ru (s̃t , at ) + γ V̂t (st+1 ) (29) DRL and Q-learning, and has never been employed with
the PDS-learning algorithm to learn the dynamic information.
After updating Q̃t+1 (s̃t , at ), the action-value function In this paper, we further extend this PER scheme to enable
Q̂t+1 (st , at ) can be updated by plugging Q̃t+1 (s̃t , at ) into prioritized experience replay in the proposed deep PDS-PER
(27). learning framework, in order to improve the learning conver-
After presenting in the above modified PDS-learning, a deep gence rate.
PDS learning algorithm is presented. In the presented learning The probability of sampling transition i (experience i) based
algorithm, the traditional DQN is adopted to estimatete the on the absolute TD-error is defined by
action-value Q-function Q(s, a) by using Q(s, a; θ), where
p(i) = |δ(i)|η1 |δ(j )|η1 (34)
θ denote the DNN parameter. The objective of DQN is to j
minimize the following loss function at each time slot where the exponent η1 weights how much prioritization is
2
used, with η1 = 0 corresponding to being uniform sampling.
L(θ t ) = {V̂t (st ; θt ) − Q̂(st , at ; θt )} = [{r(st , at )
The transition with higher p(i) will be more likely to be
replayed from the replay buffer, which is associated with very
+γ max Q̂t (st+1 , at+1 ; θ t ) − Q̂(st , at ; θt )}2 (30)
at+1 ∈A successful attempts by preventing the DNN from being over-
fitting. With the help of PER, the proposed deep PDS-PER
where V̂t (st ; θt ) = r(st , at ) + γ max Q̂t (st+1 , at+1 ; θ t ) is learning algorithm tends to replay valuable experience and
at+1 ∈A
the target value. The error between V̂t (st ; θ t ) and the estimated hence learns more effectively to find the best policy.
value Q̂(st , at ; θt ) is usually called temporal-difference (TD) It is worth noting that experiences with high absolute TD-
error, which is expressed by error are more frequently replayed, which alters the visitation
frequency of some experiences and hence causes the training
δt = V̂t (st ; θt ) − Q̂(st , at ; θ t ). (31) process of the DNN prone to diverge. To address this problem,
The DNN parameter θ is achieved by taking the partial importance-sampling (IS) weights are adopted in the calcula-
differentiation of the objective function (30) with respect to tion of weight changes
θ, which is given by W (i) = (D · p(i))
−η2
(35)
θt+1 = θ t + β∇L(θ t ). (32) where D is the size of the experience replay buffer, and the
where β is the learning rate of θ, and ∇(·) denotes the first- parameter η2 is used to adjust the amount of correction used.
order partial derivative. Accordingly, by using the PER scheme into the deep PDS-
Accordingly, the policy π̂t (s) of the modified deep PER learning, the DNN loss function (30) and the correspond-
PDS-learning algorithm is given by ing parameters are rewritten respectively as follows:
H
π̂t (s) = arg max Q̂(st , at ; θt ). 1
at ∈A
(33) L (θ t ) = (Wi Li (θ t )) (36)
H i=1
Although DQN is capable of performing well in policy
θt+1 = θt + βδt ∇θ L(θ t )) (37)
learning with continuous and high-dimensional state space,
DNN may learn ineffectively and cause divergence owing to Theorem 1: The presented deep PDS-PER learning can
the nonstationary targets and correlations between samples. converge to the optimal Q̂(st , at ) of the MDP with probability
Experience replay is utilized to avoid the divergence of the 1 when the learning rate sequence αt meets the following
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on May 02,2023 at 05:26:21 UTC from IEEE Xplore. Restrictions apply.
382 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 20, NO. 1, JANUARY 2021
∞ ∞
conditions αt ∈ [0, 1), t=0 αt = ∞ and t=0 α2t < ∞, Algorithm 1 Deep PDS-PER Learning Based Secure Beam-
where the aforementioned requirements have been appeared forming
in most of the RL algorithms [32] and they are not specific to 1: Input: IRS-aided secure communication simulator and
the proposed deep PDS-PER learning algorithm [32]. QoS requirements of all MUs (e.g., minimum secrecy rate
Proof: If each action can be executed with an infinite num- and transmission rate).
ber of learning steps at each system state, or in other words, 2: Initialize: DQN with initial Q-function Q(s, a; θ),
the learning policy is greedy with the infinite explorations, the parameters θ, learning rate α and β.
Q-function Q̂(st , at ) in PDS-learning and its corresponding 3: Initialize: experience replay buffer D with size D, and
policy strategy π(s) will converge to the optimal points, mini-batch size H.
respectively, with probability of 1 [33]–[35]. The existing 4: for each episode =1, 2, …, N epi do
references [34] and [35] have provided the proof. 5: Observe an initial system state s;
6: for each time step t=0, 1, 2, …, T do
B. Secure Beamforming Based on Proposed 7: Select action based on the ε-greedy policy at current
Deep PDS-PER Learning state st : choose a random action at with probability ε;
8: Otherwise, at = arg max Q(st , at ; θt );
Similar to most DRL algorithms, our proposed deep at ∈A
PDS-PER learning based secure beamforming approach con- 9: Execute action at , receive an immediate reward
sists of two stages, i.e., the training stage and implementation rk (st , at ) and observe the sate transition from st to PDS state
stage. The training process of the proposed approach is shown s̃t and then to the next state st+1 ;
in Algorithm 1. A central controller at the BS is responsible 10: Update the reward function r(st , at ) under PDS-learning
for collecting environment information and making decision using (25);
for secure beamforming. 11: Update the PDS action-value function Q̃(s̃t , at ; θt )
In the training stage, similar to RL-based policy control, using (29);
the controller initializes network parameters and observes the 12: Update the Q-function Q̂(st , at ; θt ) using (25);
current system state including CSI of all users, the previous 13: Store PDS experience et = st , at , rt , s̃t , st+1 in
predicted secrecy rate and the transmission data rate. Then, the experience replay buffer D, if D is full, remove least used
state vector is input into DQN to train the learning model. The experience from D;
ε-greedy scheme is leveraged to balance both the exploration 14: for i= 1, 2, …, H do
and exploitation, i.e., the action with the maximum reward is 15: Sample transition i with the probability p(i) using
selected probability 1- ε according to the current information (34);
(exploitation, which is known knowledge), while a random 16: Calculate the absolute TD-error |δ(i)| in (31);
action is chosen with probability ε based on the unknown 17: Update the corresponding IS weight Wi using (35);
knowledge (i.e., keep trying new actions, hoping it brings even 18: Update the priority of transition i based on |δ(i)|;
higher reward (exploration, which is unknown knowledge) ).
After executing the selected action, the agent receives a reward 19: end for
from the environment and observes the sate transition from 20: Update the loss function L (θ) and parameter θ of
st to PDS state s̃t and then to the next state st+1 . Then, DQN using (36) and (37), respectively;
PDS-learning is used to update the PDS action-value function 21: end for
Q̃(s̃t , at ; θt ) and Q-function Q̂(st , at ; θt ), before collecting 22: end for
and storing the transition tuple (also called experience) et = 23: Output: Return the deep PDS-PER learning model.
st , at , rt , s̃t , st+1 into the experience replay memory buffer
D, which includes the current system state, selected action,
instantaneous reward and PDS state along with the next state.
The experience in the replay buffer is selected by the PER
scheme to generate mini-batches and they are used to train action a, with the maximum value based on the trained deep
DQN. In detail, the priority of each transition p(i) is calculated PDS-PER learning model. Afterwards, the environment feeds
by using (34) and then get its IS weight W (i) in (35), where back an instantaneous reward and a new system state to the
the priorities ensure that high-TD-value (δ(i)) transitions are agent. Finally, the beamforming matrix V∗ at the BS and the
replayed more frequently. The weight W (i) is integrated into phase shift matrix Ψ∗ (reflecting beamforming) at the IRS are
deep PDS learning to update both the loss function L(θ) and achieved according to the selected action.
DNN parameter θ. Once DQN converges, the deep PDS-PER We would like to point out that the training stage needs
learning model is achieved. a powerful computation server which can be performed
After adequate training in Algorithm 1, the learning model offline at the BS while the implementation stage can be
is loaded for the implementation stage. During the implemen- completed online. The trained learning model requires to
tation stage, the controller uses the trained learning model be updated only when the environment (IRS-aided secure
to output its selected action a by going through the DNN communication system) has experienced greatly changes,
parameter θ, with the observed state s from the IRS-aided mainly depending on the environment dynamics and service
secure communication system. Specifically, it chooses an requirements.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on May 02,2023 at 05:26:21 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: DEEP REINFORCEMENT LEARNING-BASED INTELLIGENT REFLECTING SURFACE 383
indicating that the complexity of the proposed algorithm is reflecting beamforming matrix. The DQN construction is used
slightly higher than the classical DQN learning algorithm. for training stability. The network parameters will be provided
However, our proposed algorithm achieves better performance in the next section.
than that of the classical DQN algorithm, which will be Training loss function: The objective of DRL model is
shown in the next section. to find the best beamforming matrices, i.e., V and Ψ, from
the beamforming codebook with the highest achievable reward
from the environment. In this case, having the highest achiev-
D. Implementation Details of DRL able reward estimation, the regression loss function is adopted
This subsection provides extensive details regarding the gen- to train the learning model, where DNN is trained to make
eration of training, validation, and testing dataset production. its output, r̂, as close as possible to the desired normalized
Generation of training: As shown in Fig. 3, K reward, r̄. Formally, the training is driven by minimizing the
single-antenna MUs and M single-antenna eavesdroppers are loss function, L(θ), defined as
randomly located in the 100 m × 100 m half right-hand side
L(θ) = M SE (r̂, r̄) , (38)
rectangular of Fig. 3 (light blue area) in a two-dimensional
x-y rectangular grid plane. The BS and the IRS are located at where θ is the set of all DNN parameters, M SE(·) denotes
(0, 0) and (150, 100) in meter (m), respectively. The x-y grid the mean-squared-error between r̂ and r̄. Note that the outputs
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on May 02,2023 at 05:26:21 UTC from IEEE Xplore. Restrictions apply.
384 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 20, NO. 1, JANUARY 2021
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on May 02,2023 at 05:26:21 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: DEEP REINFORCEMENT LEARNING-BASED INTELLIGENT REFLECTING SURFACE 385
Fig. 6. Performance comparisons versus the maximum transmit power at Fig. 7. Performance comparisons versus the number of IRS elements.
the BS.
versus the maximum transmit power Pmax , when L = 40
similar to the suboptimal solution [14] (denoted as Base-
and ρ = 0.95. As expected, both the secrecy rate and
line 1 [14] ).
QoS satisfaction probability of all the approaches enhance
• The optimal BS’s transmit beamforming approach with-
monotonically with increasing Pmax . The reason is that when
out IRS assistance (denoted as optimal BS without IRS).
Pmax increases, the received SINR at MUs improves, lead-
Without IRS, the optimization problem (11) is trans-
ing to the performance improvement. In addition, we find
formed as
that our proposed learning approach outperforms the Base-
max min Rksec line1 approach. In fact, our approach jointly optimizes the
V {Δh}
k∈K beamforming matrixes V and Ψ, which can simultaneously
s.t. (a) : Rksec ≥ Rksec,min , ∀k ∈ K, facilitates more favorable channel propagation benefit for MUs
and impair eavesdroppers, while the Baseline1 approach opti-
(b) : min (Rku ) ≥ Rkmin , ∀k ∈ K,
||Δhbu ||2 ≤(ςbu )2 , mizes the beamforming matrixes in an iterative way. Moreover,
||Δhru ||2 ≤(ςru )2 our proposed approach has higher performance than DQN in
(c) : Tr VVH ≤ Pmax , terms of both secrecy rate and QoS satisfaction probability,
due to its efficient learning capacity by utilizing PDS-learning
(d) : |χejθl | = 1, 0 ≤ θl ≤ 2π, ∀l ∈ L. (39)
and PER schemes in the dynamic environment. From Fig. 6,
From the optimization problem (39), they system only we also find that the three IRS assisted secure beamforming
needs to optimize the BSs transmit beamforming matrix. approaches provide significant higher secrecy rate and QoS
Problem (39) is non-convex due to the rate constraints, satisfaction probability than the traditional system without
and hence we consider semidefinite programming (SDP) IRS. This indicates that the IRS can effectively guarantee
relaxation to solve it. After transforming problem (39), secure communication and QoS requirements via reflecting
into a convex optimization problem, we can use CVX to beamforming, where reflecting elements (IRS-induced phases)
obtain the solution [12]–[16]. at the IRS can be adjusted to maximize the received SINR at
Fig. 6 shows the average secrecy rate and QoS satisfaction MUs and suppress the wiretapped rate at eavesdroppers.
probability (consists of the secrecy rate satisfaction probability In Fig. 7, the achievable secrecy rate and QoS satisfaction
(11a) and the minimum rate satisfaction probability (11b)) level performance of all approaches are evaluated through
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on May 02,2023 at 05:26:21 UTC from IEEE Xplore. Restrictions apply.
386 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 20, NO. 1, JANUARY 2021
VI. C ONCLUSION
In this work, we have investigated the joint BS’s beamform-
Fig. 8. Performance comparisons versus outdated CSI coefficient ρ. ing and IRS’s reflect beamforming optimization problem under
the time-varying channel conditions. As the system is highly
dynamic and complex, we have exploited the recent advances
changing the IRS elements, i.e., from L = 10 to 60, when of machine learning, and formulated the secure beamforming
Pmax = 30 dBm and ρ = 0.95. For the secure beamforming optimization problem as an RL problem. A deep PDS-PER
approaches assisted by the IRS, their achievable secrecy rates learning based secure beamforming approach has been pro-
and QoS satisfaction levels significantly increase with the posed to jointly optimize both the BS’s beamforming and the
number of the IRS elements. The improvement results from IRS’s reflect beamforming in the dynamic IRS-aided secure
the fact that more IRS elements, more signal paths and signal communication system, where PDS and PER schemes have
power can be reflected by the IRS to improve the received been utilized to improve the learning convergence rate and
SINR at the MUs but to decrease the received SINR at the efficiency. Simulation results have verified that the proposed
eavesdroppers. In addition, the performance of the approach learning approach outperforms other existing approaches in
without IRS remains constant under the different numbers of terms of enhancing the system secrecy rate and the QoS
the IRS elements. satisfaction probability.
From Fig. 7(a), it is found that the secrecy rate of the pro-
posed learning approach is higher than those of the Baseline 1 R EFERENCES
and DQN approaches, especially, their performance gap also
[1] N. Yang, L. Wang, G. Geraci, M. Elkashlan, J. Yuan, and M. D. Renzo,
obviously increases with L, this is because that with more “Safeguarding 5G wireless communication networks using physical
reflecting elements at the IRS, the proposed deep PDS-PER layer security,” IEEE Commun. Mag., vol. 53, no. 4, pp. 20–27,
learning based secure communication approach becomes more Apr. 2015.
[2] A. D. Wyner, “The wiretap channel,” Bell Syst. Tech. J., vol. 54, no. 8,
flexible for optimal phase shift (reflecting beamforming) pp. 1355–1387, Oct. 1975.
design and hence achieves higher gains. In addition, from [3] Q. Li and L. Yang, “Beamforming for cooperative secure transmission in
Fig. 7(b) compared with the Baseline 1 and DQN approaches, cognitive two-way relay networks,” IEEE Trans. Inf. Forensics Security,
vol. 15, pp. 130–143, Jan. 2020.
as the reflecting elements at the IRS increases, we observe [4] L. Xiao, X. Lu, D. Xu, Y. Tang, L. Wang, and W. Zhuang, “UAV relay
that the proposed learning approach is the first one who attains in VANETs against smart jamming with reinforcement learning,” IEEE
100% QoS satisfaction level. This superior achievements are Trans. Veh. Technol., vol. 67, no. 5, pp. 4087–4097, May 2018.
[5] W. Wang, K. C. Teh, and K. H. Li, “Artificial noise aided physical
based on the particular design of the QoS-aware reward layer security in multi-antenna small-cell networks,” IEEE Trans. Inf.
function shown in (14) for secure communication. Forensics Security, vol. 12, no. 6, pp. 1470–1482, Jun. 2017.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on May 02,2023 at 05:26:21 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: DEEP REINFORCEMENT LEARNING-BASED INTELLIGENT REFLECTING SURFACE 387
[6] H.-M. Wang, T. Zheng, and X.-G. Xia, “Secure MISO wiretap channels [29] C. Huang, R. Mo, and C. Yuen, “Reconfigurable intelligent
with multiantenna passive eavesdropper: Artificial noise vs. artificial fast surface assisted multiuser MISO systems exploiting deep rein-
fading,” IEEE Trans. Wireless Commun., vol. 14, no. 1, pp. 94–106, forcement learning,” 2020, arXiv:2002.10072. [Online]. Available:
Jan. 2015. http://arxiv.org/abs/2002.10072
[7] R. Nakai and S. Sugiura, “Physical layer security in Buffer-State- [30] C. Li, W. Zhou, K. Yu, L. Fan, and J. Xia, “Enhanced secure transmis-
Based max-ratio relay selection exploiting broadcasting with cooperative sion against intelligent attacks,” IEEE Access, vol. 7, pp. 53596–53602,
beamforming and jamming,” IEEE Trans. Inf. Forensics Security, vol. 14, Aug. 2019.
no. 2, pp. 431–444, Feb. 2019. [31] L. Xiao, G. Sheng, S. Liu, H. Dai, M. Peng, and J. Song, “Deep rein-
[8] Z. Mobini, M. Mohammadi, and C. Tellambura, “Wireless-powered full- forcement learning-enabled secure visible light communication against
duplex relay and friendly jamming for secure cooperative communica- eavesdropping,” IEEE Trans. Commun., vol. 67, no. 10, pp. 6994–7005,
tions,” IEEE Trans. Inf. Forensics Security, vol. 14, no. 3, pp. 621–634, Oct. 2019.
Mar. 2019. [32] M. Wiering and M. Otterlo, Reinforcement Learning: State of the Art
[9] Q. Wu and R. Zhang, “Towards smart and reconfigurable environment: (Computational Intelligence and Complexity). Springer, 2012.
Intelligent reflecting surface aided wireless network,” IEEE Commun. [33] H. Yang, A. Alphones, W.-D. Zhong, C. Chen, and X. Xie, “Learning-
Mag., vol. 58, no. 1, pp. 106–112, Jan. 2020. based energy-efficient resource management by heterogeneous RF/VLC
[10] J. Zhao, “A survey of intelligent reflecting surfaces (IRSs): Towards 6G for ultra-reliable low-latency industrial IoT networks,” IEEE Trans. Ind.
wireless communication networks,” 2019, arXiv:1907.04789. [Online]. Informat., vol. 16, no. 8, pp. 5565–5576, Aug. 2020.
Available: http://arxiv.org/abs/1907.04789 [34] X. He, R. Jin, and H. Dai, “Deep PDS-learning for privacy-aware
[11] H. Han et al., “Intelligent reflecting surface aided power control for offloading in MEC-enabled IoT,” IEEE Internet Things J., vol. 6, no. 3,
physical-layer broadcasting,” 2019, arXiv:1912.03468. [Online]. Avail- pp. 4547–4555, Jun. 2019.
able: http://arxiv.org/abs/1912.03468 [35] N. Mastronarde and M. van der Schaar, “Joint physical-layer and system-
[12] C. Huang, A. Zappone, G. C. Alexandropoulos, M. Debbah, and level power management for delay-sensitive wireless communications,”
C. Yuen, “Reconfigurable intelligent surfaces for energy efficiency in IEEE Trans. Mobile Comput., vol. 12, no. 4, pp. 694–709, Apr. 2013.
wireless communication,” IEEE Trans. Wireless Commun., vol. 18, no. 8, [36] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience
pp. 4157–4170, Aug. 2019. replay,” in Proc. 4th Int. Conf. Learn. Represent. (ICLR), San Juan, PR,
[13] Q. Wu and R. Zhang, “Intelligent reflecting surface enhanced wireless USA, May 2016, pp. 1–21.
network via joint active and passive beamforming,” IEEE Trans. Wireless [37] H. Gacanin and M. Di Renzo, “Wireless 2.0: Towards an intelligent
Commun., vol. 18, no. 11, pp. 5394–5409, Nov. 2019. radio environment empowered by reconfigurable meta-surfaces and
[14] M. Cui, G. Zhang, and R. Zhang, “Secure wireless communication via artificial intelligence,” 2020, arXiv:2002.11040. [Online]. Available:
intelligent reflecting surface,” IEEE Wireless Commun. Lett., vol. 8, http://arxiv.org/abs/2002.11040
no. 5, pp. 1410–1414, Oct. 2019. [38] C. Huang et al., “Holographic MIMO surfaces for 6G wireless networks:
[15] H. Shen, W. Xu, S. Gong, Z. He, and C. Zhao, “Secrecy rate Opportunities, challenges, and trends,” IEEE Wireless Commun., early
maximization for intelligent reflecting surface assisted multi-antenna access, Jul. 8, 2020, doi: 10.1109/MWC.001.1900534.
communications,” IEEE Commun. Lett., vol. 23, no. 9, pp. 1488–1492, [39] F. B. Mismar, B. L. Evans, and A. Alkhateeb, “Deep reinforcement
Sep. 2019. learning for 5G networks: Joint beamforming, power control, and
[16] X. Yu, D. Xu, and R. Schober, “Enabling secure wireless communica- interference coordination,” IEEE Trans. Commun., vol. 68, no. 3,
tions via intelligent reflecting surfaces,” in Proc. IEEE Global Commun. pp. 1581–1592, Mar. 2020.
Conf. (GLOBECOM), Waikoloa, HI, USA, Dec. 2019, pp. 1–6. [40] H. Yang, Z. Xiong, J. Zhao, D. Niyato, and L. Xiao, “Deep reinforcement
[17] Q. Wu and R. Zhang, “Beamforming optimization for wireless network learning based intelligent reflecting surface for secure wireless commu-
aided by intelligent reflecting surface with discrete phase shifts,” IEEE nications,” in Proc. IEEE Global Commun. Conf. (GLOBECOM), Taipei,
Trans. Commun., vol. 68, no. 3, pp. 1838–1851, Mar. 2020. Taiwan, Dec. 2020, pp. 1–30.
[18] Z. Chu, W. Hao, P. Xiao, and J. Shi, “Intelligent reflecting surface
aided multi-antenna secure transmission,” IEEE Wireless Commun. Lett.,
vol. 9, no. 1, pp. 108–112, Jan. 2020.
[19] B. Feng, Y. Wu, and M. Zheng, “Secure transmission strategy Helin Yang (Student Member, IEEE)
for intelligent reflecting surface enhanced wireless system,” 2019, received the B.S. and M.S. degrees from the
arXiv:1909.00629. [Online]. Available: http://arxiv.org/abs/1909.00629 School of Telecommunications Information
[20] J. Chen, Y.-C. Liang, Y. Pei, and H. Guo, “Intelligent reflecting surface: Engineering, Chongqing University of Posts and
A programmable wireless environment for physical layer security,” IEEE Telecommunications, in 2013 and 2016, respectively,
Access, vol. 7, pp. 82599–82612, 2019. and the Ph.D. degree from the School of Electrical
[21] X. Yu, D. Xu, Y. Sun, D. W. K. Ng, and R. Schober, “Robust and secure and Electronic Engineering, Nanyang Technological
wireless communications via intelligent reflecting surfaces,” 2019, University, Singapore, in 2020. His current research
arXiv:1912.01497. [Online]. Available: http://arxiv.org/abs/1912.01497
interests include wireless communication, visible
[22] X. Guan, Q. Wu, and R. Zhang, “Intelligent reflecting surface assisted
light communication, the Internet of Things, and
secrecy communication: Is artificial noise helpful or not?” IEEE Wireless
resource management.
Commun. Lett., vol. 9, no. 6, pp. 778–782, Jun. 2020.
[23] L. Dong and H.-M. Wang, “Secure MIMO transmission via intelli-
gent reflecting surface,” IEEE Wireless Commun. Lett., vol. 9, no. 6,
pp. 787–790, Jun. 2020.
[24] W. Jiang, Y. Zhang, J. Wu, W. Feng, and Y. Jin, “Intelligent reflecting Zehui Xiong (Student Member, IEEE) received the
surface assisted secure wireless communications with multiple-transmit B.Eng. degree (Hons.) from the Huazhong Uni-
and multiple-receive antennas,” 2020, arXiv:2001.08963. [Online]. versity of Science and Technology, Wuhan, China,
Available: http://arxiv.org/abs/2001.08963 and the Ph.D. degree from Nanyang Technological
[25] D. Xu, X. Yu, Y. Sun, D. Wing Kwan Ng, and R. Schober, “Resource University, Singapore. He is currently a Researcher
allocation for secure IRS-assisted multiuser MISO systems,” 2019, with Alibaba-NTU Singapore Joint Research Insti-
arXiv:1907.03085. [Online]. Available: http://arxiv.org/abs/1907.03085 tute, Nanyang Technological University. He is also
[26] C. Huang, G. C. Alexandropoulos, C. Yuen, and M. Debbah, the Visiting Scholar with Princeton University and
“Indoor signal focusing with deep learning designed reconfigurable the University of Waterloo. His research interests
intelligent surfaces,” 2019, arXiv:1905.07726. [Online]. Available: include network economics, wireless communica-
http://arxiv.org/abs/1905.07726 tions, blockchain, and edge intelligence. He has
[27] A. Taha, M. Alrabeiah, and A. Alkhateeb, “Enabling large intelli- published more than 60 peer-reviewed research papers in leading journals
gent surfaces with compressive sensing and deep learning,” 2019, and flagship conferences, and three of them are ESI Highly Cited Papers.
arXiv:1904.10136. [Online]. Available: http://arxiv.org/abs/1904.10136 He has won several Best Paper awards. He is an Editor of Computer Networks
[28] K. Feng, Q. Wang, X. Li, and C.-K. Wen, “Deep reinforcement learning (COMNET) (Elsevier) and Physical Communication (PHYCOM) (Elsevier),
based intelligent reflecting surface optimization for MISO communica- and an Associate Editor of IET Communications. He was a recipient of the
tion systems,” IEEE Wireless Commun. Lett., vol. 9, no. 5, pp. 745–749, Chinese Government Award for Outstanding Students Abroad in 2019 and
May 2020. the NTU SCSE Outstanding Ph.D. Thesis Runner-Up Award in 2020.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on May 02,2023 at 05:26:21 UTC from IEEE Xplore. Restrictions apply.
388 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 20, NO. 1, JANUARY 2021
Jun Zhao (Member, IEEE) received the Ph.D. Liang Xiao (Senior Member, IEEE) received the
degree in electrical and computer engineering from B.S. degree in communication engineering from the
Carnegie Mellon University (CMU), USA (advi- Nanjing University of Posts and Telecommunica-
sors: Virgil Gligor, Osman Yagan; collaborator: tions, China, in 2000, the M.S. degree in electri-
Adrian Perrig), affiliating with CMU’s renowned cal engineering from Tsinghua University, China,
CyLab Security and Privacy Institute, and a bach- in 2003, and the Ph.D. degree in electrical engi-
elor’s degree from Shanghai Jiao Tong University in neering from Rutgers University, New Brunswick,
China. He is currently an Assistant Professor with NJ, USA, in 2009. She was a Visiting Professor
the School of Computer Science and Engineering, with Princeton University, Virginia Tech, and the
Nanyang Technological University (NTU), Singa- University of Maryland, College Park, MD, USA.
pore. Before joining NTU first as a Post-Doctoral She is currently a Professor with the Department of
Researcher with Xiaokui Xiao and then as a Faculty Member, he was Information and Communication Engineering, Xiamen University, Xiamen,
a Post-Doctoral Researcher with Arizona State University as an Arizona China. She has served as an Associate Editor for the IEEE T RANSACTIONS
Computing Post-Doctoral Best Practices Fellow (advisors: Junshan Zhang, ON I NFORMATION F ORENSICS AND S ECURITY and a Guest Editor for the
Vincent Poor). His research interests include communications, networks, IEEE J OURNAL OF S ELECTED T OPICS IN S IGNAL P ROCESSING. She was a
security, and AI. recipient of the Best Paper Award for 2016 INFOCOM Big Security WS and
2017 ICC.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on May 02,2023 at 05:26:21 UTC from IEEE Xplore. Restrictions apply.