0% found this document useful (0 votes)

28 views

Deep_Reinforcement_Learning-Based_Intelligent_Reflecting_Surface_for_Secure_Wireless_Communications

This paper presents a deep reinforcement learning (DRL)-based approach for optimizing secure beamforming in an intelligent reflecting surface (IRS)-aided wireless communication system. The proposed method aims to enhance the secrecy rate for multiple legitimate users while mitigating eavesdropping threats, considering dynamic channel conditions and quality of service requirements. Simulation results indicate significant improvements in system performance compared to traditional optimization techniques.

Uploaded by

Rounaque azam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views

Deep_Reinforcement_Learning-Based_Intelligent_Reflecting_Surface_for_Secure_Wireless_Communications

Uploaded by

Rounaque azam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 20, NO.

1, JANUARY 2021 375

Deep Reinforcement Learning-Based Intelligent

Reflecting Surface for Secure
Wireless Communications
Helin Yang , Student Member, IEEE, Zehui Xiong , Student Member, IEEE, Jun Zhao , Member, IEEE,
Dusit Niyato , Fellow, IEEE, Liang Xiao , Senior Member, IEEE, and Qingqing Wu , Member, IEEE

Abstract— In this paper, we study an intelligent reflecting the non-convex optimization problem, a novel deep reinforce-
surface (IRS)-aided wireless secure communication system, where ment learning (DRL)-based secure beamforming approach is
an IRS is deployed to adjust its reflecting elements to secure the firstly proposed to achieve the optimal beamforming policy
communication of multiple legitimate users in the presence of against eavesdroppers in dynamic environments. Furthermore,
multiple eavesdroppers. Aiming to improve the system secrecy post-decision state (PDS) and prioritized experience replay (PER)
rate, a design problem for jointly optimizing the base station schemes are utilized to enhance the learning efficiency and
(BS)’s beamforming and the IRS’s reflecting beamforming is secrecy performance. Specifically, a modified PDS scheme is
formulated considering different quality of service (QoS) require- presented to trace the channel dynamic and adjust the beamform-
ments and time-varying channel conditions. As the system is ing policy against channel uncertainty accordingly. Simulation
highly dynamic and complex, and it is challenging to address results demonstrate that the proposed deep PDS-PER learning
based secure beamforming approach can significantly improve
Manuscript received March 7, 2020; revised June 21, 2020 and August 27, the system secrecy rate and QoS satisfaction probability in
2020; accepted September 13, 2020. Date of publication September 25, 2020; IRS-aided secure communication systems.
date of current version January 8, 2021. This work was supported in part by the
National Research Foundation (NRF), Singapore, through Singapore Energy Index Terms— Secure communication, intelligent reflecting
Market Authority (EMA), Energy Resilience, under Grant NRF2017EWT- surface, beamforming, secrecy rate, deep reinforcement learning.
EP003-041; in part by the Singapore NRF under Grant NRF2015-NRF-
ISF001-2277; in part by the Singapore NRF National Satellite of Excellence,
Design Science, and Technology for Secure Critical Infrastructure under
Grant NSoE DeST-SCI2019-0007; in part by the A*STAR-NTU-SUTD Joint I. I NTRODUCTION
Research Grant on Artificial Intelligence for the Future of Manufacturing
under Grant RGANS1906; in part by the Wallenberg AI, Autonomous
Systems, and Software Program and Nanyang Technological University
(WASP/NTU) under Grant M4082187 (4080); in part by the Singapore
P HYSICAL layer security (PLS) has attracted increasing
attention as an alternative of cryptography-based tech-
niques for wireless communications [1], where PLS exploits
Ministry of Education (MOE) under Grant Tier 1 RG16/20; in part by the the wireless channel characteristics by using signal processing
Alibaba Group through Alibaba Innovative Research (AIR) Program; in part
by the Alibaba-NTU Singapore Joint Research Institute (JRI); in part by the designs and channel coding to support secure communication
Nanyang Technological University (NTU) Startup Grant, Singapore Ministry services without relying on a shared secret key [1], [2].
of Education Academic Research Fund, under Grant Tier 1 RG128/18, Grant So far, a variety of approaches have been reported to improve
Tier 1 RG115/19, Grant Tier 1 RT07/19, Grant Tier 1 RT01/19, and Grant Tier
2 MOE2019-T2-1-176; in part by the NTU-WASP Joint Project, Singapore PLS in wireless communication systems, e.g., cooperative
National Research Foundation through its Strategic Capability Research relaying strategies [3], [4], artificial noise-assisted beamform-
Centers Funding Initiative: Strategic Centre for Research in Privacy-Preserving ing [5], [6], and cooperative jamming [7], [8]. However,
Technologies and Systems; in part by the Energy Research Institute @NTU,
Singapore NRF National Satellite of Excellence, Design Science, and Tech- employing a large number of active antennas and relays in
nology for Secure Critical Infrastructure under Grant NSoE DeST-SCI2019- PLS systems incurs an excessive hardware cost and the system
0012; in part by the AI Singapore 100 Experiments (100E) programme; in complexity. Moreover, cooperative jamming and transmitting
part by the NTU Project for Large Vertical Take-Off and Landing Research
Platform; and in part by the Natural Science Foundation of China under Grant artificial noise require extra transmit power for security guar-
61971366. This article will be presented in part at the 2020 IEEE Global antees.
Communications Conference, Taipei, Taiwan, December 2020. The associate To tackle these shortcomings of the existing approaches
editor coordinating the review of this article and approving it for publication
was D. Gunduz. (Corresponding author: Zehui Xiong.) [3]–[8], a new paradigm, called intelligent reflecting surface
Helin Yang, Zehui Xiong, Jun Zhao, and Dusit Niyato are with the School (IRS) [9]–[13], has been proposed as a promising technique
of Computer Science and Engineering, Nanyang Technological University, to achieve high spectrum efficiency and energy efficiency, and
Singapore 639798 (e-mail: [email protected]; [email protected];
[email protected]; [email protected]). enhance secrecy rate in the fifth generation (5G) and beyond
Liang Xiao is with the Department of Information and Communi- wireless communication systems. In particular, IRS is a uni-
cation Engineering, Xiamen University, Xiamen 361005, China (e-mail: form planar array which is comprised of a number of low-cost
[email protected]).
Qingqing Wu is with the State Key Laboratory of Internet of Things passive reflecting elements, where each of elements adaptively
for Smart City, University of Macau, Macau 999078, China (e-mail: adjusts its reflection amplitude and/or phase to control the
[email protected]). strength and direction of the electromagnetic wave, hence
Color versions of one or more of the figures in this article are available
online at https://ieeexplore.ieee.org. IRS is capable of enhancing and/or weakening the reflected
Digital Object Identifier 10.1109/TWC.2020.3024860 signals at different users [9]. As a result, the reflected signal by
1536-1276 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on May 02,2023 at 05:26:21 UTC from IEEE Xplore. Restrictions apply.
376 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 20, NO. 1, JANUARY 2021

IRS can increase the received signal at legitimate users while Yu et al. [21] investigated an optimization problem with
suppressing the signal at the eavesdroppers [9]–[13]. Hence, considering the impacts of outdated CSI of the eavesdropping
from the PLS perspective, some innovative studies have been channels in an IRS-aided secure communication system, and
recently devoted to performance optimization for IRS-aided a robust algorithm was proposed to address the optimization
secure communications [14]–[25]. problem in the presence of multiple eavesdroppers.
The above mentioned studies [14]–[25] mainly applied the
traditional optimization techniques e.g., AO, SDP or MM
A. Related Works algorithms to jointly optimize the BSs beamforming and the
Initial studies on IRS-aided secure communication systems IRSs reflecting beamforming in IRS-aided secure communica-
have reported in [14]–[17], where a simple system model with tion systems, which are less efficient for large-scale systems.
only a single-antenna legitimate user and a single-antenna Inspired by the recent advances of artificial intelligence (AI),
eavesdropper was considered in these works. The authors several works attempted to utilize AI algorithms to optimize
in [14] and [15] applied the alternative optimization (AO) algo- IRSs reflecting beamforming [26]–[29]. Deep learning (DL)
rithm to jointly optimize the transmit beamforming vector at was exploited to search the optimal IRS reflection matrices
the base station (BS) and the phase elements at the IRS for the that maximize the achievable system rate in an IRS-aided com-
maximization of the secrecy rate, but they did not extend their munication system, and the simulation demonstrated that DL
models to multi-user IRS-assisted secure communication sys- significantly outperforms conventional algorithms. Moreover,
tems. To minimize the transmit power at the BS subject to the the authors in [28] and [29] proposed deep reinforcement
secrecy rate constraint, the authors in [18] utilized AO solution learning (DRL) based approach to address the non-convex
and semidefinite programming (SDP) relaxation to address the optimization problem, and the phase shifts at the IRS are
optimization problem with the objective to jointly optimize the optimized effectively. However, the works [26]–[29] merely
power allocation and the IRS reflecting beamforming. In addi- considered to maximize the system achievable rate of a single
tion, Feng et al. [19] also studied the secure transmission user without considering the scenario of multiple users, secure
framework with an IRS to minimize the system transmit power communication and imperfect CSI in their models. The authors
in cases of rank-one and full-rank BS-IRS links, and derived a in [30] and [31] applied reinforcement learning (RL) to achieve
closed-form expression of beamforming matrix. Different from smart beamforming at the BS against an eavesdropper in com-
these studies [14]–[19] which considered only a single eaves- plex environments, but the IRS-aided secure communication
dropper, secure communication systems comprising multiple system needs to optimize the IRS’s reflect beamforming in
eavesdroppers were investigated in [20]–[22]. Chen et al. [20] addition to the BS’s transmit beamforming. To the best of
presented a minimum-secrecy-rate maximization design to our knowledge, RL or DRL has not been explored yet in
provide secure communication services for multiple legitimate prior works to optimize both the BS’s transmit beamforming
users while keeping them secret from multiple eavesdrop- and the IRS’s reflect beamforming in dynamic IRS-aided
pers in an IRS-aided multi-user multiple-input single-output secure communication systems, under the condition of mul-
(MISO) system, but the simplification of the optimization tiple eavesdroppers and imperfect CSI, which thus motivates
problem may cause a performance loss. The authors in [23] this work.
and [24] studied an IRS-aided multiple-input multiple-output
(MIMO) channel, where a multi-antenna BS transmits data
B. Contributions
stream to a multi-antenna legitimate user in the presence
of an eavesdropper configured with multiple antennas, and In this paper, we investigate an IRS-aided secure commu-
a suboptimal secrecy rate maximization approach was pre- nication system with the objective to maximize the system
sented to optimize the beamforming policy. In addition to the secrecy rate of multiple legitimate users in the presence of
use of AO or SDP in the system performance optimization, multiple eavesdroppers under realistic time-varying channels,
the minorization-maximization (MM) algorithm was recently while guaranteeing quality of service (QoS) requirements of
utilized to optimize the joint transmit beamforming at the BS legitimate users. A novel DRL-based secure beamforming
and phase shift coefficient at the IRS [16], [23]. approach is firstly proposed to jointly optimize the beamform-
Moreover, the authors in [22] and [25] employed the arti- ing matrix at the BS and the reflecting beamforming matrix
ficial noise-aided beamforming for IRS-aided MISO secure (reflection phases) at the IRS in dynamic environments. The
communication systems to improve the system secrecy rate, major contributions of this paper are summarized as follows:
and an AO based solution was applied to jointly optimize the • The physical secure communication based on IRS with
BSs beamforming, artificial noise interference vector and IRSs multiple eavesdroppers is investigated under the condition
reflecting beamforming with the goal to maximize the secrecy of time-varying channel coefficients in this paper. In addi-
rate. All these existing studies [14]–[20], [22]–[25] assumed tion, we formulate a joint BS’s transmit beamforming
that perfect channel state information (CSI) of legitimate users and IRS’s reflect beamforming optimization problem with
or eavesdroppers is available at the BS, which is not a practical the goal of maximizing the system secrecy rate while
assumption. The reason is that acquiring perfect CSI at the BS considering the QoS requirements of legitimate users.
is challenging since the corresponding CSI may be outdated • An RL-based intelligent beamforming framework is pre-
when the channel is time-varying due to the transmission sented to achieve the optimal BSs beamforming and
delay, processing delay, and high mobility of users. Hence, the IRS’s reflecting beamforming, where the central

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on May 02,2023 at 05:26:21 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: DEEP REINFORCEMENT LEARNING-BASED INTELLIGENT REFLECTING SURFACE 377

to serve K single-antenna legitimate mobile users (MUs) in

the presence of M single-antenna eavesdroppers. An IRS with
L reflecting elements is deployed in the system to assist
secure wireless communications from the BS to the MUs.
The IRS is equipped with a controller to coordinate with the
BS. For the ease of practical implementation, the maximal
reflection without power loss at the IRS is considered since
the reflecting elements are designed to maximize the reflected
desired signal power to the MUs [13]–[23]. In addition,
unauthorized eavesdroppers aim to eavesdrop any of the data
streams of the MUs. Hence, the use of reflecting beamforming
at IRS is also investigated to improve the achievable secrecy
rate at the MUs while suppressing the wiretapped data rate
Fig. 1. IRS-aided secure communication under multiple eavesdroppers. at the eavesdroppers. In addition, we explicitly state that the
eavesdroppers cannot collide [5], [6], [18]–[21].
Let K = {1, 2, . . . , K}, M = {1, 2, . . . , M } and L =
controller intelligently optimizes the beamforming policy
{1, 2, . . . , L} denote the MU set, the eavesdropper set and the
by using a Markov decision process (MDP) according
IRS reflecting element set, respectively. Let Hbr ∈ CL×N ,
to the instantaneous observations from dynamic envi-
hHbu,k ∈ C
1×N
, hHru,k ∈ C
1×L
, hH be,m ∈ C
1×N
, and
ronment. Specifically, a QoS-aware reward function is H
hre,m ∈ C 1×L
denote the channel coefficients from the BS
constructed by covering both the secrecy rate and users
to the IRS, from the BS to the k-th MU, from the IRS
QoS requirements into the learning process.
to the k-th MU, from the BS to the m-th eavesdropper,
• A DRL-based secure beamforming approach is proposed
and from the IRS to the m-th eavesdropper, respectively.
to improve the learning efficiency and secrecy perfor-
All the above mentioned channel coefficients in the sys-
mance by fully exploiting the information of complex
tem are assumed to be small-scale fading with path loss
structure of the beamforming policy domain, where a
which follows the Rayleigh fading model [11]–[14], [21]. Let
modified post-decision state (PDS) learning is presented
Ψ = diag(χ1 ejθ1 , χ2 ejθ2 , . . . , χL ejθL ) denote the reflection
to trace the channel dynamic against channel uncertainty,
coefficient matrix associated with effective phase shifts at
and prioritized experience replay (PER) is applied to
the IRS, where χl ∈ [0, 1] and θl ∈ [0, 2π] denote the
enhance the learning efficiency.
amplitude reflection factor and the phase shift coefficient on
• Extensive simulation results are provided to demonstrate
the combined transmitted signal, respectively. As each phase
the effectiveness of the proposed deep PDSCPER leaning
shift is desired to be designed to achieve full reflection,
based secure beamforming approach in terms of improv-
we consider that χl = 1, ∀l ∈ L in the sequel of the paper.
ing the secrecy rate and the QoS satisfaction probability,
At the BS side, the beamforming vector for the k-th MU
compared with other existing approaches. For instance,
is denoted as vk ∈ CN ×1 , which is the continuous linear
the proposed learning approach achieves the secrecy rate
precoding [11]–[16], [23]. Thus, the Ktransmitted signal for all
and QoS satisfaction level improvements of 17.21% and
MUs at the BS is written as x = k=1 vk sk , where sk is the
8.67%, compared with the approach [14] in time-varying
transmitted symbol for the k-th MU which can be modelled
channel condition.
as independent and identically distributed (i.i.d.) random vari-
The rest of this paper is organized as follows. Section II ables with zero mean and unit variance [11]–[16], [23], and
presents the system model and problem formulation. The sk ∼ CN (0, 1). The total transmit power at the BS is subject
optimization problem is formulated as an RL problem in to the maximum power constraint:
Section III. Section IV proposes a deep PDS-PER based secure
beamforming approach. Section V provides simulation results E[||x||2 ] = Tr(VVH ) ≤ Pmax (1)
and Section VI concludes the paper.
Notations: In this paper, vectors and matrices are repre- Δ
where V = [v1 , v2 , . . . , vK ] ∈ CM×K , and Pmax is the
sented by Boldface lowercase and uppercase letters, respec-
maximum transmit power at the BS.
tively. Tr(·), (·)∗ and (·)H denote the trace, the conjugate and
When the BS transmits a secret message to the k-th MU,
the conjugate transpose operations, respectively. | · | and || · ||
the MU will receive the signal from the BS and the reflected
stand for the absolute value of a scalar and the Euclidean norm
signal from the IRS. Accordingly, the received signal at MU
of a vector or matrix, respectively. E[·] denotes the expectation
k can be given by
operation. CM×N represents the space of complex-valued
matrices.
yk = hH H
ru,k ΨHbr + hbu,k vk sk

II. S YSTEM M ODEL AND P ROBLEM F ORMULATION desired signal
H
A. System Model + hru,k ΨHbr + hH
bu,k vi si +nk (2)
We consider an IRS-aided secure communication system, i∈K,i=k

as shown in Fig. 1, where the BS is equipped with N antennas inter−userinterference

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on May 02,2023 at 05:26:21 UTC from IEEE Xplore. Restrictions apply.
378 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 20, NO. 1, JANUARY 2021

where nk denotes the additive complex Gaussian noise can be bounded with respect to the Euclidean norm by using
(AWGN) with the with zero mean and variance δk2 at the k-th norm-bounded error model, i.e.,
MU. In (2), we observe that in addition to the received desired
||Δhbu ||2 ≤ (ςbu )2 , ||Δhru ||2 ≤ (ςru )2 ,
signal, each MU also suffers inter-user interference (IUI) in
the system. In addition, the received signal at eavesdropper m ||Δhbe ||2 ≤ (ςbe )2 , ||Δhre ||2 ≤ (ςre )2 , (7)
is expressed by where ςbu , ςru , ςbe , and ςre refer to the radii of the determin-
istically bounded error regions.
ym = hH re,m ΨHbr + hbe,m
H
vk sk + nm (3) Under the channel uncertainty model, the achievable rate of
k∈K
the k-th MU is given by
where nm is the AWGN of eavesdropper m with the vari- ⎛ 2 ⎞
H H
ance δm2
. ⎜ (hru,k ΨH br + hbu,k )vk ⎟
In practical systems, it is not easy for the BS and the Rku = log2 ⎝1 + ⎠.
| (hH ΨH + hH )v |2 + δ 2
ru,k br bu,k i k
IRS to obtain perfect CSI [9], [21]. This is due to the i∈K,i=k
fact that both the transmission delay and processing delay (8)
exist, as well as the mobility of the users. Therefore, CSI
is outdated at the time when the BS and the IRS transmit If the m-th eavesdropper attempts to eavesdrop the signal
the data stream to MUs [21]. Once this outdated CSI is of the k-th MU, its achievable rate can be expressed by
employed for beamforming, it will lead to a negative effect on ⎛ 2 ⎞
H H
the demodulation at the MUs, thereby leading to substantial ⎜ (hre,m ΨH br + hbe,m )vk ⎟
e
Rm,k = log2 ⎝1+ ⎠.
performance loss [21]. Therefore, it is necessary to consider | H H 2
(hre,m ΨHbr + hbe,m )vi | + δm2
outdated CSI in the IRS-aided secure communication system. i∈K,i=k
Let Tdelay denote the delay between the outdated CSI and (9)
the real-time CSI. In other words, when the BS receives the
pilot sequences sent from the MUs at the time slot t, it will Since each eavesdropper can eavesdrop any of the K
complete the channel estimation process and begin to transmit MUs’ signal, according to [14]–[25], the achievable individual
data stream to the MUs at the time slot t + Tdelay . Hence, secrecy rate from the BS to the k-th MU can be expressed by
+
the relation between the outdated channel vector h(t) and the
real-time channel vector h(t + Tdelay ) can be expressed by Rk = Rk − max Rm,k
sec u e (10)
∀m

h(t + Tdelay ) = ρh(t) + 1 − ρ2 ĥ(t + Tdelay ). (4) where [z]+ = max(0, z).

In (4), ĥ(t + Tdelay ) is independent identically distributed B. Problem Formulation

with h(t) and h(t+Tdelay ), and it is with zero-mean and unit-
Our objective is to jointly optimize the robust BS’s transmit
variance complex Gaussian entries. ρ is the autocorrelation
beamforming matrix V and the robust IRS’s reflecting beam-
function (outdated CSI coefficient) of the channel gain h(t)
forming matrix Ψ from the system beamforming codebook F
and 0 ≤ ρ ≤ 1, which is given by
to maximize the worst-case secrecy rate with the worst-case
ρ = J0 (2πpi fD Tdelay ) (5) secrecy rate and data rate constraints, the total BS trans-
mit power constraint and the IRS reflecting unit constraint.
where J0 (·) is the zeroth-order Bessel function of the first As such, the optimization problem is formulated as
kind, fD is the Doppler spread which is generally a function
max min Rksec
of the velocity (υ) of the transceivers, the carrier frequency V,Ψ {Δh}
k∈K
(fc ) and the speed of light (c), i.e., fD = υfc /c. Note that
ρ = 1 indicates the outdated CSI effect is eliminated, whereas s.t. (a) : Rksec ≥ Rksec,min , ∀k ∈ K,
ρ = 0 represents no CSI. (b) : min
2 2
(Rku ) ≥ Rkmin , ∀k ∈ K,
||Δhbu || ≤(ςbu ) ,
As the outdated CSI introduces the channel uncertainty in ||Δhru ||2 ≤(ςru )2
practical dynamic systems, the actual channel coefficients can
(c) : Tr VVH ≤ Pmax ,
be rewritten as
(d) : |χejθl | = 1, 0 ≤ θl ≤ 2π, ∀l ∈ L, (11)
hbu,k = h̃bu,k + Δhbu,k , ∀k ∈ K,
where Rksec ,min
is the target secrecy rate of the k-th MU, and
hru,k = h̃ru,k + Δhru,k , ∀k ∈ K, Rkmin denotes its target data rate. The constraints in (11a)
hbe,m = h̃be,m + Δhbe,m , ∀m ∈ M, and (11b) are imposed to satisfy the worst-case secrecy rate
hre,m = h̃re,m + Δhre,m , ∀m ∈ M, (6) and data rate requirements, respectively. The constraint in
(11c) is set to satisfy the BS’s maximum power constraint.
where h̃bu,k , h̃ru,k , h̃be,m and h̃re,m denote the estimated The constraint in (11d) is the constraint of the IRS reflecting
channel vectors; Δhbu,k , Δhru,k , Δhbe,m and Δhre,m are the elements. Obviously, it is challenging to obtain an optimal
corresponding channel error vectors. In the paper, generally, solution to the optimization (11), since the objective function
the channel error vectors of each MU and each eavesdropper in (11) is non-concave with respect to either V or Ψ, and

the coupling of the optimization variables (V and Ψ) and the it is important to design an efficient reward function to improve
unit-norm constraints in (11d) are non-convex. In addition, the MUs’ QoS satisfaction levels.
we would consider the robust beamforming design to max- In this paper, the reward function represents the optimization
imize the worst-case achievable secrecy rate of the system objective, and our objective is to maximize the system secrecy
while guaranteeing the worst-case constraints. rate of all MUs while guaranteeing their QoS requirements.
Thus, the presented QoS-aware reward function is expressed
III. P ROBLEM T RANSFORMATION BASED ON RL as
The optimization problem given in (11) is difficult to r= Rksec − k −
μ1 psec μ2 puk
address as it is a non-convex problem. In addition, in realistic k∈K k∈K k∈K (14)

IRS-aided secure communication systems, the capabilities of part 1 part 2 part 3
MUs, the channel quality, and the service applications will
where
change dynamically. Moreover, the problem in (11) is just a
single time slot optimization problem, which may converge 1, if Rksec < Rksec ,min , ∀k ∈ K,
psec
k = (15)
to a suboptimal solution and obtain the greedy-search like 0, otherwise,
performance due to the ignorance of the historical system state
and the long term benefit. Hence, it is generally infeasible to 1, if Rk < Rkmin , ∀k ∈ K,
puk = (16)
apply the traditional optimization techniques (AO, SDP, and 0, otherwise.
MM) to achieve an effective secure beamforming policy in
In (14), the part 1 represents the immediate utility (system
uncertain dynamic environments.
secrecy rate), the part 2 and the part 3 are the cost functions
Model-free RL is a dynamic programming tool which can
which are defined as the unsatisfied secrecy rate requirement
be adopted to solve the decision-making problem by learning
and the unsatisfied minimum rate requirement, respectively.
the optimal solution in dynamic environments [32]. Hence,
The coefficients μ1 and μ2 are the positive constants of the
we model the secure beamforming optimization problem as
part 2 and the part 3 in (14), respectively, and they are used
an RL problem. In RL, the IRS-aided secure communication
to balance the utility and cost [33]–[35].
system is treated as an environment, the central controller at
The goals of (15) and (16) are to impose the QoS satisfac-
the BS is regarded as a learning agent. The key elements of
tion levels of both the secrecy rate and the minimum data rate
RL are defined as follows.
requirements, respectively. If the QoS requirement is satisfied
State space: Let S denote the system state space. The
in the current time slot, then psec u
k = 0 or pk = 0, indicating
current system state s ∈ S includes the channel information
that there is no punishment of the reward function due to the
of all users, the secrecy rate, the transmission data rate of the
successful QoS guarantees.
last time slot and the QoS satisfaction level, which is defined
The goal of the learning agent is to search for an optimal
as
policy π ∗ (π is a mapping from states in S to the probabilities

s = {hk }k∈K , {hm }m∈M , {Rksec }k∈K , {Rk }k∈K , of choosing an action in A: π(s) : S → A) that maximizes
the long-term expected discounted reward, and the cumulative
{QoSk }k∈K (12)
discounted reward function can be defined as
where hk and hm are the channel coefficients of the k-th ∞
MU and m-th eavesdropper, respectively. QoSk is the feed- Ut = γτ rt+τ +1 (17)
back QoS satisfaction level of the k-th MU, where the QoS τ =0
satisfaction level consists of both the minimum secrecy rate where γ ∈ (0, 1] denotes the discount factor. Under a cer-
satisfaction level in (11a) and the minimum data rate satis- tain policy π, the state-action function of the agent with a
faction level in (11b). Other parameters in (12) are already state-action pair (s, a) is given by
defined in Section II.
Action space: Let A denote the system action space. Qπ (st , at ) = Eπ [Ut |st = s, at = a] . (18)
According to the observed system state s, the central controller The conventional Q-Learning algorithm can be adopted to
chooses the beamforming vector {vk }k∈K at the BS and the learn the optimal policy. The key objective of Q-Learning is
IRS reflecting beamforming coefficient (phase shift) {θl }l∈L to update Q-table by using the Bellman’s equation as follows:
at the IRS. Hence, the action a ∈ A can be defined by ⎡

a = {vk }k∈K , {θl }l∈L . (13) Qπ (st , at ) = Eπ ⎣rt + γ T (st+1 |st , at )
st+1 ∈S
Transition probability: Let T (s |s, a) represent the transi- ⎤
tion probability, which is the probability of transitioning to a
new state s ∈ S, given the action a executed in the sate s. π(st+1 , at+1 )Qπ (st+1 , at+1 )⎦ (19)
Reward function: In RL, the reward acts as a signal to at+1 ∈A

evaluate how good the secure beamforming policy is when The optimal action-value function in (17) is equivalent to
the agent executes an action at a current state. The system the Bellman optimality equation, which is expressed by
performance will be enhanced when the reward function at
each learning step correlates with the desired objective. Thus, Q∗ (st , at ) = rt + γ max Q∗ (st+1 , at+1 ) (20)
at+1

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on May 02,2023 at 05:26:21 UTC from IEEE Xplore. Restrictions apply.
380 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 20, NO. 1, JANUARY 2021

and the state-value function is achieved as follows:

V (st ) = max Q(st , at ). (21)
at ∈A

In addition, the Q-value is updated as follows:

Qt+1 (st , at ) = (1 − αt )Qt (st , at ) + αt (rt + γVt (st+1 ))
(22)
where αt ∈ (0, 1] is the learning rate. Q-Learning gen-
erally constructs a lookup Q-table Q(s, a), and the agent
selects actions based on the greedy policy for each learning
step [32]. In the ε−greedy policy, the agent chooses the action
with the maximum Q-table value with probability 1 − ε,
whereas a random action is picked with probability ε to
avoid achieving stuck at non-optimal policies [32]. Once the Fig. 2. Deep PDS-PER learning based beamforming for IRS-aided secure
optimal Q-function Q∗ (s, a) is achieved, the optimal policy is communications.
determined by
π ∗ (s, a) = arg max Q∗ (s, a). (23) A. Proposed Deep PDS-PER Learning
a∈A
As discussed in Section II, CSI is unlikely to be known
IV. D EEP PDS-PER L EARNING BASED accurately due to the transmission delay, processing delay,
S ECURE B EAMFORMING and mobility of users. At the same time, beamforming with
outdated CSI will decrease the secrecy capacity, and therefore,
The secure beamforming policy discussed in Section III can
a fast optimization solution needs to be designed to reduce
be numerically achieved by using Q-Learning, policy gradi-
processing delay. PDS-learning as a well-known algorithm has
ent, and deep Q-Network (DQN) algorithms [32]. However,
been used to improve the learning speed by exploiting extra
Q-Learning is not an efficient learning algorithm because
partial information (e.g., the previous location information
it cannot deal with continuous state space and it has slow
and the mobility velocity of MUs or eavesdroppers that
learning convergence speed. The policy gradient algorithm
affect the channel coefficients) and search for an optimized
has the ability to handle continuous state-action spaces, but
policy in dynamic environments [33]–[35]. Motivated by this,
it may converge to a suboptimal solution. In addition, it is
we devise a modified deep PDS-learning to trace the envi-
intractable for Q-learning and policy gradient algorithms to
ronment dynamic characteristics, and then adjust the transmit
solve the optimization problem under high-dimensional input
beamforming at the BS and the reflecting elements at the IRS
state space. Although DQN performs well in policy learning
accordingly, which can speed up the learning efficiency in
under high-dimensional state space, its non-linear Q-function
dynamic environments.
estimator may lead to unstable learning process.
PDS-learning can be defined as an immediate system state
Considering the fact that the IRS-aided secure communica-
s̃t ∈ S happens after executing an action at at the current state
tion system has high-dimensional and high-dynamical charac-
st and before the next time state st+1 . In detail, the PDS-
teristics according to the system state that is defined in (12)
learning agent takes an action at at state st , and then will
and uncertain CSI that is shown in (4) and (6), we propose a
receive known reward rk (st , at ) from the environment before
deep PDS-PER learning based secure beamforming approach,
transitioning the current state st to the PDS state s̃t with a
as shown in Fig. 2, where PDS-learning and PER mechanisms
known transition probability T k (s̃t |st , at ). After that, the PDS
are utilized to enable the learning agent to learn and adapt
state further transform to the next state st+1 with an unknown
faster in dynamic environments. In detail, the agent utilizes the
transition probability T u (st+1 |s̃t , at ) and an unknown reward
observed state (i.e, CSI, previous secrecy rate, QoS satisfaction
ru (st , at ), which corresponds to the wireless CSI dynamics.
level), the feedback reward from environment as well as the
In PDS-learning, st+1 is independent of st given the PDS
historical experience from the replay buffer to train its learning
state s̃t , and the reward r(st , at ) is decomposed into the sum
model. After that, the agent employs the trained model to
of rk (st , at ) and ru (st , at ) at s̃t and st+1 , respectively. Mathe-
make decision (beamforming matrices V and Ψ) based on its
matically, the state transition probability in PDS-learning from
learned policy. The procedures of the proposed learning based
st to st+1 admits
secure beamforming are provided in the following subsections.
Note that the policy optimization (in terms of the BS’s
beamforming matrix V and the RIS’s reflecting beamforming T (st+1 |st , at ) = T u (st+1 |s̃t , at )T k (s̃t |st , at ). (24)
s̃t
matrix Ψ) in the IRS-aided secure communication system
can be performed at the BS and that the optimized reflecting Moreover, it can be verified that the reward of the current
beamforming matrix can be transferred in an offline manner to state-action pair (st , at ) is expressed by
the IRS by the controller to adjust the corresponding reflecting
elements accordingly. r(st , at ) = rk (st , at ) + T k (s̃t |st , at )ru (s̃t , at ). (25)
s̃t

At the time slot t, the PDS action-value function Q̃(s̃t , at ) RL algorithm. However, classical DQN uniformly samples
of the current PDS state-action pair (s̃t , at ) is defined as each transition et = st , at , rt , s̃t , st+1 from the experience
replay, which may lead to an uncertain or negative effect on
Q̃(s̃t , at ) = ru (s̃t , at ) + γ T u (st+1 |s̃t , at )V (st+1 ). (26) learning a better policy. The reason is that different transitions
st+1
(experience information) in the replay buffer have different
By employing the extra information (the known transition importance for the learning policy, and sampling every tran-
probability T k (s̃t |st , at ) and known reward rk (st , at )), the Q- sition equally may unavoidably result in inefficient usage
function Q̂(st , at ) in PDS-learning can be further expanded of meaningful transitions. Therefore, a prioritized experience
under all state-action pairs (s, a), which is expressed by replay (PER) scheme has been presented to address this issue
and enhance the sampling efficiency [36], [37], where the
Q̂(st , at ) = rk (st , at ) + T k (s̃t |st , at )Q̃(s̃t , at ). (27)
s̃t priority of transition is determined by the values of TD error.
The state-value function in PDS-learning is defined by In PER, a transition with higher absolute TD error has higher
priority in the sense that is has more aggressive correction for
V̂t (st ) = T k (st+1 |st , at )Ṽ (st+1 ) (28) the action-value function.
st+1
In the deep PDS-PER learning algorithm, similar to classical
where Ṽt (st+1 ) = max Q̃t (s̃t+1 , at ). At each time slot, DQN, the agent collects and stores each experience et =
at ∈A
the PDS action-value function Q̃(s̃t , at ) is updated by st , at , rt , s̃t , st+1 into its experience replay buffer, and DNN
updates the parameter by sampling a mini-batch of tuples
Q̃t+1 (s̃t , at ) = (1 − αt )Q̃t (s̃t , at ) from the replay buffer. So far, PER was adopted only for

+αt ru (s̃t , at ) + γ V̂t (st+1 ) (29) DRL and Q-learning, and has never been employed with
the PDS-learning algorithm to learn the dynamic information.
After updating Q̃t+1 (s̃t , at ), the action-value function In this paper, we further extend this PER scheme to enable
Q̂t+1 (st , at ) can be updated by plugging Q̃t+1 (s̃t , at ) into prioritized experience replay in the proposed deep PDS-PER
(27). learning framework, in order to improve the learning conver-
After presenting in the above modified PDS-learning, a deep gence rate.
PDS learning algorithm is presented. In the presented learning The probability of sampling transition i (experience i) based
algorithm, the traditional DQN is adopted to estimatete the on the absolute TD-error is defined by

action-value Q-function Q(s, a) by using Q(s, a; θ), where
p(i) = |δ(i)|η1 |δ(j )|η1 (34)
θ denote the DNN parameter. The objective of DQN is to j

minimize the following loss function at each time slot where the exponent η1 weights how much prioritization is
2
used, with η1 = 0 corresponding to being uniform sampling.
L(θ t ) = {V̂t (st ; θt ) − Q̂(st , at ; θt )} = [{r(st , at )
The transition with higher p(i) will be more likely to be
replayed from the replay buffer, which is associated with very
+γ max Q̂t (st+1 , at+1 ; θ t ) − Q̂(st , at ; θt )}2 (30)
at+1 ∈A successful attempts by preventing the DNN from being over-
fitting. With the help of PER, the proposed deep PDS-PER
where V̂t (st ; θt ) = r(st , at ) + γ max Q̂t (st+1 , at+1 ; θ t ) is learning algorithm tends to replay valuable experience and
at+1 ∈A
the target value. The error between V̂t (st ; θ t ) and the estimated hence learns more effectively to find the best policy.
value Q̂(st , at ; θt ) is usually called temporal-difference (TD) It is worth noting that experiences with high absolute TD-
error, which is expressed by error are more frequently replayed, which alters the visitation
frequency of some experiences and hence causes the training
δt = V̂t (st ; θt ) − Q̂(st , at ; θ t ). (31) process of the DNN prone to diverge. To address this problem,
The DNN parameter θ is achieved by taking the partial importance-sampling (IS) weights are adopted in the calcula-
differentiation of the objective function (30) with respect to tion of weight changes
θ, which is given by W (i) = (D · p(i))
−η2
(35)
θt+1 = θ t + β∇L(θ t ). (32) where D is the size of the experience replay buffer, and the
where β is the learning rate of θ, and ∇(·) denotes the first- parameter η2 is used to adjust the amount of correction used.
order partial derivative. Accordingly, by using the PER scheme into the deep PDS-
Accordingly, the policy π̂t (s) of the modified deep PER learning, the DNN loss function (30) and the correspond-
PDS-learning algorithm is given by ing parameters are rewritten respectively as follows:
H
π̂t (s) = arg max Q̂(st , at ; θt ). 1
at ∈A
(33) L (θ t ) = (Wi Li (θ t )) (36)
H i=1
Although DQN is capable of performing well in policy
θt+1 = θt + βδt ∇θ L(θ t )) (37)
learning with continuous and high-dimensional state space,
DNN may learn ineffectively and cause divergence owing to Theorem 1: The presented deep PDS-PER learning can
the nonstationary targets and correlations between samples. converge to the optimal Q̂(st , at ) of the MDP with probability
Experience replay is utilized to avoid the divergence of the 1 when the learning rate sequence αt meets the following

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on May 02,2023 at 05:26:21 UTC from IEEE Xplore. Restrictions apply.
382 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 20, NO. 1, JANUARY 2021

∞ ∞
conditions αt ∈ [0, 1), t=0 αt = ∞ and t=0 α2t < ∞, Algorithm 1 Deep PDS-PER Learning Based Secure Beam-
where the aforementioned requirements have been appeared forming
in most of the RL algorithms [32] and they are not specific to 1: Input: IRS-aided secure communication simulator and
the proposed deep PDS-PER learning algorithm [32]. QoS requirements of all MUs (e.g., minimum secrecy rate
Proof: If each action can be executed with an infinite num- and transmission rate).
ber of learning steps at each system state, or in other words, 2: Initialize: DQN with initial Q-function Q(s, a; θ),
the learning policy is greedy with the infinite explorations, the parameters θ, learning rate α and β.
Q-function Q̂(st , at ) in PDS-learning and its corresponding 3: Initialize: experience replay buffer D with size D, and
policy strategy π(s) will converge to the optimal points, mini-batch size H.
respectively, with probability of 1 [33]–[35]. The existing 4: for each episode =1, 2, …, N epi do
references [34] and [35] have provided the proof. 5: Observe an initial system state s;
6: for each time step t=0, 1, 2, …, T do
B. Secure Beamforming Based on Proposed 7: Select action based on the ε-greedy policy at current
Deep PDS-PER Learning state st : choose a random action at with probability ε;
8: Otherwise, at = arg max Q(st , at ; θt );
Similar to most DRL algorithms, our proposed deep at ∈A
PDS-PER learning based secure beamforming approach con- 9: Execute action at , receive an immediate reward
sists of two stages, i.e., the training stage and implementation rk (st , at ) and observe the sate transition from st to PDS state
stage. The training process of the proposed approach is shown s̃t and then to the next state st+1 ;
in Algorithm 1. A central controller at the BS is responsible 10: Update the reward function r(st , at ) under PDS-learning
for collecting environment information and making decision using (25);
for secure beamforming. 11: Update the PDS action-value function Q̃(s̃t , at ; θt )
In the training stage, similar to RL-based policy control, using (29);
the controller initializes network parameters and observes the 12: Update the Q-function Q̂(st , at ; θt ) using (25);
current system state including CSI of all users, the previous 13: Store PDS experience et = st , at , rt , s̃t , st+1 in
predicted secrecy rate and the transmission data rate. Then, the experience replay buffer D, if D is full, remove least used
state vector is input into DQN to train the learning model. The experience from D;
ε-greedy scheme is leveraged to balance both the exploration 14: for i= 1, 2, …, H do
and exploitation, i.e., the action with the maximum reward is 15: Sample transition i with the probability p(i) using
selected probability 1- ε according to the current information (34);
(exploitation, which is known knowledge), while a random 16: Calculate the absolute TD-error |δ(i)| in (31);
action is chosen with probability ε based on the unknown 17: Update the corresponding IS weight Wi using (35);
knowledge (i.e., keep trying new actions, hoping it brings even 18: Update the priority of transition i based on |δ(i)|;
higher reward (exploration, which is unknown knowledge) ).
After executing the selected action, the agent receives a reward 19: end for
from the environment and observes the sate transition from 20: Update the loss function L (θ) and parameter θ of
st to PDS state s̃t and then to the next state st+1 . Then, DQN using (36) and (37), respectively;
PDS-learning is used to update the PDS action-value function 21: end for
Q̃(s̃t , at ; θt ) and Q-function Q̂(st , at ; θt ), before collecting 22: end for
and storing the transition tuple (also called experience) et = 23: Output: Return the deep PDS-PER learning model.
st , at , rt , s̃t , st+1 into the experience replay memory buffer
D, which includes the current system state, selected action,
instantaneous reward and PDS state along with the next state.
The experience in the replay buffer is selected by the PER
scheme to generate mini-batches and they are used to train action a, with the maximum value based on the trained deep
DQN. In detail, the priority of each transition p(i) is calculated PDS-PER learning model. Afterwards, the environment feeds
by using (34) and then get its IS weight W (i) in (35), where back an instantaneous reward and a new system state to the
the priorities ensure that high-TD-value (δ(i)) transitions are agent. Finally, the beamforming matrix V∗ at the BS and the
replayed more frequently. The weight W (i) is integrated into phase shift matrix Ψ∗ (reflecting beamforming) at the IRS are
deep PDS learning to update both the loss function L(θ) and achieved according to the selected action.
DNN parameter θ. Once DQN converges, the deep PDS-PER We would like to point out that the training stage needs
learning model is achieved. a powerful computation server which can be performed
After adequate training in Algorithm 1, the learning model offline at the BS while the implementation stage can be
is loaded for the implementation stage. During the implemen- completed online. The trained learning model requires to
tation stage, the controller uses the trained learning model be updated only when the environment (IRS-aided secure
to output its selected action a by going through the DNN communication system) has experienced greatly changes,
parameter θ, with the observed state s from the IRS-aided mainly depending on the environment dynamics and service
secure communication system. Specifically, it chooses an requirements.

has dimensions 100m 100m with a resolution of 2.5 m, i.e., a

total of 1600 points.
In the presented IRS-assisted system, the system beamform-
ing codebook F includes the BS beamforming codebook FBS
and the IRSs beamforming codebook FIRS . Both the BSs
beamforming matrix V and the IRSs reflection beamforming
matrix Ψ are picked from the pre-defined codebook FBS and
FIRS , respectively. The data points of sampled channel vector
and the corresponding reward vector h, r is added into the
Fig. 3. Simulation setup. DNN training data set D. The sampled channel, h, is the input
to DQN. All the simples are normalized by using the normal-
ization scheme to realize a simple per-dataset scaling. After
C. Computational Complexity Analysis training, the selected BSs beamforming matrix V and IRSs
For the training stage, in DNN, let L, Z0 and Zl denote the beamforming matrix Ψ with the highest achievable reward
training layers, the size of the input layer (which is propor- are used to reflect the security communication performance.
tional to the number of states) and the number of neurons in The DQN learning model is trained using empirically hyper-
the l-th layer, respectively. The computational complexity in parameter, where DNN trained for 1000 epochs with 128 mini-

each time step for the agent is O(Z0 Zl + L−1
l=1 Z l Zl+1 ). In the
batches being utilized in each epoch. In the training process,
training phase, each mini-batch has N epi episodes with each 80% and 20% of all generated data are selected as the training
episode being T time steps, each trained model is completed and validation (test) datasets, respectively. The experience
iteratively until convergence. replay buffer size is 32000 where the corresponding samples
Hence, the total computational
L−1
are randomly sampled from this number of the most recently
complexity in DNN is O N epi T (Z0 Zl + l=1 Zl Zl+1 ) .
experiences.
The high computational complexity of the DNN training phase DQN structure: The DQN model is designed as a
can be performed offline for a finite number of episodes at a Multi-Layer Perceptron network, which is also referred to
centralized powerful unit (such as the BS). as the feedforward Fully Connected network. Note here that
In our proposed deep PDS-PER learning algorithm, Multi-Layer Perceptron network is widely used to build an
PDS-learning and PER schemes are utilized to improve the advanced estimator, which fulfills the relation between the
learning efficiency and enhance the convergence speed, which environment descriptors and the beamforming matrices (both
requires extra computational complexity. In PDS-learning the BSs beamforming matrix and the IRSs reflecting beam-
leaning, since the set of PDS states is the same as the set of forming matrix).
MDP states S [30]–[32], the computational complexity of the The DQN model is comprised of L layers, as illustrated
classical DQN algorithm and the deep PDS-learning algorithm in Fig. 2, where the first layer is input layer, the last layer is
are O(|S|2 × |A|) and O(2|S|2 × |A|), respectively. In PER, output layer and the remaining layers are the hidden layers.
since the relay buffer size is D, the system requires to make The l-th hidden layer in the network has a stack of neurons,
both updating and sampling O (log2 D) operations, so the each of which connects all the outputs of the previous layer.
computational complexity of the PER scheme is O (log2 D). Each unit operates on a single input value outputting another
According the above analysis, the complexity single value. The input of the input layer consist of the systems
of the classical DQN algorithm and the proposed states, i.e., channel samples, the achievable rate and QoS satis-
deep
PDS-PER learning algorithm are respectively
faction level information in the last time slot, while the output
O IN T (Z0 Zl + L−1
epi
l=1 Zl Zl+1 ) + |S| × |A|
2
and layer outputs the predicted reward values with beamforming

L−1 matrices in terms of the BSs beamforming matrix and the IRSs
O IN T (Z0 Zl + l=1 Zl Zl+1 ) + 2|S| × |A| + log2 D ,
epi 2

indicating that the complexity of the proposed algorithm is reflecting beamforming matrix. The DQN construction is used
slightly higher than the classical DQN learning algorithm. for training stability. The network parameters will be provided
However, our proposed algorithm achieves better performance in the next section.
than that of the classical DQN algorithm, which will be Training loss function: The objective of DRL model is
shown in the next section. to find the best beamforming matrices, i.e., V and Ψ, from
the beamforming codebook with the highest achievable reward
from the environment. In this case, having the highest achiev-
D. Implementation Details of DRL able reward estimation, the regression loss function is adopted
This subsection provides extensive details regarding the gen- to train the learning model, where DNN is trained to make
eration of training, validation, and testing dataset production. its output, r̂, as close as possible to the desired normalized
Generation of training: As shown in Fig. 3, K reward, r̄. Formally, the training is driven by minimizing the
single-antenna MUs and M single-antenna eavesdroppers are loss function, L(θ), defined as
randomly located in the 100 m × 100 m half right-hand side
L(θ) = M SE (r̂, r̄) , (38)
rectangular of Fig. 3 (light blue area) in a two-dimensional
x-y rectangular grid plane. The BS and the IRS are located at where θ is the set of all DNN parameters, M SE(·) denotes
(0, 0) and (150, 100) in meter (m), respectively. The x-y grid the mean-squared-error between r̂ and r̄. Note that the outputs

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on May 02,2023 at 05:26:21 UTC from IEEE Xplore. Restrictions apply.
384 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 20, NO. 1, JANUARY 2021

of DNN, r̂ can be acted as functions of θ and the inputs of

DNN are the system states shown in (12) in the paper.

V. S IMULATION R ESULTS AND A NALYSIS

This section evaluates the performance of the IRS-aided
secure communication system. The background noise power
of MUs and eavesdroppers is equal to -90 dBm. We set the
number of antennas at the BS is N = 4, the number of MUs
is K = 2 and the number of eavesdroppers is M = 2. The
transmit power Pmax at the BS varies between 15 dBm and
40 dBm, the number of IRS elements L varies between 10 and
60, and the outdated CSI coefficient ρ varies from 0.5 to
1 for different simulation settings. The minimum secrecy rate
and the minimum transmission data rate are 3 bits/s/Hz and
5 bits/s/Hz, respectively. The path loss model is defined by Fig. 4. Average reward performance versus episodes under different learning
rates.
P L = (P L0 − 10ς log 10(d/d0 )) dB, where P L0 = 30 dB
is the path loss at the reference distance d0 = 1 m [9], [38],
ς = 3 is the path loss exponent, and d is the distance from
the transmitter to the receiver. The learning model consists
of three connected hidden layers, containing 500, 250, and
200 neurons [39], respectively. The learning rate is set to
α = 0.002 and the discount factor is set to γ = 0.95.
The the exploration rate ε is linearly annealed from 0.8 to
0.1 over the beginning 300 episodes and remains constant
afterwards. The parameters μ1 and μ2 in (12) are set to
μ1 = μ2 = 2 to balance the utility and cost [33]–[35]. Similar
to the IRS-aided communication systems [9], [13] and [17],
the path loss exponents from the BS to the UEs is set to 3.2,
from BS to IRS is set to 2.2, and from IRS to UEs is set
to 2.2.
The selection of the network parameters decide the learn-
ing convergence speed and efficiency. Here, we take the Fig. 5. The training and validation losses of DNN employed.
network parameters, i.e., the learning rate, as an example
to demonstrate the importance of the network parameters
selection. Fig. 4 shows the average system reward versus
training episodes under different learning rates, i.e., α = worth noting that Fig. 5 is used to investigate how well the
{0.1, 0.01, 0.001, 0.0001}. It can be observed that different DNN weight parameters are designed. If the validation loss is
learning rates have different effect on the performance of high while the training loss is low, this means that the DQN
the deep reinforcement learning algorithm. Specifically, there model is over-fitting and thus the regularization factors may
exits oscillations in behavior for the too-large learning rate of need to be adjusted; if both the validation and training loss
α = 0.1, and its reward performance is much lower than that values are high, this shows that the DQN model is under-fitting
of α = 0.001. In addition, if we set the too-small learning and hence the number of neurons may require to be adjusted.
rate of α = 0.0001, it requires the longer time to achieve the In addition, simulation results are provided to evaluate the
convergence. We can see that the model is able to learn the performance of the proposed deep PDS-PER learning based
problem well with the learning rate of α = 0.001. Hence, secure beamforming approach (denoted as deep PDS-PER
we select a suitable learning rate, neither too large nor too beamforming) in the IRS-aided secure communication system,
small, and the value can be around 0.001 in our test. and compare the proposed approach with the following exiting
To analyze the loss function of the presented DRL shown approaches:
in Section IV.D, Fig. 5 illustrates the training/validation loss • The classical DQN based secure beamforming approach
value during training versus the number of epochs. We can (denoted as DQN-based beamforming), where DNN is
observe that the training/validation loss value decreases sig- employed to estimate the Q-value function, when acting
nificantly in the first few decades epochs and then tend and choosing the secure beamforming policy correspond-
to be approximately at a horizontal level after 150 epochs. ing to the highest Q-value.
Furthermore, the validation loss is only slightly higher than • The existing secrecy rate maximization approach which
the training loss, which demonstrates that the DNN weights optimizes the BS’s transmit beamforming and the IRS’s
designed have the ability to provide a good fit in terms of reflect beamforming by fixing other parameters as the
the mapping between input samples and output samples. It is constants by using the iterative algorithm, which is

Fig. 6. Performance comparisons versus the maximum transmit power at Fig. 7. Performance comparisons versus the number of IRS elements.
the BS.
versus the maximum transmit power Pmax , when L = 40
similar to the suboptimal solution [14] (denoted as Base-
and ρ = 0.95. As expected, both the secrecy rate and
line 1 [14] ).
QoS satisfaction probability of all the approaches enhance
• The optimal BS’s transmit beamforming approach with-
monotonically with increasing Pmax . The reason is that when
out IRS assistance (denoted as optimal BS without IRS).
Pmax increases, the received SINR at MUs improves, lead-
Without IRS, the optimization problem (11) is trans-
ing to the performance improvement. In addition, we find
formed as
that our proposed learning approach outperforms the Base-
max min Rksec line1 approach. In fact, our approach jointly optimizes the
V {Δh}
k∈K beamforming matrixes V and Ψ, which can simultaneously
s.t. (a) : Rksec ≥ Rksec,min , ∀k ∈ K, facilitates more favorable channel propagation benefit for MUs
and impair eavesdroppers, while the Baseline1 approach opti-
(b) : min (Rku ) ≥ Rkmin , ∀k ∈ K,
||Δhbu ||2 ≤(ςbu )2 , mizes the beamforming matrixes in an iterative way. Moreover,
||Δhru ||2 ≤(ςru )2 our proposed approach has higher performance than DQN in

(c) : Tr VVH ≤ Pmax , terms of both secrecy rate and QoS satisfaction probability,
due to its efficient learning capacity by utilizing PDS-learning
(d) : |χejθl | = 1, 0 ≤ θl ≤ 2π, ∀l ∈ L. (39)
and PER schemes in the dynamic environment. From Fig. 6,
From the optimization problem (39), they system only we also find that the three IRS assisted secure beamforming
needs to optimize the BSs transmit beamforming matrix. approaches provide significant higher secrecy rate and QoS
Problem (39) is non-convex due to the rate constraints, satisfaction probability than the traditional system without
and hence we consider semidefinite programming (SDP) IRS. This indicates that the IRS can effectively guarantee
relaxation to solve it. After transforming problem (39), secure communication and QoS requirements via reflecting
into a convex optimization problem, we can use CVX to beamforming, where reflecting elements (IRS-induced phases)
obtain the solution [12]–[16]. at the IRS can be adjusted to maximize the received SINR at
Fig. 6 shows the average secrecy rate and QoS satisfaction MUs and suppress the wiretapped rate at eavesdroppers.
probability (consists of the secrecy rate satisfaction probability In Fig. 7, the achievable secrecy rate and QoS satisfaction
(11a) and the minimum rate satisfaction probability (11b)) level performance of all approaches are evaluated through

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on May 02,2023 at 05:26:21 UTC from IEEE Xplore. Restrictions apply.
386 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 20, NO. 1, JANUARY 2021

In Fig. 8, we further analyze how the system secrecy rate

and QoS satisfaction level performances are affected by the
outdated CSI coefficient ρ in the system, i.e., from ρ = 0.5
to 1, when Pmax = 30 dBm and L = 40. Note that as
ρ decreases, the CSI becomes more outdated as shown in
(4) and (6), and ρ = 1 means non-outdated CSI. It can
be observed from all beamforming approaches, when CSI
becomes more outdated (as ρ decreases), the average secrecy
rate and QoS satisfaction level decrease. The reason is that
a higher value of ρ indicates more accurate CSI, which will
enable all the approaches to optimize secure beamforming pol-
icy to achieve higher average secrecy rate and QoS satisfaction
level in the system.
It can be observed that reducing ρ has more effects on
the performance of the other three approaches while our
proposed learning approach still maintains the performance at
a favorable level, indicating that the other three approaches are
more sensitive to the uncertainty of CSI and the robust of the
proposed learning approach. For instance, the proposed learn-
ing approach achieves the secrecy rate and QoS satisfaction
level improvements of 17.21% and 8.67%, compared with the
Baseline 1 approach when ρ = 0.7. Moreover, in comparison,
the proposed learning approach achieves the best performance
among all approaches against channel uncertainty. The reason
is that the proposed learning approach considers the time-
varying channels and takes advantage of PDS-learning to
effectively learn the dynamic environment.

VI. C ONCLUSION
In this work, we have investigated the joint BS’s beamform-
Fig. 8. Performance comparisons versus outdated CSI coefficient ρ. ing and IRS’s reflect beamforming optimization problem under
the time-varying channel conditions. As the system is highly
dynamic and complex, we have exploited the recent advances
changing the IRS elements, i.e., from L = 10 to 60, when of machine learning, and formulated the secure beamforming
Pmax = 30 dBm and ρ = 0.95. For the secure beamforming optimization problem as an RL problem. A deep PDS-PER
approaches assisted by the IRS, their achievable secrecy rates learning based secure beamforming approach has been pro-
and QoS satisfaction levels significantly increase with the posed to jointly optimize both the BS’s beamforming and the
number of the IRS elements. The improvement results from IRS’s reflect beamforming in the dynamic IRS-aided secure
the fact that more IRS elements, more signal paths and signal communication system, where PDS and PER schemes have
power can be reflected by the IRS to improve the received been utilized to improve the learning convergence rate and
SINR at the MUs but to decrease the received SINR at the efficiency. Simulation results have verified that the proposed
eavesdroppers. In addition, the performance of the approach learning approach outperforms other existing approaches in
without IRS remains constant under the different numbers of terms of enhancing the system secrecy rate and the QoS
the IRS elements. satisfaction probability.
From Fig. 7(a), it is found that the secrecy rate of the pro-
posed learning approach is higher than those of the Baseline 1 R EFERENCES
and DQN approaches, especially, their performance gap also
[1] N. Yang, L. Wang, G. Geraci, M. Elkashlan, J. Yuan, and M. D. Renzo,
obviously increases with L, this is because that with more “Safeguarding 5G wireless communication networks using physical
reflecting elements at the IRS, the proposed deep PDS-PER layer security,” IEEE Commun. Mag., vol. 53, no. 4, pp. 20–27,
learning based secure communication approach becomes more Apr. 2015.
[2] A. D. Wyner, “The wiretap channel,” Bell Syst. Tech. J., vol. 54, no. 8,
flexible for optimal phase shift (reflecting beamforming) pp. 1355–1387, Oct. 1975.
design and hence achieves higher gains. In addition, from [3] Q. Li and L. Yang, “Beamforming for cooperative secure transmission in
Fig. 7(b) compared with the Baseline 1 and DQN approaches, cognitive two-way relay networks,” IEEE Trans. Inf. Forensics Security,
vol. 15, pp. 130–143, Jan. 2020.
as the reflecting elements at the IRS increases, we observe [4] L. Xiao, X. Lu, D. Xu, Y. Tang, L. Wang, and W. Zhuang, “UAV relay
that the proposed learning approach is the first one who attains in VANETs against smart jamming with reinforcement learning,” IEEE
100% QoS satisfaction level. This superior achievements are Trans. Veh. Technol., vol. 67, no. 5, pp. 4087–4097, May 2018.
[5] W. Wang, K. C. Teh, and K. H. Li, “Artificial noise aided physical
based on the particular design of the QoS-aware reward layer security in multi-antenna small-cell networks,” IEEE Trans. Inf.
function shown in (14) for secure communication. Forensics Security, vol. 12, no. 6, pp. 1470–1482, Jun. 2017.

[6] H.-M. Wang, T. Zheng, and X.-G. Xia, “Secure MISO wiretap channels [29] C. Huang, R. Mo, and C. Yuen, “Reconfigurable intelligent
with multiantenna passive eavesdropper: Artificial noise vs. artificial fast surface assisted multiuser MISO systems exploiting deep rein-
fading,” IEEE Trans. Wireless Commun., vol. 14, no. 1, pp. 94–106, forcement learning,” 2020, arXiv:2002.10072. [Online]. Available:
Jan. 2015. http://arxiv.org/abs/2002.10072
[7] R. Nakai and S. Sugiura, “Physical layer security in Buffer-State- [30] C. Li, W. Zhou, K. Yu, L. Fan, and J. Xia, “Enhanced secure transmis-
Based max-ratio relay selection exploiting broadcasting with cooperative sion against intelligent attacks,” IEEE Access, vol. 7, pp. 53596–53602,
beamforming and jamming,” IEEE Trans. Inf. Forensics Security, vol. 14, Aug. 2019.
no. 2, pp. 431–444, Feb. 2019. [31] L. Xiao, G. Sheng, S. Liu, H. Dai, M. Peng, and J. Song, “Deep rein-
[8] Z. Mobini, M. Mohammadi, and C. Tellambura, “Wireless-powered full- forcement learning-enabled secure visible light communication against
duplex relay and friendly jamming for secure cooperative communica- eavesdropping,” IEEE Trans. Commun., vol. 67, no. 10, pp. 6994–7005,
tions,” IEEE Trans. Inf. Forensics Security, vol. 14, no. 3, pp. 621–634, Oct. 2019.
Mar. 2019. [32] M. Wiering and M. Otterlo, Reinforcement Learning: State of the Art
[9] Q. Wu and R. Zhang, “Towards smart and reconfigurable environment: (Computational Intelligence and Complexity). Springer, 2012.
Intelligent reflecting surface aided wireless network,” IEEE Commun. [33] H. Yang, A. Alphones, W.-D. Zhong, C. Chen, and X. Xie, “Learning-
Mag., vol. 58, no. 1, pp. 106–112, Jan. 2020. based energy-efficient resource management by heterogeneous RF/VLC
[10] J. Zhao, “A survey of intelligent reflecting surfaces (IRSs): Towards 6G for ultra-reliable low-latency industrial IoT networks,” IEEE Trans. Ind.
wireless communication networks,” 2019, arXiv:1907.04789. [Online]. Informat., vol. 16, no. 8, pp. 5565–5576, Aug. 2020.
Available: http://arxiv.org/abs/1907.04789 [34] X. He, R. Jin, and H. Dai, “Deep PDS-learning for privacy-aware
[11] H. Han et al., “Intelligent reflecting surface aided power control for offloading in MEC-enabled IoT,” IEEE Internet Things J., vol. 6, no. 3,
physical-layer broadcasting,” 2019, arXiv:1912.03468. [Online]. Avail- pp. 4547–4555, Jun. 2019.
able: http://arxiv.org/abs/1912.03468 [35] N. Mastronarde and M. van der Schaar, “Joint physical-layer and system-
[12] C. Huang, A. Zappone, G. C. Alexandropoulos, M. Debbah, and level power management for delay-sensitive wireless communications,”
C. Yuen, “Reconfigurable intelligent surfaces for energy efficiency in IEEE Trans. Mobile Comput., vol. 12, no. 4, pp. 694–709, Apr. 2013.
wireless communication,” IEEE Trans. Wireless Commun., vol. 18, no. 8, [36] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience
pp. 4157–4170, Aug. 2019. replay,” in Proc. 4th Int. Conf. Learn. Represent. (ICLR), San Juan, PR,
[13] Q. Wu and R. Zhang, “Intelligent reflecting surface enhanced wireless USA, May 2016, pp. 1–21.
network via joint active and passive beamforming,” IEEE Trans. Wireless [37] H. Gacanin and M. Di Renzo, “Wireless 2.0: Towards an intelligent
Commun., vol. 18, no. 11, pp. 5394–5409, Nov. 2019. radio environment empowered by reconfigurable meta-surfaces and
[14] M. Cui, G. Zhang, and R. Zhang, “Secure wireless communication via artificial intelligence,” 2020, arXiv:2002.11040. [Online]. Available:
intelligent reflecting surface,” IEEE Wireless Commun. Lett., vol. 8, http://arxiv.org/abs/2002.11040
no. 5, pp. 1410–1414, Oct. 2019. [38] C. Huang et al., “Holographic MIMO surfaces for 6G wireless networks:
[15] H. Shen, W. Xu, S. Gong, Z. He, and C. Zhao, “Secrecy rate Opportunities, challenges, and trends,” IEEE Wireless Commun., early
maximization for intelligent reflecting surface assisted multi-antenna access, Jul. 8, 2020, doi: 10.1109/MWC.001.1900534.
communications,” IEEE Commun. Lett., vol. 23, no. 9, pp. 1488–1492, [39] F. B. Mismar, B. L. Evans, and A. Alkhateeb, “Deep reinforcement
Sep. 2019. learning for 5G networks: Joint beamforming, power control, and
[16] X. Yu, D. Xu, and R. Schober, “Enabling secure wireless communica- interference coordination,” IEEE Trans. Commun., vol. 68, no. 3,
tions via intelligent reflecting surfaces,” in Proc. IEEE Global Commun. pp. 1581–1592, Mar. 2020.
Conf. (GLOBECOM), Waikoloa, HI, USA, Dec. 2019, pp. 1–6. [40] H. Yang, Z. Xiong, J. Zhao, D. Niyato, and L. Xiao, “Deep reinforcement
[17] Q. Wu and R. Zhang, “Beamforming optimization for wireless network learning based intelligent reflecting surface for secure wireless commu-
aided by intelligent reflecting surface with discrete phase shifts,” IEEE nications,” in Proc. IEEE Global Commun. Conf. (GLOBECOM), Taipei,
Trans. Commun., vol. 68, no. 3, pp. 1838–1851, Mar. 2020. Taiwan, Dec. 2020, pp. 1–30.
[18] Z. Chu, W. Hao, P. Xiao, and J. Shi, “Intelligent reflecting surface
aided multi-antenna secure transmission,” IEEE Wireless Commun. Lett.,
vol. 9, no. 1, pp. 108–112, Jan. 2020.
[19] B. Feng, Y. Wu, and M. Zheng, “Secure transmission strategy Helin Yang (Student Member, IEEE)
for intelligent reflecting surface enhanced wireless system,” 2019, received the B.S. and M.S. degrees from the
arXiv:1909.00629. [Online]. Available: http://arxiv.org/abs/1909.00629 School of Telecommunications Information
[20] J. Chen, Y.-C. Liang, Y. Pei, and H. Guo, “Intelligent reflecting surface: Engineering, Chongqing University of Posts and
A programmable wireless environment for physical layer security,” IEEE Telecommunications, in 2013 and 2016, respectively,
Access, vol. 7, pp. 82599–82612, 2019. and the Ph.D. degree from the School of Electrical
[21] X. Yu, D. Xu, Y. Sun, D. W. K. Ng, and R. Schober, “Robust and secure and Electronic Engineering, Nanyang Technological
wireless communications via intelligent reflecting surfaces,” 2019, University, Singapore, in 2020. His current research
arXiv:1912.01497. [Online]. Available: http://arxiv.org/abs/1912.01497
interests include wireless communication, visible
[22] X. Guan, Q. Wu, and R. Zhang, “Intelligent reflecting surface assisted
light communication, the Internet of Things, and
secrecy communication: Is artificial noise helpful or not?” IEEE Wireless
resource management.
Commun. Lett., vol. 9, no. 6, pp. 778–782, Jun. 2020.
[23] L. Dong and H.-M. Wang, “Secure MIMO transmission via intelli-
gent reflecting surface,” IEEE Wireless Commun. Lett., vol. 9, no. 6,
pp. 787–790, Jun. 2020.
[24] W. Jiang, Y. Zhang, J. Wu, W. Feng, and Y. Jin, “Intelligent reflecting Zehui Xiong (Student Member, IEEE) received the
surface assisted secure wireless communications with multiple-transmit B.Eng. degree (Hons.) from the Huazhong Uni-
and multiple-receive antennas,” 2020, arXiv:2001.08963. [Online]. versity of Science and Technology, Wuhan, China,
Available: http://arxiv.org/abs/2001.08963 and the Ph.D. degree from Nanyang Technological
[25] D. Xu, X. Yu, Y. Sun, D. Wing Kwan Ng, and R. Schober, “Resource University, Singapore. He is currently a Researcher
allocation for secure IRS-assisted multiuser MISO systems,” 2019, with Alibaba-NTU Singapore Joint Research Insti-
arXiv:1907.03085. [Online]. Available: http://arxiv.org/abs/1907.03085 tute, Nanyang Technological University. He is also
[26] C. Huang, G. C. Alexandropoulos, C. Yuen, and M. Debbah, the Visiting Scholar with Princeton University and
“Indoor signal focusing with deep learning designed reconfigurable the University of Waterloo. His research interests
intelligent surfaces,” 2019, arXiv:1905.07726. [Online]. Available: include network economics, wireless communica-
http://arxiv.org/abs/1905.07726 tions, blockchain, and edge intelligence. He has
[27] A. Taha, M. Alrabeiah, and A. Alkhateeb, “Enabling large intelli- published more than 60 peer-reviewed research papers in leading journals
gent surfaces with compressive sensing and deep learning,” 2019, and flagship conferences, and three of them are ESI Highly Cited Papers.
arXiv:1904.10136. [Online]. Available: http://arxiv.org/abs/1904.10136 He has won several Best Paper awards. He is an Editor of Computer Networks
[28] K. Feng, Q. Wang, X. Li, and C.-K. Wen, “Deep reinforcement learning (COMNET) (Elsevier) and Physical Communication (PHYCOM) (Elsevier),
based intelligent reflecting surface optimization for MISO communica- and an Associate Editor of IET Communications. He was a recipient of the
tion systems,” IEEE Wireless Commun. Lett., vol. 9, no. 5, pp. 745–749, Chinese Government Award for Outstanding Students Abroad in 2019 and
May 2020. the NTU SCSE Outstanding Ph.D. Thesis Runner-Up Award in 2020.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on May 02,2023 at 05:26:21 UTC from IEEE Xplore. Restrictions apply.
388 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 20, NO. 1, JANUARY 2021

Jun Zhao (Member, IEEE) received the Ph.D. Liang Xiao (Senior Member, IEEE) received the
degree in electrical and computer engineering from B.S. degree in communication engineering from the
Carnegie Mellon University (CMU), USA (advi- Nanjing University of Posts and Telecommunica-
sors: Virgil Gligor, Osman Yagan; collaborator: tions, China, in 2000, the M.S. degree in electri-
Adrian Perrig), affiliating with CMU’s renowned cal engineering from Tsinghua University, China,
CyLab Security and Privacy Institute, and a bach- in 2003, and the Ph.D. degree in electrical engi-
elor’s degree from Shanghai Jiao Tong University in neering from Rutgers University, New Brunswick,
China. He is currently an Assistant Professor with NJ, USA, in 2009. She was a Visiting Professor
the School of Computer Science and Engineering, with Princeton University, Virginia Tech, and the
Nanyang Technological University (NTU), Singa- University of Maryland, College Park, MD, USA.
pore. Before joining NTU first as a Post-Doctoral She is currently a Professor with the Department of
Researcher with Xiaokui Xiao and then as a Faculty Member, he was Information and Communication Engineering, Xiamen University, Xiamen,
a Post-Doctoral Researcher with Arizona State University as an Arizona China. She has served as an Associate Editor for the IEEE T RANSACTIONS
Computing Post-Doctoral Best Practices Fellow (advisors: Junshan Zhang, ON I NFORMATION F ORENSICS AND S ECURITY and a Guest Editor for the
Vincent Poor). His research interests include communications, networks, IEEE J OURNAL OF S ELECTED T OPICS IN S IGNAL P ROCESSING. She was a
security, and AI. recipient of the Best Paper Award for 2016 INFOCOM Big Security WS and
2017 ICC.

Qingqing Wu (Member, IEEE) received the B.Eng.

degree in electronic engineering from the South
China University of Technology in 2012 and
the Ph.D. degree in electronic engineering from
Shanghai Jiao Tong University (SJTU) in 2016. He is
currently an Assistant Professor with the Department
of Electrical and Computer Engineering, University
of Macau, China, and also affiliated with the State
Key Laboratory of Internet of Things for Smart City.
He was a Research Fellow with the Department
of Electrical and Computer Engineering, National
University of Singapore. He has published more than 70 IEEE journals and
conference papers. His current research interests include intelligent reflecting
surface (IRS), unmanned aerial vehicle (UAV) communications, and MIMO
transceiver design. He was a recipient of the IEEE WCSP Best Paper
Award in 2015, the Outstanding Ph.D. Thesis Funding in SJTU in 2016, and
the Outstanding Ph.D. Thesis Award of the China Institute of Communications
in 2017. He was the Exemplary Editor of the IEEE C OMMUNICATIONS
L ETTERS in 2019 and the Exemplary Reviewer of several IEEE journals.
He serves as an Associate Editor for the IEEE C OMMUNICATIONS L ETTERS ,
the IEEE O PEN J OURNAL OF C OMMUNICATIONS S OCIETY, and the IEEE
O PEN J OURNAL OF V EHICULAR T ECHNOLOGY. He is the Lead Guest
Editor of the IEEE J OURNAL ON S ELECTED A REAS IN C OMMUNICATIONS
Dusit Niyato (Fellow, IEEE) received the B.Eng. on UAV Communications in 5G and Beyond Networks, and a Guest
degree from the King Mongkuts Institute of Technol- Editor of the IEEE O PEN J OURNAL ON V EHICULAR T ECHNOLOGY
ogy Ladkrabang (KMITL), Thailand, in 1999, and on 6G Intelligent Communications and the IEEE O PEN J OURNAL OF
the Ph.D. degree in electrical and computer engineer- C OMMUNICATIONS S OCIETY on Reconfigurable Intelligent Surface-Based
ing from the University of Manitoba, Winnipeg, MB, Communications for 6G Wireless Networks. He is the workshop Co-Chair
Canada, in 2008. He is currently a Professor with of IEEE ICC 2019-2021 workshop on Integrating UAVs into 5G and Beyond,
the School of Computer Science and Engineering, and the workshop Co-Chair of IEEE GLOBECOM 2020 workshop on
Nanyang Technological University, Singapore. His Reconfigurable Intelligent Surfaces for Wireless Communication for Beyond
research interests are in the area of energy harvesting 5G. He serves as the Workshops and Symposia Officer for Reconfigurable
for wireless communication, the Internet of Things Intelligent Surfaces Emerging Technology Initiative and Research Blog
(IoT), and sensor networks. Officer for Aerial Communications Emerging Technology Initiative.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on May 02,2023 at 05:26:21 UTC from IEEE Xplore. Restrictions apply.

Cover Sheet: 00817-09-01-04-GB PUE 14.06.00
No ratings yet
Cover Sheet: 00817-09-01-04-GB PUE 14.06.00
50 pages
Basic 700 OM Eng PDF
No ratings yet
Basic 700 OM Eng PDF
29 pages
Towards Smart Wireless Communications Via Intelligent Reflecting Surfaces: A Contemporary Survey
No ratings yet
Towards Smart Wireless Communications Via Intelligent Reflecting Surfaces: A Contemporary Survey
33 pages
Robust and Secure Sum-Rate Maximization For Multiuser MISO Downlink Systems With Self-Sustainable IRS
No ratings yet
Robust and Secure Sum-Rate Maximization For Multiuser MISO Downlink Systems With Self-Sustainable IRS
18 pages
Sensors: Intelligent Reflecting Surface-Assisted Secure Multi-Input Single-Output Cognitive Radio Transmission
No ratings yet
Sensors: Intelligent Reflecting Surface-Assisted Secure Multi-Input Single-Output Cognitive Radio Transmission
22 pages
Deep Reinforcement Learning-Driven Secrecy Design For Intelligent Reflecting Surface-Based 6G-IoT Networks
No ratings yet
Deep Reinforcement Learning-Driven Secrecy Design For Intelligent Reflecting Surface-Based 6G-IoT Networks
14 pages
Ris + Uav
No ratings yet
Ris + Uav
17 pages
Deep_Reinforcement_Learning_for_RIS-aided_Multiuser_Full-Duplex_Secure_Communications_with_Hardware_Impairments
No ratings yet
Deep_Reinforcement_Learning_for_RIS-aided_Multiuser_Full-Duplex_Secure_Communications_with_Hardware_Impairments
15 pages
Low-Complexity Algorithm For Maximizing The Weighted Sum-Rate of Intelligent Reflecting Surface Assisted Wireless Networks
No ratings yet
Low-Complexity Algorithm For Maximizing The Weighted Sum-Rate of Intelligent Reflecting Surface Assisted Wireless Networks
10 pages
Secure Wireless Communication via IRS
No ratings yet
Secure Wireless Communication via IRS
5 pages
Intelligent Reflecting Surface Assisted Secure Wireless Communications With Multiple-Transmit and Multiple-Receive Antennas
No ratings yet
Intelligent Reflecting Surface Assisted Secure Wireless Communications With Multiple-Transmit and Multiple-Receive Antennas
15 pages
Physical Layer Security Enhancement With Reconfig
No ratings yet
Physical Layer Security Enhancement With Reconfig
16 pages
Artificial Intelligence-Powered Intelligent Reflecting Surface Systems Countering Adversarial Attacks in Machine Learning
No ratings yet
Artificial Intelligence-Powered Intelligent Reflecting Surface Systems Countering Adversarial Attacks in Machine Learning
10 pages
Beamforming-design-via-machine-learning-in-intelligent-ref_2025_Physical-Com
No ratings yet
Beamforming-design-via-machine-learning-in-intelligent-ref_2025_Physical-Com
10 pages
Channel Estimation and Passive Beamforming For Intelligent Reflecting Surface Discrete Phase Shift and Progressive Refinement
No ratings yet
Channel Estimation and Passive Beamforming For Intelligent Reflecting Surface Discrete Phase Shift and Progressive Refinement
17 pages
Learning To Reflect and To Beamform For Intelligent Reflecting Surface With Implicit Channel Estimation
No ratings yet
Learning To Reflect and To Beamform For Intelligent Reflecting Surface With Implicit Channel Estimation
15 pages
Secure Movable Antennas (1)
No ratings yet
Secure Movable Antennas (1)
5 pages
10 1109@tcomm 2020 3024621
No ratings yet
10 1109@tcomm 2020 3024621
17 pages
Random Shifting Intelligent Reflecting Surface For OTP Encrypted Data Transmission
No ratings yet
Random Shifting Intelligent Reflecting Surface For OTP Encrypted Data Transmission
5 pages
Covert Federated Learning Via Intelligent Reflecting Surfaces
No ratings yet
Covert Federated Learning Via Intelligent Reflecting Surfaces
14 pages
Capacity Improvement For Intelligent Reflecting Surface-Assisted Wireless Systems With A Limited Number of Passive Elements
No ratings yet
Capacity Improvement For Intelligent Reflecting Surface-Assisted Wireless Systems With A Limited Number of Passive Elements
5 pages
Intelligent Reflecting Surface Enhanced Wireless Network Via Joint Active and Passive Beamforming
No ratings yet
Intelligent Reflecting Surface Enhanced Wireless Network Via Joint Active and Passive Beamforming
16 pages
Intelligent Reflecting Surface Meets OFDM Protocol Design and Rate Maximization
No ratings yet
Intelligent Reflecting Surface Meets OFDM Protocol Design and Rate Maximization
14 pages
Active eavesdropping detection
No ratings yet
Active eavesdropping detection
19 pages
Intelligent Reflecting Surface Enhanced Wideband MIMO-OFDM Communications: From Practical Model To Reflection Optimization
No ratings yet
Intelligent Reflecting Surface Enhanced Wideband MIMO-OFDM Communications: From Practical Model To Reflection Optimization
13 pages
sensors-22-03589
No ratings yet
sensors-22-03589
36 pages
29-Intelligent Reflecting Surfaces Assisted Secure T
No ratings yet
29-Intelligent Reflecting Surfaces Assisted Secure T
5 pages
Beamforming_Optimization_for_Active_Intelligent_Reflecting_Surface-Aided_SWIPT
No ratings yet
Beamforming_Optimization_for_Active_Intelligent_Reflecting_Surface-Aided_SWIPT
17 pages
IRS-enabled spectrum sharing
No ratings yet
IRS-enabled spectrum sharing
8 pages
Deep Reinforcement Learning For Intelligent Reflec
No ratings yet
Deep Reinforcement Learning For Intelligent Reflec
5 pages
Active_Reconfigurable_Intelligent_Surface-Aided_Wireless_Communications
No ratings yet
Active_Reconfigurable_Intelligent_Surface-Aided_Wireless_Communications
14 pages
Dynamic Resource Allocation For IRS Assisted Energy Harvesting Systems With Statistical Delay Constraint
No ratings yet
Dynamic Resource Allocation For IRS Assisted Energy Harvesting Systems With Statistical Delay Constraint
6 pages
Ris + Uav
No ratings yet
Ris + Uav
12 pages
Intelligent Reflecting Surface Assisted Secret Key Generation
No ratings yet
Intelligent Reflecting Surface Assisted Secret Key Generation
5 pages
IRS-aided MIMO
No ratings yet
IRS-aided MIMO
30 pages
Resource_Allocation_for_an_IRS-Assisted_Dual-Functional_Radar_and_Communication_System_Energy_Efficiency_Maximization
No ratings yet
Resource_Allocation_for_an_IRS-Assisted_Dual-Functional_Radar_and_Communication_System_Energy_Efficiency_Maximization
14 pages
IRS-based Wireless Jamming Attacks When Jammers Can Attack Without Power
No ratings yet
IRS-based Wireless Jamming Attacks When Jammers Can Attack Without Power
10 pages
Joint Beamforming Design For Intelligent Omni Surface Assisted Wireless Communication Systems
No ratings yet
Joint Beamforming Design For Intelligent Omni Surface Assisted Wireless Communication Systems
17 pages
A Survey On Channel Estimation and Practical Passive Beamforming Design For Intelligent Reflecting Surface Aided Wireless Communications
No ratings yet
A Survey On Channel Estimation and Practical Passive Beamforming Design For Intelligent Reflecting Surface Aided Wireless Communications
37 pages
Boosting Secret Key Generation For IRS-Assisted Symbiotic Radio Communications
No ratings yet
Boosting Secret Key Generation For IRS-Assisted Symbiotic Radio Communications
6 pages
Wu 2021 Intelligent
No ratings yet
Wu 2021 Intelligent
39 pages
Ergodic_Rate_Analysis_and_IRS_Configuration_for_Multi-IRS_Dual-Hop_DF_Relaying_Systems
No ratings yet
Ergodic_Rate_Analysis_and_IRS_Configuration_for_Multi-IRS_Dual-Hop_DF_Relaying_Systems
5 pages
Sigfox Technology and Applications: Definitive Reference for Developers and Engineers
From Everand
Sigfox Technology and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Intelligent_Reflecting_Surface-Assisted_Wireless_Powered_Heterogeneous_Networks
No ratings yet
Intelligent_Reflecting_Surface-Assisted_Wireless_Powered_Heterogeneous_Networks
12 pages
Intelligent Reflecting Surface Enhanced Wireless Network Via Joint Active and Passive Beamforming
No ratings yet
Intelligent Reflecting Surface Enhanced Wireless Network Via Joint Active and Passive Beamforming
16 pages
NB-IoT Systems and Protocols: Definitive Reference for Developers and Engineers
From Everand
NB-IoT Systems and Protocols: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Deep Learning Enabled IRS For 6G Intelligent Transportation Systems A Comprehensive Study
No ratings yet
Deep Learning Enabled IRS For 6G Intelligent Transportation Systems A Comprehensive Study
18 pages
Spatial Throughput Characterization For Intelligent Reflecting Surface Aided Multiuser System
No ratings yet
Spatial Throughput Characterization For Intelligent Reflecting Surface Aided Multiuser System
5 pages
Intelligent Reflecting Surface Enhanced Wireless Network Joint Active and Passive Beamforming Design
No ratings yet
Intelligent Reflecting Surface Enhanced Wireless Network Joint Active and Passive Beamforming Design
6 pages
A_Hybrid_Relay_and_Intelligent_Reflecting_Surface_Network_and_Its_Ergodic_Performance_Analysis
No ratings yet
A_Hybrid_Relay_and_Intelligent_Reflecting_Surface_Network_and_Its_Ergodic_Performance_Analysis
5 pages
The Rise of Intelligent Reflecting Surfaces In
No ratings yet
The Rise of Intelligent Reflecting Surfaces In
8 pages
Reconfigurable Intelligent Surfaces-Assisted Multiuser MIMO Uplink Transmission With Partial CSI
No ratings yet
Reconfigurable Intelligent Surfaces-Assisted Multiuser MIMO Uplink Transmission With Partial CSI
15 pages
Secure THZ Communication Using Intelligent Reflecting Surface
No ratings yet
Secure THZ Communication Using Intelligent Reflecting Surface
14 pages
10 1109@jiot 2020 3027048
No ratings yet
10 1109@jiot 2020 3027048
15 pages
IoT_Paper_Final
No ratings yet
IoT_Paper_Final
17 pages
The Role of Network Security and 5G Communication in Smart Cities and Industrial Transformation
From Everand
The Role of Network Security and 5G Communication in Smart Cities and Industrial Transformation
Devasis Pradhan
No ratings yet
sensors-22-02390-v2
No ratings yet
sensors-22-02390-v2
21 pages
IEEE Conference Template 3
No ratings yet
IEEE Conference Template 3
7 pages
08315053
No ratings yet
08315053
13 pages
Glocom 2018 8647620
No ratings yet
Glocom 2018 8647620
6 pages
Physical-Layer_Secure_Wireless_Transmission_via_Active_Reconfigurable_Intelligent_Surfaces
No ratings yet
Physical-Layer_Secure_Wireless_Transmission_via_Active_Reconfigurable_Intelligent_Surfaces
5 pages
Entropy: An Overview of Key Technologies in Physical Layer Security
No ratings yet
Entropy: An Overview of Key Technologies in Physical Layer Security
34 pages
Active_RIS_vs._Passive_RIS_Which_Will_Prevail_in_6G
No ratings yet
Active_RIS_vs._Passive_RIS_Which_Will_Prevail_in_6G
19 pages
Closed-Form_Global_Optimization_of_Beyond_Diagonal_Reconfigurable_Intelligent_Surfaces
No ratings yet
Closed-Form_Global_Optimization_of_Beyond_Diagonal_Reconfigurable_Intelligent_Surfaces
15 pages
Optical Wireless Communication Channel Measurements and Models
No ratings yet
Optical Wireless Communication Channel Measurements and Models
24 pages
Analysis On The Effect of Salinity in Underwater Wireless Optical Communication
No ratings yet
Analysis On The Effect of Salinity in Underwater Wireless Optical Communication
12 pages
Going Beyond 10-Meter, Gbit/s Underwater Optical Wireless Communication Links Based On Visible Lasers
No ratings yet
Going Beyond 10-Meter, Gbit/s Underwater Optical Wireless Communication Links Based On Visible Lasers
3 pages
6G Wireless Communication Systems: Applications, Requirements, Technologies, Challenges, and Research Directions
No ratings yet
6G Wireless Communication Systems: Applications, Requirements, Technologies, Challenges, and Research Directions
19 pages
Performance Comparisons Between PIN and APD Photodetectors For Use in Optical Communication Systems
No ratings yet
Performance Comparisons Between PIN and APD Photodetectors For Use in Optical Communication Systems
6 pages
BERvariationofOWCsysteminunderwaterturbmed YA 2020!10!10
No ratings yet
BERvariationofOWCsysteminunderwaterturbmed YA 2020!10!10
15 pages
Wedding Guidelines For University Circle United Methodist Church
No ratings yet
Wedding Guidelines For University Circle United Methodist Church
4 pages
Title Importance of Effective Communication in The Accounting
No ratings yet
Title Importance of Effective Communication in The Accounting
10 pages
Modicon ABE7 - ABE7CPA03
No ratings yet
Modicon ABE7 - ABE7CPA03
5 pages
Unit 5
No ratings yet
Unit 5
78 pages
Airworthiness Procedure Manual
100% (1)
Airworthiness Procedure Manual
287 pages
TOA Test Bank
No ratings yet
TOA Test Bank
4 pages
Seemp Study
No ratings yet
Seemp Study
71 pages
Electronics-Tutorials - Ws-The Zener Diode
No ratings yet
Electronics-Tutorials - Ws-The Zener Diode
10 pages
c010 Contract I Rakshinda 81012019358
No ratings yet
c010 Contract I Rakshinda 81012019358
36 pages
Untitled Notebook
No ratings yet
Untitled Notebook
3 pages
Ethylene Glycol: Hazard Summary
No ratings yet
Ethylene Glycol: Hazard Summary
4 pages
LAC Proposal - ClassPoint
No ratings yet
LAC Proposal - ClassPoint
2 pages
Conflict of Law Prelim Digested Cases
No ratings yet
Conflict of Law Prelim Digested Cases
47 pages
OOAD Unit-2 Notes
No ratings yet
OOAD Unit-2 Notes
8 pages
Reliability and Validity of The iLOAD Application.9
No ratings yet
Reliability and Validity of The iLOAD Application.9
9 pages
Tan Vs Sec
No ratings yet
Tan Vs Sec
2 pages
CoR and Safety Driver Assessment Questions
No ratings yet
CoR and Safety Driver Assessment Questions
2 pages
Challenges in CFS Services of Kutch Region.
No ratings yet
Challenges in CFS Services of Kutch Region.
25 pages
Safehouse Switches Guide
No ratings yet
Safehouse Switches Guide
11 pages
Tax SOLVING
No ratings yet
Tax SOLVING
3 pages
Regional Disparity in Agricultural Development:: Research Plan Proposal
No ratings yet
Regional Disparity in Agricultural Development:: Research Plan Proposal
25 pages
Rajshahi Master Plan 1984
No ratings yet
Rajshahi Master Plan 1984
11 pages
Livre John J. A. Johnson D.G Whitaker D Statistical Thinking in Business Second Edition CRC Press 2005 2
100% (1)
Livre John J. A. Johnson D.G Whitaker D Statistical Thinking in Business Second Edition CRC Press 2005 2
400 pages
C Aptitude Aucse
No ratings yet
C Aptitude Aucse
64 pages
Arihant Capital Pre Conference Note Bharat Rising Star Summit 2023
No ratings yet
Arihant Capital Pre Conference Note Bharat Rising Star Summit 2023
114 pages
Creating New Market Space Slides
No ratings yet
Creating New Market Space Slides
28 pages
FSQ510 85947
No ratings yet
FSQ510 85947
15 pages
EE6504-Electrical Machines - II-1330526698-Em II Unit 5
No ratings yet
EE6504-Electrical Machines - II-1330526698-Em II Unit 5
30 pages

Deep_Reinforcement_Learning-Based_Intelligent_Reflecting_Surface_for_Secure_Wireless_Communications

Uploaded by

Deep_Reinforcement_Learning-Based_Intelligent_Reflecting_Surface_for_Secure_Wireless_Communications

Uploaded by

IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 20, NO.

1, JANUARY 2021 375

Deep Reinforcement Learning-Based Intelligent

to serve K single-antenna legitimate mobile users (MUs) in

In (4), ĥ(t + Tdelay ) is independent identically distributed B. Problem Formulation

and the state-value function is achieved as follows:

In addition, the Q-value is updated as follows:

has dimensions 100m 100m with a resolution of 2.5 m, i.e., a

of DNN, r̂ can be acted as functions of θ and the inputs of

V. S IMULATION R ESULTS AND A NALYSIS

In Fig. 8, we further analyze how the system secrecy rate

Qingqing Wu (Member, IEEE) received the B.Eng.

You might also like