Ris + Uav
Ris + Uav
7, JULY 2021
Abstract— A novel framework is proposed for integrating referred as intelligent reflecting surfaces (IRSs) [2] or large
reconfigurable intelligent surfaces (RIS) in unmanned aerial intelligent surfaces (LISs) [3]1 , comprise an array of reflecting
vehicle (UAV) enabled wireless networks, where an RIS is elements for reconfiguring the incident signals. Thus, RISs
deployed for enhancing the service quality of the UAV. Non-
orthogonal multiple access (NOMA) technique is invoked to can be regarded as the pathway towards massive MIMO 2.0.
further improve the spectrum efficiency of the network, while RISs have received significant attention for their potential to
mobile users (MUs) are considered as roaming continuously. The enhance the capacity and coverage of wireless networks.
energy consumption minimizing problem is formulated by jointly Unmanned aerial vehicles (UAVs) enabled wireless net-
designing the movement of the UAV, phase shifts of the RIS, works have been proved as a benefit of mitigating challenges
power allocation policy from the UAV to MUs, as well as deter-
mining the dynamic decoding order. A decaying deep Q-network emerging in the conventional terrestrial cellular networks [4].
(D-DQN) based algorithm is proposed for tackling this pertinent UAVs are capable of being flexibly deployed as aerial BSs
problem. In the proposed D-DQN based algorithm, the central in temporary tele-traffic hotspots weaved by political ral-
controller is selected as an agent for periodically observing the lies, sporting events or after natural disasters for providing
state of UAV-enabled wireless network and for carrying out uninterrupted and ubiquitous connectivity [5], [6]. However,
actions to adapt to the dynamic environment. In contrast to
the conventional DQN algorithm, the decaying learning rate is before fully reaping the benefits of integrating UAVs into
leveraged in the proposed D-DQN based algorithm for attaining wireless networks, some of the weaknesses of UAVs such
a tradeoff between accelerating training speed and converging as their limited coverage area, meagre energy supply has to
to the local optimal. Numerical results demonstrate that: 1) In be mitigated. By integrating RISs in UAV-enabled wireless
contrast to the conventional Q-learning algorithm, which cannot networks, concatenated virtual LoS links2 between UAVs and
converge when being adopted for solving the formulated problem,
the proposed D-DQN based algorithm is capable of converging mobile users (MUs) can be formed via passively reflecting the
with minor constraints; 2) The energy dissipation of the UAV incident signals, which lead to extended coverage area as well
can be significantly reduced by integrating RISs in UAV-enabled as less movement of UAVs.
wireless networks; 3) By designing the dynamic decoding order
and power allocation policy, the RIS-NOMA case consumes
A. State-of-the-Art
11.7% less energy than the RIS-OMA case.
1) Reinforcement Learning in UAV-Enabled Wireless Net-
Index Terms— Non-orthogonal multiple access, reconfigurable
intelligent surfaces, reinforcement learning, trajectory design, works: It has been proved that integrating UAVs in wire-
unmanned aerial vehicle. less networks is envisioned to be a promising candidate
technique for enhancing the quality of wireless connectivity
I. I NTRODUCTION [7], [8]. In an effort to overcome the high dynamic stochastic
environment, machine learning (ML) approaches have been
T HE fifth generation of wireless networks (5G) is being
developed to provide uninterrupted and ubiquitous con-
nectivity to everyone and everything, everywhere, which
invoked in UAV-enabled wireless networks. The authors of [9]
leveraged a deep deterministic policy gradient (DDPG) based
impose enormous challenges on conventional terrestrial cel- algorithm for maximizing the defined energy efficiency (EE)
lular networks. The massive multiple-input-multiple-output function, which considers communications coverage, fairness,
(MIMO) technology, which equips base stations (BSs) with an energy consumption and connectivity. In [10], a multi-agent
array of active antennas, has been proved to improve the spec- reinforcement learning (RL) algorithm was proposed for tack-
trum efficiency of next-generation systems. A related concept, ling the dynamic resource allocation problem in multi-UAV
namely, reconfigurable intelligent surfaces (RISs) [1], also aided wireless networks. The long-term resource allocation
problem was formulated as a stochastic game and solved by
Manuscript received July 17, 2020; revised October 12, 2020; accepted the proposed RL algorithm. The authors of [11] presented
October 27, 2020. Date of publication December 2, 2020; date of current
version June 17, 2021. (Corresponding author: Yuanwei Liu.) a decentralized deep reinforcement learning (DRL) based
The authors are with the School of Electronic Engineering and Computer
1 Without loss of generality, we use the name of RIS in the remainder of
Science, Queen Mary University of London, London E1 4NS, U.K. (e-mail:
[email protected]; [email protected]; [email protected]). this paper.
Color versions of one or more figures in this article are available at 2 ’concatenated link’ is constituted by the LoS link between UAVs and MUs
https://doi.org/10.1109/JSAC.2020.3041401. and the concatenated link between the RIS and MUs, which is also a LoS
Digital Object Identifier 10.1109/JSAC.2020.3041401 link.
0733-8716 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: ULAKBIM UASL - KARADENIZ TECHNICAL UNIVERSITY. Downloaded on July 14,2022 at 20:02:54 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: ML EMPOWERED TRAJECTORY AND PASSIVE BEAMFORMING DESIGN IN UAV-RIS WIRELESS NETWORKS 2043
framework for designing the trajectory of a swarm of UAVs. to exploit the spectrum more efficiently by opportunistically
Thus, the geographical fairness of all considered point-of- exploring the users’ different channel conditions [21], power-
interests was maximized while the total energy dissipation domain non-orthogonal multiple access (NOMA) has been
was minimized. In [4], the trajectory and transmit power of invoked for improving the spectrum efficiency and massive
multiple UAVs was designed to maximize the total throughput connectivity of UAV-enabled/RIS-enhanced wireless networks.
of all MUs by considering user movement. The formulated In terms of integrating NOMA technique in UAV-enabled
problem was solved by a multi-agent RL algorithm. It was wireless networks, the authors [22] considered the resource
demonstrated in [4] that the proposed algorithms outperformed allocation problem in NOMA-UAV networks to maximize the
the benchmarks when taking into account of user mobility. EE. An successive convex approximation (SCA) algorithm
2) Reinforcement Learning in RIS-Enhanced Wireless was leveraged to convert the non-convex challenging prob-
Networks: In an effort to effectively exploit RISs for lem to a convex problem. In [23], the UAV-enabled uplink
optimizing wireless systems, preliminary researches have NOMA network was studied for overcoming the inherent
tackled some fundamental problems, including channel esti- latency. The sum rate of all users was maximized by jointly
mation/modeling, active beamforming for the BS, passive designing the trajectory and power control of the UAV under
beamforming design for RISs, and resource allocation from the the QoS constraint. As for integrating NOMA technique in
BS to MUs. Efficient optimization approaches, such as convex RIS-enhanced wireless systems, the authors of [24] considered
optimization [12], iterative algorithm [13], gradient descent the deployment and passive beamforming design problem of
approach [14], and alternating algorithm [15] have been the RIS in the MISO-NOMA network. A D3 QN algorithm
adopted for tackling the aforementioned challenges. In order to was proposed for tackling the formulated energy efficiency
reap the benefits of integrating RISs in the wireless systems, maximizing problem. Simulation results of [24] showed that
joint active beamforming and passive phase shift design of the EE of the systems can be improved with the aid of the
RIS-enhanced system have been investigated in multiple- RIS, while NOMA-RIS case outperformed OMA-RIS case in
input-single-output (MISO) system [16], [17], OFDM-based terms of EE. In [25], the downlink NOMA-MISO-RIS network
system [18], wireless security system [19] and Millimeter was investigated via jointly designing both the active and
Wave system [20] with the aid of RL algorithms. In contrast passive beamforming. The objective function was formulated
to the alternating optimization techniques, which alternatively as maximizing the sum rate of all MUs. An SCA approach
optimize the active beamforming at BSs and the passive was invoked to design the active and passive beamforming in
beamforming at RISs, the RL-based solution is capable of an alternate manner.
jointly obtaining the beamforming matrix and phase shift
matrix. More explicitly, the authors of [16] invoked a DDPG B. Motivations
based algorithm for maximizing the throughput by utilizing the Due to the fact that UAVs are battery powered, energy
sum rate as the instant rewards for training the DDPG model. consumption is one of the most important challenges in
In the proposed model, the continuous transmit beamforming commercial and civilian applications. The limited endurance
and phase shift were jointly optimized with a low complexity. of UAVs (usually under 30 minutes) hampers the practical
The authors of [18] proposed a DRL based algorithm for implementation of the UAVs. Before fully reap the advan-
maximizing the achievable transmit rate by directly optimizing tages of UAV-enabled wireless systems, how to overcome the
interaction matrices from the sampled channel knowledge. energy-hungry issue is supposed to be considered. The total
In the proposed DRL model, only one beam was utilized energy dissipated by the UAV consists of two aspects, namely,
for each training episode. Thus, the training overhead was the communication-related and the propulsion-related energy
avoided, while the dataset collection phase was not required. consumption. The first component is dissipated for radiation,
The authors of [20] proposed a DRL based algorithm for signal procession and hardware circuitry, while the other one
maximizing the throughput with both perfect channel state is dissipated for supporting the hovering and mobility of the
information (CSI) and imperfect CSI. A quantile regression UAV. It is worth noting that, the propulsion-related energy
method was applied for modeling a return distribution for consumption accounted for the vast majority of the sum energy
each state-action pair, which modeled the intrinsic random- consumption (usually more than 95%), which emphasize the
ness in the Markov Decision Process (MDP) interaction importance of considering the propulsion-related energy con-
between the IR and communication environment. The authors sumption model of UAVs when aiming for designing envi-
of [19] considered the application of RIS in the physical ronment friendly wireless networks. In this paper, when we
layer security. The system secrecy rate was maximized with are calculating the energy consumption of the UAV, we only
the aid of DRL model by designing both active and passive consider the energy dissipated for supporting the hovering and
beamforming matrices under users’ different quality-of-service mobility of the UAV.
(QoS) requirements and time-varying channel condition. Addi- Since all MUs are roaming continuously, the UAV has to be
tionally, post-decision state and prioritized experience replay periodically repositioned according to the mobility of users for
schemes were adopted for enhancing the training performance establishing LoS wireless links. By deploying an RIS, one can
and secrecy performance. adjust the phase shift matrix of RISs instead of controlling the
3) NOMA in UAV-Enabled/RIS-Enhanced Wireless Net- movement of UAVs for forming concatenated virtual LoS links
works: Sparked by the concept of superimposing the sig- between the UAV and MUs. Therefore, the UAV can maintain
nals of multiple associated users at different power levels hovering status only when the concatenated virtual LoS links
Authorized licensed use limited to: ULAKBIM UASL - KARADENIZ TECHNICAL UNIVERSITY. Downloaded on July 14,2022 at 20:02:54 UTC from IEEE Xplore. Restrictions apply.
2044 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 39, NO. 7, JULY 2021
cannot be formed even with the aid of the RIS. By invoking NOMA-RIS-UAV case due to the mobility of both UAV
the aforementioned protocol, the total energy dissipation of the and MUs.
UAV is minimized, which in turn, maximizes the endurance • We adopt a decaying learning rate deep Q-network
of the UAV. (D-DQN) based algorithm to tackle the formulated joint
In this paper, both the UAV and MUs are considered as trajectory and phase shift design problem. In contrast to
roaming instead of static. Thus, the considered RIS-UAV the conventional DQN algorithm, decaying learning rate
network is naturally a dynamic scenario, which is challenging is leveraged in the proposed D-DQN based algorithm
for the conventional optimization approaches. Additionally, for attaining a tradeoff between accelerating training
the goal of deploying and designing the UAV-RIS is for speed and converging to the local optimal, as well as
maximizing the long-term benefits instead of current benefits, for avoiding oscillation. The central controller is selected
which falls into the field of the RL algorithm for the reason as an agent for determining the movement of the UAV,
that this algorithm can incorporate farsighted system evolution as well as the passive beamforming design. Additionally,
instead of myopically optimizing current benefits. Moreover, we adopt the three-dimensional (3D) radio map for ver-
by integrating NOMA technique in the dynamic RIS-UAV ifying the performance of the proposed D-DQN based
scenario, the decoding order cannot be determined directly algorithm.
by the order of MUs’ channel gains. In an effort to guarantee • We demonstrate that the proposed D-DQN based algo-
successful successive interference cancelation (SIC), another rithm can converge under minor constraint. With the aid
decoding order constraint based on decoding rate needs to of the RIS, the energy dissipation of the UAV is capable
be satisfied while dynamic decoding order needs to be re- of being reduced by roughly 23.3%, while RIS-NOMA
determined at each timeslot. Therefore, joint trajectory design case consumes 11.7% less energy than RIS-OMA case.
of the UAV, passive beamforming design at the RIS, decoding
order and power allocation determination are expected to be D. Organization and Notations
optimized. The search-space is expanded as the number of
The remainder of this paper is structured as follows.
parameters increases, which makes the conventional gradient-
Section II focuses on the system model and problem for-
based optimization techniques unsuitable. Against the afore-
mulation of energy consumption minimization for the UAV.
mentioned background, RL algorithm, which is a powerful AI
Section III elaborates on the proposed D-DQN based algo-
paradigm that is capable of empowering agents by learning
rithms for solving the formulated problem. The simulation
from the dynamic environment, was selected as the methodol-
results are illustrated in Section V. Finally, Section VI con-
ogy for tackling problems in RIS-UAV networks. Since RISs
cludes the main concept, insights and results of this paper.
enjoy discrete phase shifts, the state space and action space
The list of notations is illustrated in Table I.
are also discrete. Thus, the policy-based algorithm, which aims
for solving problems with continuous state space and action
II. S YSTEM M ODEL AND P ROBLEM F ORMULATION
space, is not suitable for controlling RISs. the deep Q-network
(DQN) algorithm can be used for supporting the UAV/RIS A. System Model
(agents) in their interactions with the environment (states), We consider the downlink MISO communications in a
whilst finding the optimal behavior (actions) of the UAV/RIS. particular area, where terrestrial infrastructures are destroyed
Thus, it is applied for solving challenging problems in the due to natural disaster or had not been installed. As illustrated
UAV-RIS enhanced wireless networks. in Fig.1, an UAV equipped with M antenna is employed for
providing wireless service for a total number of K single-
antenna users. In this paper, all MUs are considered as
C. Contributions
roaming around in this area. In an effort to enhance the
The contributions of this paper are as follows: quality of wireless service by forming concatenated virtual
• We propose a novel UAV-RIS framework for attaining LoS propagation between the UAV and the users via passively
the long-term benefits of the network controlling the reconfiguring their incident signals, it is assumed that an RIS is
RIS and designing the trajectory of the UAV in RIS- employed on the facade of a particular high-rise building [2],
enhanced UAV-enabled wireless networks, where an RIS [13], [26], while the RIS is equipped with a number of N
is deployed for enhancing the wireless connectivity and reflecting elements.
reduce the movement of the UAV. Additionally, we for- Remark 1: Since all users are roaming continuously,
mulate the energy consumption minimization problem by the UAV has to be periodically repositioned according to the
jointly deciding the movement of the UAV and the passive mobility of users for establishing LoS wireless links. With the
beamforming at the RIS. aid of RIS, one can adjust the phase shift of the RIS instead
• We investigate both OMA case and NOMA case in of controlling the movement of UAV for forming concatenated
RIS-enhanced UAV-enabled wireless networks, zero- virtual LoS propagation between the UAV and the users, which,
forcing based linear precoding method is employed for in turn, leads to the reduction of UAV’s energy dissipation.
eliminating the multiuser interference in OMA case and In contrast to the conventional MISO networks, the received
the inter-cluster interference in NOMA case, respec- signal consists of two parts, namely signals of direct link
tively. Additionally, in contrast to the conventional MISO- (UAV-MU link) and reflecting link (UAV-RIS-MU link).
NOMA networks, we consider dynamic decoding order in Denote H U,S ∈ CN ×M , hH S,k ∈ C
1×N
as the channel of
Authorized licensed use limited to: ULAKBIM UASL - KARADENIZ TECHNICAL UNIVERSITY. Downloaded on July 14,2022 at 20:02:54 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: ML EMPOWERED TRAJECTORY AND PASSIVE BEAMFORMING DESIGN IN UAV-RIS WIRELESS NETWORKS 2045
TABLE I
L IST OF N OTATIONS
30.9 + (22.25 − 0.5log10 h) log10 dk + 20log10 fc , if LoS link,
Lk = LoS (1)
max Lk , 32.4 + (43.2 − 7.6log10 h) log10 dk + 20log10 fc , if NLoS link,
⎧
⎪
⎪
⎨ 1,
ifd2k − h2 ≤ d0 ,
PLoS =
⎪ d0 − d2k − h2 d0 (2)
⎪
⎩ + exp 1− , if d2k − h2 > d0 ,
d2k − h2 p1 d2k − h2
Authorized licensed use limited to: ULAKBIM UASL - KARADENIZ TECHNICAL UNIVERSITY. Downloaded on July 14,2022 at 20:02:54 UTC from IEEE Xplore. Restrictions apply.
2046 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 39, NO. 7, JULY 2021
by the Twitter API and it becomes available to the general where sk represents the transmitted data symbol for the
public [4]. Indeed, the prediction accuracy of the users’ k-th MU with with k ∈ K, |K| = K, g k ∈ CM×1
location is essential for the design of RIS-UAV enabled denotes the corresponding transmit beamforming vector, and
wireless networks. Collecting and processing data gleaned pk represents the allocated transmit power from the UAV to
from online social networks is of lower complexity than the k-th MU.
1
(n+1)Tr
3v 2 (t) 1
(n+1)Tr
v 4 (t) v 2 (t) 1 1
(n+1)Tr
Ē(t) = · Eb 1+ + · Ei 1+ − + · fd ρsAv 3 (t), (5)
Tr 2
Utip Tr 4v04 2v02 Tr 2
t=nTr t=nTr t=nTr
Authorized licensed use limited to: ULAKBIM UASL - KARADENIZ TECHNICAL UNIVERSITY. Downloaded on July 14,2022 at 20:02:54 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: ML EMPOWERED TRAJECTORY AND PASSIVE BEAMFORMING DESIGN IN UAV-RIS WIRELESS NETWORKS 2047
The received signal of MU k is given by where wl represents the precoding vector of cluster l.
K In the NOMA case, each MU in the same cluster adopts SIC
√ method for removing the intra-cluster interference [30], [31].
yk = h H
U,k + hS,k ΘH U,S
H
p j g j sj + n k , (7)
j=1 The strong MU in the cluster is capable of removing the intra-
cluster interference caused by the weak MU with the aid of
where nk ∼ CN 0, σk2 represents the additive white SIC method. On the other hand, the weak MU is designed
Gaussian noise (AWGN). Based on (7), the received signal- to decode the received signal directly without invoking SIC
to-interference-plus-noise (SINR) of MU k can be calculated [21], [32]. Denote MU a as the strong user in cluster l. Thus,
as follow the received signal of MU a can be expressed as
2
√ √
pk h H + hH
ΘH U,S g k
γk = K
U,k S,k
. (8) yl,a = hH U,l,a + hS,l,a ΘS H U,S w l
H
αl,a sl,a + αl,b sl,b
H 2
pj hU,j + hHS,j ΘH U,S g j + σk
2 L
j=k + hU,l,a (t) + hS,l,a ΘS H U,S
H H
w j xj + nl,a ,
j=1,j=l
To eliminate the multiuser interference, zero-forcing (ZF)-
based linear pre-coding method is invoked at the UAV [14]3. (14)
Denote hHj , j ∈ K be the combined channel of the j-th MU,
L
the precoding metrics can be expressed as where U,l,a + hU,l,a ΘS H U,S
hH H
wj xj denotes the
⎧ j=1,j=l
⎨ hH inter-cluster interference in multi-cell NOMA networks,
U,j + hS,j ΘH U,S g k = 0, ∀j = k, j ∈ K,
H
√
(9) hHU,l,a + hS,l,a ΘS H U,S w l al,b sl,b represents the intra-
H
⎩ h (t) + h ΘH U,S g = 1, j = k.
H H
U,k S,k k cluster interference.
In the same manner with OMA case, ZF precoding approach
Denote H H = H H UAV-MU + H RIS-MU Θl H UAV-RIS , where
H H
is leveraged for eliminating the inter-cluster interference for
H
we have H UAV-MU = [hU,1 , · · · , hU,K ] and H H
H
RIS-MU = the strong user [14], [33], [34]. Although the dirty paper
H
[hS,1 , · · · , hS,K ] . Thus, the transmit pre-coding metric G = coding (DPC) is proved to achieve the maximum capacity
[g 1 , · · · , g K ] can be derived by the pseudo-inverse of the in multi-user MIMO-NOMA system, it is non-trivial to be
combined channel H H , which can be expressed as implemented in practice for the reason that it adopts brute-
−1 force searching. Thus, ZF-based linear precoding method,
G = H HH H . (10)
which is of a low complexity is employed. The corresponding
Thus, the instantaneous transmit rate of the k-th MU at ZF pre-coding constraints can be expressed as
timeslot t in OMA case can be expressed as ⎧
⎨ hH U,j + hS,j ΘS H U,S w l = 0, ∀j = l, j ∈ L,
H
pk (15)
Rk = Bk log2 (1 + γk ) = Bk log2 1 + 2 , (11) ⎩ hH
σ U,l + h H
S,l ΘS H U,S wl = 1, j = l.
where Bk represents the bandwidth allocated by UAV to the
k-th MU. The same as OMA case, the optimal transmit pre-coding
metric W = [w1 , · · · , wL ] can also be calculated by the
pseudo-inverse of the combined channel H H , which can be
F. Signal Model for NOMA Scheme expressed as
By invoking NOMA technique instead of OMA technique, −1
the spectrum efficiency can be further improved. The transmit W = H HH H . (16)
signal from the UAV to the l-th cluster is given by
With the aid of ZF precoding method, the inter-cluster
xl (t) = αl,a (t)sl,a (t) + αl,b (t)sl,b (t), (12) interference suffered by the strong users can be removed, while
the intra-user interference can also be removed with the aid of
where sl,a and sl,b represent the signals for MU a and MU b
successful SIC. However, the weak still suffers the inter-cluster
in the same cluster, respectively. Since the concept of NOMA
interference. Therefore, the received signal of the strong user
technique is superimposing the signals of two associated users
can be calculated as
at different power levels, we denote αl,a and αl,b as the power √
allocation factors for MU a and MU b, respectively. Naturally, yl,a = hH U,l,a + hS,l,b ΘS H U,S w l αl,a sl,a + nl,a , (17)
H
the power allocation factors satisfy αl,a + αl,b = 1. Hence,
the received signal of a particular MU i in cluster l can be while the received signal of the weak user can be calculated
calculated as as
L √ √
yl,i = hH + h H
Θ S H U,S w l xl + nl,i , (13) yl,b = hHU,l,b + hS,l,b ΘS H U,S w l
H
αl,a sl,a + αl,b sl,b
U,l,i S,l,i
l=1
L
Authorized licensed use limited to: ULAKBIM UASL - KARADENIZ TECHNICAL UNIVERSITY. Downloaded on July 14,2022 at 20:02:54 UTC from IEEE Xplore. Restrictions apply.
2048 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 39, NO. 7, JULY 2021
Therefore, the received SINR for both strong user and weak (21d) formulates the altitude bound of UAVs, which indicates
user can be expressed as that the UAV can only move in this particular area. (21e)
2 is the decoding order constraint of NOMA technique for
H √
hU,l,a + hHS,l,a ΘS H U,S w l αl,a sl,a guaranteeing successful SIC. (21f) qualifies that the total
γl,a = required transmit power cannot exceed the maximal power
σl2
αl,a Pl constraint of the UAV. The constraints (21b) and (21e) are not
= , (19) convex with respect to either θ, P and Q. All these constraints
σl2
yield a mixed-integer non-convex problem and, in general,
and there is no standard method for solving this kind of problems
2
|hl,b w l | αl,b Pl efficiently. Additionally, the dynamic movement of the UAV
γl,b = 2 , (20) and MUs are discussed, which indicates that the considered
2 L
scenario is dynamic and is challenging for the conventional
|hl,b w l | αl,b Pl + hl,b w l xj + σl2
j=1,j=l optimization algorithms due to their failure to overcome the
dynamic in the environment. Moreover, since joint trajectory
where hl,b = hH U,l,b + hS,l,b ΘS H U,S .
H
design of the UAV, phase shift control of the RIS, decoding
Remark 3: In contrast to the conventional NOMA systems, order and power allocation policy determination are planned
for MISO-NOMA networks, the decoding order cannot be to be optimized. The search-space is expanded as the number
decided directly by the order of MUs’ channel gains, another of parameters increases, which also makes the conventional
decoding rate constraint needs to be satisfied for guaranteeing gradient-based optimization techniques unsuitable. Therefore,
successful SIC. Additionally, by integrating RISs in the MISO- RL algorithm, which empowers the agent to making decisions
NOMA system, the channel response is also modified by RISs. by learning from the environment, is invoked to solve the
Thus, the decoding order constraint in this paper has to be formulated problem.
satisfied at each timeslot.
The decoding rate condition is given by γ l,b→l,a ≥ γl,b→l,b ,
πl (a) ≥ πl (b) [35], where γl,b→l,a denotes the SINR of user III. P ROPOSED S OLUTIONS
a to decode user b. In this section, we first formulate the joint phase shift
Since both the UAV and MUs are considered as roaming control and trajectory design problem as an MDP. Afterward,
continuously, the decoding order has to be re-determined at we proposed a D-DQN based algorithm for tackling the formu-
each timeslot to adapt to the dynamic environment. lated problem. In addition, the state space, action space, reward
function design of the proposed D-DQN based algorithm are
G. Problem Formulation specified.
We are interested in minimizing the energy consumption
of the UAV while guaranteeing the wireless service qual- A. Markov Decision Process Formulation
ity from the UAV to the users at each timeslot by jointly Before invoking RL algorithms, the formulated problem
optimizing the phase shift metric of the RIS, the move- is expected to be proved to be capable of being considered
ment of the UAV, the power allocation from the UAV to as an MDP problem. It has been proved in [36] that, since
MUs, and the dynamic decoding order. Denote θ = [θ1 (t), the central controller makes sequential decisions, which influ-
· · · , θn (t), · · · , θN (t)], P = [p1 (t), · · · , pk (t), · · · , pK (t)] ence the observed state (UAV’s position, RIS’s phase shifts,
and Q = [xUAV (t), yUAV (t)]T . Thus, the optimization problem and allocated power to each MU) at next timeslot. Thus,
can be formulated as the trajectory design and phase shift control problem for RIS-
T enhanced UAV-enabled wireless network can be formulated as
min EUAV = Ē(t) (21a) an MDP. As illustrated in Fig. 4, the MDP is defined by the
θ,P ,Q
t=0 environment, the set of states S, the set of actions A and the
s.t. Rk (t) ≥ Rkmin (t),
∀k, ∀t, (21b) reward function r and the state transition function τ forms
|φn (t)| = 1, ∀n, ∀t, (21c) the MDP cycle. After one MDP cycle, the process transitions
into a new state according to the previous state and the carried
s.t. xmin ≤ xUAV (t) ≤ xmax ,
out actions.
ymin ≤ yUAV (t) ≤ ymax , ∀t, (21d)
Rl,b→l,a (t) ≥ Rl,b→l,b (t), πl (a) ≥ πl (b), ∀l,
B. Proposed DRL Based Phase Shift Control and Trajectory
(21e) Design Algorithm
−1
tr P H H H ≤ Pmax , ∀k, ∀t, (21f) In this subsection, a D-DQN based algorithm is introduced
to determine the trajectory of the UAV and the phase shift of
where (21b) represents that the data demand of all MUs the RIS, while guaranteeing that all the users’ data demand is
has to be satisfied at each timeslot, while Rkmin(t) denotes satisfied. In the D-DQN based model, the central controller,
the minimal data rate requirement constraint for any MU. which controls both the UAV and the RIS, acts as the agent.
(21c) denotes that the passive beamforming constraint of the At each time slot t, the agent observes a state, st , from the
RIS need to be satisfied when controlling its phase shift. state space, S, which consists of the coordinates of both the
Authorized licensed use limited to: ULAKBIM UASL - KARADENIZ TECHNICAL UNIVERSITY. Downloaded on July 14,2022 at 20:02:54 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: ML EMPOWERED TRAJECTORY AND PASSIVE BEAMFORMING DESIGN IN UAV-RIS WIRELESS NETWORKS 2049
UAV and of all the users, as well as of the phase shift of Algorithm 1 D-DQN Based Phase Shift Control and
the RIS. According to the current state and decision policy J, Trajectory Design Algorithm
the agent takes an action, at , from the action space, A, which Input:
consists of the moving directions of the UAV and the variable Replay memory D, minibatch size n, and initial learning
quantity of each reflecting element’s phase shift. After carrying rate α
out actions, the agent obtains a reward/penalty rt based on Initialize the replay memory D, Q-network weights θ,
the energy consumption of the UAV and the connectivity weights of the target network θ∗ = θ, and Q(s, a).
condition. At each timeslot, a Q-value is calculated based on The UAV is deployed at a random point, the phase shift
the current state and previously taken actions. Thus, the state, metric of the RIS is initially randomly decided.
action and Q-value is stored in a Q-function, Q(st , at ), which repeat
determines the decision policy J. The aim of the D-DQN For each episode do:
model is to enable the agent to carry out the optimal actions The central controller chooses at uniformly with
to maximize the long-term sum reward. The principle of the probability of ε, while chooses at such that Qθ (st , at ) =
D-DQN model is maximizing the long-term sum reward maxa∈A Qθ (st , at ) with probability of (1 − ε).
instead of aiming for maximizing the reward at a particular The central controller observes reward rt ,
timeslot. Thus, in the D-DQN model, the selected action may The D-DQN model transfers to a new state st+1 ;
not be the optimal choice for the current timeslot, but the Store transition (st , at , rt , st+1 ) and sample random
optimal choice for pursing long-term benefits. In this paper, minibatch of transitions (si , ai , ri , s i )i∈n from D;
the phase shift of the RIS is considered as discrete, so value- For each i ∈ I, we can obtain
based RL algorithm is invoked. When considering continuous yi = ri + γ · maxa∈A Qθ∗ (s i , a);
phase shift, policy-based algorithms or actor-critic algorithms, Perform a gradient
such as DDPG algorithm can be adopted. descent step
θ ← θ − at · 1I [yi − Qθ (si , ai )] · ∇θ Qθ (si , ai );
At each timeslot, the Q-value and Q-function are updated i∈n
based on the current state, previously taken actions and the θ ← θ∗ .
received reward by following the principle (22) at the bottom Calculate the learning rate based on α (ne ) =
of the page, where α denotes the learning rate and γ is the α0 /1 + ηne .
discount factor. end
In (22), the reward rt is drawn from a fixed until State s terminates
reward distribution R : S × A → R, where Return: Q-function Qθ and policy J.
s
E {rt |(s, a, s ) = (st , at , st+1 ) } = Rsa . By solving the
following equation, the optimal value function can be
obtained as
the variable quantity of the transmit power from the UAV to
Q∗ (s, a) = Es [r + γmaxa Q∗ (s , a ) |s, a ], (23)
each MU.
where Q∗ (s, a) is the optimal value function and Q → Q∗ 3) Reward Function in the D-DQN Model: The
when i → ∞. reward/penalty function is decided by the transmit rate
1) States in the D-DQN Model: In terms of the state of each MU and the energy consumption of the UAV. Thus,
space of the proposed D-DQN model. it contains four parts: the reward/penalty is a function of the UAV’s coordinate,
1) θn (t) ∈ [0, 2π] , n ∈ N , the phase shift metric of the RIS’s phase shift metric, and the power allocation
each reflecting elements at timeslot t; 2) [xUAV (t), yUAV (t)]T , coefficient from the UAV to mobile users, which can be
the 2D coordinate of the UAV at timeslot t4 ; 3) cU k (t) = calculated as r(t) = f [xUAV (t), yUAV (t), θn (t), pk (t)]. When
[xUk (t), y U
k (t)]T
, k ∈ K, the 2D coordinate of each MU at the D-DQN model carries out an action that reduce the
timeslot t; 4) pk (t), k ∈ K, the power allocated from the energy consumption while grantee the data transmit rate
UAV to each MU at timeslot t. of each MU, a positive reward will be given to the agent.
2) Actions in the D-DQN Model: As for the action space By taking any other actions, which result to the increment of
of the proposed D-DQN model,
it contains three parts: the energy dissipation, unsatisfiable of the data transmit rate,
1) Δθn (t) ∈ − 10 π
, 0, 10
π
, the variable quantity of the the D-DQN model receives a penalty. Before designing the
phase shift value of each reflecting element; 2) ΔcIUAV (t) ∈ reward/penalty function of the D-DQN model, we invoke ξ
{(−1, 0), (1, 0), (0, 0), (0, −1), (0, 1)}, the traveling direc- to value the satisfaction of users. Thus, we have
tion and distance of the UAV; 3) Δpk (t) ∈ {− p, 0, p},
4 In
this paper, 2D trajectory design of the UAV is investigated due to lack of 1 Rk (t) ≥ Rkmin , ∀k ∈ {1, · · · , K},
the energy consumption model of UAV’s 3D movement. The results derived ξk (t) = (24)
from the proposed algorithm can be extended to the 3D trajectory of the UAV. 0 Rk (t) < Rkmin , ∀k ∈ {1, · · · , K},
Authorized licensed use limited to: ULAKBIM UASL - KARADENIZ TECHNICAL UNIVERSITY. Downloaded on July 14,2022 at 20:02:54 UTC from IEEE Xplore. Restrictions apply.
2050 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 39, NO. 7, JULY 2021
Fig. 4. MDP and DRL model for RIS-aided UAv-enabled wireless network.
TABLE II
C ONNECTION B ETWEEN PARAMETERS U SED IN THE S YSTEM M ODEL , P ROBLEM F ORMULATION AND THE P ROPOSED D-DQN M ODEL
where Rkmin denotes the minimal achievable rate of the this problem, function approximation by neural networks is
k-th MU. Based on (24), the reward/penalty function of the adopted for approximating Q-table. A Convolutional Neural
D-DQN model can be designed as (25) at the bottom of the Networks (CNN) with weights {θ} is invoked to output the
page, where C is of a constant value to guarantee that the Q-table. In an effort to reduce the correlation of sampling,
penalty function for dissatisfaction of transmit rate is with a memory replay is invoked in the proposed D-DQN model.
high value, so that actions that lead to this phenomenon can At the early stage of training, the agent carries out random
be avoided. implementation actions, and stores its experiences in a memory
It can be observed from (25) that, the UAV tends to carries bank. The experiences, which contains the states, actions,
out actions from controlling the phase shift of the RIS instead and rewards, can be leveraged as training samples. The aim
to changing the position of the UAV, unless the data rate of of CNN is to minimize the following loss function at each
users cannot be satisfied. Additionally, maximizing the long- episode,
term sum rewards in the proposed D-DQN model tends to 2
minimize the long-term energy consumption of the UAV. Loss(θ) = [y − Q (st , at , θ)] , (26)
Table II maps the connection between parameters invoked
in the system model, problem formulation and the proposed where we have y = rt + γ · max Qold (st , at , θ).
a∈A
D-DQN model. Since the state space is huge, overflow happens In terms of the number of neurons in hidden layers, this
when storing Q-value in the Q-table. In an effort to solve number has to be larger than the dimension of input metric
⎧ K
⎪
⎪ C∗
K
ξk (t) − K ∗ E(t + 1) E(t + 1) > E(t)& ξk (t) < K,
⎪
⎪
⎪
⎪ k=1
k=1
⎪
⎪ K
⎨ K
ξk (t) − K ∗ E(t + 1) E(t + 1) = E(t)& ξk (t) < K,
r(t) = k=1 k=1 (25)
⎪
⎪ K
⎪
⎪ −E(t + 1) E(t + 1) > E(t)& ξk (t) = K,
⎪
⎪
⎪
⎪ k=1
⎩ E(t + 1) E(t + 1) = E(t)&
K
ξk (t) = K.
k=1
Authorized licensed use limited to: ULAKBIM UASL - KARADENIZ TECHNICAL UNIVERSITY. Downloaded on July 14,2022 at 20:02:54 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: ML EMPOWERED TRAJECTORY AND PASSIVE BEAMFORMING DESIGN IN UAV-RIS WIRELESS NETWORKS 2051
and output metric. Thus, it depends on the antennas number of TABLE III
the UAV and reflecting elements number of the RIS, as well S IMULATION PARAMETERS
as the cluster number.
In order to strike a balance between exploration and
exploitation in the proposed D-DQN algorithm, -greedy
exploration method is leveraged by satisfy the following
principle
= 1 − ,
P r(J = J)
a = argmaxQ (s, a),
(27)
/ (|a| − 1), otherwise.
In terms of the learning rate, we invoke the decaying
learning concept for attaining a tradeoff between accelerating
training speed and converging to the local optimal, as well as
for avoiding oscillation. The decaying learning rate is given by
α0
α (ne ) = , (28)
1 + ηne
where α0 represents the learning rate at the initial episode, η related to the training process. The computational com-
is a constant parameter for determining the decaying rate, ne plexity related
to the CNN model can be calculated
2 2
denotes the training episodes. as O f1 n22 (n1 − n2 + 1) + f2 n23 (n1 − n2 − n3 + 2) ,
Remark 4: By invoking the decaying learning rate, the ini- where i denotes the number of Conv layer [39]. The para-
tial training episode is with a large learning rate, which is meters in this equation is related to the number and size of
helpful for accelerating training speed. With the increasing of the filters in each Conv layer. Since in the training stage, all
training episode, the learning rate decays, which is useful for states and actions are observed by the agent, the computational
the D-DQN model to converge to a local optimal. complexity related to the training process is given by O(|S| ·
|A|), where |S| and |A| are the total number of states and
C. Analysis of the Proposed Algorithm actions, respectively.
1) Convergence Analysis: Before analyzing the convergence
of the proposed D-DQN algorithm, the convergence of the IV. S IMULATION R ESULTS
conventional Q-learning algorithm and DQN algorithm has
In this section, we aim for verifying the validity of the pro-
to be discussed first. Afterward, by discussing the influence
posed D-DQN based algorithm by illustrating the convergence
of decaying learning rate on the convergence of the DQN
of the proposed algorithm and for validating the enhancement
algorithm, the convergence of the proposed D-DQN algorithm
of the network performance with the assistance of the RIS.
an be proved.
Additionally, we test the performance of both NOMA-RIS case
It has been proved in [37] that the conventional Q-learning
and OMA-RIS case. In the simulation, the UAV is initially
algorithm converges
to the optimal Q-function
2 when satisfying
placed at a random position and the phase shift metric of the
0 ≤ αt ≤ 1, αt = ∞ and αt < ∞. Additionally,
t t RIS is also randomly given at initial timeslot. We invoke a
it has also been proved that the DQN algorithm, which is an 3 layer CNN with 50 nodes in hidden layer. The learning
extended Q-learning algorithm, is capable of converging to the rate decays from 0.1 to 0.001. The simulation parameters are
optimal state once the neural networks is large enough [38]. shown in Table III.
The DQN based algorithm is a sub-optimal solution due to
the reason that the optimality of a RL model can not be
guaranteed. In terms of the influence of decaying learning A. Convergence Rate of the Proposed D-DQN Algorithm
rate on the optimality and convergence of the DQN algorithm, Fig. 5 characterizes the average energy consumption of the
since the only function of decaying learning rate concept is UAV over episodes. It can be observed that the conventional
attaining a tradeoff between accelerating training speed and Q-learning algorithm fails to converge, which is due to the
converging to the local optimal, as well as avoiding oscillation. huge state space of UAV-aided wireless network. In contrast
Thus, the decaying learning rate concept will not affect the to the conventional Q-learning model, the D-DQN algorithm is
converging ability and optimality of the DQN algorithm, but capable of converging with the aid of the concept of function
the convergence rate will be influenced. Overall, the conver- approximation via neural network. It can also be observed that
gence of the proposed D-DQN algorithm can be guaranteed, when the learning rate is 0.005, the proposed DQN algorithm
while the convergence rate of the proposed D-DQN algorithm can converge after about 50000 episodes, which indicates
can also be influenced when comparing to the conventional that the performance of the DQN model (learning rate is
DQN algorithm. 0.005) outperforms the cases with larger learning rate in terms
2) Complexity Analysis: The computational complexity of both converging rate and average energy consumption.
of the proposed D-DQN based algorithm consists of Although the convergence rate of the proposed D-DQN based
two aspects, namely, the computational complexity related algorithm is slower than that of DQN algorithm (learning rate
to the CNN model and the computational complexity is 0.005), the average energy consumption derived from the
Authorized licensed use limited to: ULAKBIM UASL - KARADENIZ TECHNICAL UNIVERSITY. Downloaded on July 14,2022 at 20:02:54 UTC from IEEE Xplore. Restrictions apply.
2052 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 39, NO. 7, JULY 2021
Fig. 5. Convergence rate of the proposed DQN algorithm. Fig. 7. Average energy consumption over transmit power. (The number of
reflecting elements is 24.)
Authorized licensed use limited to: ULAKBIM UASL - KARADENIZ TECHNICAL UNIVERSITY. Downloaded on July 14,2022 at 20:02:54 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: ML EMPOWERED TRAJECTORY AND PASSIVE BEAMFORMING DESIGN IN UAV-RIS WIRELESS NETWORKS 2053
Fig. 8. Average energy consumption over number of reflecting elements. Fig. 10. Impact of UAV’s altitude on energy consumption. (The number of
(The transmit power is 12dBm.) reflecting elements is 24.)
Authorized licensed use limited to: ULAKBIM UASL - KARADENIZ TECHNICAL UNIVERSITY. Downloaded on July 14,2022 at 20:02:54 UTC from IEEE Xplore. Restrictions apply.
2054 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 39, NO. 7, JULY 2021
[3] M. Najafi, V. Jamali, R. Schober, and V. H. Poor, “Physics- [24] X. Liu, Y. Liu, Y. Chen, and H. Vincent Poor, “RIS enhanced mas-
based modeling and scalable optimization of large intelligent sive non-orthogonal multiple access networks: Deployment and passive
reflecting surfaces,” 2020, arXiv:2004.12957. [Online]. Available: beamforming design,” 2020, arXiv:2001.10363. [Online]. Available:
http://arxiv.org/abs/2004.12957 http://arxiv.org/abs/2001.10363
[4] X. Liu, Y. Liu, Y. Chen, and L. Hanzo, “Trajectory design and power [25] X. Mu, Y. Liu, L. Guo, J. Lin, and N. Al-Dhahir, “Exploit-
control for multi-UAV assisted wireless networks: A machine learning ing intelligent reflecting surfaces in NOMA networks: Joint beam-
approach,” IEEE Trans. Veh. Technol., vol. 68, no. 8, pp. 7957–7969, forming optimization,” 2019, arXiv:1910.13636. [Online]. Available:
Aug. 2019. http://arxiv.org/abs/1910.13636
[5] C. Yan, L. Fu, J. Zhang, and J. Wang, “A comprehensive survey [26] M. Di Renzo et al., “Smart radio environments empowered by AI
on UAV communication channel modeling,” IEEE Access, vol. 7, reconfigurable meta-surfaces: An idea whose time has come,” 2019,
pp. 107769–107792, 2019. arXiv:1903.08925. [Online]. Available: http://arxiv.org/abs/1903.08925
[6] M. Mozaffari, W. Saad, M. Bennis, Y.-H. Nam, and M. Debbah, [27] S. Abeywickrama, R. Zhang, Q. Wu, and C. Yuen, “Intelligent reflecting
“A tutorial on UAVs for wireless networks: Applications, challenges, surface: Practical phase shift model and beamforming optimization,”
and open problems,” IEEE Commun. Surveys Tuts., vol. 21, no. 3, IEEE Trans. Commun., vol. 68, no. 9, pp. 5849–5863, Sep. 2020.
pp. 2334–2360, 3rd Quart., 2019. [28] S. D. Muruganathan et al., “An overview of 3GPP release-15 study on
[7] Q. Wu, Y. Zeng, and R. Zhang, “Joint trajectory and communication enhanced LTE support for connected drones,” 2018, arXiv:1805.00826.
design for multi-UAV enabled wireless networks,” IEEE Trans. Wireless [Online]. Available: http://arxiv.org/abs/1805.00826
Commun., vol. 17, no. 3, pp. 2109–2121, Mar. 2018. [29] Y. Zeng, J. Xu, and R. Zhang, “Energy minimization for wireless
[8] M. Mozaffari, W. Saad, M. Bennis, and M. Debbah, “Mobile communication with rotary-wing UAV,” IEEE Trans. Wireless Commun.,
unmanned aerial vehicles (UAVs) for energy-efficient Internet of Things vol. 18, no. 4, pp. 2329–2345, Apr. 2019.
communications,” IEEE Trans. Wireless Commun., vol. 16, no. 11, [30] Y. Liu, Z. Ding, M. Elkashlan, and H. V. Poor, “Cooperative non-
pp. 7574–7589, Nov. 2017. orthogonal multiple access with simultaneous wireless information
[9] C. H. Liu, Z. Chen, J. Tang, J. Xu, and C. Piao, “Energy-efficient and power transfer,” IEEE J. Sel. Areas Commun., vol. 34, no. 4,
UAV control for effective and fair communication coverage: A deep pp. 938–953, Apr. 2016.
reinforcement learning approach,” IEEE J. Sel. Areas Commun., vol. 36, [31] Z. Ding, X. Lei, G. K. Karagiannidis, R. Schober, J. Yuan, and
no. 9, pp. 2059–2070, Sep. 2018. V. K. Bhargava, “A survey on non-orthogonal multiple access for 5G
[10] J. Cui, Y. Liu, and A. Nallanathan, “Multi-agent reinforcement learning- networks: Research challenges and future trends,” IEEE J. Sel. Areas
based resource allocation for UAV networks,” IEEE Trans. Wireless Commun., vol. 35, no. 10, pp. 2181–2195, Oct. 2017.
Commun., vol. 19, no. 2, pp. 729–743, Feb. 2020. [32] Y. Liu, H. Xing, C. Pan, A. Nallanathan, M. Elkashlan, and
[11] C. H. Liu, X. Ma, X. Gao, and J. Tang, “Distributed energy-efficient L. Hanzo, “Multiple-antenna-assisted non-orthogonal multiple access,”
multi-UAV navigation for long-term communication coverage by deep IEEE Wireless Commun., vol. 25, no. 2, pp. 17–23, Apr. 2018.
reinforcement learning,” IEEE Trans. Mobile Comput., vol. 19, no. 6, [33] E. Basar, “Reconfigurable intelligent surface-based index modulation:
pp. 1274–1285, Jun. 2020. A new beyond MIMO paradigm for 6G,” 2019, arXiv:1904.06704.
[12] H. Guo, Y.-C. Liang, J. Chen, and E. G. Larsson, “Weighted [Online]. Available: http://arxiv.org/abs/1904.06704
sum-rate optimization for intelligent reflecting surface enhanced [34] C. Huang, G. C. Alexandropoulos, A. Zappone, M. Debbah, and
wireless networks,” 2019, arXiv:1905.07920. [Online]. Available: C. Yuen, “Energy efficient multi-user MISO communication using
http://arxiv.org/abs/1905.07920 low resolution large intelligent surfaces,” in Proc. IEEE Globecom
[13] Q. Wu and R. Zhang, “Intelligent reflecting surface enhanced wireless Workshops (GC Wkshps), Dec. 2018, pp. 1–6.
network via joint active and passive beamforming,” IEEE Trans. Wireless [35] J. Cui, Y. Liu, Z. Ding, P. Fan, and A. Nallanathan, “Optimal user
Commun., vol. 18, no. 11, pp. 5394–5409, Nov. 2019. scheduling and power allocation for millimeter wave NOMA sys-
[14] C. Huang, A. Zappone, G. C. Alexandropoulos, M. Debbah, and tems,” IEEE Trans. Wireless Commun., vol. 17, no. 3, pp. 1502–1517,
C. Yuen, “Reconfigurable intelligent surfaces for energy efficiency in Mar. 2018.
wireless communication,” IEEE Trans. Wireless Commun., vol. 18, no. 8, [36] Y. Zeng, X. Xu, S. Jin, and R. Zhang, “Simultaneous naviga-
pp. 4157–4170, Aug. 2019. tion and radio mapping for cellular-connected UAV with deep rein-
[15] H. Shen, W. Xu, S. Gong, Z. He, and C. Zhao, “Secrecy rate forcement learning,” 2020, arXiv:2003.07574. [Online]. Available:
maximization for intelligent reflecting surface assisted multi-antenna http://arxiv.org/abs/2003.07574
communications,” IEEE Commun. Lett., vol. 23, no. 9, pp. 1488–1492, [37] F. S. Melo, “Convergence of Q-learning: A simple proof,” Inst. Syst.
Sep. 2019. Robot., Lisbon, Portugal, Tech. Rep., 2001, pp. 1–4.
[16] C. Huang, R. Mo, and C. Yuen, “Reconfigurable intelligent [38] X. Liu, Y. Liu, Y. Chen, and L. Hanzo, “Enhancing the fuel-economy of
surface assisted multiuser MISO systems exploiting deep rein- V2I-assisted autonomous driving: A reinforcement learning approach,”
forcement learning,” 2020, arXiv:2002.10072. [Online]. Available: IEEE Trans. Veh. Technol., vol. 69, no. 8, pp. 8329–8342, Aug. 2020.
http://arxiv.org/abs/2002.10072 [39] K. He and J. Sun, “Convolutional neural networks at constrained time
[17] K. Feng, Q. Wang, X. Li, and C.-K. Wen, “Deep reinforcement learning cost,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
based intelligent reflecting surface optimization for MISO communica- Jun. 2015, pp. 5353–5360.
tion systems,” IEEE Wireless Commun. Lett., vol. 9, no. 5, pp. 745–749, [40] S. Zhang and R. Zhang, “Capacity characterization for intelligent
May 2020. reflecting surface aided MIMO communication,” IEEE J. Sel. Areas
[18] A. Taha, Y. Zhang, F. B. Mismar, and A. Alkhateeb, “Deep Commun., vol. 38, no. 8, pp. 1823–1838, Aug. 2020.
reinforcement learning for intelligent reflecting surfaces: Towards
standalone operation,” 2020, arXiv:2002.11101. [Online]. Available:
http://arxiv.org/abs/2002.11101
[19] H. Yang, Z. Xiong, J. Zhao, D. Niyato, L. Xiao, and Q. Wu, “Deep
reinforcement learning based intelligent reflecting surface for secure
wireless communications,” 2020, arXiv:2002.12271. [Online]. Available:
http://arxiv.org/abs/2002.12271
[20] Q. Zhang, W. Saad, and M. Bennis, “Millimeter wave communications
with an intelligent reflector: Performance optimization and distributional
reinforcement learning,” 2020, arXiv:2002.10572. [Online]. Available: Xiao Liu (Student Member, IEEE) received the B.S.
http://arxiv.org/abs/2002.10572 and M.S. degrees in 2013 and 2016, respectively.
[21] Y. Liu, Z. Qin, M. Elkashlan, Z. Ding, A. Nallanathan, and L. Hanzo, He is currently pursuing the Ph.D. degree with
“Non-orthogonal multiple access for 5G and beyond,” Proc. IEEE, Communication Systems Research Group, School
vol. 105, no. 12, pp. 2347–2381, Dec. 2017. of Electronic Engineering and Computer Science,
[22] H. Zhang, J. Zhang, and K. Long, “Energy efficiency optimization for Queen Mary University of London.
NOMA UAV network with imperfect CSI,” 2020, arXiv:2005.02046. His research interests include unmanned aerial
[Online]. Available: http://arxiv.org/abs/2005.02046 vehicle (UAV) aided networks, machine learning,
[23] J. Lu et al., “UAV-enabled uplink non-orthogonal multiple access non-orthogonal multiple access (NOMA) tech-
system: Joint deployment and power control,” IEEE Trans. Veh. Technol., niques, and reconfigurable intelligent surface (RIS).
vol. 69, no. 9, pp. 10090–10102, Sep. 2020.
Authorized licensed use limited to: ULAKBIM UASL - KARADENIZ TECHNICAL UNIVERSITY. Downloaded on July 14,2022 at 20:02:54 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: ML EMPOWERED TRAJECTORY AND PASSIVE BEAMFORMING DESIGN IN UAV-RIS WIRELESS NETWORKS 2055
Yuanwei Liu (Senior Member, IEEE) received the Yue Chen (Senior Member, IEEE) received the
B.S. and M.S. degrees from the Beijing University bachelor’s and master’s degree from the Beijing Uni-
of Posts and Telecommunications in 2011 and 2014, versity of Posts and Telecommunications (BUPT),
respectively, and the Ph.D. degree in electrical engi- Beijing, China, in 1997 and 2000, respectively,
neering from the Queen Mary University of London, and the Ph.D. degree from QMUL, London, U.K.,
U.K., in 2016. in 2003.
He was with the Department of Informatics, King’s She is currently a Professor of Telecommunica-
College London, from 2016 to 2017, where he was tions Engineering with the School of Electronic
a Post-Doctoral Research Fellow. He has been a Engineering and Computer Science, Queen Mary
Lecturer (an Assistant Professor) with the School University of London (QMUL), U.K. Her current
of Electronic Engineering and Computer Science, research interests include intelligent radio resource
Queen Mary University of London, since 2017. His research interests include management (RRM) for wireless networks, cognitive and cooperative wireless
5G and beyond wireless networks, Internet of Things, machine learning, networking, mobile edge computing, HetNets, smart energy systems, and
and stochastic geometry. He has served as a TPC Member for many IEEE Internet of Things.
conferences, such as GLOBECOM and ICC. He received the Exemplary
Reviewer Certificate of the IEEE W IRELESS C OMMUNICATIONS L ETTERS
in 2015, the IEEE T RANSACTIONS ON C OMMUNICATIONS in 2016 and
2017, the IEEE T RANSACTIONS ON W IRELESS C OMMUNICATIONS in 2017.
He has served as the Publicity Co-Chairs for VTC2019-Fall. He is currently
in the Editorial Board of serving as an Editor for the IEEE T RANSAC -
TIONS ON C OMMUNICATIONS, IEEE C OMMUNICATIONS L ETTERS and the
IEEE A CCESS . He also serves as a Guest Editor for IEEE J OURNAL OF
S ELECTED T OPICS IN S IGNAL P ROCESSING (JSTSP) Special Issue on Signal
Processing Advances for Non-Orthogonal Multiple Access in Next Generation
Wireless Networks.
Authorized licensed use limited to: ULAKBIM UASL - KARADENIZ TECHNICAL UNIVERSITY. Downloaded on July 14,2022 at 20:02:54 UTC from IEEE Xplore. Restrictions apply.