0% found this document useful (0 votes)
32 views

1.machine Learning For User Partitioning and

1) The document proposes a novel framework that uses reconfigurable intelligent surfaces (RIS) to aid non-orthogonal multiple access (NOMA) downlink transmission. It formulates an optimization problem to jointly optimize NOMA user partitioning and RIS phase shifting to maximize mobile user sum data rate. 2) To solve this problem, it uses a modified object migration automation algorithm to partition users into clusters and a deep deterministic policy gradient algorithm to control multiple reflecting elements on the RIS for phase shifting optimization. 3) Numerical results show the proposed RIS-aided NOMA framework enhances sum data rate compared to orthogonal multiple access, and the deep reinforcement learning algorithm can dynamically optimize resource allocation over time.

Uploaded by

Yusra Banday
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

1.machine Learning For User Partitioning and

1) The document proposes a novel framework that uses reconfigurable intelligent surfaces (RIS) to aid non-orthogonal multiple access (NOMA) downlink transmission. It formulates an optimization problem to jointly optimize NOMA user partitioning and RIS phase shifting to maximize mobile user sum data rate. 2) To solve this problem, it uses a modified object migration automation algorithm to partition users into clusters and a deep deterministic policy gradient algorithm to control multiple reflecting elements on the RIS for phase shifting optimization. 3) Numerical results show the proposed RIS-aided NOMA framework enhances sum data rate compared to orthogonal multiple access, and the deep reinforcement learning algorithm can dynamically optimize resource allocation over time.

Uploaded by

Yusra Banday
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

1

Machine Learning for User Partitioning and


Phase Shifters Design in RIS-Aided NOMA
Networks
Zhong Yang, Student Member, IEEE, Yuanwei Liu, Senior Member, IEEE,
Yue Chen, Senior Member, IEEE, and Naofal Al-Dhahir, Fellow, IEEE
arXiv:2101.01212v1 [cs.IT] 4 Jan 2021

Abstract

A novel reconfigurable intelligent surface (RIS) aided non-orthogonal multiple access (NOMA)
downlink transmission framework is proposed. We formulate a long-term stochastic optimization prob-
lem that involves a joint optimization of NOMA user partitioning and RIS phase shifting, aiming at
maximizing the sum data rate of the mobile users (MUs) in NOMA downlink networks. To solve the
challenging joint optimization problem, we invoke a modified object migration automation (MOMA)
algorithm to partition the users into equal-size clusters. To optimize the RIS phase shifting matrix,
we propose a deep deterministic policy gradient (DDPG) algorithm to collaboratively control multiple
reflecting elements (REs) of the RIS. Different from conventional training-then-testing processing, we
consider a long-term self-adjusting learning model where the intelligent agent is capable of learning the
optimal action for every given state through exploration and exploitation. Extensive numerical results
demonstrate that: 1) The proposed RIS-aided NOMA downlink framework achieves enhanced sum data
rate compared with the conventional orthogonal multiple access (OMA) framework. 2) The proposed
DDPG algorithm is capable of learning a dynamic resource allocation policy in a long-term manner.
3) The performance of the proposed RIS-aided NOMA framework can be improved by increasing the
granularity of the RIS phase shifts. The numerical results also show that reducing the granularity of the
RIS phase shifts and increasing the number of REs are two efficient methods to improve the sum data
rate of the MUs.

Part of this paper has been presented in IEEE Global Communication Conference (GLOBECOM) 2020 [1].
Z. Yang, Y. Liu and Y. Chen are with the School of Electronic Engineering and Computer Science, Queen Mary University
of London, London E1 4NS, UK. (email:{zhong.yang, yuanwei.liu, yue.chen}@qmul.ac.uk)
N. Al-Dhahir is with the Department of Electrical and Computer Engineering, University of Texas at Dallas, Richardson, TX
75080. (email: [email protected] )
2

I. I NTRODUCTION

Due to the explosive increase of mobile devices, mobile data traffic has been growing dramat-
ically in wireless networks. According to a white paper from Cisco [2], there will be 5.7 billion
(71% of global population) mobile users by 2022, up from 5.0 billion in 2017, a compound
annual growth rate (CAGR) of 2.8%, which will generate 77.5 exabytes of mobile data traffic
per month by 2022, up from 11.5 exabytes per month in 2017. In order to satisfy the higher
requirements in data rates, lower latency, and massive connectivity in future generation wireless
networks, promising technologies have been introduced and actively investigated, such as a
reconfigurable intelligent surfaces (RIS), non-orthogonal multiple access (NOMA), and deep
reinforcement learning (DRL).
An RIS reconfigures the wireless propagation environment via adjusting the passive beam-
forming adaptively. As an appealing complementary solution to enhance wireless transmissions,
RIS are composed of massive low-cost and nearly passive reflective elements (REs) that can be
flexibly deployed in the current wireless networks [3–8]. NOMA is also a promising technology
for massive user connectivity in future wireless communication networks. The integration of
RIS with NOMA has already attracted significant attention both from academia and industry.
RIS provides a new approach to enhance the NOMA performance by manipulating the wireless
environment for mobile users who are blocked by obstacles, which motivates us to integrate
RIS with NOMA downlink networks for further performance enhancement. Another promising
recent technology is DRL, thanks to the rapid progression of fast and massively parallel graphical
processing units (GPU). Reinforcement learning (RL) has been proven effective in ATARI games
of Google DeepMind. The objective of RL algorithms is, for the defined intelligent agent,
to intelligently take actions in an unknown environment, so as to maximize some notion of
cumulative reward. Different from deep neural networks (DNN), which need large amounts of
training data to model the complex environment, RL algorithms focus on finding the balance
between exploration (of unknown environment) and exploitation (of known environment). In [9],
DRL is adopted for joint mode selection and resource management in green fog radio access
networks. The authors in [10] proposed a deep multi-user RL based distributed dynamic spectrum
access algorithm to maximize the formulated objective function. A DRL-based Online Offloading
(DROO) framework is proposed in [11] which implements a deep neural network as a scalable
solution that learns the binary offloading decisions from experience. In this paper, we adopt
DRL for RIS-aided NOMA downlink networks with the goal of answering the following key
questions:
• Question 1 : Do RIS-aided NOMA downlink networks significantly outperform RIS-
3

aided orthogonal multiple access (OMA) downlink networks?


• Question 2 : Does RIS dynamic phase shifting bring performance enhancement compared
with a random phase shifting strategy?
• Question 3 : Which parameter plays a more critical role in improving the performance
of the RIS-aided NOMA downlink framework?

A. Motivations and Related Works

For the NOMA transmission aspect: Different from conventional OMA techniques, such as
orthogonal frequency division multiple access (OFDMA), which assigns subsets of subcarriers
to an individual user, NOMA enables a base station (BS) to simultaneously transmit to several
mobile users (MUs) [12–18]. The key idea behind NOMA is to ensure that multiple users
are served simultaneously within the same given time/frequency resource block (RB), utilizing
superposition coding (SC) techniques at the transmitter and successive interference cancellation
(SIC) at the receiver [14, 15]. On one hand, for the NOMA scheme, user clustering has a signif-
icant impact on the tradeoff between the complexity of the SIC decoding and the performance.
Recent research contributions have investigated user clustering for NOMA downlink for several
aspects as the fairness guarantee [16], the sum rate maximization [17, 19], and the transmit power
minimization [18]. On the other hand, for NOMA transmission, the channel gains for mobile
users also affect the performance. If the mobile users are blocked, then we cannot apply the
NOMA strategy.
For the RIS transmission aspect: Previous research contributions on RIS mainly focus on
sum transmit power minimization [20], sum-rate maximization [21, 22], spectrum efficiency [23],
and energy efficiency [24]. [20] investigate the sum transmit power minimization problem under
signal-to-interference-plus-noise ratio (SINR) constraints of the users, which is solved by a dual
method. A joint design of transmit beamforming and phase shifts is investigated in [21] to
maximize the sum rate of multiuser downlink multiple input single output (MISO) systems.
The authors in [22] implement the RIS by deploying programmable phase shifters to establish
a favorable propagation environment for secure communication. In [23], the RIS is applied in
a RIS-assisted full-duplex (FD) cognitive radio systems and joint beamforming design of the
transmitter, the RIS and the receiver is optimized for maximizing the total spectral efficiency. A
multiple RISs distribution is investigated in [24] to maximize the energy efficiency dynamically.
Different from [20–24], the authors in [25] maximize the secrecy rate for the RIS aided multi-
antenna network. The performance of an RIS-aided large-scale antenna system is evaluated
in [26] by formulating a tight upper bound on the ergodic spectral efficiency. Apart from the
4

published papers cited above, there are more arXiv papers on RIS [27, 28], intelligent reflecting
surfaces (IRS) [29, 30], and large reconfigurable intelligent surface (LRIS) [31]. In [29] and [30],
the transmission power is minimized by optimizing the beamforming vector and the RIS phase
shift matrix.
Due to RIS performance benefits, researchers also explored their integration with other key
technologies such as millimeter wave communications [32], unmanned aerial vehicles (UAVs) [33],
and OFDM [34]. The afore-mentioned potential benefits of RIS and NOMA motivate us to
improve the system performance in RIS-aided NOMA downlink transmission [35, 36]. In [35],
RIS-NOMA transmission is proposed to ensure that more users are served on each orthogonal
spatial direction, compared to spatial division multiple access (SDMA). In [36], the authors
consider a multi-cluster MISO NOMA RIS-aided downlink communication network. In this
paper, we propose RIS aided NOMA downlink networks as a promising solution for the above
challenge. The application of machine algorithms to RIS brings promising advantages for im-
proving the coverage and rate for NOMA aided downlink networks. Another advantage of
NOMA aided downlink networks is enhancing the services for both line-of-sight (LOS: favorable
signal propagation conditions) users and non-line-of-sight (NLOS: unfavorable signal propagation
conditions) users, by smartly tuning the phase shifts, amplitude, and position.
For the RIS-aided NOMA aspect: Recent research contributions have studied the potential
benefits in RIS-aided NOMA networks [37, 38]. In [37], the authors maximize the sum rate of
mobile users by jointly optimizing the active beamforming at the BS and the passive beamforming
at the RIS. The authors in [38] minimize the downlink transmit power for a RIS-empowered
NOMA network by jointly optimizing the transmit beamformers at the BS and the phase shift
matrix at the RIS. The aforementioned RIS-aided NOMA contributions mainly focus on the static
active beamforming at the BS and passive beamforming at RIS. However in practical scenarios,
we often need to adjust the beamforming matrices in a long-term manner, which is complex in
conventional optimization approaches.
For the DRL aspect: Network intelligence is an extremely active area in the field of com-
munication networks to meet the ever-increasing traffic demands. Thanks to the latest machine
learning algorithms in artificial intelligence, especially RL algorithms and DRL algorithms, more
and more RL algorithms are being adopted for communication network intelligence tasks [39],
e.g., resource allocation, power control and demand prediction. In RIS-aided NOMA downlink
networks, one critical challenge is to support ultra-fast wireless data aggregation, which pervades
a wide range of applications in massive communication. RIS are envisioned as an innovative and
promising technology to fufill the voluminous data transmission in future wireless networks, by
5

enhancing both the spectral and energy efficiency [4, 40, 41]. RIS is a planar array composed of a
large number of reconfigurable REs (e.g., low-cost printed dipoles), where each of the elements
is able to induce certain phase shifting (by the attached intelligent controller) independently on
the received signal, thus collaboratively adjusting the reflected signal propagation.

B. Contributions and Organization

The beamforming strategy in RIS-aided NOMA networks still needs further investigation,
especially when the network environment is complex and dynamic. For classical static opti-
mization problems, conventional optimization approaches like branch and bound (BB) [42],
alternating optimization (AO) techniques [43] and Lagrange duality (LD) method [44] can be
used. However, considering a dynamic network where the beamforming matrix need to be
calculated periodically makes the complexity of conventional approaches less practical. The
above-cited references motivate us to further explore the potential benefits of applying DRL
in RIS-aided NOMA downlink networks. In order to comprehensively analyze the network’s
performance enhancement brought by the RIS, we propose a RIS-aided multiple-input-single-
output (MISO) network to study the sum rate of mobile uses, where some of the users are blocked,
thus RIS is deployed efficiently to provide favorable wireless channels for them. Sparked by the
aforementioned potential benefits of DRL, we explore the potential performance enhancement
brought by DRL for RIS-aided NOMA downlink networks.
In this paper, we consider a NOMA downlink network where half of the mobile users are
blocked. The idea of applying RIS for smart communication is to improve the service of access
points for both line-of-sight users and non-line-of-sight users [45]. The RL is adopted to find the
long-term optimal setting of the REs of the RIS, i.e., their amplitudes and phases. As in Fig. 1,
a number of sensors collect environmental data. The sensors send their collected data to the
controller through a wired link. In our contributions of the GLOBECOM paper [1], we invoke a
DRL approach for the phase shifter design. In this paper, to better improve the performance, a
novel modified object migration automation (MOMA) approach is utilized to partition the users
into equal-size clusters. Our main contributions are summarized as follows
1) We propose a RIS-aided NOMA downlink framework to study the performance enhance-
ment of the RIS. We formulate a long-term sum rate maximization problem subject to the
NOMA protocol, which is a combinatorial optimization problem.
2) We apply a novel MOMA algorithm for rapid user clustering that is easy to implement
and rapid to converge.
6

3) We develop a deep deterministic policy gradient (DDPG) algorithm based solution for
designing the phase shifts of the RIS. In the proposed DDPG based solution, we define
the reward function using the sum rate of the mobile users, thus the formulated objective
function finds the optimal trajectory of the intelligent agent.
4) We demonstrate that the proposed RIS-aided NOMA downlink framework outperforms
the conventional OMA framework in sum data rate. Increasing the number of REs is an
efficient method to improve the performance of the proposed RIS-aided NOMA framework.
The rest of this paper is organized as follows. In Section II, the system model for beamforming
in the RIS is presented. In Section III, user partitioning is investigated. DDPG for phase shifts
design is formulated in Section IV. Simulation results are presented in Section V, before we
conclude this work in Section VI. Table I provides a summary of the notations used in this
paper. Table II provides a summary of the acronyms in this paper.
Notation: The expectation of a random variable is denoted as E (·). The absolute value of a
complex scalar is denoted as |·|.

II. S YSTEM M ODEL

A. Network Model

We consider a RIS aided NOMA downlink network operating over a bandwidth of B Hz. The
network architecture is illustrated in Fig. 1, including an access point (AP) with K antennas
denoted by K = {1, 2, · · · , K}. There are N REs on the RIS, where each RE is capable of
independently reflecting the incident signal according to the channel state information (CSI),
by controlling the amplitude and/or phase and thereby collaboratively achieve directional signal
enhancement or nulling. We use n ∈ N = {1, 2, · · · , N} to index the REs of the meta-surfaces.
As is illustrated in Fig. 1, there are K users close to the AP (these users are considered as good
users). In addition, there are K users far from the AP and are also blocked (these users are
considered as poor users). To connect these poor users, we place an RIS near them to reflect the
signal from the AP. To implement NOMA downlink transmission, we select one user from each
group to form K user clusters. The NOMA downlink strategy is adopted for each user cluster.
There are obstacles between the users and the RIS, so there are no line-of-sight transmissions
between the AP and users. As illustrated in Fig. 1, the received signal from the RIS passes
through three stages:
• AP − RIS transmission : the RIS receives the signal from the AP. This is a multiple-
input multiple-output (MIMO) NOMA downlink transmission.
7

TABLE I
LIST OF NOTATIONS

Notation Description Notation Description


B The bandwidth of the network K The number of antennas
x (t) The transmit signal of the AP P (t) The precoding matrix of the AP
e
s The transmit signal vector hAU
k,1 The channel vector
hk,2
RU ∈C 1×N
The channel coefficients nk,1 The addictive noise
Q (t) The phase-shifting matrix of the RIS hAR
k,2 (t) The channel coefficients from the AP to the RIS
SINRk,1 (t) The SINR of the good user ρ The transmit power
θn (t) The phase shifting of the n-th RE C User clustering
zk (t) The position of the k-th user Ap The query of a user

Ω The optimal user partition A1 The reflecting amplitude of the RE
s (t) The state of the DDPG agent a (t) The action of the DDPG agent
lr The learning rate the DDPG agent Vπ (s) The state-value function of the DDPG agent
π
Q (s, a) The action-value function of the DDPG agent πθ (a|s) The action selection policy
π
J (πθ ) The expected reward A (s, a) The advantage function
Pt The transmit power of the AP γ The discount factor
H The minibatch size U The replay memory
Li (θi ) The loss fucntion O The hidden layer size

TABLE II
LIST OF KEY ACRONYMS

Acronyms Original text Acronyms Original text


NOMA Mon-orthogonal multiple access SC Superposition coding
MU Mobile users SIC Successive interference cancellation
MOMA Modified object migration automation AP Access point
DDPG Deep deterministic policy gradient IRS Intelligent reflecting surface
RIS Reconfigurable intelligent surfaces LRIS Large reconfigurable intelligent surface
DRL Deep reinforcement learning LOS Line-of-sight
RL Reinforcement learning NLOS Non-line-of-sight
DNN Deep neural networks RE Reflecting element
DROO DRL-based Online Offloading BB Branch and bound
OMA orthogonal multiple access AO Alternating optimization
OFDMA Orthogonal frequency division multiple access LD Lagrange duality
BS Base station MISO Multiple-input-single-output
RB Resource block CSI Channel state information

• RIS reflection : The RIS reflecting elements adjust the amplitude and/or phase of the
received signal.
• RIS − user transmission : The adjusted signal is transmitted to the users. This is a
8

Intelligent Single antenna good users


controller Incident ray
No reflecting link
Sensors for good users
NOMA MIMO Reflection ray
Access point

K antennas

Block
Mobile user
N REs
No direct link for poor users
Reflecting
Element
(RE)
User pair K
User pair 1 User pair 2 Access point

RIS NOMA MISO transmission Single antenna poor users

Fig. 1. An illustration of the RIS-aided NOMA downlink transmission.

MISO NOMA downlink transmission

B. Wireless Channel Model

In the RIS-aided MISO NOMA downlink transmission, the transmitted signal of the AP is

x (t) = P (t) es, (1)

where P ∈ CK×K is the precoding matrix generated by zero-forcing. Assuming that the channel
matrix between the user and the AP is H, and HH represent the conjugate transpose of the
−1
channel matrix, then the precoding matrix P = [p1 , · · · pK ] is given by P = H HH H .
es ∈ CK×1 denotes the signal vector given as follows:
   
√ √
α1,1 s1,1 + α1,2 s1,2 es1
   
 .. ∆ . 
s=
e  .  =  ..  ,
   (2)
   
√ √
αK,1sK,1 + αK,2 sK,2 e
sK
where sk,1 and αk,1 are defined as the transmitted information and the power allocation coefficient
of the good user in the k-th cluster, respectively.
The received signal of the good user in cluster k is given by

yk,1 (t) = hAU


k,1 (t) Pe
s + nk,1 , (3)
1×K
where hAU
k,1 ∈ C denotes the channel vector between the AP and the good user and nk,1
represents the additive noise. We assume a Rayleigh fading channel model between the AP and
the good user.
9

The received complex-valued signal at the poor user is given by

yk,2 (t) = hk,2 AR


RU (t) Q (t) hk,2 (t) P (t) e
s + nk,2, (4)

where hk,2
RU ∈ C
1×N
represents the channel coefficients vector from the RIS to the poor user and
nk,2 represents the additive noise. Q (t) = diag [q1 (t) , q2 (t) , · · · , qN (t)] is the phase-shifting
matrix of the reflection elements in the RIS, where qn (t) = βejθn(t) . hAR
k,2 ∈ C
N ×M
is the
2

channel coefficients matrix from the AP to the RIS and nk,2 ∼ CN 0, σk,2 represents the
additive Gaussian noise at the poor user.
As illustrated in Fig. 1, the user from the far area is set as the weaker user because it only
receives the signal from the RIS. Denote the k-th column of P by pk . Then, for the good user,
the signal-to-interference-plus-noise ratio (SINR) is given by

2
ρ hAK
k,1 (t) pk (t) αk,1
SINRk,1 (t) = , (5)
P
K
2 2
ρ hAK
k,1 (t) pi (t) + σk,1
i=1,i6=k

where ρ is the transmit power. For the poor user, the SINR is given by (6).

C. Problem Formulation

In this paper, our objective is to maximize the sum rate of all the users in a long-term
manner. There are two parameters that need to be optimized. In [46], the authors assume
that the phase shifter is capable of tuning the phase continuously θn (t) ∈ (0, π]. However, in
practical scenarios [47], the phase can only be adjusted in a discrete manner, e.g.: 0, π/2, π, 3π/2.
Therefore, in this paper, we assume that the phase of each RE can only be changed to D − 1
phases. The formulated problem is formally given as:

T X
X 2
K X

(P1) max log2 1 + SINRk,l (t) , (7a)
C,Q
t=1 k=1 l=1

s.t. C1 : cm
k (t) ∈ {0, 1} , ∀k ∈ [1, K] , m ∈ [1, M] , t ∈ [1, T ] , (7b)
 
1 D−1
C2 : θn (t) ∈ π, · · · , π , ∀n ∈ [1, N] , t ∈ [1, T ] , (7c)
D D
C3 : Rm m m m
j→k (t) ≥ Rj→j (t) , πk (t) > πj (t) , (7d)
2
K X
X
C4 : αk,l (t) ≤ P. (7e)
k=1 l=1

where C = {cm
k (t) , k ∈ K, m ∈ M, t ∈ T }, θ = {θn (t) , n ∈ N , t ∈ T }. Constraint (7d) is to

ensure that the SIC is performed successfully. Constraint (7e) is the sum power constraint.
10

2
ρ hk,2 AR
RU (t) Q (t) hk,2 (t) pk (t) αk,2
SINRk,2 (t) = .
2 P
K 2
ρ hk,2
RU (t) Q (t) hAR
k,2 (t) pk (t) αk,1 + ρ hk,2
RU (t) Q (t) hAR
k,2 (t) pi (t) + 2
σk,2
i=1,i6=k ′

(6)

It is nontrivial to obtain the optimal solution of P1 because:


• The objective function is non-concave with respect to each parameter.
• The objective function is a long-term metric.
• The constraint (7d) is non-convex.
The formulated maximization problem is a Markov decision process (MDP) problem. On the
one hand, the channel power gain between the AP and user hAU
k,1 (t) is not known. On the other

hand, the positions of the users follow a stochastic process, which makes it very challenging
to solve P1. To solve (P1), we propose a MOMA algorithm for the user partitioning problem
and a DDPG algorithm for the phase shifts design problem. The details of the structure of this
paper are presented in Fig. 2. RL is a machine learning algorithm that maximizes the long-term
sum reward. Note that the objective function in P1 is also a long-term maximization problem.
Therefore, we propose a RL based solution for P1 and we adopt the DDPG algorithm to design
the phase shifts of the RIS.

III. M ODIFIED O BJECT M IGRATION AUTOMATION FOR U SER PARTITIONING

In this section, we consider the fundamental problem of “user partitioning” in NOMA trans-
mission, which is a critical problem, because large numbers of mobile NOMA users cannot
be processed in a single resource block (RB), but are better serviced when partitioned into
orthogonal clusters. We partition the users into different clusters, where one beam is associated
with each cluster, while the NOMA strategy is implemented in each cluster. We assume that
for each beam, the bandwidth and power are the same, so we cluster the users into equal-
size groups. Different from conventional object migration automation algorithm in [48], that the
objects are all the same, in our scenario, there are two groups of users, one group of good
users and another group of poor users. Therefore, we need to remove the users pairs if the two
users comes from the same group, i.e., two good users or two poor users. Because the NOMA
strategy is implemented for one good user and one poor user. To improve the user partitioning
performance, we cluster the users according to both the positions of the users and the channel
conditions. The position of the k-th user is denoted as zk (t) = (xk (t) , yk (t)). As in Section II,
11

User position Inputs: Environment


Parameters
Channel vector Outputs: resource allocation

Algorithms MOMA DDPG

Subjects User partition Phase shifting

Problems Equi-partitioning Problem Resource allocation problem

Methodology Partitioning Reinforcement learning

Fig. 2. The proposed MOMA and DDPG algorithm for user clustering and phase shifter design of the RIS. The users are
partitioned into equal-size clusters by the proposed MOMA algorithm. The REs of the RIS are controlled by the intelligent
agent of the DDPG algorithm, who is capable of learning the optimal phase shifts through exploration and exploitation.

the channel of the k-th user is denoted as gk (t). Firstly, due to the fact that the user position
and channel condition have different dimensions, we convert the expressions to be dimensionless
using zero-mean normalization.
For the K users, we denote their positions by {z1 (t) , z2 (t) , · · · , zK (t)}, where zk (t) =
(xk (t) , yk (t)). We use zero-mean normalization to normalize the X and Y data as follows.

xk (t)−ux (t)
x′k (t) = σx (t)
(8)
yk (t)−uy (t)
y ′k (t) = σy (t)

where ux and uy denote the means of the X and Y data, respectively. The parameters σx and
σy denote the standard deviations of the X and Y data, respectively.
For the channel gain gkm (t) of user k, the normalized data is given as

gk (t) − ug (t)
g ′k (t) = , (9)
σg (t)
where ug (t) and σg (t) denote the mean and standard deviation of the channel data.
After the normalization process, we obtain the data of user k ∈ K as dk (t) = [x′k (t) , y ′k (t) , g ′k (t)].
The distance between users k and p is
q
dist (k, p) = |dk (t) − dp (t)|2
q 2 2 2 (10)
= x′k (t) − x′p (t) + y ′k (t) − y ′p (t) + g x′k (t) − g ′p (t) .
12

Good Good Good Good


user 1 user 2 user 3 user 4

(a) C1* C2* C3* C4*

Optimal
user
clustering
*

poor poor poor poor


user 1 user 2 user 3 user 4

Good Good Good Good Good Good Good Good


user 1 user 2 user 3 user 4 user 1 user 2 user 3 user 4
MOMA
partition
(c)
(b) C1 C2 C3 C4 C1 C2 C3 C4
Convenged
Initial user user
clustering clustering
0

poor poor poor poor poor poor poor poor


user 3 user 4 user 1 user 2 user 1 user 2 user 3 user 4

Fig. 3. The MOMA based user clustering (number of users is 8). The users are divided into two groups. (a) presents the optimal
user clustering where good user 1 and poor user 1 constitute a NOMA group. (b) shows the initial user clustering which is not
an optimal solution. (c) illustrates the converged user clustering of the proposed MOMA algorithm.

In the considered scenario, there are 2K users and the number of AP antennas is K. Therefore,
we partition the users into K groups. The optimal partition is denoted as Ω∗ , where Ω∗ =
{G∗1 , · · · , G∗K }. The objective of the MOMA algorithm is to divide the 2K users into the K
groups by means of a partition Ω that converges to Ω∗ as the learning process proceeds. The
elements of the proposed MOMA algorithm are given as:
• Actions (A): There are K actions in the proposed MOMA {α1 , · · · , αK }.
• States (S): For each action αk , there are N states in the proposed MOMA denoted as
{φk1, · · · , φkN }. Therefore, there are KN states in the proposed MOMA, represented as
{φ1 , · · · , φKN }. The automation states are separated into K groups, each with a state depth
of N. For groups i, 2, · · · , K, when i 6= {1, K}, the state index ranges from (i − 1) N + 1
through to iN.
• Reward (R): The reward is defined by the sum data rate in Eq. (7). When an action is
rewarded, the automation moves a step towards the innermost state of the action. However,
if the automation is in states 1, · · · , (i − 1) N + 1, · · · , (N − 1) K + 1, it remains there. On
the other hand, on receiving a penalty, the automation moves a step towards the boundary
states of the automation, and if the automation is in states N, · · · , iN, · · · , KN, it then
moves to the boundary-state of the subsequent action (mod K).
13

A. Time and Space Complexity of the Proposed MOMA

The time complexity of the proposed MOMA algorithm is equal to the number of queries,
which is, in our case, equal to the number of users. Therefore, the time complexity of MOMA
is O(2K). The space complexity of the proposed MOMA algorithm contains two parts: the
memory locations to store the states of the users and memory locations to store the sizes of the
clusters. Thus, the space complexity of MOMA is O (2K + K) = O (3K). We also calculate the
time and space complexities of the K-means algorithm in Table III as a benchmark. According
to Table III, the time and space complexities of the proposed MOMA is less than that of the
K-means algorithm.

IV. DDPG BASED RIS P HASE S HIFTER D ESIGN

After computing the user partitioning, we adopt RL to obtain a long-term phase shifts solution
for the RIS. The “State Space”, “Action Space”, and “Reward Function” for the proposed DDPG
approach are given as follows:
• State Space : The state space is defined by the phase shift matrix of the RIS. For a RIS
with size NRIS × NRIS , the state at time t is defined as
 
A1 ejθ1 · · · 0
 
 .. .. .. 
s (t) = 
 . . . ,
 (11)
 
0 · · · ANRIS ejθNRIS
where A1 = · · · = ANRIS = 1 indicates that the receive signal is reflected by the RIS.

θ1 = · · · = θNRIS ∈ D1 π, · · · , D−1
D
π denotes the set of all possible discrete phase shift
choices.
• Action Space : The action matrix has the same dimension as the state matrix. The
function of the action matrix is to change the diagonal elements in the state matrix. For
example, the following action matrix at time t
 
1 ··· 0
 
 . . . 
a (t) =  .. . . .. 

 (12)
 
0 ··· 0
14

Algorithm 1 Modified object migration automation (MOMA) for User Clustering in NOMA
Downlink Networks
Initialization:
1: A set of users N = {n1 , · · · , nN }.
2: A stream of queries {hAp , Aq i}
3: Initialization {δi }
4: for A sequence of T queries do
5: if δp div N = δq div N then
6: if δp mod N 6= 1 then
7: δp = δp − 1
8: end if
9: if δq mod N 6= 1 then
10: δq = δq − 1
11: end if
12: else
13: if δp mod N 6= 0 ∧ δq mod N 6= 0 then
14: δp = δp + 1, δq = δq + 1
15: else {δp mod N 6= 0}
16: δp = δp + 1
17: else {δq mod N 6= 0}
18: δq = δq + 1
19: else
20: temp = δp , δp = δq , l = index of an unaccessed user in group of Nq closest to the
boundary, δl = temp
21: end if
22: end if
23: if δp and δq in same group then
24: Break.
25: end if
26: end for
Return: The K clusters of users. The state δi of user i, where if δi ∈ [(k − 1) N + 1, kN].

changes the first diagonal element in the state matrix by θ1 = θ1 + D1 π. On the other hand,
15

TABLE III
C OMPLEXITY ANALYSIS OF INVESTIGATED ALGORITHM .

Investigated algorithms MOMA algorithm K-means algorithm


Time complexity O(2K) O(2K 2 )
Space complexity O(3K) O(4K)

the action matrix  


−1 · · · 0
 
 . .. .. 
a (t) =  .
 . . .  (13)
 
0 ··· 0
1
changes the first diagonal element in the state matrix by θ1 = θ1 − D
π.
• Reward Function : The reward function is defined by the sum rate difference between
the states; i.e. 
 R (s′ ) − R (s) if R (s′ ) > R (s)

r (s, a, s ) = . (14)

0 Others
P
K P
2 
where R (s) = log2 1 + SINRk,l (t) represents the data rate at state s. The state
k=1 l=1
s corresponds to one time slot t. s′ represents the next state after state s when taking an
action.
• Exploration and exploitation : During the training phase, the action for each state
is selected according to ǫ-greedy, softmax and the Exponential-weight algorithm (Exp3
strategy) [49]. The ǫ-greedy based action selection is given by


 arg max Q (e
a) with probability (1 − ε)
an (t) = a∈{0,1,··· ,K}
e . (15)

 random action with probability ε

The softmax based action selection is given by

eQ(a)/τ
Pr (an (t) = a) = P , (16)
a∈{0,1,··· ,K}
e eQ(ea)/τ
where τ is the “temperature”. High temperatures cause the actions to be equiprobable.
Low temperatures cause a greater difference in selection probability for actions that differ
in their value estimates.
The Exp3 based action selection is given by
16

(1 − α) eβQ(a) α
Pr (an (t) = a) = P a)
βQ(e
+ , (17)
a∈{0,1,··· ,K} e
e K +1
where K denotes the number of actions, while α and β are the experience parameters.
The Exp3 strategy for action selection balances between the softmax and ǫ-greedy based
action selection strategies.
• Dynamic learning rate : Reducing the learning rate as the training progresses is an
efficient method for enhancing the RL training progress. In this paper, we compare the
following three learning rate decay strategies: iteration-based decay, step decay and ex-
ponential decay. For iteration-based decay, the learning rate decreases with the iterations
according to the relation

lr0
lr = , (18)
1 + kt
where lr0 is the initial learning rate, k is a hyper-parameter and t is the iteration number.
For the step decay strategy, the learning rate decreases by a factor every few steps according
to
step
lr = lr0 ∗ drop step drop . (19)

where drop is the decay factor, step is learning step of the agent, step drop is the number
of steps for each decay. For the exponential decay strategy, the learning rate is given by

lr = lr0 ∗ e−kt . (20)

In order to apply the RL algorithm for RIS-assisted networks, one key issue is to define the
qualified reward function for the agent, which defines the reward and the cost after taking each
action in the system. The reward and cost measure the quality of taking every action.
Based on the definition of the reward function, we reformulate the proposed problem as the
following long-term expected reward maximization problem
" +∞ #
X
t
max Vπ (s) = E γ r (t) |s0 = s . (21)
t=0

The parameter γ ∈ [0, 1] denotes the discount factor, where γ → 0 means that the short-term
reward has more weight than the long-term reward, while γ → 1 means the opposite.
Value function approaches attempt to find a policy that maximizes the return by maintaining
a set of estimates of expected returns for some policy. Given a state s, an action a and a policy
π, the action-value of the pair (s, a) under π is defined by
17

Agent: Intelligent controllers


Q': Target Q-network μ': Target policy network
NET 3 NET 4
m'(s(t +1) | qm' )

Policy

Q ( s (t ) , a (t ) | q Q ) (
Q ' s ( t + 1) , m ' ( s ( t + 1) | q m ' ) | q Q ' ) m ( s ( t + 1) | q m )
Q: Q-network μ: Policy network
NET 1 NET 2

m ( s ( t + 1) | q m )

Mini-patch

Experience memory
s ( t ) , a ( t ) , r ( t ) , s ( t + 1)
tuple
s ( t + 1) r (t )
a ( t ) : {ak ,m , sm ,n , q m ,n }
RIS 1 RIS M
...

RE 1 RE N

Environment: RISs

Fig. 4. The framework of the proposed DDPG algorithm for RIS-aided NOMA downlink transmission.

Qπ (s, a) = E (r|s, a, π) . (22)

Policy gradient (PG) algorithms are widely used in RL algorithms with continuous action
spaces. The basic idea is to represent the policy by a parametric probability πθ (a|s) = P (a|s; θ)
that stochastically selects action a in state s according to parameter vector θ. The goal of PG is
to derive a policy πθ (a|s) and obtain the best θ. We utilize the average reward per time-step to
measure the quality of a policy πθ
Z Z
π
J (πθ ) = ρ (s) πθ (a|s) R (a|s) dads
S A (23)
= Es∼ρπ ,s∼πθ [R (a|s)] .
where Es∼ρπ ,s∼πθ [R (a|s)] represents the expected value with respect to the discount state dis-
tribution ρπ (s).
DDPG is an effective actor-critic, model-free RL algorithm for Markov decision process [50]
and obtains a long-term solution [51]. As shown in Fig. 4, there are four networks in the deep
deterministic policy gradient (DDPG) structure, the function of each network is described below
18

h i
∗ ′ ′
∇θi Li (θi ) = Es,a∼ρ(·);s′ ∼ε r + γ max Q (s , a ; θi−1 ) − Q (s, a; θi ) ∇ θ i
Q (s, a; θi ) . (24)
′ a

• Network1 : Q − network
Input: Mini-batches from the experience memory.

Output: Q s (t) , a (t) |θQ . The Q-value function satisfies the Bellman Equation

h i
Q∗ (s, a) = Es′ ∼ε r + γ max Q∗
(s ′ ′
, a ) |s, a . (25)
′ a

We use neural networks to approximate the optimal Q-value function. During the training
part of the network, the loss function used is given as

 
Li (θi ) = Es,a∼ρ(·) (yi − Q (s, a; θi ))2 , (26)

where yi is the Bellman equation given as

h i
∗ ′ ′
yi = Es′ ∼ε r + γ max Q (s , a ; θi−1 ) |s, a . (27)
′ a

After calculating the loss function, the gradient update with respect to the parameter θ is
given in Eq. (24).
• Network 2 : Deterministic policy network
Input: Q values from the Q-network.
Output: µ (s (t) |θµ ).
In addition, we adopt the difference between the Q-value function and the value function
as the baseline to reduce the variance of the expected reward. The advantage function is
given as follows

Aπ (s, a) = Qπ (s, a) − V π (s) . (28)

Here, the gradient of Eq. (23) is given as

X
∇θ J (θ) ≈ [Qπ (s, a) − V π (s)] ∇θ log πθ (at |st ). (29)
t≥0

• Network 3 : Target Q − network


 ′
Input: Q s (t) , a (t) |θQ and µ′ s (t + 1) |θµ .
′ ′
Output: Q′ s (t + 1) , µ′ s (t + 1) |θµ |θQ .
19

In the target deterministic policy network, the agent chooses the action from the output of
′ ′
the network. The parameters are updated by θQ ← τ θQ + (1 − τ ) θQ .
• Network 4 : Target deterministic policy network The parameters are updated by
′ ′
θµ ← τ θµ + (1 − τ ) θµ . Input: µ (s (t) |θµ )
′
Output: µ′ s (t + 1) |θµ and the action selection policy.
The framework of DDPG for RIS-assisted communication is shown in Fig. 4. The agent, (i.e.,
the central intelligent controller) receives the reward r (t) and new state s (t + 1) from the envi-
ronment after taking an action a (t) in state s (t). Then, the transition tuple hs (t) , a (t) , r (t) , s (t + 1)i
is stored in the experience memory. A mini-batch contains L transitions that are sampled from
the experience memory to train the Q-network using the chain role as described above.

A. Gradient Descent for Value Function Approximation

The goal of RL is to find the relationship between all the states and actions, i.e., the optimal
θ in the DDPG algorithm. We adopt the stochastic gradient descent algorithm to approximate
the value function V (s), which is to minimise the mean-squared error (MSE) between the
approximate value function Vb (s, w) and the true value function V (s) defined as follows

 2 
MSE (w) = Eπ V (s) − Vb (s, w) . (30)

The gradient descent is utilized to find the local minimum which is given by

1
∆w = − α∇w MSE (w)
2 (31)
h  i
= αEπ V (s) − Vb (s, w) ∇w Vb (s, w) ,
where α is a step-size parameter which determines the learning step size of the gradient descent.
Different from the gradient descent, the stochastic gradient descent algorithm samples the
gradient, then represents the value function V (s) with a linear combination of feature vectors.
Hence, we can prove that the stochastic gradient descent algorithm is capable of converging to
global minimum.

Theorem 1. Stochastic gradient descent algorithm for the value function approximation is
capable of converging to global optimum.
Proof: See Appendix A .

In Q-learning, the value function approximation evaluates the optimal state-action value func-
tion Q∗ (s, a), which is given in Eq. (30). The parameter w is recursively estimated using a
stochastic gradient descent. Therefore, θ is updated as follows
20

Algorithm 2 DDPG Algorithm for Phase Shifters Design in RIS-aided NOMA Downlink
Initialization:
1: The number of time slots T in one episode, the mini-batch size H, the learning rate τ , the
replay memory U, the hidden layer size O.
2: Randomly initialize the switch and phase of the REs in the RIS, associate the users with
the RIS randomly.

3: Randomly initialize parameters θQ and θµ in the Q-network Q s (t) , a (t) |θQ and de-
′ ′
terministic policy network µ (s (t) |θµ ). Then, initialize the parameters θQ and θµ in
′ ′
the target Q-network Q′ s (t + 1) , µ′ s (t + 1) |θµ |θQ and target deterministic policy
′
network µ′ s (t + 1) |θµ .
4: for each episode do
5: Generate an initial action sample from a uniform random distribution over all actions.
6: Execute the sampled action from the last step and obtain the initial state.
7: for each time slot do
′ 
8: Select the actions a (t) = µ′ s (t) |θµ + N (t) according to the current policy and
exploration noise.
9: Execute selected action a (t) : {ak,m (t) , sm,n (t) , θm,n (t)} and observe the reward r (t)
and new state s (t + 1).
10: Store transition (s (t) , a (t) , r (t) , s (t + 1)) into the experience memory.
11: Sample a random mini-batch of H transitions (s (t) , a (t) , r (t) , s (t + 1)) from the
experience memory.
′ 
12: Set y (t) = r (t) + γQ′ s (t + 1) , u′ s (t + 1) |θQ .
 
13: Update the parameters Q s (t) , a (t) |θQ in the Q-network Q s (t) , a (t) |θQ by
 P 2
minimizing the loss function L θQ = H1 t∈T y (t) − Q s (t) , a (t) |θQ .
14: Update the parameters µ (s (t) |θµ ) in the deterministic policy network µ (s (t) |θµ ) using
the policy gradient in Eq. (29).
′ ′ ′
15: Update parameters θQ in target Q-network θQ ← τ θQ + (1 − τ ) θQ .
′ ′ ′
16: Update parameters θµ in target Q-network θµ ← τ θµ + (1 − τ ) θµ .
17: end for
18: end for

αi  2
θi+1 ∗ b
= θi − ∇θi Qi − Qθ (si , ai )
2 (32)
  
b ∗ b
= θi + αi ∇θi Qθ (si , ai ) Qi − Qθ (si , ai ) ,
21

TABLE IV
S IMULATION PARAMETERS

Parameter Description Value


K Antennas at the AP 4
N REs of the RIS 9
2K Number of users 8
B Bandwidth 20 MHz
Pt Transmit power of the AP 20 dBm
2
σ Noise power -138 dBm
α Learning rate 0.01
ǫ Exploration 0.1
γ Discount factor 0.99
H Minibatch size 256
U Replay memory 1000
O Hidden layer size 48

V. N UMERICAL R ESULTS

In this section, we present extensive simulation results to quantify the performance of the
proposed MOMA and DDPG algorithms for user clustering and phase shifter design in RIS-
aided NOMA downlink networks. We adopt small-scale Rayleigh fading between the AP and
users. The simulation parameters settings are as given in Table IV unless otherwise stated. We
consider the situation where the positions of the users and the AP are fixed. In our simulations,
the positions of users are randomly distributed within a square region with a side length of
500m. The number of users is set to 8, the AP location is in the centre of the square, and the
bandwidth is 20MHz. We compare our proposed algorithm with the following three conventional
downlink schemes: (1) “RIS-aided OMA downlink” where the RIS is applied for the OMA
downlink transmission. (2) “Fixed phase shifting RIS-aided NOMA downlink” where the phase
shifts of the RIS are fixed during the learning process. (3) “Random phase shifting RIS-aided
NOMA downlink” where the phase shifts of the RIS are random in each time slot during the
whole period. All the simulations are performed on a desktop with an Intel Core i7 9700K 3.6
GHz CPU and 16 GB memory. We use the MATLAB© R2019b programming language for the
proposed MOMA algorithm and RL algorithms.

A. The convergence of the proposed MOMA algorithm

Fig. 5 shows the number of users that are not correctly placed in the optimal user partitioning
when increasing the number of iterations during the learning process. According to Fig. 5, in
22

8
Modified OMA

The number of users in groups


6

2
0 20 40 60 80 100 120 140 160 180 200
The number of iterations

Fig. 5. The number of users in MOMA clusters different from the optimal partitioning under increased number of iterations

(The number of users is 8).

the initial learning state, there is a large number of users that are partitioned into wrong clusters.
Then, with more iterations, the number of users in the wrong clusters is reduced.

B. The performance of the proposed DDPG algorithm

Fig. 6 to Fig. 10 presents the performance of the proposed DDPG algorithm for the RIS phase
shifter design. Fig. 6 presents the sum reward versus the number of trials for different delta,
where delta is the granularity of the action space for changing the phase shift of the RE. From
Fig. 6, during the training phase, the sum reward is increasing with the number of trials until it
converges to a stable value. This is because the intelligent agent is capable of learning the phase
shifts in each trial and remembers the learning history. We can also see that the performance of
the proposed algorithm can be improved by reducing the granularity of the phase shifts.
Fig. 7 shows that the sum data rate grows when increasing the AP transmit power. We calculate
the optimal solution by exhaustive search. The optimal solution for the 9 REs case is shown
as a benchmark. Since the computation complexity of the proposed DDPG algorithm increases
exponentially with the number of REs, we only show the optimal solution for the 9 REs case.
Fig. 7 shows that the proposed algorithm outperforms random phase shifting of the RIS, and
achieves performance close to the optimal solution. Meanwhile, when we increase the number
23

60

50

40

EpisodeReward
30

20
EpisodeReward (delta = 5)
AverageReward (delta = 5)
10 EpisodeReward (delta = 10)
AverageReward (delta = 10)
EpisodeReward (delta = 20)
AverageReward (delta = 20)
0
0 10 20 30 40 50 60 70 80 90 100
Episode Number

Fig. 6. The sum reward versus the number of trials for different delta (delta represents the granularity in the action space for

changing the phase shifts of the RIS).


Sum data rate (bits/sec/Hz) Sum data rate (bits/sec/Hz)

15
9 REs,optimal
5 REs,optimal
10 5 REs,proposed solution
5 REs,random solution

0
5 10 15 20 25
Transmit power (dBm)
15
9 REs,optimal
4 REs,optimal
10 4 REs,proposed solution
4REs,random solution

0
5 10 15 20 25
Transmit power (dBm)

Fig. 7. The sum data rate versus the AP transmit power.

of REs in the RIS, the sum data rate grows rapidly, which indicates that increasing the number
of REs is an efficient method to improve the performance of the proposed framework.
Fig. 8 shows that the proposed RIS-NOMA downlink transmission outperforms the con-
24

12
NOMA-RIS-5REs
OMA-RIS-5REs
NOMA-RIS-4REs
10 OMA-RIS-5REs
NOMA

Sum data rate (bits/sec/Hz)


8

OMA
4

0
5 10 15 20 25
Transmit power (dBm)

Fig. 8. The sum data rate versus the AP transmit power.

5 The proposed phase shifing


Random phase shifing
Sum data rate (bits/sec/Hz)

Fixed phase shifing


4

0.058
3
0.057

2
0.056

1 0.055
10 12 14 16

5 10 15 20 25
Transmit power Pt (dBm)

Fig. 9. The impact of the transmit power on the sum rate.

ventional RIS-OMA downlink transmission significantly. As the transmit power increases, the
performance gap between NOMA strategy and OMA strategy increases and the sum data rate
gap between different numbers of REs reduces.
Fig. 9 shows the sum rate versus increased transmit power where the random phase shifting
and fixed phase shifting are presented as benchmarks. According to Fig. 9, the proposed phase
25

Time complexity
7
NOMA transmission
Conventional OMA transmission

Computation time (s)


5

2
0 50 100 150 200 250
Number of steps

Fig. 10. The impact of different number of steps on the time complexity.

shifting scheme based on the proposed DDPG algorithm outperforms the random phase shifting
and fixed phase shifting schemes. Also from the subwindow in Fig. 9, we observe that the sum
rate increases with higher transmit power.
Fig. 10 shows the time complexity (as measured by the execution time in seconds) of the
proposed algorithm versus the number of steps. The simulations are performed on a desktop with
an Intel Core i7 9700K 3.6 GHz CPU and 16 GB memory and MATLAB© R2019b programming
language is used for the proposed algorithm. According to Fig. 10, the time complexities of
both NOMA and conventional OMA increase almost linearly with the number of steps. In
addition, we can see from Fig. 10 that the time complexities of NOMA and OMA are very
close, which demonstrates that the transmission strategy does not have significant influence on
the time complexity.

VI. C ONCLUSION

In this paper, a RIS-aided NOMA downlink design framework is proposed. To maximize the
sum data rate of the mobile users, a long-term joint optimization problem is formulated, subject
to the decoding order of NOMA transmission. Two parameters, namely, user partitioning and RIS
phase shifts, are optimized in the maximization problem. For user clustering, a MOMA algorithm
is adopted due to its low complexity and fast convergence speed. The users are partitioned
into equal-size clusters by the proposed MOMA algorithm. For RIS phase shifting in NOMA
26

downlink network, a DDPG algorithm based phase shifting design scheme was proposed, utilizing
the strong fitting ability of neural networks. Both Q-networks and deterministic policy networks
are applied in the proposed DDPG algorithm. The effectiveness of the proposed framework and
algorithms were illustrated by numerical experiments. Numerical results demonstrate that the
performance of the proposed framework can be improved by reducing the granularity of the RIS
phase shifts and increasing the number of REs of the RIS.

A PPENDIX A: P ROOF OF T HEOREM 1

In order to prove Theorem 1, we formulate the stochastic gradient decent based value function
approximation (SGD-VFA) problem as follows

N
1 X
min f (ω) = MSEi (ω), (33)
ω∈R N i=1
where N is the number of total iterations. The formulated SGD-VFA problem is to minimize the
total loss function given the approximation parameter in the approximate value function Vb (s, w).
To solve the above optimization problem, the SGD algorithm starts with an initial vector ω0
and generates a sequence {ωk } according to the following equation

α
ωk+1 = ωk − ∇MSE (ωk ) , (34)
2
where α is the learning rate. Equation (34) can be rewritten as

 
1 2
ωk+1 = arg min MSE (ωk ) + hu − ωk , ∇MSE (ωk )i + ku − ωk k . (35)
u∈R α
The value function V (s) is linear, therefore, the objective function f (ω) in Eq. (33) is also
linear. Thus, we have Eq. (36).
To avoid the vanishing of the gradient, the difference between f (ωk+1) and f (ω ∗ ) should be
smaller than the gradient, therefore, we have

1 
f (ωk+1 ) ≤ f (ω ∗ ) + kωk − ω ∗ k2 − kωk − ω ∗ k2 − kωk+1 − ω ∗ k2 − kωk+1 − ωk k2 . (37)
α
From Eq. (37) and Eq. (33) we can obtain Eq. (38).
In Eq. (38), since MSE (ωk+1) ≥ MSE (ω ∗), therefore kωk − ω ∗k2 ≥ kωk+1 − ω ∗k2 .
We prove Theorem 1 by contradiction. Consider ∀k ∈ [1, N], assuming that the limit of ωk
is η, i.e., ωk → ηk , thus
27

α (f (ωk+1 ) − f (ω ∗ )) = α h∇MSE (ωk ) , ωk+1 − ω ∗ i

= h2 (ωk − ωk+1 ) , ωk+1 − ω ∗ i

= 2 (hωk , ωk+1 i − hωk , ω ∗ i − hωk+1, ωk+1 i + hωk+1 , ω ∗i) (36)

= 2 (ωk ωk+1 − ωk ω ∗ − ωk+1ωk+1 + ωk+1 ω ∗)

= kωk − ω ∗ k2 − kωk+1 − ω ∗k2 − kωk+1 − ωk k2

1 
MSE (ωk+1) ≤ MSE (ω ∗) + kωk − ω ∗ k2 − kωk+1 − ω ∗ k2 − kωk+1 − ωk k2
α (38)
1 
≤ MSE (ω ∗) + kωk − ω ∗ k2 − kωk+1 − ω ∗ k2 .
α

kωk − ω ∗ k2 ≥ kωk+1 − ω ∗ k2 → kηk − ω ∗ k2 . (39)



/ W ∗ , where W ∗ = ∩N
Assuming that ηk ∈ k=1 Wk is the common space of global minimizers

and Wk ∗ is the global minimizer of MSEk (ω) in Eq. (33). ηk ∈


/ W ∗ implies that the limit of ωk
belong to Wk ∗ \W ∗ , i.e., ηk ∈ Wk ∗ \W ∗.
For u 6= v, we have ηu ∈ Wu ∗ \W ∗ and ηv ∈ Wv ∗ \W ∗ . Therefore,

ηu 6= ηv . (40)

However, during the iterations, if u = v + 1 then Eq. (41).


According to Eq. (41), we obtain ωv → ηu , therefore

ηu = ηv (42)

/ W ∗ is
Clearly, Eq. (40) and Eq. (42) contradict each other. Thus, the assumption that ηk ∈
not true. Hence, the sequence {ωk } converges to a global minimizer.
The proof is complete.

R EFERENCES

[1] Z. Yang, Y. Liu, Y. Chen, and J. T. Zhou, “Deep reinforcement learning for RIS-aided non-orthogonal multiple access

downlink networks,” in Proc. IEEE Global Commun. Conf. (GLOBECOM), Taipei, 2020.
28

m→∞
kωv − ηu k2 ≥ kωu − ηu k2 ≥ kωu+1 − ηu k2 ≥ · · · ≥ kωu+m − ηu k2 → 0. (41)

[2] Cisco, “Cisco Visual Networking Index (VNI): VNI Mobile Forecast Highlights Tool,” in White Paper, San Jose, CA,

USA, 2019.

[3] L. Li, T. J. Cui, W. Ji, and et al., “Electromagnetic reprogrammable coding-metasurface holograms,” Nature Commun.,

vol. 8, no. 1, p. 197, Aug. 2017.

[4] C. Liaskos, S. Nie, A. Tsioliaridou, A. Pitsillides, S. Ioannidis, and I. Akyildiz, “A new wireless communication paradigm

through software-controlled metasurfaces,” IEEE Commun. Mag., vol. 56, no. 9, pp. 162–169, Sep. 2018.

[5] R. J. Williams, E. D. Carvalho, and T. L. Marzetta, “A communication model for large intelligent surfaces,” in IEEE Proc.

of International Commun. Conf. Workshops (ICC Workshops), Dublin, Ireland, 2020.

[6] Z. Wang, L. Liu, and S. Cui, “Channel estimation for intelligent reflecting surface assisted multiuser communications:

Framework, algorithms, and analysis,” IEEE Trans. Wireless Commun., vol. 19, no. 10, pp. 6607–6620, Oct. 2020.

[7] Y. Cui and H. Yin, “An efficient CSI acquisition method for intelligent reflecting surface-assisted mmWave networks,”

ArXiv, 2019. [Online]. Available: https://arxiv.org/abs/1912.12076

[8] S. Gong, X. Lu, D. T. Hoang, and et al., “Towards smart radio environment for wireless communications via intelligent

reflecting surfaces: A comprehensive survey,” ArXiv, 2019. [Online]. Available: https://arxiv.org/abs/1912.07794

[9] Y. Sun, M. Peng, and S. Mao, “Deep reinforcement learning-based mode selection and resource management for green

fog radio access networks,” IEEE Int. of Things, vol. 6, no. 2, pp. 1960–1971, Apr. 2019.

[10] O. Naparstek and K. Cohen, “Deep multi-user reinforcement learning for distributed dynamic spectrum access,” IEEE

Trans. Wireless Commun., vol. 18, no. 1, pp. 310–323, Jan. 2019.

[11] L. Huang, S. Bi, and Y. J. Zhang, “Deep reinforcement learning for online computation offloading in wireless powered

mobile-edge computing networks,” IEEE Trans. Mob. Comput., pp. 1–1, 2019.

[12] Y. Liu, Z. Qin, M. Elkashlan, Z. Ding, A. Nallanathan, and L. Hanzo, “Nonorthogonal multiple access for 5G and beyond,”

Proc. IEEE, vol. 105, no. 12, pp. 2347–2381, Dec. 2017.

[13] M. Vaezi, G. A. Aruma Baduge, Y. Liu, A. Arafa, F. Fang, and Z. Ding, “Interplay between NOMA and other emerging

technologies: A survey,” IEEE Trans. Cogni. Commun. Netw., vol. 5, no. 4, pp. 900–919, Dec. 2019.

[14] Z. Ding, Y. Liu, J. Choi, Q. Sun, M. Elkashlan, C. I, and H. V. Poor, “Application of non-orthogonal multiple access in

LTE and 5G networks,” IEEE Commun. Mag., vol. 55, no. 2, pp. 185–191, Feb. 2017.

[15] C. Xiao, J. Zeng, W. Ni, X. Su, R. P. Liu, T. Lv, and J. Wang, “Downlink MIMO-NOMA for ultra-reliable low-latency

communications,” IEEE J. Sel. Areas Commun., vol. 37, no. 4, pp. 780–794, Apr. 2019.

[16] Y. Liu, M. Elkashlan, Z. Ding, and G. K. Karagiannidis, “Fairness of user clustering in MIMO non-orthogonal multiple

access systems,” IEEE Commun. Lett., vol. 20, no. 7, pp. 1465–1468, Jul. 2016.

[17] Z. Ding, P. Fan, and H. V. Poor, “Impact of user pairing on 5G nonorthogonal multiple-access downlink transmissions,”

IEEE Trans. Veh. Technol., vol. 65, no. 8, pp. 6010–6023, Aug. 2016.
29

[18] Z. Wei, D. W. K. Ng, J. Yuan, and H. Wang, “Optimal resource allocation for power-efficient MC-NOMA with imperfect

channel state information,” IEEE Trans. Commun., vol. 65, no. 9, pp. 3944–3961, Sep. 2017.

[19] L. Du, J. Ma, Q. Liang, and Y. Tang, “Multiple antenna multicast transmission assisted by reconfigurable intelligent

surfaces,” ArXiv, 2019. [Online]. Available: https://arxiv.org/abs/1912.07960

[20] Z. Yang, W. Xu, C. Huang, J. Shi, and M. Shikh-Bahaei, “Beamforming design for multiuser transmission through

reconfigurable intelligent surface,” IEEE Trans. Commun., pp. 1–1, 2020.

[21] C. Huang, R. Mo, and C. Yuen, “Reconfigurable intelligent surface assisted multiuser MISO systems exploiting deep

reinforcement learning,” IEEE J. Sel. Areas Commun., vol. 38, no. 8, pp. 1839–1850, Aug. 2020.

[22] X. Yu, D. Xu, Y. Sun, D. W. K. Ng, and R. Schober, “Robust and secure wireless communications via intelligent reflecting

surfaces,” IEEE J. Sel. Areas Commun., vol. 38, no. 11, pp. 2637–2652, Nov. 2020.

[23] D. Xu, X. Yu, Y. Sun, D. W. K. Ng, and R. Schober, “Resource allocation for IRS-assisted full-duplex cognitive radio

systems,” IEEE Trans. Commun., vol. 68, no. 12, pp. 7376–7394, Dec. 2020.

[24] Z. Yang, M. Chen, W. Saad, and et al., “Energy-efficient wireless communications with distributed reconfigurable

intelligent surfaces,” ArXiv, 2020. [Online]. Available: https://arxiv.org/abs/2005.00269

[25] H. Shen, W. Xu, S. Gong, Z. He, and C. Zhao, “Secrecy rate maximization for intelligent reflecting surface assisted

multi-antenna communications,” IEEE Commun. Lett., vol. 23, no. 9, pp. 1488–1492, Sep. 2019.

[26] Y. Han, W. Tang, S. Jin, C. Wen, and X. Ma, “Large intelligent surface-assisted wireless communication exploiting statistical

CSI,” IEEE Trans. Veh. Technol., vol. 68, no. 8, pp. 8238–8242, Aug. 2019.

[27] H. Wang, Z. Zhang, B. Zhu, and et al., “Performance of wireless optical communication with reconfigurable intelligent

surfaces and random obstacles,” ArXiv, 2020. [Online]. Available: https://arxiv.org/abs/2001.05715

[28] Y. Cao and T. Lv, “Sum rate maximization for reconfigurable intelligent surface assisted device-to-device communications,”

ArXiv, 2020. [Online]. Available: https://arxiv.org/abs/2001.03344

[29] J. Zhu, Y. Huang, J. Wang, K. Navaie, and Z. Ding, “Power efficient IRS-assisted NOMA,” IEEE Trans. Commun., pp.

1–1, 2020.

[30] J. Yuan, Y. C. Liang, J. Joung, G. Feng, and E. G. Larsson, “Intelligent reflecting surface-assisted cognitive radio system,”

IEEE Trans. Commun., pp. 1–1, 2020.

[31] K. Ying, Z. Gao, S. Lyu, Y. Wu, H. Wang, and M. Alouini, “GMD-based hybrid beamforming for large reconfigurable

intelligent surface assisted millimeter-wave massive MIMO,” IEEE Access, vol. 8, pp. 19 530–19 539, 2020.

[32] P. Wang, J. Fang, and H. Li, “Joint beamforming for intelligent reflecting surface-assisted millimeter wave

communications,” ArXiv, 2019. [Online]. Available: https://arxiv.org/abs/1910.08541

[33] Y. Cang, M. Chen, Z. Yang, M. Chen, C. Huang, “Optimal resource allocation for multi-UAV assisted visible light

communication,” ArXiv, 2020. [Online]. Available: https://arxiv.org/abs/2012.13200

[34] B. Zheng and R. Zhang, “Intelligent reflecting surface-enhanced OFDM: Channel estimation and reflection optimization,”

IEEE Commun. Lett., vol. 9, no. 4, pp. 518–522, Apr. 2020.

[35] Z. Ding and H. Vincent Poor, “A simple design of IRS-NOMA transmission,” IEEE Commun. Lett., vol. 24, no. 5, pp.

1119–1123, May 2020.

[36] Y. Li, M. Jiang, Q. Zhang, and J. Qin, “Joint beamforming design in multi-cluster MISO NOMA intelligent reflecting
30

surface-aided downlink communication networks,” ArXiv, 2019. [Online]. Available: https://arxiv.org/abs/1909.06972

[37] X. Mu, Y. Liu, L. Guo, J. Lin, and N. Al-Dhahir, “Exploiting intelligent reflecting surfaces in multi-antenna aided

NOMA systems,” ArXiv, 2019. [Online]. Available: https://arxiv.org/abs/1910.13636

[38] M. Fu, Y. Zhou, and Y. Shi, “Intelligent reflecting surface for downlink non-orthogonal multiple access networks,” in Proc.

IEEE Global Commun. Conf. Workshops (GLOBECOM Workshops), Dec. 2019, pp. 1–6.

[39] Z. Yang, Y. Liu, Y. Chen, and G. Tyson, “Deep reinforcement learning in cache-aided MEC networks,” in IEEE Proc. of

International Commun. Conf. (ICC), 2019, pp. 1–6.

[40] W. Saad, M. Bennis, and M. Chen, “A vision of 6G wireless systems: Applications, trends, technologies, and open research

problems,” IEEE Network, pp. 1–9, 2019.

[41] S. Hu, F. Rusek, and O. Edfors, “Beyond massive MIMO: The potential of data transmission with large intelligent surfaces,”

IEEE Trans. Signal Process., vol. 66, no. 10, pp. 2746–2758, May 2018.

[42] Narendra and Fukunaga, “A branch and bound algorithm for feature subset selection,” IEEE Trans. Comput., vol. C-26,

no. 9, pp. 917–922, Sep. 1977.

[43] S. Zhang and R. Zhang, “Capacity characterization for intelligent reflecting surface aided MIMO communication,” IEEE

J. Sel. Areas Commun., vol. 38, no. 8, pp. 1823–1838, Aug. 2020.

[44] X. Mu, Y. Liu, L. Guo, J. Lin, and N. Al-Dhahir, “Capacity and optimal resource allocation for IRS-assisted multi-user

communication systems,” ArXiv, 2020. [Online]. Available: https://arxiv.org/abs/2001.03913

[45] C. Huang, G. C. Alexandropoulos, C. Yuen, and M. Debbah, “Indoor signal focusing with deep learning designed

reconfigurable intelligent surfaces,” in 2019 IEEE 20th International Workshop on Signal Processing Advances in Wireless

Communications (SPAWC), Jul. 2019, pp. 1–5.

[46] M. Jung, W. Saad, and G. Kong, “Performance analysis of large intelligent surfaces (LISs): Uplink spectral efficiency

and pilot training,” ArXiv, 2019. [Online]. Available: https://arxiv.org/abs/1904.00453

[47] T. Cui, M. Qi, X. Wan, J. Zhao, and Q. Cheng, “Coding metamaterials, digital metamaterials and programmable

metamaterials,” Light: Science and Applications, vol. 3, no. 10, 2014.

[48] A. Shirvani and B. J. Oommen, “On enhancing the deadlock-preventing object migration automaton using the pursuit

paradigm,” Pattern Analysis and Applications, pp. 1–18, 2019.

[49] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. Schapire, “Gambling in a rigged casino: The adversarial multi-armed bandit

problem,” Foundations of Computer Science, 1975., 16th Annual Symposium on, 07 1998.

[50] Z. Zhang, Y. Yang, M. Hua, C. Li, Y. Huang, and L. Yang, “Proactive caching for vehicular multi-view 3D video streaming

via deep reinforcement learning,” IEEE Trans. Wireless Commun., vol. 18, no. 5, pp. 2693–2706, 2019.

[51] C. Qiu, Y. Hu, Y. Chen, and B. Zeng, “Deep deterministic policy gradient (DDPG)-based energy harvesting wireless

communications,” IEEE Int. of Things, vol. 6, no. 5, pp. 8577–8588, 2019.

You might also like