Reconfigurable_Intelligent_Surface_Assisted_Multiuser_MISO_Systems_Exploiting_Deep_Reinforcement_Learn-1

This paper explores the use of Reconfigurable Intelligent Surfaces (RIS) in multiuser MISO systems, leveraging Deep Reinforcement Learning (DRL) for joint design of transmit beamforming and phase shifts. The proposed DRL-based algorithm aims to optimize performance while reducing complexity, allowing for efficient adaptation to varying wireless environments. Results indicate that the algorithm effectively learns from its environment and achieves comparable performance to existing benchmarks, while significantly reducing power consumption in large-scale antenna systems.

Uploaded by

Prabhat Sharma

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Reconfigurable_Intelligent_Surface_Assisted_Multiuser_MISO_Systems_Exploiting_Deep_Reinforcement_Learn-1

Uploaded by

Prabhat Sharma

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 38, NO.

8, AUGUST 2020 1839

Reconfigurable Intelligent Surface Assisted

Multiuser MISO Systems Exploiting Deep
Reinforcement Learning
Chongwen Huang , Member, IEEE, Ronghong Mo, and Chau Yuen , Senior Member, IEEE

Abstract— Recently, the reconfigurable intelligent surface hardware cost, constrained physical size, and increased power
(RIS), benefited from the breakthrough on the fabrication of consumption scaling up the conventional MIMO systems
programmable meta-material, has been speculated as one of by many orders of magnitude, arise when the conventional
the key enabling technologies for the future six generation
(6G) wireless communication systems scaled up beyond massive large-scale antenna array is used at the BS.
multiple input multiple output (Massive-MIMO) technology to On another hand, the reconfigurable intelligent surface
achieve smart radio environments. Employed as reflecting arrays, (RIS), benefited from the breakthrough on the fabrication
RIS is able to assist MIMO transmissions without the need of of programmable meta-material, has been speculated as one
radio frequency chains resulting in considerable reduction in of the key enabling technologies for the future six genera-
power consumption. In this paper, we investigate the joint design
of transmit beamforming matrix at the base station and the phase tion (6G) wireless communication systems scaled up beyond
shift matrix at the RIS, by leveraging recent advances in deep Massive-MIMO to achieve smart radio environment [4]– [10].
reinforcement learning (DRL). We first develop a DRL based The meta-material based RIS makes possible wideband
algorithm, in which the joint design is obtained through trial-and- antennas with compact size, such that large scale antennas
error interactions with the environment by observing predefined can be easily deployed at both ends of the user devices
rewards, in the context of continuous state and action. Unlike
the most reported works utilizing the alternating optimization and BS, to achieve Massive-MIMO gains but with signif-
techniques to alternatively obtain the transmit beamforming and icant reduction in power consumption. With the help of
phase shifts, the proposed DRL based algorithm obtains the joint varactor diode or other micro electrical mechanical sys-
design simultaneously as the output of the DRL neural network. tems (MEMS) technology, Electromagnetic (EM) properties
Simulation results show that the proposed algorithm is not only of the RIS are fully defined by its micro-structure, and
able to learn from the environment and gradually improve its
behavior, but also obtains the comparable performance compared can be programmed to vary the phase, amplitude, frequency
with two state-of-the-art benchmarks. It is also observed that, and even orbital angular momentum of an EM wave, effec-
appropriate neural network parameter settings will improve sig- tively modulating a radio signal without a mixer and radio
nificantly the performance and convergence rate of the proposed frequency (RF) chain.
algorithm. The RIS can be deployed as reconfigurable transmitters,
Index Terms— Reconfigurable intelligent surface, Massive receivers and passive reflecting arrays. Being reflecting arrays,
MIMO, 6G, smart radio environment, beamforming matrix, the RIS is usually placed in between the BS and single antenna
phase shift matrix, deep reinforcement learning. receivers, and consists of a vast number of nearly passive,
I. I NTRODUCTION low-cost and low energy consuming reflecting elements, each
of which introduces a certain phase shift to the signals imping-
R ECENT years have witnessed the successful deployment
of massive multiple input multiple output (Massive-
MIMO) in the fifth generation (5G) wireless communication
ing on it. By reconfiguring the phase shifts of elements of RIS,
the reflected signals can be added constructively at the desired
receiver to enhance the received signal power or destructively
systems, as a promising approach to support massive number
at non-intended receivers to reduce the co-channel interfer-
of users at high data rate, low latency and secure transmission
ence. Due to the low power consumption, the reflecting RIS
simultaneously and efficiently [1]– [3]. However, implement-
can be fabricated in very compact size with light weight,
ing a Massive-MIMO base station (BS) is challenging, as high
leading to easy installation of RIS in building facades, ceilings,
Manuscript received October 1, 2019; revised January 15, 2020; accepted moving trains, lamp poles, road signs, etc., as well as ready
February 17, 2020. Date of publication June 8, 2020; date of current version integration into existing communication systems with minor
August 20, 2020. The work of Chongwen Huang and Chau Yuen was
supported by A∗ STAR under its RIE2020 Advanced Manufacturing and modifications on hardware [10]– [14].
Engineering (AME) Industry Alignment Fund–Pre Positioning (IAF-PP) under Note that, passive reflecting surfaces have been used in radar
Grant A19D6a0053. (Corresponding author: Ronghong Mo.) systems for many years. However, the phase shifts of passive
The authors are with the Engineering Product Development (EPD) Pillar,
Singapore University of Technology and Design, 487372 Singapore (e-mail: radars cannot be adjusted once fabricated, and the signal prop-
[email protected]; [email protected]; yuenchau@sutd. agation cannot be programmed through controlling the phase
edu.sg). shifts of antenna elements. The reflecting RIS also differs
Color versions of one or more of the figures in this article are available
online at http://ieeexplore.ieee.org. from relaying systems, in that the RIS reflecting array only
Digital Object Identifier 10.1109/JSAC.2020.3000835 alters the signal propagation by reconfiguring the constituent
0733-8716 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: National Institute of Technology. Downloaded on June 06,2024 at 05:55:29 UTC from IEEE Xplore. Restrictions apply.
1840 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 38, NO. 8, AUGUST 2020

meta-atoms of meta-surfaces of RISs without RF chains and optimization techniques. In [24], the fractional programming
additional thermal noise added during reflections, whereas method was used to find the transmit beamforming matrix, and
the latter requires active RF components for signal reception three efficient algorithms were developed to optimize the phase
and emission. Consequently, the beamforming design in relay shifts. In [25], the large system analysis was exploited to derive
nodes is classified as active, while it is passive in reflecting the closed-form expression of the minimum SINR when only
RIS assisted systems. spatial correlation matrices of the RIS elements are available.
Then, authors targeted at maximizing the minimum SINR by
optimizing the phase shifts based on the derived expression.
A. Prior Works In [26], the weighted sum rate of all users in multi-cell
Although RIS has gained considerable attentions in recent MIMO settings were investigated, through jointly optimizing
years, the most of reported works are primarily focused on the transmit beamforming and the phase shifts subject to each
implementing hardware testbeds, e.g., reflect-arrays and meta- BS’s power constraint and unit modulus constraint.
surfaces, and on realizing point-to-point experimental tests [9], Recently, the model-free artificial intelligence (AI) has
[10]. More recently, there are some works attempting to emerged as an extraordinarily remarkable technology to
investigate optimizing the performance of RIS-assisted MIMO address explosive mass data, mathematically intractable
systems. The optimal receiver and matched filter (MF) were non-linear non-convex problems and high-computation issues
investigated for uplink RIS assisted MIMO systems in [8], [27]– [30]. Overwhelming interests in applying AI to the
where the RIS is deployed as a MIMO receiver. An index design and optimization of wireless communication systems
modulation (IM) scheme exploiting the programmable nature have been witnessed recently, and it is a consensus that
of the RIS was proposed in [13], where it was shown that the AI will be at the heart of future wireless communication
RIS-based IM enables high data rates with remarkably low systems (e.g. 6G and beyond) [31]– [40]. The AI technol-
error rates. ogy is most appealing to large scale MIMO systems with
When RISs are utilized as reflecting arrays, the error per- massive number of array elements, where optimization prob-
formance achieved by a reflecting RIS assisted single antenna lems become non-trivial due to extremely large dimension
transmitter/receiver system was derived in [14]. A joint design optimization involved. Particularly, deep learning (DL) has
of local optimal transmit beamforming at the BS and the phase been used to obtain the beamforming matrix for MIMO
shifts at reflecting RIS with discrete entries was proposed systems by building a mapping relations between channel
in [15] for reflecting RIS assisted single-user multiple input information and the precoding design [34]– [37]. Actually,
single output (MISO) systems, by solving the transmit power DL based approaches are able to significantly reduce the
minimization problem utilizing an alternating optimization complexity and computation time utilizing the offline predic-
technique. The received signal power maximization problem tion, but often require an exhaustive sample library for online
for MISO systems with reflecting RIS was formulated and training. Meanwhile, the deep reinforcement learning (DRL)
studied in [16] through the design of transmit beamforming technique which embraces the advantage of DL in neural
and phase shifts employing efficient fixed point iteration and network training as well as improves the learning speed and
manifold optimization techniques. The authors in [17] derived the performance of reinforcement learning (RL) algorithms,
a closed-form expression for the phase shifts for reflecting has also been adopted in designing wireless communcation
RIS assisted MISO systems when only the statistical channel systems [29], [32], [38]–[40].
state information (CSI) is available. Compressive sensing DRL is particularly beneficial to wireless communication
based channel estimation was studied in [18] for reflecting systems where radio channels vary over time. DRL is able
RIS assisted MISO systems with single antenna transmit- to allow wireless communication systems to learn and build
ter/receiver. Deep learning based algorithm was proposed to knowledge about the radio channels without knowing the
obtain phase shifts. In [19], [20], the transmit beamforming channel model and mobility pattern, leading to efficient algo-
and the phase shifts were designed to maximize the secrecy rithm designs by observing the rewards from the environment
rate for reflecting RIS assisted MIMO systems with only one and find out solutions of sophisticated optimization problems.
legitimate receiver and one eavesdropper, employing various In [38], the hybrid beamforming matrices at the BS were
optimization techniques. obtained by applying DRL where the sum rate and the
All above mentioned works focus on single user MISO elements of the beamforming matrices are denoted as states
systems. As multiple users and massive access are concerned, and actions. In [40], the cell vectorization problem is casted as
the transmit beamforming and the phase shift were stud- the optimal beamforming matrix selection to optimize network
ied in [21], [22], by solving the sum rate/energy efficiency coverage utilizing DRL to track the user distribution pattern.
maximization problem, assuming a zero-forcing (ZF) based In [39], the joint design of beamforming, power control, and
algorithm employed at the BS, whereas the stochastic gra- interference coordination were formulated as an non-convex
dient descent (SGD) search and sequential fractional pro- optimization problem to maximize the SINR solved by DRL.
gramming are utilized to obtain the phase shifter. In [23],
through minimizing the total transmit power while guarantee-
ing each user’s signal-to-interference-plus-noise ratio (SINR) B. Contributions
constraint, the transmit beamforming and phase shifts were In this paper, we investigate the joint design of transmit
obtained by utilizing semi-definite relaxation and alternating beamforming at the BS and phase shifts at the reflecting RIS to

Authorized licensed use limited to: National Institute of Technology. Downloaded on June 06,2024 at 05:55:29 UTC from IEEE Xplore. Restrictions apply.
HUANG et al.: RIS ASSISTED MULTIUSER MISO SYSTEMS EXPLOITING DRL 1841

maximize the sum rate of multiuser downlink MISO systems HT , and HH represent the transpose and conjugate transpose
utilizing DRL, assuming that direct transmissions between of matrix H, respectively. H(t) is the value of H at time t.
the BS and the users are totally blocked. This optimization hk is the k th column vector of H. T r{} is the trace of the
problem is non-convex due to the multiuser interference, enclosed item. For any column vector h (all vectors in this
and the optimal solution is unknown. We develop a DRL paper are column vectors), h(i) is the ith entry, while hk,n
based algorithm to find the feasible solution, without using is the nth channel vector for the k th user. ||h|| denotes the
sophisticate mathematical formulations and numerical opti- magnitude of the vector. |x| denotes the absolute value of a
mization techniques. Specifically, we use policy-based deep complex number x, and Re(x) and Im(x) denote its real part
deterministic policy gradient (DDPG) derived from Markov and imaginary part, respectively.
decision process to address continuous beamforming matrix
and phase shifts [41]. The main contributions of this paper II. S YSTEM M ODEL AND P ROBLEM F ORMULATION
are summarized as follows:
• We propose a new joint design of transmit beamforming We consider a MISO system comprised of a BS, one
and phase shifts based on the recent advance in DRL reflecting RIS and multiple users, as shown in Fig. 1. The BS
technique. This paper is a very early attempt to formulate has M antennas and communicates with K users where M ≥
a framework that incorporates the DRL technique into K single antenna users. The reflecting RIS is equipped with
optimal designs for reflecting RIS assisted MIMO sys- N reflecting elements and one micro-controller. A number of
tems to address large-dimension optimization problems. K data streams are transmitted simultaneously from the M
• The proposed DRL based algorithm has a very standard antennas of the BS. Each data stream is targeted at one of
formulation and low complexity in implementation, with- the K users. The signals are first arrived at the reflecting
out knowledge of explicit model of wireless environment RIS and then are reflected by the RIS. The direct signal
and specific mathematical formulations. Such that it is transmissions between the BS and the users are assumed to be
very easy to be scaled to various system settings. More- negligible. This is reasonable since in practical the reflecting
over, in contrast to DL based algorithms which rely on RIS is generally deployed to overcome the situations where
sample labels obtained from mathematically formulated severe signal blockage happens in-between the BS and the
algorithms, DRL based algorithms are able to learn users. The RIS functions as a reflecting array, equivalent to
the knowledge about the environment and adapt to the introducing phase shifts to impinging signals. Being intelligent
environment. surface, the reflecting RIS could be intelligently programmed
• Unlike reported works which utilize alternating opti- to vary phase shifts based on the wireless environment through
mization techniques to alternatively obtain the transmit electronic circuits integrated in the meta-surfaces. We assume
beamforming and phase shifter, the proposed algorithm that, the channel matrix from the BS to the reflecting RIS,
jointly obtain the transmit beamforming matrix and the H1 ∈ C(N ×M) , and the channel vector hk,2 ∈ C(N ×1) for
phase shifts, as one of the outputs of the DRL algorithm. all k, from the RIS to all the K users are perfectly known at
Specifically, the sum rate is utilized as the instant rewards both the BS and the RIS, with the aid of the transmission
to train the DRL based algorithm. The transmit beam- of pilot signals and feedback channels. It should be noted
forming matrix and the phase shifts are jointly obtained that, obtaining CSI at the RIS is a challenging task, which
by gradually maximizing the sum rate through observing definitely requires that the RIS has the capability to transmit
the reward and iteratively adjusting the parameters of the and receive signals. However this is indeed contradictory to
proposed DRL algorithm accordingly. Since the transmit the claim that RIS does not need RF chains. One solution is
beamforming matrix and the phase shifts are continuous, to install RF chains dedicated to channel estimation. To this
we resort to DDPG to develop our algorithm, in contrast end, the system should be delicately designed to tradeoff the
to designs addressing the discrete action space. system performance and cost, which is beyond the scope of
Simulations show that the proposed algorithm is able to this paper.
learn from the environment through observing the instant Assume the frequency flat channel fading. The signal
rewards and improve its behavior step by step to obtain received at the k th user is given as
the optimal transmit beamforming matrix and phase shifts. yk = hTk,2 ΦH1 Gx + wk , (1)
It is also observed that, appropriate neural network parameter
settings will increase significantly the performance and con- where yk denotes the signal received at the k th user. x is a
vergence rate of the proposed algorithm. column vector of dimension K × 1 consisting of data streams
The rest of the paper is organized as follows. The sys- transmitted to all the users, with zero mean unit variance
tem model will be described in Section II. The DRL based entries, E[|x|2 ] = 1. G ∈ C(M×K) is the beamforming
algorithm for joint design of transmit beamforming and phase matrix applied at the BS, while Φ diag[φ1 , φ2 , . . . , φN ], ∈
shifts is presented in Section III. Simulation results are pro- C(N ×N ) is the phase shift matrix applied at the reflecting RIS.
vided in Section V to verify the performance of the proposed wk is the zero mean additive white Gaussian noise (AWGN)
algorithms, whereas conclusions are presented in Section VI. with entries of variance σn2 .
The notations used in this paper are listed as follows. E Note that, Φ is a diagonal matrix whose entries are given
denotes the statistical expectation. For any general matrix H, by Φ(n, n) = φn = ejϕn , where ϕn is the phase shift induced
H(i, j) denotes the entry at the ith row and the j th column. by each element of the RIS. Here we assume ideal reflection

Authorized licensed use limited to: National Institute of Technology. Downloaded on June 06,2024 at 05:55:29 UTC from IEEE Xplore. Restrictions apply.
1842 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 38, NO. 8, AUGUST 2020

where Rk is the data rate of the k th user, given by Rk =

log2 (1 + ρk ). Unlike the traditional beamforming design and
phase shift optimization algorithms that require full up-to-
date cross-cell channel state information (CSI) for RIS-based
systems. Our objective is to find out the optimal G and Φ
by maximizing sum rate C leveraging the recent advance
of DRL technique under given a particular CSI. Unlike the
conventional deep neural networks (DNN), where it needs two
phases, offline training phase and online learning phase, our
proposed DRL method, each CSI is used to construct the state,
and run the algorithm to obtain the two matrices continuously.
The optimization problem can be formulated as
max C(G, Φ, hk,2 , H1 )
G,Φ

Fig. 1. The considered RIS-based multi-user MISO system comprised of s.t. tr{GGH } ≤ Pt
a M -antenna BS simultaneously serving in the downlink K single-antenna |φn | = 1 ∀n = 1, 2, . . . , N. (6)
users. RIS is equipped with N reflecting elements and one micro-controller,
which is attached to a building’s facade, and the transmit signal propagates
to the users via the RIS’s assistance.
It can be seen that (6) is a non-convex non-trivial optimization
problem, due to the non-convex objective function and the
constraint. Exhaustive search would have to be used to obtain
by the RIS such that the signal power is lossless from each the optimal solution if utilizing classical mathematical tools,
reflection element or |Φ(n, n)|2 = 1. Then, the reflection which is impossible, particularly for large scale network.
results in the phase shift of the impinging signals only. In this Instead, in general, algorithms are developed to find out subop-
paper, we consider the continuous phase shift where ϕn ∈ timal solutions employing alternating optimization techniques
[0, 2φ)∀n for the development of DRL based algorithm. to maximize the objective functions, where in each iteration,
From (1), it can be seen that, compared to MISO relay- suboptimal G is solved by first fixing Φ [15]– [20] while
ing systems, reflecting RIS assisted MISO systems do not suboptimal Φ is derived by fixing the G, until the algorithms
introduce AWGN at the RIS. This is because that the RIS converge. In this paper, rather than directly solving the chal-
acts as a passive mirror simply reflecting the signals incident lenging optimization problem mathematically, we formulate
on it, without signal decoding and encoding. The phases of the sum rate optimization problem in the context of advanced
signals impinging on the RIS will be reconfigured through DRL method to obtain the feasible G and Φ.
the micro-controller connected to the RIS. It is also clear
that, the signals arriving at the users experience the composite III. P RELIMINARY K NOWLEDGE OF DRL
channel fading, hTk,2 ΦH1 . Compared to point to point wireless
In this section, we briefly describe the background of DRL
communications, this composite channel fading results in more
which builds up the foundation for the proposed joint design
severe signal loss, if without signal compensation at the RIS.
of transmit beamforming and phase shifts.
To maintain the transmission power at the BS, the following
constraint is considered
A. Overview of DRL
E tr{Gx(Gx)H } ≤ Pt , (2) In a typical RL, the agent gradually derives its best action
where Pt is the total transmission power allowed at the BS. through the trial-and-error interactions with the environment
The received signal model (1) can be further written as over time, applying actions to the environment, observing the
instant rewards and the transitions of state of the environment,
K
as shown in Fig. 2. There are a few basic elements used to
yk = hTk,2 ΦH1 gk xk + hTk,2 ΦH1 gn xn + wk , (3)
fully characterize the RL learning process, the state, the action,
n,n=k
the instant reward, the policy and the value function:
where gk is the k th column vector of the matrix G. (1) State: a set of observations characterizing the environ-
Without joint detection of data streams for all users, the sec- ment. The state s(t) ∈ S denotes the observation at the time
ond term of (3) is treated as cochannel interference. The SINR step t.
at the k th user is given by (2) Action: a set of choices. The agent takes one action step
|hTk,2 ΦH1 gk |2 by step during the learning process. Once the agent takes an
ρk = K . (4) action a(t) ∈ A at time instant t following a policy π, the state
n,n=k |hTk,2 ΦH1 gn |2 + σn2 of the environment will transit from the current state s(t) to
In this paper, we adopt the ergodic sum rate, as given in (5), the next state s(t+1) . As a result, the agent gets a reward r(t) .
as a metric to evaluate the system performance, (3) Reward: a return. The agent wants to acquire a reward
K by taking action a given state s. It is also a performance metric

C(G, Φ, hk,2 , H1 ) = Rk , (5) r(t) to evaluate how good the action a(t) is given a state s(t)
k=1 at time instant t.

Fig. 2. (a) An illustration of deep Q learning, where a double DNN to approximate the optimal state-action value and Q function. (b) Illustration of the
proposed DNN.

(4) Policy: the policy π(s(t) , a(t) ) denotes the probability The Bellman equation (9) can be solved recursively to
of taking action a(t) conditioned (t)
on the state s . Note that, obtain the optimal Q∗ (s(t) , a(t) ), without the knowledge of
the policy function satisfies a(t) ∈A π(s(t) , a(t) ) = 1. exact reward model and the state transition model. The updat-
(5) State-action value function: value to be in a state s and ing on the Q function is given as
action a. The reward measures immediate return from action
a given state s, whereas the value function measures potential Q∗ (s(t) , a(t) ) ← (1 − α)Q∗ (s(t) , a(t) ) + α(r(t+1)
future rewards which the agent may get from taking action a + γ max Qπ (s(t+1) , a )), (10)
a
being in the state s.
(6) Experience: defined as (s(t) , a(t) , r(t+1) , s(t+1) ).
Adopting the Qπ (s(t) , a(t) ) function as the state-action where α is the learning rate for the update of Q function.
value function. Given the state s(t) , the action a(t) , and the If Q(s(t), a(t)) is updated at every time instant, it will con-
instant reward r(t) at time t, the Q value function is given as verge to the optimal state-action value function Q∗ (s(t), a(t)).
However, this is not easily achieved, particularly with the large

Qπ (s(t) , a(t) ) = Eπ R(t) |s(t) = s, a(t) = a dimension state space and action space. Instead, the func-
∞ tion approximation is usually used to address problems with
R(t) = γ τ r(t+τ +1) , (7) enormous state/action spaces. Popular function approxima-
τ =0 tors include feature representation, neural networks and ones
directly relating value functions to state variables. Rather than
where γ ∈ (0, 1] is the discount rate. The Q function is utilizing explicit mathematical modeling, the DNN approx-
a metric to evaluate the impact of the choice of action on imates state/action value function, policy function and the
the expected future cumulative discounted reward achieved by system model as composition of many non-linear functions,
the learning process, with the choice of action a(t) under the as shown in Fig. 2, where both the Q function and the action
policy π. are approximated by DNN. However, the neural network
The Q function satisfies the Bellman equation given by based approximation does not give any interpretation and the
resulting DRL based algorithm might also give local optimal
Qπ (s(t) , a(t) ) = Eπ r(t+1) |s(t) = s, a(t) = a

due to the sample correlation and non-stationary targets.
a
+γ Pss π(s , a )Qπ (s , a ) , (8) One key issue using neural network as Q function approxi-
s ∈S a ∈A mation is that, the states are highly correlated in time domain
and result in reduction of randomness of states since they are
a (t+1)
where Pss = Pr (s = s |s(t) = s, a(t) = a) is the all extracted from the same episode. The experience replay,
transition probability from state s to state s with action a which is a buffer window consisting of last few states, can
being taken. considerably improve the performance of DRL. Instead of
The Q-learning algorithm searches the optimal policy π ∗ . updating from the last state, the DNN updates from a batch
From (8), the optimal Q function associated with the optimal of randomly sampled states to the experience replay.
policy becomes With DRL, the Q value function is completely determined
by a parameter vector θ
Q∗ (s(t) , a(t) ) = r(t+1) (s(t) = s, a(t) , π = π ∗ )

a ∗
+γ Pss max Q (s , a ).

(9) Q(s(t), a(t)) Q(θ|s(t), a(t)), (11)
a ∈A
s ∈S
Authorized licensed use limited to: National Institute of Technology. Downloaded on June 06,2024 at 05:55:29 UTC from IEEE Xplore. Restrictions apply.
1844 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 38, NO. 8, AUGUST 2020

where θ is equivalent to the weighting and bias parameters

in the neural network. Rather than updating the Q function
directly as in (8), with DRL, the optimal Q value function can
be approached by updating θ using stochastic optimization
algorithms
θ(t+1) = θ(t) − μΔθ (θ), (12)
where μ is the learning rate for the update on θ and Δθ is the
gradient of loss function (θ) with respect to θ.
The loss function is generally given as the difference
between the neural network predicted value and the actual tar-
get value. However, since reinforcement learning is a process
learning to approach the optimal Q value function, the actual
target value is not known. To address this problem, two neural
networks with the identical architecture are defined, the train-
ing neural network and the target neural network, whose
value functions are respectively given by Q(θ(train) |s(t) , a(t) ) Fig. 3. Proposed DNN structure of the critic network and the actor network.
and Q(θ(target) |s(t) , a(t) ). The target neural network is syn-
chronized to the training neural network at a predetermined
on target network is much slower than the training network.
frequency. The actual target value is estimated as
The update on the training actor network is given as
y = r(t+1) + γ max

Q(θ(target) |s(t+1) , a ). (13) θa(t+1)
a

The loss function is thus given by θa(t) −μa Δa q(θc(target) |s(t) , a)Δθ(train) π(θa(train) |s(t) ) (17)
a
2 where μa is the learning rate for the update on training actor
(θ) = y−Q(θ(train) |s(t) , a(t) ) . (14) network. π(θa
(train) (t)
|s ) denotes the training actor network
(train)
with θa being the DNN parameters and given input s(t) .
(target) (t)
B. DDPG Δa q(θc |s , a) is the gradient of target critic network
(train) (t)
with respect to the action, whereas Δθ(train) π(θa |s )
As the proposed joint design of transmit beamforming and a
is the gradient of training actor network with respect to its
phase shifts is casted as a DRL optimization problem, the most (train)
parameter θa . It can be seen from (17), the update of
challenging of it is the continuous state space and action space.
training actor network is affected by the target critic network
To address this issue, we explore the DDPG neural network
through gradient of the target critic network with respect to
to solve our optimization problem, as shown in Fig. 2. It can
the action, which ensures that the next selection of action is
be seen that, there are two DNNs in DDPG neural network,
on the favorite direction of actions to optimize the Q value
the actor network and the critic network. The actor network
function.
takes the state as input and outputs the continuous action,
The updates on the target critic network and the target actor
which is in turn input to the critic network together with the
network are given as follows, respectively
state. The actor network is used to approximate the action, thus
eliminating the need of finding the action maximizing the Q θc(target) ← τc θc(train) + (1 − τc )θc(target) ,
value function given the next state which involves non-convex θa(target) ← τa θa(train) + (1 − τa )θa(target) , (18)
optimization.
The updates on the training critic network are given as where τc , τa are the learning rate for updating of the target
follows: critic network and the target actor network, respectively.

θc(t+1) = θc(t) − μc Δθ(train) (θc(train) ), (15) IV. DRL BASED J OINT D ESIGN OF T RANSMIT
c
B EAMFORMING AND P HASE S HIFTS
(train)
(θc ) = r + γq(θc(target) |s(t+1) , a )
(t)
In this section, we present the proposed DRL based
2 algorithm for joint design of transmit beamforming and
−q(θc(train) |s(t) , a(t) ) , (16) phase shifts, utilizing DDPG neural network structure shown
in Fig. 3. The DRL algorithm is driven by two DNNs, the state
where μc is the learning rate for the update on training s, the action a and the instant reward r. First we introduce
critic network. a is the action output from the target actor the structure of the proposed DNNs, followed by detailed
(train)
network and Δθ(train) (θc ) denotes the gradient with description of s, a, r and the algorithm.
c
(train) (target)
respect to the training critic network θc . The θc and
(train) A. Construction of DNN
the θc denote the training and the target critic network,
in which the parameters of the target network are updated as The structures of DNN utilized in this paper are shown
that of the training network in certain time slots. The update in Fig. 3. As can be seen, both proposed DNN structures

of the critic network and the actor network of are a fully Algorithm 1 Joint Transmit beamforming and Phase Shifts
connected deep neural network. The critic network and the Design
actor network have the identical structure, comprised of one Input: H1 , hk,2 , ∀k
input layer, one output layer, and two hidden layers. The input Output: optimal action a = {G, Φ}, Q value function
and the output dimension of the critic network equals to the Initialization: experience replay memory M with size D,
cardinality of the state set together with the action set and the (train)
training actor network parameter θa , target actor net-
Q value function, respectively. The input and output dimension (train) (target)
work parameter θa = θa , training critic network
of the actor network are defined as the cardinality of the state (train)
with parameter θc , target critic network with parameter
and the action, respectively. The number of neurons of hidden (train) (target)
θc = θc , transmit beamforming matrix G, phase
layers depend on the number of users, the number of antennas
shift matrix Φ
at the BS and the number of elements at RIS. In general,
Do:
the number of neurons of hidden layers must be larger than
1: for espisode = 0, 1, 2, · · · , N − 1 do
the input and the output dimension. The action output from (n) (n)
2: Collect and preprocess H1 , hk,2 , ∀k for the nth episode
the actor network will be input to the hidden layer 2 to
avoid Python implementation issues in the computation of to obtain the first state s(0)
Δa q(θc
(target) (t)
|s , a). 3: for t=0, 1, 2, · · · , T − 1 do
(train)
Note that, the correlation between entries of s will degrade 4: Obtain action a(t) = {G(t) , Φ(t) } = π(θa ) from
the efficiency of using neural network as function approxima- the actor network
tion. To overcome this problem, prior to being input to both 5: Observe new state s(t+1) given action a(t)
the critic and the actor network, the state s will go through 6: Observe instant reward r(t+1)
a whitening process, to remove the correlation between the 7: Store the experience (s(t) , a(t) , r(t+1) , s(t+1) ) in
entries of the state s. 8: the replay memory
In order to overcome the variation on distribution of each 9: Obtain the Q value function as Q =
(train) (t) (t)
layer’s inputs resulting from the changes in parameters of the q(θc |s , a ) from the critic network
previous layers, batch normalizing is utilized at the hidden 10: Sample random mini-batches of size W of experiences
layers. Batch normalization allows for much higher learning from replay memory M
rates and less careful about initialization, and in some cases 11: Construct training critic network loss function
(train)
eliminates the need for dropout. (θc ) given by (16)
The activation function utilized here is tanh in order to 12: Perform SGD on training critic network to obtain
(train)
address the negative inputs. The optimizer used for both Δθ(train) (θc )
c
the training critic network and the training actor network 13: Perform SGD on target critic network to obtain
(t) (t−1) (target) (t)
is Adam with adaptive learning rate μc = λc μc and Δa q(θc |s , a)
(t) (t−1) 14: Perform SGD on training actor network to obtain
μa = λa μa , where λc and λa are the decaying rate for (train) (t)
the training critic network and training actor network. Δθ(train) π(θa |s )
a
(train)
Noting that, G should satisfied the power constraint defined 15: Update the training critic network θc
in (6). To implement this, a normalization layer is employed at (train)
16: Update the training actor network θa
the output of the actor network, where T r{GGT } = Pt . For 17:
(target)
Every U steps, update the target critic network θc
Φ, |Φ(n, n)|2 = 1 be maintained to ensure signal reflection 18: Every U steps, update the target actor network θa
(target)
without the power consumption. 19: Set input to DNN as s (t+1)

20: end for

B. Algorithm Description 21: end for
Assuming there exists a central controller, or the agent,
which is able to instantaneously collect the channel infor-
mation, H1 and hk,2 ∀k. At time step t, given the channel this algorithm is to obtain the optimal G and Φ utilizing DRL,
information, and the action G(t−1) and Φ(t−1) in the previous rather than to train a neural network for online processing. The
state, the agent constructs the state s(t) for time step t details of the proposed method are shown in Algorithm 1.
following the section IV.B.1 State. The construction of the state s, the action G, Φ, and the
At the beginning of the algorithm, the experience replay instant reward are described in details as follows.
buffer M, the critic network and the actor network paramters 1) State: The state s(t) at the time step t is determined by
(train) (train)
θc and θa , the action G and Φ need to be initialized. the transmission power at the tth time step, the received power
In this paper, we simply adopt the identity matrix to initialize of users at the tth time step, the action from the (t − 1)th time
G and Φ. step, the channel matrix H1 and hk,2 ∈ k. Since the neural
The algorithm is run over N episodes and each episode network can only take real rather than complex numbers as
iterates T steps. For each episode, the algorithm terminates input, in the construction of the state s, if a complex number is
whenever it converges or reaches the maximum number of involved, the real part and the imaginary part will be separated
allowable steps. The optimal Gopt and Φopt are obtained as as independent input port. Given transmit symbols with unit
the action with the best instant reward. Note that the purpose of variance, the transmission power for the k th user is given by

Authorized licensed use limited to: National Institute of Technology. Downloaded on June 06,2024 at 05:55:29 UTC from IEEE Xplore. Restrictions apply.
1846 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 38, NO. 8, AUGUST 2020

||Gk ||2 = |Re{GH 2 H 2

k Gk }| + |Im{Gk Gk }| . The first term is
TABLE I
the contribution from the real part, whereas the second term H YPER - PARAMETERS D ESCRIPTIONS
is the contribution from the imaginary part, both of which are
used as the independent input port to the critic network and
the actor network. In total, there will be 2K entries of the state
s formed by the transmission power. Assuming that h̃k,2 =
hTk,2 ΦH1 G. The received power at the k th user contributed
by the k2th user is given as |h̃k,2 (n)|2 = |Re{h̃k,2 (n)}|2 +
|Im{h̃k,2 (n)}|2 . Likewise, both the power contributed by the
real part and the imaginary part are used as independent input
port to the critic network and the actor network. The total
number of entries formed here is 2K 2 . The real part and the
imaginary part of each entry of H1 and hk,2 ∈ k are also used
as entries of the state. The total number of entries of the state
constructed from the action at the (t − 1)th step is given by
2M K + 2N , while the total number of entries from H1 and
hk,2 ∀k is 2N M + 2KN .
In summary, the dimension of the state space is Ds = 2K +
2K 2 +2N +2M K+2N M +2KN . The reason we differentiate
the transmission power and the receiving power contributed by
the real part and the imaginary part is that, both the G and
Φ are matrix with complex entries and the transmission and
receiving power only will result in information lost due to the
absolute operator.
2) Action: The action is simply constructed by the transmit
beamforming matrix G and the phase shift matrix Φ. Like-
wise, to tackle with the real input problem, G = Re{G} +
Im{G} and Φ = Re{Φ} + Im{Φ} are separated as real
part and imaginary part, both are entries of the action. The
dimension of the action space is Da = 2M K + 2N .
3) Reward: At the tth step of the DRL, the reward is
determined as the sum rate capacity C(G(t) , Φ(t) , hk,2 , H1 ),
given the instantaneous channels H1 , hk,2 , ∀k and the action
G(t) and Φ(t) obtained from the actor network.

V. N UMERICAL R ESULTS AND A NALYSIS

In this section, we present performance evaluation for the
proposed DRL based algorithm. In the simulations, we ran-
domly generate channel matrix H1 and hk,2 , ∀k following
rayleigh distribution. We assume that, the large scale path loss
and the shadowing effects have been compensated. This is
because the objective of this paper in its current format is to Fig. 4. Sum rate versus Pt to show the proposed DRL-based algorithm in
develop a framework for the optimal beamforming design and comparison with two benchmarks.
phase shift matrices by employing advanced DRL technique.
Once the framework is ready, the effects of the path loss, beamforming [4]. In their generic forms, both algorithms
the shadowing effects, the distribution of users and the direct require full up-to-date cross-cell CSI. Both are centralized and
link from the BS to the users can be easily investigated, iterative in their original forms. The iterative FP algorithm with
through scaling DNNs, reconstructing the state, the action the ZF beamforming used in this paper is formulated in [4]
and the reward. All presented illustrations have been averaged Algorithm 3. Similarly, a detailed explanation and pseudo code
results over 500 independent realizations. of the WMMSE algorithm is given in [43] Algorithm 1. The
performance of the proposed DRL-based algorithm in com-
A. Setting and Benchmarks parison with these state-of-the-art benchmarks is illustrated in
the following.
The hyper-parameters used in the algorithm are shown
in Table I. We select two state-of-the-art algorithms as
benchmarks. These are the weighted minimum mean square B. Comparisons With Benchmarks
error (WMMSE) algorithm [42], [43] and an iterative algo- We have evaluated the proposed DRL-based approach
rithm based on fractional programming (FP) with the ZF described in Algorithm 1 as well as two benchmarks. Fig. 4

Fig. 5. Sum rate as a function of element number N with the proposed Fig. 6. Rewards as a function of time steps at Pt = 5dB and Pt = 20dB
DRL-based algorithm as well as two benchmarks, Pt = 20dB, M = respectively.
64, K = 64.

shows the sum rate versus maximum transmit power Pt .

We consider two sets of system parameters, namely M =
32, N = 32, K = 32, and M = 8, N = 8, K = 8.
It can be seen that our proposed DRL-based algorithm obtains
the comparable sum-rate performance with these state-of-the-
art benchmarks (WMMSE and FP optimization algorithm
with ZF), and the sum rates increase with the transmit power
Pt under all considered algorithms and scenarios.
To further verify our proposed algorithm in more wider
application scenarios, we perform another a simulation, which
compares the sum rate as a function of the number of elements
in RIS N shown in Fig. 5 for Pt = 20dB, M = 64, K = 64.
It is observed that, the average sum rates increase with the N ,
resulting from the increase in the sum power of reflecting
RIS as N increases. This is achieved at the cost of the com-
plexity of implementing RIS. It also further indicates that our
Fig. 7. Average rewards versus time steps under different Pr =
proposed algorithm is robust in considered wider application {−10dB, 0dB, 10dB, 20dB, 30dB}.
scenarios, and approaching the optimal performance.

convergence. These two figures also show that, starting

C. Impact of Pt on DRL from the identity matrices, the DRL based algorithm is
To get better understanding of our proposed DRL-based able to learn from the environment and adjust G and
method, we investigate the impact of Pt on it shown in Fig. 6, Φ to approach optimal solutions. Furthermore, the result
in which we considered two settings Pt = 5dB and Pt = of average rewards versus time steps under different
20dB with rewards (instant rewards and average rewards) as Pr = {−10dB, 0dB, 10dB, 20dB, 30dB} is shown in Fig. 7.
a function of time steps. In simulations, we use the following It can be seen that the SNRs have significantly effect on
method to calculate the average rewards, the convergence rate and performance, especially for the
Ki low SNR scenarios, i.e., below 10dB. When Pt ≥ 10dB,
reward(k)
average_reward(Ki ) = k=1 , Ki = 1, 2, . . . , K, the performance gap is far less than that between Pt = 0dB
Ki and Pt = 10dB. In other words, the proposed DRL method
(19) is extremely sensitive to the low SNR although it takes less
time to achieve the convergence.
where K is the maximum steps. It can be seen
that, the rewards will converge with the increase
of time step t. It converges faster at the low SNR D. Impact of System Settings
(Pt = 5dB) than high SNR (Pt = 20dB). The reason Similarly, we investigate the impact of element number N
is that, with higher SNR, the dynamic range of instant on the performance of DRL shown in Fig. 8, in which we con-
rewards is large, resulting in more fluctuations and worse sidered the system settings N = {4, 10, 20, 30} with rewards

Authorized licensed use limited to: National Institute of Technology. Downloaded on June 06,2024 at 05:55:29 UTC from IEEE Xplore. Restrictions apply.
1848 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 38, NO. 8, AUGUST 2020

Fig. 8. Average rewards versus time steps under different system parameter
settings.
Fig. 10. CDF of sum rate for various system settings.

Fig. 11. Average rewards versus steps under different learning rates,
Fig. 9. Sum rate as a function of Pt under two scenarios.
i.e., {0.01, 0.001, 0.0001, 0.0001}.

versus time steps. Compared with the transmit power, DRL different system settings. It is seen that the CDF curves
is more robust to the change of system settings. Specifically, confirm the observations from Fig. 9, where the average sum
with the increase of elements N , the average rewards also rates improve with the transmission power Pt and the number
increase gradually as expected, but this doesn’t increase the of RIS elements N .
convergence time of the DRL method.
Fig. 9 presents the average sum rate as a function of Pt .
From this figure, we see that, the average sum rate increases E. Impact of Learning and Decaying Rate
with Pt . As more transmit power is allocated to the BS, In our proposed DRL algorithm, we use constant learning
higher average sum rate can be achieved by the proposed and decaying rates for the critic and actor neural networks, and
DRL based algorithm. This observation is aligned to that investigate their impacts on the performance and converge rate
of conventional multiuser MISO systems. With joint design of DRL-based method. Fig. 11 demonstrates average rewards
of transmit beamforming and phase shifts, the co-channel versus time steps under different learning rates, i.e., {0.01,
interference of multiuser MISO systems can be efficiently 0.001, 0.0001, 0.0001}. It can be seen that different learning
reduced, resulting in the performance improvement with Pt . rates have the great influence on the performance of the
In Fig. 10, we plot the cumulative distribution func- DRL algorithm. Specifically, the DRL with 0.001 learning
tion (CDF) of the sum rate over different snapshots for rate achieves the best performance although it takes a longer

reported works utilizing alternating optimization techniques to

alternatively obtain the transmit beamforming and phase shifts,
the proposed DRL based algorithm obtains the joint design
simultaneously as the output of the DNNs. Simulation results
show that the proposed DRL algorithm is able to learn from
the environment through observing the instant rewards and
improve its behavior step by step to obtain the optimal transmit
beamforming matrix and phase shifts. It is also observed that,
appropriate neural network parameter settings will improve
significantly the performance and convergence rate of the
proposed algorithm.

R EFERENCES

[1] S. Yang and L. Hanzo, “Fifty years of MIMO detection: The road
to large-scale MIMOs,” IEEE Commun. Surveys Tuts., vol. 17, no. 4,
pp. 1941–1988, Sep. 2015.
Fig. 12. Average rewards versus steps under different decaying rates, [2] E. G. Larsson, F. Tufvesson, O. Edfors, and T. L. Marzetta, “Massive
i.e., {0.001, 0.0001, 0.00001, 0.00001}. MIMO for next generation wireless systems,” IEEE Commun. Mag.,
vol. 52, no. 2, pp. 186–195, Feb. 2014.
[3] F. Rusek et al., “Scaling up MIMO: Opportunities and challenges with
very large arrays,” IEEE Signal Process. Mag., vol. 30, no. 1, pp. 40–46,
time to converge compared with 0.0001 and 0.00001 learning Jan. 2013.
rate, while the large learning rate as 0.01 has the worse [4] C. Huang, A. Zappone, G. C. Alexandropoulos, M. Debbah, and
C. Yuen, “Reconfigurable intelligent surfaces for energy efficiency in
performance. This is because that too large learning rate will wireless communication,” IEEE Trans. Wireless Commun., vol. 18, no. 8,
increases the oscillation that renders the performance drop pp. 4157–4170, Aug. 2019.
dramatically. To sum up, the learning rate should be selected [5] J. Zhao, “A survey of intelligent reflecting surfaces (IRSs): Towards 6G
wireless communication networks,” 2019, arXiv:1907.04789. [Online].
properly, neither too large nor too small. Fig. 12 compares Available: http://arxiv.org/abs/1907.04789
average rewards versus time steps under different decaying [6] C. Huang et al., “Holographic MIMO surfaces for 6G wireless net-
rates, i.e., {0.001, 0.0001, 0.00001, 0.00001}. It shares the works: Opportunities, challenges, and trends,” 2019, arXiv:1911.12296.
[Online]. Available: http://arxiv.org/abs/1911.12296
similar conclusion with the learning rate, but it exerts less [7] K. B. Letaief, W. Chen, Y. Shi, J. Zhang, and Y.-J.-A. Zhang,
influence on the DRL’s performance and convergence rate. “The roadmap to 6G: AI empowered wireless networks,” IEEE Commun.
It can be seen that although 0.00001 decaying rate achieves Mag., vol. 57, no. 8, pp. 84–90, Aug. 2019.
the best performance, the gap between them are narrowed [8] S. Hu, F. Rusek, and O. Edfors, “Beyond massive MIMO: The potential
of data transmission with large intelligent surfaces,” IEEE Trans. Signal
significantly. Process., vol. 66, no. 10, pp. 2746–2758, May 2018.
Finally, we also should point out that, the performance [9] T. J. Cui, M. Q. Qi, X. Wan, J. Zhao, and Q. Cheng, “Coding
of DRL based algorithms is very sensitive to initialization metamaterials, digital metamaterials and programmable metamaterials,”
Light, Sci. Appl., vol. 3, no. 10, p. e218, Oct. 2014.
of the DNN and the other hyper-parameters, i.e., minibatch [10] C. Liaskos, S. Nie, A. Tsioliaridou, A. Pitsillides, S. Ioannidis,
size, etc. The hyper-parameters need to be defined delicately and I. Akyildiz, “A new wireless communication paradigm through
under a given system setting, and the appropriate neural software-controlled metasurfaces,” IEEE Commun. Mag., vol. 56, no. 9,
pp. 162–169, Sep. 2018.
network hyper-parameters setting will improves significantly [11] S. Hu, F. Rusek, and O. Edfors, “Beyond massive MIMO: The potential
the performance of the proposed DRL algorithm as well as its of positioning with large intelligent surfaces,” IEEE Trans. Signal
convergence rate. Process., vol. 66, no. 7, pp. 1761–1774, Apr. 2018.
[12] C. Huang, G. C. Alexandropoulos, C. Yuen, and M. Debbah, “Indoor
signal focusing with deep learning designed reconfigurable intelligent
VI. C ONCLUSION surfaces,” in Proc. IEEE 20th Int. Workshop Signal Process. Adv.
Wireless Commun. (SPAWC), Cannes, France, Jul. 2019, pp. 1–5.
In this paper, a new joint design of transmit beamforming [13] E. Basar, “Reconfigurable intelligent surface-based index modulation:
and phase shifts based on the recent advances in DRL tech- A new beyond MIMO paradigm for 6G,” 2019, arXiv:1904.06704.
[Online]. Available: http://arxiv.org/abs/1904.06704
nique was proposed, which attempts to formulate a frame-
[14] E. Basar, “Transmission through large intelligent surfaces: A new
work that incorporates the DRL technique into the optimal frontier in wireless communications,” in Proc. Eur. Conf. Netw. Commun.
designs for reflecting RIS assisted MIMO systems to address (EuCNC), Valencia, Spain, Jun. 2019, pp. 112–117.
large-dimension optimization problems. The proposed DRL [15] Q. Wu and R. Zhang, “Intelligent reflecting surface enhanced wireless
network via joint active and passive beamforming,” IEEE Trans. Wireless
based algorithm has a very standard formulation and low com- Commun., vol. 18, no. 11, pp. 5394–5409, Nov. 2019.
plexity in implementation, without the knowledge of explicit [16] Y. Gao, C. Yong, Z. Xiong, D. Niyato, Y. Xiao, and J. Zhao, “Recon-
mathematical formulations of wireless systems. It is therefore figurable intelligent surface for miso systems with proportional rate
constraints,” in Proc. ICC, Dublin, Ireland, Jun. 2020, pp. 1–6.
very easy to be scaled to accommodate various system settings. [17] Y. Han, W. Tang, S. Jin, C.-K. Wen, and X. Ma, “Large intelli-
Moreover, the proposed DRL based algorithm is able to learn gent surface-assisted wireless communication exploiting statistical CSI,”
the knowledge about the environment and also is robust to IEEE Trans. Veh. Technol., vol. 68, no. 8, pp. 8238–8242, Aug. 2019.
[18] A. Taha, M. Alrabeiah, and A. Alkhateeb, “Enabling large intelli-
the environment, through trial-and-error interactions with the gent surfaces with compressive sensing and deep learning,” 2019,
environment by observing predefined rewards. Unlike most arXiv:1904.10136. [Online]. Available: http://arxiv.org/abs/1904.10136

Authorized licensed use limited to: National Institute of Technology. Downloaded on June 06,2024 at 05:55:29 UTC from IEEE Xplore. Restrictions apply.
1850 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 38, NO. 8, AUGUST 2020

[19] M. Cui, G. Zhang, and R. Zhang, “Secure wireless communication via [42] Q. Shi, M. Razaviyayn, Z.-Q. Luo, and C. He, “An iteratively weighted
intelligent reflecting surface,” IEEE Wireless Commun. Lett., vol. 8, MMSE approach to distributed sum-utility maximization for a MIMO
no. 5, pp. 1410–1414, Oct. 2019. interfering broadcast channel,” IEEE Trans. Signal Process., vol. 59,
[20] H. Shen, W. Xu, S. Gong, Z. He, and C. Zhao, “Secrecy rate no. 9, pp. 4331–4340, Sep. 2011.
maximization for intelligent reflecting surface assisted multiantenna [43] H. Sun, X. Chen, Q. Shi, M. Hong, X. Fu, and N. D. Sidiropoulos,
communications,” IEEE Commun. Lett., vol. 23, no. 9, pp. 1488–1492, “Learning to optimize: Training deep neural networks for interfer-
Sep. 2019. ence management,” IEEE Trans. Signal Process., vol. 66, no. 20,
[21] M. Fu, Y. Zhou, Y. Shi, and K. B. Letaief, “Reconfigurable intelligent pp. 5438–5453, Oct. 2018.
surface empowered downlink non-orthogonal multiple access,” 2019,
arXiv:1910.07361. [Online]. Available: http://arxiv.org/abs/1910.07361
[22] C. Huang, A. Zappone, M. Debbah, and C. Yuen, “Achievable rate Chongwen Huang (Member, IEEE) received the
maximization by passive intelligent mirrors,” in Proc. IEEE Int. Conf. B.Sc. degree from Nankai University in 2010,
Acoust., Speech Signal Process. (ICASSP), Apr. 2018, pp. 3714–3718. the M.Sc. degree from the University of Elec-
[23] Q. Wu and R. Zhang, “Beamforming optimization for intelligent reflect- tronic Science and Technology of China (UESTC),
ing surface with discrete phase shifts,” in Proc. ICASSP-IEEE Int. Conf. Chengdu, in 2013, and the Ph.D. degree from the
Acoust., Speech Signal Process. (ICASSP), Brighton, U.K., May 2019, Singapore University of Technology and Design
pp. 7830–7833. (SUTD), Singapore. Before this, he joined the
[24] H. Guo, Y.-C. Liang, J. Chen, and E. G. Larsson, “Weighted sum- Institute of Electronics, Chinese Academy of Sci-
rate optimization for intelligent reflecting surface enhanced wireless ences (IECAS), Beijing, as a Research Engineer
networks,” 2019, arXiv:1905.07920. [Online]. Available: http://arxiv.org/ in July 2013. Since September 2015, he started
abs/1905.07920 his Ph.D. journey at the Singapore University of
[25] Q.-U.-U. Nadeem, A. Kammoun, A. Chaaban, M. Debbah, and Technology and Design (SUTD), Singapore, and CentraleSupélec University,
M.-S. Alouini, “Asymptotic max-min SINR analysis of reconfigurable Paris, France, under the supervision of Prof. C. Yuen and Prof. M. Debbah.
intelligent surface assisted MISO systems,” 2019, arXiv:1903.08127. He has been a Research Fellow at SUTD since October 2019. His main
[Online]. Available: http://arxiv.org/abs/1903.08127 research interests include 5G/6G technologies, deep learning theory for wire-
less communication, statistics and optimization for wireless communication
[26] C. Pan et al., “Multicell MIMO communications relying on intelli-
and intelligent network systems, and millimeter wave communications and
gent reflecting surface,” 2019, arXiv:1907.10864. [Online]. Available:
massive MIMO-NOMA for 5G and beyond. He was a recipient of the
http://arxiv.org/abs/1907.10864
Singapore Government Ph.D. Scholarship. He received the Partenariats Hubert
[27] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Curien Merlion Ph.D. Grant from 2016–2019, for studying at CentraleSupélec,
Cambridge, MA, USA: MIT Press, 1998. France, and more than ten outstanding scholarships coming from China and
[28] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep Learning. industries, which involves the “Tang Lixin” Overseas Scholarship, “Tang
Cambridge, MA, USA: MIT Press, 2016. Lixin” Scholarship, the National Postgraduate Scholarships, and the National
[29] V. Mnih et al., “Human-level control through deep reinforcement learn- Second Prize for National Undergraduate Electronic Design.
ing,” Nature, vol. 518, no. 7540, pp. 529–533, Feb. 2015.
[30] C. Huang, G. C. Alexandropoulos, A. Zappone, C. Yuen, and
M. Debbah, “Deep learning for UL/DL channel calibration in generic Ronghong Mo received the B.Sc. and M.Sc. degrees
massive MIMO systems,” in Proc. ICC-IEEE Int. Conf. Commun. (ICC), in physics from Zhongshan University and the
May 2019, pp. 1–6. Ph.D. degree in electrical engineering from the
[31] C. Jiang, H. Zhang, Y. Ren, Z. Han, K.-C. Chen, and L. Hanzo, National University of Singapore (NUS). From Sep-
“Machine learning paradigms for next-generation wireless networks,” tember 2003 to August 2005, she was a Research
IEEE Wireless Commun., vol. 24, no. 2, pp. 98–105, Apr. 2017. Fellow at NUS, investigating novel synchronization
[32] N. C. Luong et al., “Applications of deep reinforcement learning in algorithms for wireless OFDM systems. She joined
communications and networking: A survey,” IEEE Commun. Surveys Panasonic Singapore Laboratories Pte., Ltd., as a
Tuts., vol. 21, no. 4, pp. 3133–3174, 4th Quart., 2019. Research Engineer in September 2005, for devel-
[33] T. Lin and Y. Zhu, “Beamforming design for large-scale antenna arrays oping transmission technologies for wireless MIMO
using deep learning,” IEEE Wireless Commun. Lett., vol. 9, no. 1, systems and involving in standardization for LTE.
pp. 103–107, Jan. 2020. She was ever with the Institute for Infocomm Research, Singapore, and the
[34] F. Zhou, G. Lu, M. Wen, Y.-C. Liang, Z. Chu, and Y. Wang, “Dynamic Singapore University of Technology and Design. Since November 2019, she
spectrum management via machine learning: State of the art, taxonomy, has been a Senior Research Scientist at the Singapore Institute of Technology.
challenges, and open research issues,” IEEE Netw., vol. 33, no. 4, Her research interests include synchronization, channel estimation, MIMO
pp. 54–62, Jul. 2019. relay network design, and application of game theory to wireless communi-
[35] H. Huang, Y. Song, J. Yang, G. Gui, and F. Adachi, “Deep- cation systems and data analysis for cellular networks.
learning-based millimeter-wave massive MIMO for hybrid precod-
ing,” IEEE Trans. Veh. Technol., vol. 68, no. 3, pp. 3027–3032,
Chau Yuen (Senior Member, IEEE) received the
Mar. 2019.
B.Eng. and Ph.D. degrees from Nanyang Techno-
[36] X. Li and A. Alkhateeb, “Deep learning for direct hybrid precoding
logical University (NTU), Singapore, in 2000 and
in millimeter wave massive MIMO systems,” 2019, arXiv:1905.13212.
2004, respectively.
[Online]. Available: http://arxiv.org/abs/1905.13212
He was a Post-Doctoral Fellow at Lucent Tech-
[37] H. Huang, W. Xia, J. Xiong, J. Yang, G. Zheng, and X. Zhu, “Unsu- nologies Bell Labs, Murray Hill, in 2005. He was
pervised learning-based fast beamforming design for downlink MIMO,” a Visiting Assistant Professor at The Hong Kong
IEEE Access, vol. 7, pp. 7599–7605, 2019. Polytechnic University, in 2008. From 2006 to 2010,
[38] Y. Zhou, F. Zhou, Y. Wu, R. Q. Hu, and Y. Wang, “Subcarrier he worked at the Institute for Infocomm Research
assignment schemes based on Q-learning in wideband cognitive radio (I2R), Singapore, as a Senior Research Engineer,
networks,” IEEE Trans. Veh. Technol., vol. 69, no. 1, pp. 1168–1172, where he was involved in an industrial project
Jan. 2020. on developing an 802.11n wireless LAN system, and participated actively
[39] F. B. Mismar, B. L. Evans, and A. Alkhateeb, “Deep reinforcement in 3Gpp Long Term Evolution (LTE) and LTE−Advanced (LTE−A) stan-
learning for 5G networks: Joint beamforming, power control, and dardization. He joined the Singapore University of Technology and Design,
interference coordination,” IEEE Trans. Commun., vol. 66, no. 3, as an Assistant Professor in June 2010. He was a recipient of the Lee
pp. 1581–1592, Mar. 2020. Kuan Yew Gold Medal, the Institution of Electrical Engineers Book Prize,
[40] R. Shafin et al., “Self-tuning sectorization: Deep reinforcement learning the Institute of Engineering of Singapore Gold Medal, the Merck Sharp &
meets broadcast beam optimization,” 2019, arXiv:1906.06021. [Online]. Dohme Gold Medal, and Hewlett Packard Prize twice. He received the IEEE
Available: http://arxiv.org/abs/1906.06021 Asia-Pacific Outstanding Young Researcher Award in 2012. He serves as an
[41] T. P. Lillicrap et al., “Continuous control with deep reinforcement Editor of the IEEE T RANSACTIONS ON C OMMUNICATIONS and the IEEE
learning,” 2015, arXiv:1509.02971. [Online]. Available: http://arxiv.org/ T RANSACTIONS ON V EHICULAR T ECHNOLOGY and was the Top Associate
abs/1509.02971 Editor from 2009 to 2015.

Authorized licensed use limited to: National Institute of Technology. Downloaded on June 06,2024 at 05:55:29 UTC from IEEE Xplore. Restrictions apply.