The Frontiers of Deep Reinforcement Learning For Resource Management in Future Wireless HetNets Techniques Challenges and Research Directions
The Frontiers of Deep Reinforcement Learning For Resource Management in Future Wireless HetNets Techniques Challenges and Research Directions
ABSTRACT Next generation wireless networks are expected to be extremely complex due to their
massive heterogeneity in terms of the types of network architectures they incorporate, the types and
numbers of smart IoT devices they serve, and the types of emerging applications they support. In such
large-scale and heterogeneous networks (HetNets), radio resource allocation and management (RRAM)
becomes one of the major challenges encountered during system design and deployment. In this context,
emerging Deep Reinforcement Learning (DRL) techniques are expected to be one of the main enabling
technologies to address the RRAM in future wireless HetNets. In this paper, we conduct a systematic
in-depth, and comprehensive survey of the applications of DRL techniques in RRAM for next generation
wireless networks. Towards this, we first overview the existing traditional RRAM methods and identify
their limitations that motivate the use of DRL techniques in RRAM. Then, we provide a comprehensive
review of the most widely used DRL algorithms to address RRAM problems, including the value- and
policy-based algorithms. The advantages, limitations, and use-cases for each algorithm are provided. We
then conduct a comprehensive and in-depth literature review and classify existing related works based on
both the radio resources they are addressing and the type of wireless networks they are investigating. To
this end, we carefully identify the types of DRL algorithms utilized in each related work, the elements
of these algorithms, and the main findings of each related work. Finally, we highlight important open
challenges and provide insights into several future research directions in the context of DRL-based RRAM.
This survey is intentionally designed to guide and stimulate more research endeavors towards building
efficient and fine-grained DRL-based RRAM schemes for future wireless networks.
INDEX TERMS Radio resource allocation and management, deep reinforcement learning, next generation
wireless networks, HetNets, power, bandwidth, rate, access control.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
an unprecedented increase in their numbers and types of access points, to continuously interact with the environ-
data-hungry applications they require/support [3], [6]. It is ment to make autonomous control decisions [7]–[14]. DRL
expected that by 2023, the number of user networked devices Techniques have attracted considerable research recently and
and connections, including smart-phones, tablets, wearable demonstrated efficient performance in addressing complex
devices, and sensors, will reach 29.3 billion [6], and gener- wireless optimization problems, including RRAM problems.
ate a data rate exceeding 50 trillion GB [1]. All these trends Therefore, experts expect DRL methods to be one of the
will exacerbate the burdens during system design, plan- main enabling technologies for future wireless networks due
ning, deployment, operation, and management. In particular, to their ability to overcome the limitations of traditional
RRAM will become crucial in such complex and large-scale RRAM techniques [2], [15].
networks in order to guarantee an enhanced communications
experience. A. MOTIVATIONS OF THE PAPER
RRAM plays a pivotal role during infrastructure plan- The main motivations of this work stem from three
ning, implementation, and resource optimization of modern aspects. First, the paramount importance of allocating radio
wireless networks. Efficient RRAM solutions will guarantee resources in future wireless networks. Second, the limita-
enhanced network connectivity, increased system efficiency, tions and shortcomings of existing state-of-the-art RRAM
and reduced energy consumption. The performance of wire- techniques. Third, the robustness of Deep reinforcement
less networks heavily relies on two aspects. First, how techniques in alleviating these limitations and providing
network radio resources are being utilized, managed, and efficient performance in the context of RRAM. Here we
orchestrated, including transmit power control, spectrum elaborate more on each aspect.
channel allocations, and user access control. Second, how
efficiently the system can react to the rapid changes of 1) IMPORTANCE OF RRAM IN MODERN WIRELESS
network dynamics, including wireless channel statistics, NETWORKS
users mobility patterns, instantaneous radio resources avail- The explosive growth in the number and types of mod-
ability, and variability in traffic loads. Efficient RRAM ern smart devices, such as smartphones/tablets and wearable
techniques must efficiently and dynamically account for such devices, has led to the emergence of disruptive wireless
design aspects in order to ensure high network QoS and communications and networking technologies, such as 5G
enhanced users’ Quality of Experience (QoE). NR cellular networks, IoT networks, personal (or wireless
Deep reinforcement learning (DRL) is a branch of arti- body area networks), device-to-device (D2D) communica-
ficial intelligence (AI) that enables network entities, such tions, holographic imaging and haptic communications, and
as base stations, user devices, edge servers, gateways, and vehicular networks [3], [4], [16]–[23]. Such networks are
envisaged to meet the stringent requirements of the emerg- The large-scale nature of next generation networks makes
ing applications and services via supporting high data rates, it quite difficult to formulate RRAM optimization problems
coverage, and connectivity with significant enhancements in that are often intractable non-convex. Also, conventional
reliability, reduction in latency, and mitigation of energy techniques used to solve the RRAM problems require
consumption. complete or quasi-complete knowledge of the wireless envi-
However, achieving this goal in such large-scale, versa- ronment, including accurate channel models and real-time
tile, and complex wireless networks is quite challenging, channel state information (CSI). However, obtaining such
as it requires a judicious allocation and management of information in a real-time fashion in these large-scale
the networks’ limited radio resources [24], [25]. In partic- networks is quite difficult or even impossible. Furthermore,
ular, efficient and more advanced RRAM solutions must conventional techniques are often computationally-expensive
be developed to balance the tradeoff between enhancing and incur considerable timing overhead. This renders them
network performance while guaranteeing an efficient uti- inefficient for most emerging time-sensitive applications,
lization of radio resources. Furthermore, efficient RRAM such as autonomous vehicles and robotics.
solutions must also strike and intelligent tradeoff between Moreover, game theory-based techniques are unsuitable
optimizing network radio resources and satisfying users’ for future heterogeneous networks (HetNets) as such tech-
QoE. For example, RRAM techniques must jointly enhance niques are devised for homogeneous players. Also, the
network spectral efficiency (SE), energy efficiency (EE), and explosive number of network APs and user devices will
throughput while mitigating interference, reducing latency, create extra burdens on game theory-based techniques. In
and enhancing rate for user devices. particular, network players, such as BSs, APs, and user
Efficient and advanced RRAM schemes can considerably devices, need to exchange a tremendous amount of data
enhance the system’s SE compared to the traditional tech- and signaling. This will induce unmanageable overhead that
niques by relying on the advanced channel and/or source largely increases delay, computation, and energy/memory
coding methods. RRAM is essential in broadcast wireless consumption of network elements.
networks covering wide geographical areas as well as in
modern cellular communication networks comprised of sev- 3) HOW CAN DRL OVERCOME THESE CHALLENGES
eral adjacent and dense access points (APs) that typically AND PROVIDE EFFICIENT RRAM SOLUTIONS?
share and reuse the same radio frequencies. Emerging artificial intelligence (AI) techniques, such as
From a cost point of view, the deployment of wireless deep reinforcement learning (DRL), have shown efficient
APs and sites, e.g., base stations (BSs), including the real performance in addressing various issues in modern wire-
estate costs, planning, maintenance, and energy, is the most less communication networks, including solving complex
critical aspect alongside with the frequency license fees. RRAM optimization problems [7]–[15]. In the context of
Hence, the goal of RRAM is maximizing the network’s SE RRAM, DRL methods are mainly used as an alterna-
in terms of bits/sec/Hz/area unit or Erlang/MHz/site, under tive to overcome the shortcomings and limitations of the
some constraints related to user fairness. For instance, the conventional RRAM techniques discussed above. In partic-
service grade must meet a minimum acceptable level of QoS, ular, DRL techniques can solve complex network RRAM
including the coverage of certain geographical areas while optimization problems and take judicious control decisions
mitigating network outages caused by interference, noise, with only limited information about the network statistics.
large-scale fading (due to path losses and shadowing), and They achieve this by enabling network entities, such as BSs,
small-scale fading (due to multi-path). The service grade also RAN’s APs, edge servers (ESs), gateways nodes, and user
depends on blocking caused by admission control, scheduling devices, to make intelligent and autonomous control deci-
errors, or inability to meet certain QoS demands of edge sions, such as RRAM, user association, and RAN’s selection,
devices (EDs). in order to achieve various network goals such as sum-
rate maximization, reliability enhancement, delay reduction,
2) WHERE DO TRADITIONAL RRAM TECHNIQUES FAIL? and SE/EE maximization. In addition, DRL techniques are
Future wireless communication networks are complex due to model-free that enable different network entities to learn
their large-scale, versatile, and heterogeneous nature. To opti- optimal policies about the network, such as RRAM and user
mally allocate and manage radio resources in such networks, association, based on their continuous interactions with the
we typically formulate RRAM as complex optimization wireless environment, without knowing the exact channel
problems. The objective of such problems is to achieve a par- models or other network statistics a-priori. These appealing
ticular goal, such as maximizing network sum-rate, SE, and features make DRL methods one of the main key enabling
EE, given the available radio resources and QoS requirements technologies to address the RRAM issue in modern wireless
of user devices. Unfortunately, the massive heterogeneity communication networks [2], [3].
nature of modern networks poses tremendous challenges dur-
ing the process of formulating optimization problems as well B. RELATED WORK
as applying conventional techniques to solve them, such as There is a limited number of surveys that focus on the role
optimization, heuristic, and game theory algorithms. of DRL in RRAM. Existing related surveys are listed in
Table 1. The table also summarizes the topics covered in supervised learning, Bayesian learning, K-means
these surveys along with a mapping to the relevant sections of clustering, Principal Component Analysis (PCA), etc.
this paper and a categorical discussion of the improvements • Even the surveys that address DRL for RRAM in
and value-added in this paper relative to these surveys. In wireless networks focus on specific wireless network
general, as reported in Table 1, these published surveys still types or applications [8], [9], [11], [12], [28], missing
have several research gaps that are addressed in this survey. some of the recent research, not providing an adequate
We summarize them as follows. overview of the most widely used DRL algorithms for
RRAM [12], or not covering the RRAM in-depth, but,
rather, just covering a limited number of radio resources.
• Some of the existing surveys focus on DRL applications Hence, the role of this paper to fill these research gaps
in wireless communications and networking in general, and overcome these shortcomings. In particular, we pro-
without paying much attention to RRAM [10], [15]. For vide a comprehensive survey on the application of DRL
example, existing surveys cover topics related to DRL techniques in RRAM for next generation wireless commu-
enabling technologies, use-cases, architectures, secu- nication networks. We have carefully cited up-to-date surveys
rity, scheduling, clustering and data aggregation, traffic and related research works. We should emphasize here that
management, etc. the scope of this paper is focused only on radio (or com-
• Some of the published surveys focus on RRAM munication) resources, and no computation resources are
for wireless networks using ML and/or DL included during the study and analysis. Fig. 2 shows the
techniques without paying much attention to DRL radio resources or issues addressed in this survey. However,
techniques [1], [24], [26], [27]. For example, they computation resource aspects such as offloading, storage,
consider ML techniques such as convolutional neural task scheduling, caching, etc., can be found in other studies
networks (CNN), recurrent neural networks (RNN), such as [29]–[33] and the references therein.
C. PAPER CONTRIBUTIONS
The main contributions of this paper are summarized as
follows. FIGURE 5. The review protocol followed in this survey.
1) We provide a detailed discussion on the state-of-the-
art techniques used for RRAM in wireless networks,
including their types, shortcomings, and limitations in-depth technical knowledge of how to efficiently
that led to the adoption of DRL solutions. engineer DRL models for RRAM problems in wireless
2) We identify the most widely used DRL techniques communications.
utilized in RRAM of wireless networks and provide a 4) Based on the papers reviewed in this survey, we outline
comprehensive overview of them. The advantages, fea- and identify some of the existing challenges and pro-
tures, and limitations of each technique are discussed. vide deep insights into some promising future research
Hence, the reader is provided with an in-depth knowl- directions in the context of using DRL for RRAM in
edge of which DRL techniques should be leveraged wireless networks.
for each RRAM problem under investigation. Fig. 4 shows the percentage of the related works, clas-
3) We conduct an extensive and up-to-date literature sified based on the types of radio resources discussed in
review and classify the papers as reported in the lit- each paper, Fig. 4 (a), and based on the types of wireless
erature based on the type of radio resources they networks studied in each paper, Fig. 4 (b). This survey is
address (as shown in Fig. 2) and the types of wireless designed by carefully following the review protocol illus-
networks, applications, and services they consider (as trated in Fig. 5. Since this survey mainly focuses on deep
shown in Fig. 3). Specifically, for each paper reviewed, reinforcement learning for RRAM in wireless networks, we
we identify the problem it addresses, type of wire- included the following terms during the search stage along
less network it investigates, type of DRL model(s) it with “AND/OR” combinations of them; “deep reinforcement
implements, main elements of the DRL models (i.e., learning,” “DRL,” “resource allocation,” “resource manage-
agent, state space, action space, and reward function), ment,” “power,” “spectrum,” “bandwidth,” “access control,”
and its main findings. This provides the reader with “user association,” “network selection,” “cell selection,”
“rate control,” “joint resources,” “wireless networks,” “satel- It is observed from Fig. 4 (a) that the majority of
lite networks,” “cellular networks,” and “Heterogeneous related works are on the Spectrum and Access Control radio
networks.” The number of papers found and the databases resources, followed by both the Power radio resource and
searched are detailed in Fig. 5. The inclusion criteria are Joint radio resources. Also, as shown in Fig. 4 (b), the related
papers that address the use of DRL techniques to man- works on the IoT and Other Emerging Wireless Networks
age and allocate the radio resources shown in Fig. 2 for have received more attention than the other wireless network
the wireless networks shown in Fig. 3. The exclusion crite- types, followed by the Cellular Networks.
ria are papers that: 1) address computation resources, e.g., The rest of this paper is organized as follows. Table 2
task offloading, storage, scheduling, etc., 2) use conventional lists the acronyms used in this paper and their definitions.
RRAM approaches, i.e., not using DRL techniques, 3) use Section II discusses existing RRAM techniques, including
ML/DL techniques, or 4) address non-wireless networks, conventional methods and DRL-based methods. The def-
e.g., wired networks, optical networks, etc. In Fig. 5, the initions, types, and limitations of existing techniques are
number of papers excluded after a detailed check of the discussed. Also, the advantages of employing DRL tech-
body is 71, which are directly related to our survey but niques for RRAM are explained. Section III provides an
not influential or do not clearly identify the types of DRL overview of the DRL techniques widely employed for
algorithms used, elements of DRL models (i.e., agents, state RRAM, including their types and architectures. In-depth
space, action space, and reward function), type of wireless classifications of the existing research works is provided in
networks covered, and/or not well written. Section IV. Existing papers are classified based on the radio
In general, the research questions that this survey aims resources and the network types they cover. Section V pro-
to address are stated as follows. How can DRL techniques vides key open challenges, lessons learned, and some insights
be implemented to address the RRAM problems in modern for future research directions. Finally, Section VI con-
wireless networks? What are the performance advantages cludes the paper. The organization of the paper is pictorially
achieved when using DRL tools compared to the state-of- illustrated in Fig. 6.
the-art RRAM approaches? What are the most effective and
widely used DRL algorithms to address the RRAM prob-
lems, and how can they be implemented? What are the most II. RADIO RESOURCE ALLOCATION AND MANAGEMENT
important and influential papers that present DRL-based TECHNIQUES
solutions for RRAM in next generation wireless networks? In this section, we define the main radio resources of interest
What are the challenges and possible research directions that and provide a summary of the conventional techniques and
stem from the reviewed papers in the context of using DRL tools used for RRAM in wireless networks. Also, the lim-
for RRAM in wireless networks? The retrieved papers shown itations of these conventional techniques that motivate the
in Fig. 5, i.e., the 76 papers, are selected carefully to help use of DRL solutions will be highlighted. Then we discuss
with answering these questions, as we will elaborate in the how DRL techniques can be efficient alternatives to these
next sections. traditional approaches.
A. RADIO RESOURCES: DEFINITIONS AND TYPES (OR enhance network QoS. RRAM would be essential in such
ISSUES) UDN-based network deployments [38].
In general, allocation and management of wireless network The most crucial radio resources or issues that play a fun-
resources include radio (i.e., communication) and compu- damental role in controlling wireless networks’ performance
tation resources. This paper focuses only on the RRAM are summarized below.
issue. This involves strategies and algorithms used to con-
trol and manage wireless network parameters and resources, • Power resource: Which is one of the most critical issues
such as transmit power, spectrum allocation, user associa- in the RRAM of modern HetNets. Transmit power allo-
tion/assignment, rate control, access control, etc. The main cation in the downlink/uplink from/to network APs,
goal of wireless networks, in general, is to utilize and man- such as BSs and edge servers (ESs), is essential
age these available radio resources as efficiently as possible to guarantee a satisfactory QoS for communication
to provide enhanced network QoS, such as enhanced data links. Power control is essential from two perspec-
rate, SE, EE, reliability, connectivity, and coverage while tives; physical limitations and communication links.
meeting users’ QoS demands. Practically, the maximum power is limited by the
Efficient RRAM schemes can considerably enhance the capability of APs’ power amplifiers or government reg-
system’s SE compared to the traditional techniques relying ulation. Hence, it is common to incorporate the limited
on advanced channel and/or source coding methods. For power resource as a constraint during the design and
example, future wireless networks are expected to cover implementation of HetNets. On the other hand, power
broad geographical areas with ultra-dense network (UDN) control is also needed to guarantee enhanced networks’
deployments. In these UDNs, a massive number of adjacent QoS and user devices’ QoE. For example, in large-
APs typically require sharing communication resources, such scale and UDNs such as the mmWave and THz band
as radio frequencies and channels, to utilize resources and systems [2], [3], [39], signal attenuation due to path
TABLE 4. List of the model-free DRL algorithms that are widely used in RRAM for modern wireless networks.
∞
π
Q (s, a) = E γ t rt (st , at , st+1 )|
t=0
at ∼ π(.|st ), s0 = s, a0 = a .
it needs more interaction with the environment. Second, the training transitions to derive the gradients with respect to
REINFORCE algorithm implicitly performs the exploration its NN weights. Learners will then propagate their gradients
process, as it depends on the probabilities returned by the to the global NN to update its weights. This mechanism
network, which incorporate uniform random agent behavior. ensures a periodic update of the global model with diverse
Third, no target network is required in the REINFORCE transitions from each learner.
method as the Q values are obtained from the experiences
in the environment. 4) DEEP DETERMINISTIC POLICY GRADIENT (DDPG)
The disadvantage of the REINFORCE algorithm is that it ALGORITHM
suffers from high variance, meaning that any small shift DDPG is one of the most widely used DRL techniques in
in the return leads to a different policy. This limitation addressing RRAM problems for wireless networks charac-
motivated the actor-critic algorithms. terized by their high dimensionality and continuous action
space [86]. DDPG algorithm belongs to the actor-critic fam-
2) ACTOR-CRITIC ALGORITHM ily, and it combines both Q-learning and policy gradients
The actor-critic methods are mainly developed to enhance algorithms. It consists of actor and critic networks. The
the convergence speed and stability (i.e., reducing the vari- actor network takes the state as its input, and it outputs the
ance) of the policy-gradient method. Like the policy-based exact “deterministic” action, not probability distribution over
methods, it utilizes the accumulated discounted reward J to actions as in the actor-critic algorithm. Whereas the critic
obtain the gradient of policy J, which provides the direc- is a Q-value network that takes both the state and action as
tion that enhances the policy. This algorithm learns a critic inputs, and it outputs the Q-value as a single output.
to reduce the variance of gradient estimates since it uti- The deterministic policy gradient (DPG) algorithm is
lizes various samples, whereas the REINFORCE algorithm proposed in [87] to overcome the limitation caused by the max
utilizes only a single sample trajectory. operator in the Q-learning algorithm, i.e., maxQ(st+1 , at ). It
at+1
To select the best action in any state, the total discount simultaneously learns both the Q-function and the policy.
reward of the action is used, i.e., Q(s, a). The total reward In particular, the DPG algorithm has a parameterized actor
can be decomposed into state-value function V(s) and advan- function μ(s|θ μ ) with weights θ , which learns the determin-
tage function A(s, a), i.e., as Q(s, a) = V(s) + A(s, a). So, istic policy that gives the optimal action corresponding to
another DNN is utilized to estimate V(s), which is trained maxQ(st+1 , at ). The critic Q(s, a) is leaned via minimizing
based on the Bellman equation. The estimated V(s) is then at+1
the Bellman loss function as in the Q-learning algorithm.
leveraged to obtain the policy gradient and update the pol-
The learning process of the actor policy is updated using
icy network such that the probabilities of actions with good
gradient ascent with respect to θ μ in order to solve the
advantage values are increased. Hence, the actor is the policy
objective given by the following chain rule [87]:
network π(a|s) that takes actions by returning the probabil-
ity distribution of actions, while the critic network evaluates J(θ ) = Es∈D Q(s, μ(s|θ μ )) ,
the quality of the taken actions, V(s). This algorithm is also
called the advantage actor-critic method (A2C). θ μ J = Es∈D a Q(s, a|θ Q )|s=st ,a=μ(st ) θ μ μ(s|θ μ )|s=st .
In the A2C algorithm, the weights of actor network θπ The DDPG algorithm proposed in [86] is built based on
and critic network θv are updated using the accumulated the DPG algorithm, where both the policy and critic are
policy gradients ∂θπ and value gradients ∂θv , respectively, DNNs, as shown in Fig. 10. The DDPG algorithm creates
to move in the direction of the policy gradients and the a copy of both the actor and critic networks, Q (s, a|θ Q )
opposite direction of the value gradients. and μ (s|θ μ ), respectively, to compute the target values.
The weights of these target networks, θ Q and θ μ , are then
3) A3C ALGORITHM
updated to slowly track the weight of the learned network
The asynchronous advantage actor-critic (A3C) algorithm to provide more stable training using θ ← τ θ + (1 − τ )θ
is an extension of the basic A2C [85]. This algorithm with τ 1. The critic network is updated to minimize the
is used to solve the high variance issue in gradients that following Bellman loss;
results in non-optimal policies. A3C algorithm conducts a
parallel implementation of the actor-critic algorithm, where L(θ ) = Est ,at ,rt ,st+1 ∈D (rt (st , at )
Q
the actor and critic share the network layers. A global
NN is trained to output action probabilities and an esti- 2
π
mate function A(st , at |θπ , θv ) given by + γ maxQ st+1 , μ st+1 |θ |θ Q
− Q st , at |θ Q
.
k−1 ofi the advantage at+1
i=0 γ r t+1 + γ k V(s
t+k |θv ) − V(st |θv ), where k depends
on the state and upper-bounded by the maximum number of Note that the DDPG algorithm is off-policy, which means
time steps. that we use a replay buffer D to store training transitions.
Several parallel actor learners are instantiated with copies The exploration-exploitation issue is addressed by adding
of both the environment and global NN weights. Each learner the Ornstein–Uhlenbeck (OU) process or some Gaussian noise
μ
independently interacts with its environment and gathers N to the action selected by the policy, i.e., μ(st |θt )+εN [86].
their DQN-based models to provide energy-efficient power Also, the state space is continuous, defined by; each MR
allocation for the underlying D2D network. own signal channel, local observation information of each
UAV IoT networks are attracting considerable attention MR, i.e., beamforming design, each MR achievable rate,
recently due to their ability to provide enhanced QoS com- and each MR transmit power in the last time step. The
munication in harsh and vital environments. However, power reward function is continuous, defined by the achievable
management is one of the key challenges in such networks. sum-rate of the network. Simulation results demonstrate that
In this context, the authors in [113] address the problem of the SE of their proposed algorithm is comparable to the
downlink power control in ultra-dense UAV networks with full digital beamforming scheme, and it outperforms con-
the aim of maximizing the network’s EE. A multi-agent ventional approaches such as maximum power allocation,
DQN-based DRL model is proposed in which the agents random power allocation, DQN, and FP.
are the UAVs in the network. The state space is continu- Federated deep reinforcement learning (FDRL) is an
ous, representing the remaining energy of the UAV and the emerging AI paradigm that integrates FD and DRL methods.
interference caused by neighboring UAVs. The action space FDRL can be utilized as an efficient technique to enhance
is discrete, representing the set of possible discrete transmit the RRAM solutions in large-scale distributed systems. As
power values, while the reward function is the EE of the UAV an example, an interesting approach is proposed in [115],
network. Simulation results are compared with Q-learning in which the authors propose a cooperative multi-agent
and random algorithms, which show the superiority of their actor-critic-based FDRL framework for distributed wire-
proposed scheme in terms of both the convergence speed less networks. The authors particularly address the problem
and EE. of energy/spectrum efficiency maximization via distributed
In the same context for multi-UAV wireless networks, the power allocation for network edge users. In their proposed
authors in [114] propose a multi-agent DDPG-based DRL model, the agents are the edge users, whose action space is
to address the problem of joint trajectory design and power continuous, defined as the power allocation strategies. The
allocation. In their scheme, the agents are the UAVs, whose state space is continuous, defined by the allocated trans-
state space is a discrete binary indicator function representing mit power, the SINR on the assigned RBs, and the reward
whether the QoS of the user ends (UEs) are satisfied or not. of the previous training time step. The system is defined
The action space is also discrete, corresponding to selecting in terms of a local continuous cost function expressed in
UAVs’ trajectory and transmission power. The reward is a terms of SINR, power, path loss, and environmental noise.
continuous function defined by the joint trajectory design Using simulation results, the authors demonstrate that their
and power allocation as well as the number of UEs covered proposed framework achieves better performance accuracy
by the UAVs. Simulation results show that the proposed in terms of power allocation than other approaches such
algorithm achieves higher network utility and capacity than as greedy, non-cooperation power allocation, and traditional
the other optimization methods in wireless UAV networks FL.
with reduced computational complexity.
Another interesting work [107] proposes a multi-agent 3) IN SATELLITE NETWORKS
DQN-based DRL method to study the problem of trans- In the following paragraphs, we review works that employ
mit power control in wireless networks. The agents are DRL techniques to address the power allocation issue in
the transmitters whose state space is continuous, consisting satellite networks as well as emerging satellite IoT systems.
of three main feature groups; local information, interfering Managing downlink transmit power in satellite networks is
neighbors, and interfered neighbors. The action space is also one of the major persistent challenges. To this end, the
discrete corresponds to discrete power levels, while the authors in [116] extended their work in [117] and present
reward is a function of the weighted sum-rate of the whole a single-agent Proximal Policy Optimization (PPO)-based
network. Experimental results demonstrate that the proposed DRL model to solve the problem of power allocation in
distributed algorithm provides comparable and even better multi-beam satellite systems. In their model, the agent is the
performance results to the state-of-the-art optimization-based processing engine that allocates power within the satellite,
algorithms available in the literature. whose state space is continuous, comprises the set of demand
High-speed railway (HSR) systems are one of the emerg- requirements per beam, and the optimal power allocations for
ing IoT applications for next-generation wireless networks. the two previous time steps. The action space is continuous,
Such systems are characterized by their rapid variations in representing the allocation of the power for each beam, while
the wireless environment, which mandate the development the reward is a function of both the link data rate achieved by
of light-weighted RRAM solutions. As a response to this, the beam and the power set of the agent. Experimental results
Xu and Ai [88] propose a multi-agent DDPG-based DRL demonstrate the robustness of their proposed DRL algorithm
model to address the problem of sum-rate maximization in dynamic power resource allocation for multi-beam satellite
via power allocation in hybrid beamforming-based mmWave systems.
HSR systems. In their approach, each mobile relay (MR) NOMA technique has shown efficient results in improv-
acts as an agent. The action space is continuous, corre- ing the performance of terrestrial mmWave cellular
sponding to the transmit power level of each MR agent. systems [118]. This has motivated the use of NOMA for
deliberately engineered as they play a crucial role in the DRL techniques have attracted considerable research interest
convergence and accuracy of the learned policies. For policy- recently due to their robustness in making optimal deci-
based power allocation algorithms, it is more convenient to sions in dynamic and stochastic environments. This section
define the reward as a continuous function since the learn- presents the related works to the applications of DRL algo-
ing process depends on taking its derivative, which is not rithms for radio spectrum allocation in modern wireless
necessarily the case with the value-based algorithms. networks. This includes issues, such as dynamic network
It is also observed that DRL-based power allocation algo- access, user association or cell selection, spectrum access or
rithms can be deployed in a centralized and distributed channels selection/assignment, and the joint of any of these
fashion, depending on the deployment scenario. Distributed issues, as shown in Fig. 3.
scenarios provide more accurate and reliable policies than In modern wireless networks, a massive number of
centralized ones at the expense of added complexity and user devices may request to access the wireless channel
signaling overhead, especially as the number of agents simultaneously. This may drastically overload and congest
increases. Therefore, the tradeoff between the centralized and the channel, causing communication failure and unreliable
distributed policies heavily depends on the scenario under QoS. Hence, efficient communication schemes and protocols
investigation. For example, it is preferable to deploy DRL must be developed to address this issue in channel access
models in a distributed fashion to address the power allo- via employing various access scheduling and prioritization
cation problem for time-sensitive applications. However, for techniques. RRAM for modern wireless networks requires
ultra-reliable applications, it is preferable to adopt centralized considering dynamic load balancing and access management
DRL deployment. Moreover, most of the papers consider the methods to support the massive capacity and connectivity
rate maximization, SE, and EE as key performance metrics requirements of the future wireless networks while utilizing
(e.g., [88], [102], [110]). However, other KPI metrics must their radio resources efficiently. DRL methods have been
be considered as well during the design of DRL frame- used recently to address these issues, and they have demon-
works, such as latency, reliability, and coverage, especially strated efficient results in the context of massive channel
for emerging real-time and time-sensitive IoT applications. access.
We also observe from Table 5 that both the cellular On the other hand, user devices in cellular networks are
HomNets and emerging IoT wireless networks gain more required to associate or be assigned to BS(s) or network
attention than satellite and multi-RAT networks that still in AP(s) to get a service. The association process could be
their early stages and require more in-depth investigation. symmetric, i.e., both uplink and downlink are from the same
BS/AP, or it may be asymmetric in which the uplink and
B. DRL FOR SPECTRUM ALLOCATION AND ACCESS downlink may associate to different BSs/APs. This associa-
CONTROL tion or cell selection process must be carefully addressed as
One of the significant challenges in modern wireless com- it strongly affects the allocation of network radio resources.
munication networks that still needs more investigation is Unfortunately, such types of problems are typically non-
spectrum management and access control. In this context, convex and combinatorial [41] and need accurate network
The problem of spectrum management in wireless DSA IoT sensor networks. In that work, the agent is one BS, which
is addressed in [129] based on distributed multi-agent DQN. controls the channel assignments for energy harvesting-
In their approach, the agents are each DSA user, whose enabled sensors. The problem of the agent is to predict the
action space is discrete, corresponding to the transmit power battery level of the sensors and to assign channels to sen-
change for each channel. The state space is discrete, defined sors such that the total rate is maximized. The DQL model
as the transmit power on wireless channels. The reward is a used to solve this problem has two long-short-term-memory
continuous function defined by the SE and the penalty caused (LSTM) neural network (NN) layers, one for predicting the
by the interference to PUs. Experimental results show that sensor’s battery state and one for obtaining channel access
their proposed model with echo state network-based DQN policy based on the predicted states obtained from the first
achieves a higher reward with both achievable data rate and layer. The agent’s action is all the available sensors that
PU protections. require to access the channels. The state contains the his-
Antenna selection is widely used for physical layer secu- tory of channel access scheduling, true and predicted battery
rity in multi-antenna-based cellular networks. In this context, information history and the current sensor’s CSI. Simulation
the authors in [130] propose a single-agent DQN algorithm results show that the total rates obtained using the DQL
to predict the optimal transmit antenna in the MIMO wire- scheme are 6.8 kbps compared to 7 kbps obtained from the
tap channel. The state space is discrete, defined in terms optimal scheme rate.
of the security capacity and maximum SNR of the MIMO Managing spectrum allocation is one of the main objec-
wiretap channel. The action space is discrete, corresponds to tives in cognitive radio networks (CRNs). The main idea
selecting the transmit antenna. The reward function is dis- is to efficiently utilize the available spectrum via enabling
crete, defined in terms of the achieved SNR at the antenna. SUs to use the spectrum resources when the PUs are inac-
Experimental results demonstrate that their proposed algo- tive. The authors in [135] propose a multi-agent DQN-based
rithm proactively predicts the optimal antenna while reducing model to address the cooperative spectrum sensing issue in
the secrecy outage probability of MIMO wiretap system CRNs. In their scheme, the agents are the SUs whose action
compared to the support vector machine and conventional is discrete, corresponding to sensing the spectrum for possi-
approaches. ble transmission without interfering with the PUs. The state
space is discrete, and it is comprised of four elements repre-
2) IN IOT AND OTHER EMERGING WIRELESS senting cases when the spectrum is sensed as occupied, the
NETWORKS spectrum is not sensed in a particular time slot, the spectrum
In the following paragraphs, we review works that employ is sensed as idle, and one of the other SUs broadcast the
DRL algorithms to address the spectrum and access control sensing result first. The reward function is the binary indi-
problem in IoT and emerging wireless networks illustrated cator, which is “+1” if the spectrum is sensed as idle and
in Fig. 2. “0” otherwise. Simulation results show that their proposed
IoT sensor networks are characterized by their high algorithm has a faster convergence speed and better reward
dynamicity, which necessitates efficient channel access for performance than the conventional Q-learning algorithm.
the connecting nodes. In [131], the authors build on their For the same network in [135], the authors in [136] extend
initial work in [132] and propose a single-agent DQN-based the work and propose a multi-agent DQN-based DRL scheme
DRL scheme to tackle the problem of dynamic channel to address the problem of dynamic joint spectrum access
access for IoT sensor networks. In their scheme, the agent is and mode selection (SAMS) in CRNs. The agents are the
the sensor, and its action is discrete, corresponding to select- secondary users (SUs) whose action space is discrete, corre-
ing one channel to transmit its packets at each time slot. The sponding to selecting the access channel and access mode.
state space is discrete, which is a combination of rewards The state space of each SU agent is discrete, comprised of
and actions in the previous time slots. The reward func- the action taken by the mth SU agent, the ACKs of all SUs
tion is also discrete, which is “+1” if the selected channel agents, and the ACK of the mth SU agent. The reward func-
is in low interference in such case a successful transmis- tion is discrete, which is “1” if the action selection process is
sion occurs; otherwise, the reward is “−1” in such case the successful; otherwise, there is a collision, and the agent will
selected channel is in high interference, and a transmission receive a “0” reward. Simulation results demonstrate that
failure occurs. Simulation results show that their proposed their proposed DQN algorithm provides comparable results
scheme achieves an average reward of 4.4 compared to 4.5 to the Max benchmark after the model’s convergence.
obtained using the conventional myopic policy [133], which Xu et al. [137] propose a single-agent DQN and DDQN-
needs a compact knowledge of the transition matrix of the based DRL approaches to address the problem of dynamic
system. spectrum access in wireless networks. In their model, the
Energy consumption is considered one of the persistent agent is a wireless node (e.g., a user) whose action is discrete,
challenges for emerging wireless sensor networks. In this corresponding to sensing the discrete frequency channel for
context, an interesting work is proposed in [134] in which possible data transmission. The state space is discrete, defin-
the authors develop a single-agent DQN-based DRL model ing if the channel is occupied or idle at time slot t. The
to address the channel selection in energy harvesting-based reward function is discrete, which is ranging from 0 to 100
In other work in [145], the authors propose both a scheme are the users. The action space is discrete, which
single-agent and multi-agent deep actor-critic DRL-based is “0” if the user does not attempt to transmit packets dur-
framework for dynamic multi-channel access in wireless ing the current time slot, and it is “1” if it has attempted
networks. In their system, the agents are the users whose to transmit. The state space is discrete, consisting of four
action space is discrete, corresponding to selecting a channel. main elements; each user action taken on the current time
The observation space is also discrete, which is defined based slot, channel capacity (which could be negative, positive, or
on the status of the channel and collision status. The reward zero), a binary acknowledgment signal showing if the user
function is discrete, which is “+1” if the selected channel transmits successfully or not, and a parameter that enables
is good; otherwise, it is “−1”. Using simulation results, the each user to estimate other users’ situations. The reward is a
authors show that their proposed actor-critic framework out- discrete binary function that takes the value of “1” if the user
performs the DQN-based algorithm, random access, and the transmits successfully; otherwise, it is “0” meaning that the
optimal policy when there is full knowledge of the channel user transmitted with collision. It is shown using simulation
dynamics. results that their scheme can maximize the total through-
The problem of DSA for the CRN is studied in [146] put while trying to make fair resource allocation among
based on an uncoordinated and distributed multi-agent DQN users. Also, it is shown that their proposed scheme outper-
model. The agents are CRs, whose action is discrete, rep- forms the conventional Slotted-Aloha scheme in terms of
resenting the possible transmit powers. The state space is sum-throughput.
discrete, reflecting whether the limits for DSA are being Vehicular ad hoc networks (VANETs) are one of the
met or not, depending on the relative throughput change promising networks for next generation wireless networks,
at all the primary network links. The reward is also dis- where networks are formed and information is relayed among
crete, which is a function of the throughput of the links vehicles. Wang et al. [150] address the problem of DSA in
and the environment’s state. Experimental results reveal that VANETs, by proposing an interesting scheme that combines
their proposed scheme finds policies that yield performance multi-hop forwarding via vehicles and DSA. The optimal
within 3% of an exhaustive search solution, and it finds the DSA policy is defined to be the joint maximization of chan-
optimal policy in nearly 70% of cases. nel utilization and minimization of the packet loss rate. A
Industrial IoT (IIoT) has emerged recently as an innova- multi-agent DRL network structure is presented that com-
tive networking ecosystem that facilitates data collection and bines RNN and DQN for learning the time-varying process of
exchange in order to improve network efficiency, productiv- the system. In their scheme, each user acts as an agent whose
ity, and other economic benefits [147]. RRAM in such a action space is discrete, corresponding to choosing a channel
sophisticated paradigm is also a challenge that needs more for transmission at time slot t. The state space is discrete,
investigation. The work in [148] can be considered to be a composed of three components; a binary transmission condi-
pioneer in which the authors propose a solution for spectrum tion η, which is “1” if the transmission is successful and “0”
resource management for IIoT networks, with the goal of otherwise, the channel selection action, and the channel sta-
enabling spectrum sharing between different kinds of UEs. tus indicator after each dynamic access process. The reward
In particular, a single-agent DQN algorithm is proposed in is a discrete binary function, which takes a positive value if
which the agent is the BS. The action space is discrete, which η = 1, otherwise it takes the value of “0”. Simulation results
corresponds to the allocation of spectrum resources for all show that their proposed scheme: 1) reduces the packet loss
UEs. The observation space is a hybrid (continuous and rate from 0.8 to around 0.1, 2) outperforms Slotted-Aloha
discrete) consisting of four elements; the current action (i.e., and DQN in terms of reducing collision probability and
the allocation of spectrum resources), the data priority of channel idle probability by about 60%, and 3) enhances the
type I UEs, the buffer length of type II UE, and the commu- transmission success rate by around 20%.
nication status of the first type of UEs. The reward function Due to their ability to improve communication in
is continuous, defined to address their optimization problem. harsh environments, UAV networks have gained consid-
It is divided into four objectives; 1) maximizing the spectrum erable research lately [151]. For example, most recently
resource utilization; 2) quickly transmitting the high-priority in [152], the authors propose efficient multi-agent DRL-
data; 3) meeting the threshold rate requirement of the first based schemes to address the problem of joint cooperative
type of UEs; 4) ensuring that the second type of UEs com- spectrum sensing and channel access in clustered cognitive
pletes the transmission in time. Using simulation results, UAV (CUAV) communication networks. Three algorithms
it is demonstrated that their proposed algorithm achieves are proposed: 1) a time slot multi-round revisit exhaustive
better network performance with a faster convergence rate search based on virtual controller (VC-EXH), 2) a Q-learning
compared with other algorithms. based on independent learner (IL-Q), and 3) a DQN based on
Most recently in [149], the authors propose a multi-agent independent learner (IL-DQN). The agents are the CUAVs
Double DQN-based DRL model to address the problem in the network. The action space of any CUAV agent is a
of DSA in distributed wireless networks. In particular, discrete function defined by the steps that this agent moves
they design a channel access scheme to maximize channel clockwise in time slot t relative to the channel selected in
throughput regarding fair channel access. The agents in their time slot t − 1 on the PU channel ring. The state space is a
problem in multi-RAT HetNets. This includes the coexistence in a hierarchical fashion considering a network comprised of
of various variants of the wireless networks shown in Fig. 2. macro eNodeB (MeNB) and Wi-Fi APs. The agents are the
Managing the spectrum bands in unlicenced cellular controller installed at MEC servers. The agents’ action space
networks is also another persistent challenge. In this con- is discrete including the spectrum slicing ratio set, spectrum
text, the authors in [159] present a multi-agent DQN-based allocation fraction sets for the MeNB and for each Wi-Fi AP,
model that jointly tackles the dynamic channel selection and computing resource allocation fraction, and storing resource
interference management in Small Base Stations (SBSs) cel- allocation fraction. The state space is discrete representing
lular networks that share a set of unlicensed channels in Long information of the vehicles within the coverage area of the
Term Evolution (LTE) networks. In the proposed scheme, the MEC server, including vehicles’ number, x-y coordinates,
SBSs are the agents who choose one of the available chan- moving state, position, and task information. The reward
nels for transmitting packets in each time slot. The agent’s function is discrete, defined in terms of the delay require-
action is channel access and channel selection probability. ment, and requested storing resources required to guarantee
The DQL input includes the channels’ traffic history of both the QoS demands of an offloaded task. Provided experi-
the SBSs and Wireless Local Area Networks (WLAN), while mental results reveal that their proposed schemes achieve
the output is the agent’s predicted action vectors. Simulation high QoS satisfaction ratios compared with the random
results reveal that their proposed DQL strategy enhances the assignment techniques.
average data rate by up to 28% compared to the conventional The integration of various cellular wireless networks is
Q-learning scheme. also one of the main enabling technologies for the next
For the same network settings in [159], the authors generation wireless networks. Recently in [163], the authors
in [160] propose a single-agent DQN-based model to tackle propose an efficient single-agent DQN algorithm based on
the dynamic spectrum allocation for multiple users that Monte Carlo Tree Search (MCTS) to address the problem
share a set of K channels. In their scheme, the agent is of dynamic spectrum sharing between 4G LTE and 5G NR
the user whose action is either choosing a channel with systems. In particular, the authors used the MuZero algorithm
a particular attempt probability or selecting not to trans- to enable a proactive BW split between 4G LTE and 5G NR.
mit. The agent’s state includes the history of the actions The agent is a controller located at the network core, whose
of the agent and its current observations. The DQL model action space is discrete, corresponding to a horizontal line
input is the previous actions along with their observa- splitting the BW to both 4G LTE and 5G NR. The state space
tions, while the output is the Q-values corresponding to the is discrete, defined by five elements: 1) an indicator if the
actions. Simulation results demonstrate that their proposed user is an NR user or not, 2) the number of bits in the user’s
DQL strategy achieves a double data rate compared to the buffer, 3) an indicator of whether the user is configured with
state-of-the-art Slotted-Aloha scheme. multimedia broadcast single frequency network (MBSFN) or
The integration of cellular networks and indoor networks not, 4) the number of bits that can be transmitted for the
has also shown efficient results in enhancing the QoS of user in a given subframe, and 5) the number of bits that
wireless communication in terms of coverage and data rate. will arrive for each user in the future subframes. The reward
Towards this end, Wang and Lv [161] propose an effi- function is a continuous function defined as a summation of
cient single-agent prediction-DDPG-based DRL algorithm the exponential of the delayed packet per user. Experimental
to study the problem of the dynamic multichannel access results show that their proposed scheme provides comparable
(MCA) for the hybrid long-term evolution and wireless local performance to the state-of-the-art optimal solutions.
area network (LTE-WLAN) aggregation in dynamic HetNets. Findings and lessons learned: This section reviews the
The agent is the central BS controller, whose state space is applications of DRL for dynamic spectrum allocation and
continuous, consisting of both the channels’ service rates access control in modern wireless networks. These types of
and the users’ requirement rates. The action space, on the radio resources are inherently coupled with user associa-
other hand, is discrete, representing the users’ index. Two tion, network/RAT selection, dynamic multi-channel access,
reward functions are provided; online traffic real reward and and DSA. Table 6 summarizes the reviewed papers in this
online traffic prediction reward, each of which are functions section. In general, the application of DRL for spectrum allo-
of users’ requirements, channels’ supplies, degree of system cation and access control problems has received considerable
fluctuation, the relative resource utilization, and the qual- attention lately. We observe that most DRL algorithms, when
ity of user experience. Using simulation results, the authors deployed for non-IoT networks, are implemented in central-
demonstrate the efficiency of the proposed prediction-DDPG ized fashions at network controllers, such as BSs, RSUs,
model in solving the dynamic MCA problem compared to and satellites [125], [141], [154]. This is done to utilize
conventional methods. the controllers’ powerful and advanced hardware capabilities
Another interesting work in [162], the authors investigate in collecting network information and designing cross-layer
the joint allocation of the spectrum, computing, and stor- policies. Hence, we observe that DRL models are deployed
ing resources in multi-access edge computing (MEC)-based as a single-agent at the network controllers [148]. On the
vehicular networks. In particular, the authors propose multi- contrary, DRL provides a flexible tool in diversified IoT
agent DDPG-based DRL algorithms to address the problem networks and systems, conventionally involving dynamic
system modeling and multi-agent interactions, such as CRNs The exponential increase in smart IoT devices man-
and distributed systems. Also, note that the main motivations dates making autonomous decisions locally, especially for
of using DRL techniques in almost all the papers presented in delay-sensitive IoT applications and services. In this con-
this subsection are the complexity of the formulated spec- text, we anticipate that the research on spectrum allocation
trum allocation and access control problems, the inability and access control using distributed multi-agent DRL algo-
to obtain accurate CSI, and the inadequacy of conventional rithms for future IoT networks will attract more attention as
methods to solve the formulated problems. in [139], [141], [150], [152].
In addition, the management of such types of radio
resources falls in general in the discrete action space. C. DRL FOR RATE CONTROL
Therefore, the value-based algorithms are utilized more This refers to the adaptive rate control in the uplink and
than the policy-based ones, and they have shown efficient downlink of wireless networks. With the explosive increase
results, as we discussed in the surveyed papers. We also in the number of user devices and the emergence of massive
observe that embedding prediction-based DRL algorithms, types of data-hungry applications [3], it becomes essen-
such as RNN, with the conventional DNN models has shown tial to keep high network KPIs in terms of data rates and
efficient results in enabling DRL to perform a proactive users’ QoE. Adaptive rate control schemes must ensure sat-
spectrum prediction. Such integrated models have been seen isfactory QoS in highly dynamic and unpredictable wireless
in [144], [150], [153] and we expect that they will attract environments. In this context, DRL techniques can be effi-
more attention in the future. In addition, it is always prefer- ciently deployed to solve adaptive rate control problems
able to utilize the DQN-based algorithms to the Q-learning instead of conventional approaches that possess high com-
algorithm as they provide better performance in terms of plexity and heavily rely on accurate network information and
convergence speed and accuracy of the learned policies. instantaneous CSI.
Moreover, as is the case with the other DRL models, the In the following paragraphs, we review works that employ
definitions of the state space and reward function are crucial, DRL algorithms to address the rate control issue in cellular
and they must provide representative and rich information networks.
about the system and environment to the agent in order to 5G network slicing is a technique based on the network
learn efficient and reliable RRAM policies. virtualization concept that enables dividing the single
We also observe from Table 6 that the use of DRL tech- network connections into multiple unique virtual connec-
niques for IoT and emerging wireless networks receives more tions to provide various radio resources to various types of
attention than other wireless networks, especially for the traffic. Liu et al. [165] conduct a pioneer DRL-based work
cognitive radio-based systems as in [152]. to address the problem of network resource allocation, in
terms of rate, for 5G network slices. The problem is decom- efficient schemes that account for the joint radio resources.
posed into a master-slave, and a multi-agent DDPG-based In many scenarios, the design problem in wireless networks
DRL scheme is then proposed to solve it. The agents are might end up with competing objectives. For example, in
located in every network slice, whose action space is contin- UDNs, increasing the power level is beneficial in com-
uous, defining the resource allocation to users in the network bating path loss and enhancing the received signal quality.
slice. The state space is continuous and has two main parts; However, this might cause serious interference to the neigh-
the first one shows how much utility the user obtained com- boring user devices and BSs. Hence, the joint design of
pared to its minimum utility requirement, while the second power level control and interference management becomes
part shows the auxiliary and dual variables from the mas- mandatory. Conventional approaches for solving joint RRAM
ter problem. The reward is a continuous function defined in problems require complete and instantaneous knowledge
terms of utility, utility requirements, and auxiliary and dual about network statistics, such as traffic load, channel qual-
variables. Simulation results demonstrate that their proposed ity, and radio resources availability. However, obtaining such
algorithm outperforms the baseline approaches and gives a information is not possible in such large-scale networks.
near-optimal solution. In this context, DRL techniques can be adopted to learn
High mobility networks are characterized by their rapid system dynamics and communication context to overcome
variations that render link establishment a major issue. In the limited knowledge of wireless parameters.
this context, the authors in [166] propose an interesting work This section intensively reviews the most important and
using a single-agent DQN-based DRL algorithm to address influential works that implement DRL algorithms for the
the problem of dynamic uplink/downlink radio resources allo- problem of joint RRAM in modern wireless networks.
cation in terms of network capacity in high mobility 5G Particularly, we present related works that jointly optimize
HetNets. Their proposed algorithm is based on the Time the radio resources shown in Fig. 2, such as power alloca-
Division Duplex (TDD) configuration in which the agent is tion, spectrum resources, user association, dynamic channel
the BS, whose action space is discrete, corresponding to the access, cell selection, etc.
configurations of TDD sub-frame allocation at the BS. The
state space is discrete, comprised of different kinds of fea-
1) IN CELLULAR NETWORKS
tures of the BS, including uplink/downlink occupancy, buffer
occupancy, and channel condition of all uplinks/downlinks to In the following paragraphs, we review works that employ
the BS. The reward is discrete, defined as a function of the DRL algorithms to address the joint RRAM issue in cellular
uplink and downlink channel utility, which mainly depends on networks shown in Fig. 2.
channel occupancy with chosen TDD configuration. Using Cellular vehicular communication (CV2X) is regarded as
experimental results, the authors show that their proposed one of the main enabling technologies for next generation
algorithm achieves performance improvement in terms of wireless networks. RRAM in such networks has received
both network throughput and packet loss rate, compared to significant momentum using conventional methods, and they
some conventional TDD resource allocation algorithms. are now gaining notable attention using DRL methods. For
Findings and lessons learned: This section reviews the example, an interesting work is reported in [169], in which
use of DRL techniques for adaptive rate control in next the authors study the problem of joint optimization of trans-
generation wireless networks. In general, there is limited mission mode selection and resource allocation for CV2X.
research that is solely dedicated to addressing the rate radio They propose single-agent settings in which DQN and fed-
resource issue. We consider [165] and [166] as pioneer works erated learning (FL) models are integrated to improve the
in this type of RRAM. Most of the research in the litera- model’s robustness. The agent in their model is each V2V
ture is dedicated to video streaming applications, and the pair. The action space is discrete, representing the resource
paper [15] highlighted some of them. However, as we dis- block (RB) allocation, communication mode selection, and
cussed in the previous sections, the data rate control issue transmit power level of the V2V transmitter. The state space
is typically addressed via controlling other radio resources is a hybrid (continuous and discrete) consisting of five main
such as power, user association, and spectrum. In addition, parts; the received interference power at the V2V receiver
the adaptive rate control is typically addressed as a joint and the BS on each RB at the previous subframe, the number
optimization with other radio resources, as we will elaborate of selected neighbors on each RB at the previous subframe,
in the next section, e.g., as in [167], [168]. the large-scale channel gains from the V2V transmitter to its
We also observe that DRL-based solutions for cellu- corresponding V2V receiver and the BS, current load, and
lar networks receive more attention than other wireless remaining time to meet the latency threshold. The reward is
networks, and there is a lack of research on adaptive rate a continuous function defined in terms of the sum-capacity
control for IoT and satellite networks. This also deserves of vehicular UEs as well as the QoS requirements of both
more in-depth investigation and analysis. vehicular UEs and V2V pairs. Using experimental results,
the authors show that their proposed two-timescale federated
D. DRL FOR JOINT RRAM DRL algorithm outperforms other decentralized baselines.
Due to the massive complexity and large-scale nature of RRAM in small cell networks is one of the ongo-
modern wireless networks, it becomes necessary to design ing challenges for cellular operators. Towards this end,
350 VOLUME 3, 2022
Jang and Yang [170] propose a multi-agent DQN-based algo- 4) the interference that each agent causes to its neigh-
rithm to address the problem of sum-rate maximization via bors. The reward is a continuous function comprised of
a joint optimization of resource allocation and power control three elements: 1) the SE achieved by each agent, 2) the
in small cell wireless networks. The agents in their proposed SE degradation of the agent’s interfered neighbors, and
model are the small cell BSs, whose action space is discrete, 3) the penalty due to the interference generated at the BS.
corresponding to selecting the resource allocation and power Experimental results show that their proposed algorithm out-
control of small BS on RB. The state space is continuous, performs both the exhaustive and random subcarrier and even
including all the CSI that the small BS collects on RB, such power (RSEP) assignment methods in terms of SE of D2D
as local CSI, local CSI at the transmitter, etc. The reward is pairs.
a continuous function expressed by the average sum-rate of Mission-Critical Communications (MCC) is an emerging
its own serving users and the other small BSs. Experimental service in the next generation wireless networks. It is envi-
results show that their proposed approach both outperforms sioned to enable First Responders, such as firefighters and
the conventional algorithms under the same CSI assumptions emergency medical personnel, to replace conventional radio
and provides a flexible tradeoff between the amount of CSI with advanced communication capabilities available to next
and the achievable sum-rate. generation smartphones and IoT devices. Most recently, a
In the same context, another interesting work is presented pioneer work is conducted by Wang et al. [173] in which
in [171] in which the authors propose a model-driven multi- the authors propose a multi-agent DQN-based DRL scheme
agent Double DQN-based framework for resource allocation to address the problem of spectrum allocation and power con-
in UDNs. In particular, They first develop a DNN-based trol for MCC in 5G networks. In MCC, multiple D2D users
optimization framework comprised of a series of ADMM reuse non-orthogonal wireless resources of cellular users
iterative procedures that uses the CSI as the learned weights. without BS in order to enhance the network’s reliability.
Then, channel information absent Q-learning resource allo- The paper aims to help the D2D users autonomously select
cation algorithm is presented to train the DNN-based the channel and allocate power to maximize system capacity
optimization scheme without massive labeling data, where and SE while minimizing interference to cellular users. The
the EE, SE, and fairness are jointly optimized. The agents agents are the D2D transmitters whose action space is dis-
are each D2D transmitter, whose action space is discrete, crete, corresponding to channel and power level selection.
corresponding to selecting a subcarrier and corresponding The state space is discrete, defined in a three-dimensional
transmission power. The state space is a hybrid (continu- matrix, which includes information on the channel state of
ous and discrete) consisting of two parts; user association uses, the state of power level, and the number of the D2D
information and interference power. The reward function is pairs. The reward function is discrete, defined in terms of
continuous, comprised of two components; the network EE the total system capacity and constraints. Simulation results
and the fairness of service quality, which is expressed by show that their proposed learning approach significantly
the variance of throughput between authorized users. Using improves spectrum allocation and power control compared
experimental results, it is demonstrated that their proposed to traditional methods.
algorithm has a rapid convergence speed, well characterizes RRAM in OFDM-based systems is also one of th main
the extent of optimization objective with partial CSI, and challenging issues. In this context, the authors in [174] pro-
outperforms other existing resource allocation algorithms. pose a multi-agent DQN-based model to address the problem
D2D-enabled cellular networks are also one of the key of joint user association and power control in OFDMA-based
enabling technologies for next generation cellular systems. wireless HetNet. The agents are the UEs, whose action space
RRAM in such networks is one major concern, especially is discrete, corresponding to jointly associate with the BS
for mmWave-based cellular networks, as the D2D links and determine the transmit power. The state space is discrete,
require frequent link re-establishment to combat the high which is defined by the situation of all UEs association with
blockage rate. The authors in [172] propose a multi-agent BS and power control. The reward function is continuous,
Double DQN-based scheme to address the problem of joint which is defined in terms of the sum-EE of all UEs. Using
subcarrier assignment and power allocation in D2D under- simulation results, it is shown that their proposed method
lying 5G cellular networks. The agents in their model outperforms the Q-learning method in terms of convergence
are the D2D pairs, whose action space is discrete, cor- and EE.
responding to determining the transmit power allocation Another interesting work is reported in [175] in which the
on the available subcarriers. The state space is a hybrid authors propose a single-agent DQN-based DRL model to
(continuous and discrete), comprised of four components: address the problem of joint optimization of user associa-
1) local information (including the previous transmit power, tion, resource allocation, and power allocation in HetNets.
previous SE achieved by transmitters, channel gain, and The agent is the BS, whose action is discrete, correspond-
SINR), 2) the interference that each agent causes at the BS ing to power allocation to users. The state space is discrete,
side, 3) the interference received from agent’s interfering defined by the channel gain matrix and the set of users asso-
neighbors and the SE achieved by agent’s neighbors, and ciation. The reward function is continuous, defined by the
utility function of users’ achieved data rate. Using simula- to the selection of the frequency band and transmission
tion results, the authors show that their proposed algorithm power level that generate small interference to all V2V and
outperforms some of the existing methods in terms of SE V2I links while ensuring enough resources to meet latency
and convergence speed. constraints. The state space is continuous, containing the
following information; the CSI of the V2I link, the received
2) IN IOT AND OTHER EMERGING WIRELESS interference signal strength in the previous time slot, the
NETWORKS channel indices selected by neighbors in the previous time
In the following paragraphs, we review works that employ slot, the remaining load for transmission, and the time left
DRL algorithms to address the joint RRAM issue in IoT before violating the latency constraint. The reward function
and emerging wireless networks depicted in Fig. 2. is continuous, consisting of three components; the capacity
For the same system settings in [106], [107], the authors of V2I links, the capacity of V2V links, and the latency con-
in [67] extended their work and propose a multi-agent straint. Using experimental results, it is shown that agents
DDPG-based DRL framework to address the problem of the learn to satisfy the latency constraints on V2V links while
joint spectrum and power allocation in wireless networks. minimizing the interference to V2I communications.
Two DRL-based algorithms are proposed, which are exe- Another pioneering work is reported in [181] in which
cuted and trained simultaneously in two layers in order to the authors propose a single-agent Double DQN-based DRL
jointly optimize the discrete subband selection and contin- to address the problem of joint channel selection and power
uous power allocation. The agent in their approach is each allocation with network slicing in CRNs. Their study aims
transmitter. In the top layer, the action space of all agents is to provide high SE and QoS for cognitive users. The agent is
discrete, representing the discrete subband selection, while the overall CRN, whose action space is discrete, correspond-
the bottom layer has a continuous action space correspond- ing to the channel selection and power allocation of SUs.
ing to the transmit power allocation. The state space is The state space is continuous, defined by the SINR of the
a hybrid (continuous and discrete), containing information PU. The reward function is continuous, which is a function
on achieved SE, transmit power, sub-band selection, rank, of the system SE, user QoS, interference temperature, and
and downlink channel gain. The reward is shared by both the interference temperature threshold. Experimental results
layers, which is a continuous function defined in terms show that their proposed algorithm improves the SE and QoS
of the externality of agents to interference and the spec- and provides faster convergence and more stable performance
tral efficiency. Using experimental results, the authors show than the Q-learning and DQN algorithms.
that their proposed framework outperforms the conventional NOMA-based systems are characterized by their ability
fractional programming algorithm. to provide enhanced QoS in cellular networks. However,
Based on their initial work in [176], the authors in [177] allocating radio resources in such systems is quite chal-
extended their work and propose a distributed multi-agent lenging. In this context, the problem of joint subcarrier
DQN-based DRL scheme to address the problem of joint assignment and power allocation in an uplink multi-user
channel selection and power control in D2D networks. The 5G-based NOMA systems is addressed in [182]. A multi-
agents in their model are the D2D pairs, whose action agent two-step DRL algorithm is proposed; the first step
space is discrete, corresponding to selecting a channel and employs the DQN algorithm to output the optimum subcar-
a transmit power. The state space of each agent is a rier assignment policy, while the second step employs the
hybrid (continuous and discrete) which contains three sets of DDPG algorithm to dynamically allocate the transmit power
information; local information, non-local information from for the network’s users. The agent is a controller located at
the agent’s receiver-neighbor set, and non-local information the BS, whose action space is a hybrid (discrete and contin-
from the agent’s transmitter-neighbor set. The reward func- uous), corresponding to the subcarrier assignment decisions
tion of each agent is continuous, which is decomposed and power allocation decisions. The state space is continu-
into the following elements; its own received signal power, ous, which is defined by the users’ channel gains at each
its own total received SINR, its interference caused to subcarrier. The reward function is defined as the sum EE
transmitter-neighbors, the received signal power, and the of the NOMA system. Experimental results show that their
total received SINR of transmitter-neighbors. Using sim- proposed algorithm provides better EE than the fixed and
ulation results, it is shown that the performance of their DQN-based power allocation schemes.
scheme closely approaches that of the FP-based algorithm Unlike the work in [182], a pioneer work is reported
even without knowing the instantaneous global CSI. in [183] in which the authors propose a multi-agent DDPG-
In [178], the authors extended their previous works based model to address the problem of joint power and
in [179], [180] and present a distributed multi-agent DQN- spectrum allocation in NOMA-based V2X networks. In par-
based model to address the problem of joint sub-band ticular, the authors are looking to maximize the sum-rate of
selection and power level control in V2V communication V2I communications. The agents are the V2V communica-
networks. Their proposed model is applicable to both uni- tion links. The state space is discrete, defined by a set of
cast and broadcast scenarios. The agents are the V2V link actions performed by V2I and V2V communication links.
or vehicles whose action space is discrete, corresponding The set includes the transmit power of both V2I and V2V
TABLE 7. Summary of the related works that address the joint RRAM.
address the problem of delay minimization via joint spectrum the fraction of time that the PENs should use over a partic-
and power resource allocation in mmWave mobile hybrid ular, and the data rate. Simulation results demonstrate that
access network. The agent is located in the roadside BS, their proposed scheme outperforms the greedy techniques in
whose action space is discrete, corresponding to allocating terms of minimizing energy consumption and latency while
spectrum and power resources for data. The state space is satisfying different PENs requirements.
discrete, consisting of information about the current power Findings and lessons learned: This section reviews the use
and spectrum of the resource pool, required spectrum and of DRL methods for joint radio resources of power, spectrum,
power, and the number of spectrum and power levels. The access control, user association, and rate. Table 7 summa-
reward signal is a continuous function defined in terms of rizes the reviewed papers in this section. We observe that
queueing delay and the resource length required for each DRL tools can be efficiently deployed to address different
data. Using simulation results, it is shown that their proposed types of joint radio resources for diversified network scenar-
model guarantees the URLLC delay constraint when the load ios. The results obtained using DRL models are better than
does not exceed 130%, which outperforms other conventional the heuristic methods [168], [183] and comparable to the
methods such as random and greedy algorithms. state-of-the-art optimization approaches [67], [177]. Also,
Healthcare systems are one of the main services in next note that the main motivations of using DRL techniques
generation wireless systems. Unlike the work in [41], a pio- in addressing the joint radio resources problems presented
neer work is presented in [167] to address the problem of in this subsection are the complexity of these formulated
network selection with the aim of optimizing medical data problems, the limited information about system dynamics
delivery over heterogeneous health systems. In particular, and CSI, and the difficulty in applying traditional RRAM
an optimization problem is formulated in which the network methods to solve such problems.
selection problem is integrated with adaptive compression We also observe that multi-agent DRL deployment based
to minimize network energy consumption and latency while on value-based algorithms receives more attention than
meeting applications’ QoS requirements. A single-agent policy-based algorithms. The reason is that users tend to
DDPG-based DRL model is proposed to solve it. The agent is have more control over their channel selection, data control,
a centralized entity that can access all radio access networks and transmission mode selection, and hence we find a more
(RANs) information and Patient Edge Node (PEN) data popular implementation of DRL agents at local IoT devices.
running in the core network. The action space is discrete, In addition, the integration of value-based and policy-based
corresponding to the joint selection of data split over the algorithms for joint RRAM is also an interesting concept
existing RANs and the adequate compression ratio. The state that requires more investigation, especially for multi-agent
space is a hybrid (continuous and discrete) defined by two deployment scenarios. In particular, depending on the type
elements: the fraction of time that the PENs should use over of radio resources under investigation, resources with contin-
a particular RAN and the PEN investigated in the current uous nature such as power typically implement policy-based
timestamp. The reward is a continuous function, which is algorithms, while resources with discreet nature such as
defined in terms of: the fraction of data of PEN that will be channel allocation and user association typically implement
transferred through RAN, the energy consumed by PEN to value-based algorithms. Simultaneous dealing with continu-
transfer bits over RAN, distortion, expected latency of RANs, ous and discrete types of radio resources may integrate both
the monetary cost of PENs to use RANs, the resource share, the policy- and value-based DRL algorithms to learn a global
policy as in [41], [182], or even adopting the value-based algorithms are one promising research direction in this con-
algorithms as in, e.g., [172], [174], [185] with an expense text [14]. However, the rapid increase in the number of
of added quantization error. edge devices (players) makes information exchange in such
We also observe that DRL methods for cellular networks networks unmanageable. Also, the partial observability of
as well as IoT wireless networks gain more attention than agents might lead to suboptimal RRAM policies. Therefore,
multi-RAT networks, particularly for D2D and V2V com- it is an open challenge to develop DRL-assisted algorithms
munications. In addition, there is a lack of research on that optimally balance between the centralization and dis-
applications of DRL for emerging IoT applications, such tribution issue in RRAM. A possible solution is to develop
as healthcare systems as investigated recently in [167], hybrid ecosystems that implement some DRL models at the
which is also a promising field that requires more atten- network’s edge, e.g., at the ESs or user devices, instead of
tion. Furthermore, we observe a lack of research on DRL deploying all DRL algorithms on a centralized network.
applications for joint RRAM in satellite networks, which
also deserves more in-depth investigation. 2) DIMENSIONALITY OF STATE SPACE IN HETNETS
In modern wireless HetNets, service requirements and
network conditions are rapidly changing. Hence, single-agent
V. OPEN CHALLENGES AND FUTURE RESEARCH
DIRECTIONS DRL algorithms must be designed to capture and respond
Throughout the previous section, we have demonstrated the to these fast network changes. To this end, it is required
superiority of DRL algorithms over traditional methods in to reduce the state space and action space during the learn-
addressing complex RRAM problems for modern wireless ing process, which inevitably degrades the quality of the
networks. However, there are still several challenges and learned policies. The existence of multi-agents and their
open issues that either not explored yet or need further interactions will also complicate the agents’ environment
exploration. This section provides highlights these open chal- and prohibitively increase the dimensionality of state space,
lenges and provides insights on future research directions which will slow down the learning algorithms. A possi-
in the context of DRL-based RRAM for next generation ble solution to this issue is to split the large state spaces
wireless networks. Table 8 summarizes the advantages and into smaller ones through state-space decomposition. The
disadvantages/shortcomings of DRL methods when applied idea is to use smaller DNNs to learn the dynamics of the
for RRAM in next generation wireless networks. decomposed sub-state spaces, while another DNN considers
the relatively less frequent interactions between the different
sub-state spaces [189]. This approach enables us to distribute
A. OPEN CHALLENGES computation and accelerate agents’ training.
1) CENTRALIZED VS. DECENTRALIZED RRAM
TECHNIQUES 3) RELIABILITY OF TRAINING DATASET
Future wireless networks are characterized mainly by their Although the DRL-based solutions for RRAM we reviewed
massive heterogeneity in wireless RANs, the number of previously demonstrate efficient performance results, almost
user devices, and types of applications. Centralized DRL- all the models are developed based on simulated training and
based RRAM schemes are efficient in guaranteeing enhanced testing datasets. The simulated dataset is typically produced
network QoS and fairness in allocating radio resources. They based on some stochastic models, which provide simpli-
also ensure that RRAM optimization problems will not get fied versions of practical systems and greatly ignore hidden
stuck in local minima due to their holistic view of the system. system patterns. This methodology greatly weakens the reli-
However, formulating and solving RRAM optimization prob- ability of the developed policies as their performance on
lems become tough tasks in such large-scale HetNets. practical networks would be skeptical. Hence, it is imperative
Hence, centralized DRL-based RRAM solutions are typically to develop more effective and reliable approaches that gener-
unscalable. This motivates distributed multi-agent DRL- ate precise simulation datasets and capture practical system
based algorithms that enable edge devices to make resource settings as much as possible. This ensures high reliability
allocation decisions locally. Stochastic Game-based DRL and confidence during the training and testing modes of
the developed RRAM policies. Developing such approaches RRAM problem, such as network topology and available
is still a challenge due to the large-scale nature and rapid radio resources, the DRL model must be retrained as the
variations of future wireless environments. old model is no longer reflecting the new training experi-
On the other hand, the DRL models are sensitive to any ences. In modern wireless HetNets, such cases are frequently
change in the input data. Any minor changes in the input encountered, especially with real-time applications or in
data will cause considerable change in the models’ output. highly dynamic environments. In such a case, it becomes
This mainly deteriorates the reliability of DRL algorithms, quite challenging for DRL agents to update and retrain their
especially when deployed for modern IoT applications that DNNs with rapidly changing input information from the
require ultra-reliability, such as remote surgery or any other HetNet environment [1]. A possible solution is to design
mission-critical IoT applications. Hence, ensuring high reli- DRL-based RRAM models in a manner that supports gener-
ability for DRL models is a challenging issue. A possible alization via transfer learning and meta-learning. Multi-tasks
solution to such issues is to exploit real-field measurement DRL approaches [191], [192] are efficient frameworks to
data collected from various cellular and IoT wireless sce- support these aspects.
narios to train and test the DRL-based RRAM models. This On the other hand, if domain knowledge is available or
will increase the reliability of the learned policies and also easy to obtain, it becomes hard for the DRL algorithms to
enables DRL model generalization. beat the well-designed algorithms based on the full domain
knowledge. This fact has been observed and reported in the
4) ENGINEERING OF DRL MODELS FOR RRAM surveyed papers in Section IV.
Since DRL employs DNNs as function approximators for
6) CONTINUOUS TRAINING OF DRL MODELS
the reward functions, DRL models will inherit some of the
challenges that exist in the DNN world. For example, it is DRL algorithms require big datasets to train their models,
still quite challenging to optimize the DNN hyperparame- which is typically associated with a high cost [15]. The
ters, such as the type of DNNs used (e.g., convolutional, network system pays this cost during the information col-
fully connected, or RNN), the number of hidden layers, lection process due to, e.g., the high delays, extra overhead,
the number of neurons per hidden layer, the learning rate, and energy consumption. The emergence of a large number
batch size, etc. DRL models suffer from high sensitivity to of real-time applications and services has even increased
these hyperparameters. This challenge is even exacerbated this training cost. In this context, DRL models require to
in multi-agent settings as all agents share the same radio be continuously retrained with fresh data collected from the
resources and must converge simultaneously to some poli- wireless environment to be up-to-date and ensure accurate
cies. A possible solution is to implement some optimization and long-term control decisions. It is not practical to con-
techniques from the deep learning field, such as grid and duct manual retraining of the models in such large-scale
random search methods, to find the optimal configuration of HetNets settings. Also, manually monitoring and updating
these hyperparameters [190]. DRL models in multi-agent scenarios becomes an expensive
On the other hand, the engineering of DRL param- task. Therefore, continuous retraining can solve this issue,
eters such as state space and reward function is chal- in which a dedicated autonomous system is employed to
lenging for RRAM. The state space must be engineered continuously assess and retrain old DRL models.
to capture useful and representative information about 7) CONTEXT OF RRAM
the wireless environment, such as the available radio The implementation of DRL algorithms basically depends
resources, users’ QoS requirements, channel quality, etc. on the use-cases. The context and deployment scenarios in
Such information is crucial and heavily defines the learning which RRAM is required must be considered during the
and convergence behaviors of DRL agents. Again, the development of DRL models. For example, RRAM in health-
presence of multi-agents will even make it more challenging, sector IoT applications is different from the environmental
as discussed in [14]. Also, since DRL models are reward- IoT applications counterparts. Due to the high sensitivity
driven learning algorithms, the design of the reward function of data in the health-sector applications, extra data pre-
is also essential to guide the agent during the policy-learning processing must be performed, including data compression
stage. Formulating reward functions that capture the network and encryption [167]. This will directly affect the number of
objective and account for the available radio resources is also radio resources to be allocated for such applications. Hence,
challenging. DRL models must be aware of the context aspect of appli-
cations, which is considered another challenge. A possible
5) SYSTEM DEPENDENCY OF DRL MODELS
solution is to develop context-aware DRL models that are
DRL models are system-dependent as they are trained and able to learn context variables in an unsupervised manner
tested for specific wireless environments and networks. and adapt the policy to the existing context, e.g., as in [193].
Therefore, they provide effective results when employed to
solve specific types of problems for which they are trained. 8) COMPETING OBJECTIVE DESIGN OF DRL MODELS
However, if there would be a significant change in the char- Next generation wireless networks are expected to provide
acteristics of the wireless environment or the nature of the enhanced system QoS in terms of high data rate, high
developing single/multi-agent DRL models to achieve intelli- on a mix of real and synthetic data is a promising research
gent load balancing in future HetNets, is a possible research direction as in [211].
direction.
7) DRL FOR RRAM IN RIS-ASSISTED WIRELESS
5) MADRL ALGORITHMS IN SUPPORT OF MASSIVE NETWORKS
HETEROGENEITY AND MOBILITY Reconfigurable Intelligent Surfaces (RIS) have emerged
Load balancing in modern wireless UDNs is another promis- recently as an innovative technology to enhance the QoS of
ing research direction. The objective is to balance the future wireless networks [212], [213]. RISs can be deployed
wireless networks by moving some users from the heavily in cellular networks as passive reflecting elements to pro-
congested BSs to uncongested ones, thus improving BSs uti- vide near line-of-sight communication links to users, hence
lization and providing enhanced QoS provisioning. Although enhancing communication reliability, increasing through-
the load balancing field has been heavily investigated put, and reducing latency [214], [215]. Deploying RIS to
in the literature using conventional resource management assist cellular communication, however, requires judicious
approaches, as in [207]–[209], there still a research gap in RRAM schemes to optimize network performance. This
applying DRL for such a field. In this context, DRL can be research field is still nascent, and there is much to do for
adopted to realize the self-sustaining (or self-organization) future research and investigation, especially in the context of
vision of next generation wireless networks [3]. Hence, DRL-based RRAM techniques. Towards this, it is required
developing single/multi-agent DRL models to achieve intelli- to develop end-to-end DRL-based algorithms that jointly
gent load balancing in future HetNets, is a possible research optimize the configuration of the RIS system, i.e., elements’
direction. Such models must be agile to network dynam- phases and amplitudes, and radio resources of BSs. For
ics, including varying users’ mobility patterns and network instance, designing DRL models that intelligently and opti-
resources availability. mally allocate the downlink BSs’ transmit power and/or BSs’
beamforming configuration from one side and the amplitude
6) DRL-BASED RRAM WITH GENERATIVE ADVERSARIAL and phase shifts of the RIS elements on the other side is
NETWORKS (GANS) FOR RRAM a promising research direction, as in [43]. We also believe
that the currently ongoing research in RIS-assisted wireless
Ensuring the reliability of DRL algorithms is one of the
networks, e.g., [43], [216]–[218] will be cornerstones.
major challenges and objectives in DRL-based RRAM
methods. In many real-life scenarios, we may require to
deploy DRL models to allocate resources in vital systems 8) DRL FOR RRAM IN WIRELESS DIGITAL TWIN
requiring ultra-reliability, such as IoT healthcare applica- NETWORKS
tions [167]. In this context, there are proposals on Generative Digital twin (DT) has recently emerged as a promising tech-
Adversarial Networks (GANs), which have emerged recently nology for future wireless networks [219]. DT is a virtual
as an effective technique to enhance the reliability of DRL representation of the components and dynamics of a given
algorithms [210]. physical system, which is envisioned to bridge the connection
In practice, the shortage of realistic training datasets gap between physical systems and digital spaces. The digital
required to train DRL models and learn optimal policies replicas of physical systems, such as user devices, BSs, and
is a challenging issue. To overcome this, GANs are utilized, machines, are constructed at the server based on historical
which generate large amounts of realistic datasets syntheti- data and real-time running status. DT utilizes tools from
cally by expanding the available limited amounts of real-time ML, data analytics, and multiphysics simulation to study
datasets. From a DRL perspective, GANs-generated synthetic and analyze the dynamics of physical systems. Therefore, DT
data is more effective and reliable than traditional augmen- enables system monitoring, real-time interaction, and reliable
tation methods [79]. This is because DRL agents will be communication between physical systems and digital space
exposed to various extreme challenging and practical situa- in order to optimize the operation of physical systems [220].
tions by merging the realistic and synthetic data, enabling With these promising features, DT is getting considerable
DRL models to be trained on unpredicted and rare events. interest recently in enhancing the performance of wireless
Another advantage of GAN over traditional data augmen- communication networks for applications, such as computa-
tation methods is that it eliminates dataset biases in the tion offloading, content caching, and RRAM. For example,
synthetic data, which greatly enhances the quality of the a promising research direction is to develop DRL algo-
generated data and leads to more reliability in DRL models’ rithms to address various problems in wireless DT networks,
training and learning processes. such as the DT placement and migration problems [221], in
In general, the research in the GANs-based DRL methods capturing the dynamics of UAV-based networks [222], in
for RRAM is still in its early stages, and we believe that it blockchain-based networks to enhance network security and
will take further pace in the future. For example, developing users privacy [223].
experienced DRL-based algorithms for URLL communica- Table 9 summarizes the open challenges and future
tion using GANs in which DRL models are pre-trained based research directions provided in this section.
VI. CONCLUSION [3] W. Saad, M. Bennis, and M. Chen, “A vision of 6G wireless systems:
This paper presents a comprehensive survey on the appli- Applications, trends, technologies, and open research problems,”
IEEE Netw., vol. 34, no. 3, pp. 134–142, May/Jun. 2020.
cations of DRL techniques in RRAM for next generation [4] Z. Zhang et al., “6G wireless networks: Vision, requirements, archi-
wireless HetNets. We thoroughly review the conventional tecture, and key technologies,” IEEE Veh. Technol. Mag., vol. 14,
approaches for RRAM, including their types, advantages, no. 3, pp. 28–41, Sep. 2019.
and limitations. We then illustrate how the emerging DRL [5] “6G Summit Connecting the Unconnected.” [Online]. Available:
https://6gsummit.org (accessed Feb. 18, 2022).
approaches can overcome these shortcomings to enable DRL- [6] “Cisco visual networking index: Global mobile data traffic forecast
based RRAM. After that, we illustrate how the RRAM update, 2017–2022,” Cisco, San Jose, CA, USA, White Paper, 2019.
optimization problems can be formulated as an MDP before [7] R. S. Sutton and A. G. Barto, Reinforcement Learning: An
solving them using DRL techniques. Furthermore, we con- Introduction. Cambridge, MA, USA: MIT Press, 2018.
[8] Y. L. Lee and D. Qin, “A survey on applications of deep rein-
duct an extensive overview of the most efficient DRL forcement learning in resource management for 5G heterogeneous
algorithms that are widely leveraged in addressing RRAM networks,” in Proc. Asia-Pacific Signal Inf. Process. Assoc. Annu.
problems, including the value- and policy-based algorithms. Summit Conf. (APSIPA ASC), 2019, pp. 1856–1862.
The advantages, limitations, and use-cases for each algo- [9] F. Obite, A. D. Usman, and E. Okafor, “An overview of deep rein-
forcement learning for spectrum sensing in cognitive radio networks,”
rithm are provided. We then conduct a comprehensive and Digit. Signal Process., vol. 113, Jun. 2021, Art. no. 103014.
in-depth literature review and classified the existing related [10] S. Gupta, G. Singal, and D. Garg, “Deep reinforcement learning
works based on both the radio resources they are address- techniques in diversified domains: A survey,” Arch. Comput. Methods
Eng., vol. 28, pp. 4715–4754, Feb. 2021.
ing and the type of wireless networks they are considering.
[11] Z. Du, Y. Deng, W. Guo, A. Nallanathan, and Q. Wu, “Green deep
To this end, the types of DRL models developed in these reinforcement learning for radio resource management: Architecture,
related works and their main elements are carefully iden- algorithm compression, and challenges,” IEEE Veh. Technol. Mag.,
tified. Finally, we outline important open challenges and vol. 16, no. 1, pp. 29–39, Mar. 2021.
[12] Y. Qian, J. Wu, R. Wang, F. Zhu, and W. Zhang, “Survey on reinforce-
provide insights into future research directions in the context ment learning applications in communication networks,” J. Commun.
of DRL-based RRAM. Inf. Netw., vol. 4, no. 2, pp. 30–39, Jun. 2019.
[13] Z. Xiong, Y. Zhang, D. Niyato, R. Deng, P. Wang, and
L.-C. Wang, “Deep reinforcement learning for mobile 5G and
beyond: Fundamentals, applications, and challenges,” IEEE Veh.
REFERENCES Technol. Mag., vol. 14, no. 2, pp. 44–52, Jun. 2019.
[1] F. Hussain, S. A. Hassan, R. Hussain, and E. Hossain, “Machine [14] A. Feriani and E. Hossain, “Single and multi-agent deep rein-
learning for resource management in cellular and IoT networks: forcement learning for ai-enabled wireless networks: A tutorial,”
Potentials, current solutions, and open challenges,” IEEE Commun. IEEE Commun. Surveys Tuts., vol. 23, no. 2, pp. 1226–1252,
Surveys Tuts., vol. 22, no. 2, pp. 1251–1275, 2nd Quart., 2020. 2nd Quart., 2021.
[2] K. B. Letaief, W. Chen, Y. Shi, J. Zhang, and Y.-J. A. Zhang, “The [15] N. C. Luong et al., “Applications of deep reinforcement learning in
roadmap to 6G: AI empowered wireless networks,” IEEE Commun. communications and networking: A survey,” IEEE Commun. Surveys
Mag., vol. 57, no. 8, pp. 84–90, Aug. 2019. Tuts., vol. 21, no. 4, pp. 3133–3174, 4th Quart., 2019.
[16] I. Tomkos, D. Klonidis, E. Pikasis, and S. Theodoridis, “Toward the [37] Z. Zhang, D. Zhang, and R. C. Qiu, “Deep reinforcement learning
6G network era: Opportunities and challenges,” IT Prof., vol. 22, for power system applications: An overview,” CSEE J. Power Energy
no. 1, pp. 34–38, Jan./Feb. 2020. Syst., vol. 6, no. 1, pp. 213–225, Mar. 2019.
[17] P. Yang, Y. Xiao, M. Xiao, and S. Li, “6G wireless communica- [38] Y. Xu, G. Gui, H. Gacanin, and F. Adachi, “A survey on resource
tions: Vision and potential techniques,” IEEE Netw., vol. 33, no. 4, allocation for 5G heterogeneous networks: Current research, future
pp. 70–75, Jul./Aug. 2019. trends, and challenges,” IEEE Commun. Surveys Tuts., vol. 23, no. 2,
[18] K. David and H. Berndt, “6G vision and requirements: Is there any pp. 668–695, 2nd Quart., 2021.
need for beyond 5G?” IEEE Veh. Technol. Mag., vol. 13, no. 3, [39] T. S. Rappaport et al., “Wireless communications and applications
pp. 72–80, Sep. 2018. above 100 GHz: Opportunities and challenges for 6G and beyond,”
[19] S. Elmeadawy and R. M. Shubair, “6G wireless communications: IEEE Access, vol. 7, pp. 78729–78757, 2019.
Future technologies and research challenges,” in Proc. Int. Conf. [40] H. Tataria, M. Shafi, A. F. Molisch, M. Dohler, H. Sjöland, and
Electr. Comput. Technol. Appl. (ICECTA), 2019, pp. 1–5. F. Tufvesson, “6G wireless systems: Vision, requirements, chal-
[20] T. Huang, W. Yang, J. Wu, J. Ma, X. Zhang, and D. Zhang, “A lenges, insights, and opportunities,” Proc. IEEE, vol. 109, no. 7,
survey on green 6G network: Architecture and technologies,” IEEE pp. 1166–1199, Jul. 2021.
Access, vol. 7, pp. 175758–175768, 2019. [41] A. Alwarafy, B. S. Ciftler, M. Abdallah, and M. Hamdi, “DeepRAT:
[21] A. Alwarafy, K. A. Al-Thelaya, M. Abdallah, J. Schneider, and A DRL-based framework for multi-RAT assignment and power allo-
M. Hamdi, “A survey on security and privacy issues in edge- cation in hetnets,” in Proc. IEEE Int. Conf. Commun. Workshops
computing-assisted Internet of Things,” IEEE Internet Things J., (ICC Workshops), 2021, pp. 1–6.
vol. 8, no. 6, pp. 4004–4022, Mar. 2021. [42] J. Kong, Z.-Y. Wu, M. Ismail, E. Serpedin, and K. A. Qaraqe,
[22] A. I. Sulyman, A. Alwarafy, G. R. MacCartney, T. S. Rappaport, “Q-learning based two-timescale power allocation for multi-homing
and A. Alsanie, “Directional radio propagation path loss mod- hybrid RF/VLC networks,” IEEE Wireless Commun. Lett., vol. 9,
els for millimeter-wave wireless networks in the 28-, 60-, and no. 4, pp. 443–447, Apr. 2020.
73-GHz bands,” IEEE Trans. Wireless Commun., vol. 15, no. 10, [43] G. Lee, M. Jung, A. T. Z. Kasgari, W. Saad, and M. Bennis, “Deep
pp. 6939–6947, Oct. 2016. reinforcement learning for energy-efficient networking with recon-
[23] A. Alwarafy, A. Albaseer, B. S. Ciftler, M. Abdallah, and figurable intelligent surfaces,” in Proc. IEEE Int. Conf. Commun.
A. Al-Fuqaha, “AI-based radio resource allocation in support of (ICC), 2020, pp. 1–6.
the massive heterogeneity of 6G networks,” in Proc. IEEE 4th 5G [44] B. S. Ciftler, M. Abdallah, A. Alwarafy, and M. Hamdi, “DQN-based
World Forum (5GWF), Oct. 2021, pp. 464–469. multi-user power allocation for hybrid RF/VLC networks,” in Proc.
[24] L. Liang, H. Ye, G. Yu, and G. Y. Li, “Deep-learning-based wireless IEEE Int. Conf. Commun., 2021, pp. 1–6.
resource allocation with application to vehicular networks,” Proc. [45] A. Ahmad, S. Ahmad, M. H. Rehmani, and N. U. Hassan,
IEEE, vol. 108, no. 2, pp. 341–356, Feb. 2020. “A survey on radio resource allocation in cognitive radio sensor
networks,” IEEE Commun. Surveys Tuts., vol. 17, no. 2, pp. 888–917,
[25] K. K. Nguyen, T. Q. Duong, N. A. Vien, N.-A. Le-Khac, and
2nd Quart., 2015.
M.-N. Nguyen, “Non-cooperative energy efficient power allocation
[46] M. El Tanab and W. Hamouda, “Resource allocation for underlay
game in D2D communication: A multi-agent deep reinforcement
cognitive radio networks: A survey,” IEEE Commun. Surveys Tuts.,
learning approach,” IEEE Access, vol. 7, pp. 100480–100490, 2019.
vol. 19, no. 2, pp. 1249–1276, 2nd Quart., 2017.
[26] M. Chen, U. Challita, W. Saad, C. Yin, and M. Debbah,
[47] M. Naeem, A. Anpalagan, M. Jaseemuddin, and D. C. Lee, “Resource
“Artificial neural networks-based machine learning for wireless
allocation techniques in cooperative cognitive radio networks,” IEEE
networks: A tutorial,” IEEE Commun. Surveys Tuts., vol. 21, no. 4,
Commun. Surveys Tuts., vol. 16, no. 2, pp. 729–744, 2nd Quart.,
pp. 3039–3071, 4th Quart., 2019.
2014.
[27] A. Zappone, M. D. Renzo, and M. Debbah, “Wireless networks
[48] S. Manap, K. Dimyati, M. N. Hindia, M. S. A. Talip, and
design in the era of deep learning: Model-based, AI-based, or both?”
R. Tafazolli, “Survey of radio resource management in 5G het-
IEEE Trans. Commun., vol. 67, no. 10, pp. 7331–7376, Oct. 2019.
erogeneous networks,” IEEE Access, vol. 8, pp. 131202–131223,
[28] H. Khorasgani, H. Wang, and C. Gupta, “Challenges of apply- 2020.
ing deep reinforcement learning in dynamic dispatching,” 2020, [49] M. Peng, C. Wang, J. Li, H. Xiang, and V. Lau, “Recent advances in
arXiv:2011.05570. underlay heterogeneous networks: Interference control, resource allo-
[29] G. S. Rahman, T. Dang, and M. Ahmed, “Deep reinforcement cation, and self-organization,” IEEE Commun. Surveys Tuts., vol. 17,
learning based computation offloading and resource allocation for no. 2, pp. 700–729, 2nd Quart., 2015.
low-latency fog radio access networks,” Intell. Converged Netw., [50] Y. Teng, M. Liu, F. R. Yu, V. C. M. Leung, M. Song, and
vol. 1, no. 3, pp. 243–257, 2020. Y. Zhang, “Resource allocation for ultra-dense networks: A survey,
[30] A. Mohammed, H. Nahom, A. Tewodros, Y. Habtamu, and some research issues and challenges,” IEEE Commun. Surveys Tuts.,
G. Hayelom, “Deep reinforcement learning for computation offload- vol. 21, no. 3, pp. 2134–2168, 3rd Quart., 2019.
ing and resource allocation in blockchain-based multi-UAV-enabled [51] K. Piamrat, A. Ksentini, J.-M. Bonnin, and C. Viho, “Radio resource
mobile edge computing,” in Proc. 17th Int. Comput. Conf. management in emerging heterogeneous wireless networks,” Comput.
Wavelet Active Media Technol. Inf. Process. (ICCWAMTIP), 2020, Commun., vol. 34, no. 9, pp. 1066–1076, 2011.
pp. 295–299. [52] N. Xia, H. Chen, and C. Yang, “Radio resource management in
[31] S. Sheng, P. Chen, Z. Chen, L. Wu, and Y. Yao, “Deep reinforcement machine-to-machine communications—A survey,” IEEE Commun.
learning-based task scheduling in IoT edge computing,” Sensors, Surveys Tuts., vol. 20, no. 1, pp. 791–828, 1st Quart., 2018.
vol. 21, no. 5, p. 1666, 2021. [53] S. Sadr, A. Anpalagan, and K. Raahemifar, “Radio resource alloca-
[32] X. Chen and G. Liu, “Energy-efficient task offloading and resource tion algorithms for the downlink of multiuser OFDM communication
allocation via deep reinforcement learning for augmented reality in systems,” IEEE Commun. Surveys Tuts., vol. 11, no. 3, pp. 92–106,
mobile edge networks,” IEEE Internet Things J., vol. 8, no. 13, 3rd Quart., 2009.
pp. 10843–10856, Jul. 2021. [54] E. Yaacoub and Z. Dawy, “A survey on uplink resource allocation in
[33] Q. Liu, T. Han, and E. Moges, “Edgeslice: Slicing wireless edge OFDMA wireless networks,” IEEE Commun. Surveys Tuts., vol. 14,
computing network with decentralized deep reinforcement learning,” no. 2, pp. 322–337, 2nd Quart., 2012.
2020, arXiv:2003.12911. [55] R. O. Afolabi, A. Dadlani, and K. Kim, “Multicast scheduling and
[34] M. Lin and Y. Zhao, “Artificial intelligence-empowered resource resource allocation algorithms for OFDMA-based systems: A sur-
management for future wireless communications: A survey,” China vey,” IEEE Commun. Surveys Tuts., vol. 15, no. 1, pp. 240–254,
Commun., vol. 17, no. 3, pp. 58–77, Mar. 2020. 1std Quart., 2013.
[35] Q. T. A. Pham, K. Piamrat, and C. Viho, “Resource management in [56] S. Chieochan and E. Hossain, “Adaptive radio resource allocation
wireless access networks: A layer-based classification-version 1.0,” in OFDMA systems: A survey of the state-of-the-art approaches,”
Rep. PI-2017, 2014, p. 23. Wireless Commun. Mobile Comput., vol. 9, no. 4, pp. 513–527, 2009.
[36] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath, [57] D. Niyato and E. Hossain, “Radio resource management in MIMO-
“Deep reinforcement learning: A brief survey,” IEEE Signal Process. OFDM- mesh networks: Issues and approaches,” IEEE Commun.
Mag., vol. 34, no. 6, pp. 26–38, Nov. 2017. Mag., vol. 45, no. 11, pp. 100–107, Nov. 2007.
[102] A. A. Khan and R. S. Adve, “Centralized and distributed deep rein- [123] N. Zhao, Y.-C. Liang, D. Niyato, Y. Pei, and Y. Jiang, “Deep
forcement learning methods for downlink sum-rate optimization,” reinforcement learning for user association and resource allocation
IEEE Trans. Wireless Commun., vol. 19, no. 12, pp. 8410–8426, in heterogeneous networks,” in Proc. IEEE Global Commun. Conf.
Dec. 2020. (GLOBECOM), 2018, pp. 1–6.
[103] F. Meng, P. Chen, L. Wu, and J. Cheng, “Power allocation in multi- [124] N. Zhao, Y.-C. Liang, D. Niyato, Y. Pei, M. Wu, and Y. Jiang, “Deep
user cellular networks: Deep reinforcement learning approaches,” reinforcement learning for user association and resource allocation
IEEE Trans. Wireless Commun., vol. 19, no. 10, pp. 6255–6267, in heterogeneous cellular networks,” IEEE Trans. Wireless Commun.,
Oct. 2020. vol. 18, no. 11, pp. 5141–5152, Nov. 2019.
[104] F. Meng, P. Chen, and L. Wu, “Power allocation in multi-user cellular [125] Q. Zhang, Y.-C. Liang, and H. V. Poor, “Intelligent user association
networks with deep Q learning approach,” in Proc. IEEE Int. Conf. for symbiotic radio networks using deep reinforcement learning,”
Commun. (ICC), 2019, pp. 1–6. IEEE Trans. Wireless Commun., vol. 19, no. 7, pp. 4535–4548,
[105] L. Zhang and Y.-C. Liang, “Deep reinforcement learning for multi- Jul. 2020.
agent power control in heterogeneous networks,” IEEE Trans. [126] W. Lei, Y. Ye, and M. Xiao, “Deep reinforcement learning-based
Wireless Commun., vol. 20, no. 4, pp. 2551–2564, Apr. 2021. spectrum allocation in integrated access and backhaul networks,”
[106] Y. S. Nasir and D. Guo, “Deep actor-critic learning for distributed IEEE Trans. Cogn. Commun. Netw., vol. 6, no. 3, pp. 970–979,
power control in wireless mobile networks,” in Proc. 54th Asilomar Sep. 2020.
Conf. Signals Syst. Comput., 2020, pp. 398–402. [127] Z. Li, C. Wang, and C.-J. Jiang, “User association for load balancing
[107] Y. S. Nasir and D. Guo, “Multi-agent deep reinforcement learning in vehicular networks: An online reinforcement learning approach,”
for dynamic power allocation in wireless networks,” IEEE J. Sel. IEEE Trans. Intell. Transp. Syst., vol. 18, no. 8, pp. 2217–2228,
Areas Commun., vol. 37, no. 10, pp. 2239–2250, Oct. 2019. Aug. 2017.
[108] Z. Bi and W. Zhou, “Deep reinforcement learning based power allo-
[128] J. Zheng, X. Tang, X. Wei, H. Shen, and L. Zhao, “Channel assign-
cation for D2D network,” in Proc. IEEE 91st Veh. Technol. Conf.
ment for hybrid NOMA systems with deep reinforcement learning,”
(VTC-Spring), 2020, pp. 1–5.
IEEE Wireless Commun. Lett., vol. 10, no. 7, pp. 1370–1374,
[109] S. Saeidian, S. Tayamon, and E. Ghadimi, “Downlink power con-
Jul. 2021.
trol in dense 5G radio access networks through deep reinforcement
learning,” in Proc. IEEE Int. Conf. Commun. (ICC), 2020, pp. 1–6. [129] H. Song, L. Liu, J. Ashdown, and Y. Yi, “A deep reinforcement
[110] Z. Zhang, H. Qu, J. Zhao, and W. Wang, “Deep reinforcement learn- learning framework for spectrum management in dynamic spectrum
ing method for energy efficient resource allocation in next generation access,” IEEE Internet Things J., vol. 8, no. 14, pp. 11208–11218,
wireless networks,” in Proc. Int. Conf. Comput. Netw. Internet Things, Jul. 2021.
2020, pp. 18–24. [130] Y. Hu et al., “Optimal transmit antenna selection strategy for MIMO
[111] Q. Wang, K. Feng, X. Li, and S. Jin, “PrecoderNet: Hybrid wiretap channel based on deep reinforcement learning,” in Proc.
beamforming for millimeter wave systems with deep reinforce- IEEE/CIC Int. Conf. Commun. China (ICCC), 2018, pp. 803–807.
ment learning,” IEEE Wireless Commun. Lett., vol. 9, no. 10, [131] S. Wang, H. Liu, P. H. Gomes, and B. Krishnamachari, “Deep
pp. 1677–1681, Oct. 2020. reinforcement learning for dynamic multichannel access in wire-
[112] X. Li, J. Fang, W. Cheng, H. Duan, Z. Chen, and H. Li, less networks,” IEEE Trans. Cogn. Commun. Netw., vol. 4, no. 2,
“Intelligent power control for spectrum sharing in cognitive radios: pp. 257–265, Jun. 2018.
A deep reinforcement learning approach,” IEEE Access, vol. 6, [132] S. Wang, H. Liu, P. Gomes, and B. Krishnamachari, “Deep rein-
pp. 25463–25473, 2018. forcement learning for dynamic multichannel access,” in Proc. Int.
[113] L. Li, Q. Cheng, K. Xue, C. Yang, and Z. Han, “Downlink transmit Conf. Comput. Netw. Commun. (ICNC), 2017, pp. 257–265.
power control in ultra-dense UAV network based on mean field game [133] S. H. A. Ahmad, M. Liu, T. Javidi, Q. Zhao, and B. Krishnamachari,
and deep reinforcement learning,” IEEE Trans. Veh. Technol., vol. 69, “Optimality of myopic sensing in multichannel opportunistic access,”
no. 12, pp. 15594–15605, Dec. 2020. IEEE Trans. Inf. Theory, vol. 55, no. 9, pp. 4040–4050, Sep. 2009.
[114] N. Zhao, Z. Liu, and Y. Cheng, “Multi-agent deep reinforcement [134] M. Chu, H. Li, X. Liao, and S. Cui, “Reinforcement learning-based
learning for trajectory design and power allocation in multi-UAV multiaccess control and battery prediction with energy harvesting in
networks,” IEEE Access, vol. 8, pp. 139670–139679, 2020. IoT systems,” IEEE Internet Things J., vol. 6, no. 2, pp. 2009–2020,
[115] M. Yan, B. Chen, G. Feng, and S. Qin, “Federated cooperation Apr. 2019.
and augmentation for power allocation in decentralized wireless [135] Y. Zhang, P. Cai, C. Pan, and S. Zhang, “Multi-agent deep reinforce-
networks,” IEEE Access, vol. 8, pp. 48088–48100, 2020. ment learning-based cooperative spectrum sensing with upper confi-
[116] J. G. Luis, M. Guerster, I. del Portillo, E. Crawley, and B. Cameron, dence bound exploration,” IEEE Access, vol. 7, pp. 118898–118906,
“Deep reinforcement learning architecture for continuous power 2019.
allocation in high throughput satellites,” 2019, arXiv:1906.00571. [136] N. Yang, H. Zhang, and R. Berry, “Partially observable multi-agent
[117] J. J. G. Luis, M. Guerster, I. del Portillo, E. Crawley, and B. Cameron, deep reinforcement learning for cognitive resource management,” in
“Deep reinforcement learning for continuous power allocation in Proc. IEEE Global Commun. Conf., 2020, pp. 1–6.
flexible high throughput satellites,” in Proc. IEEE Cogn. Commun.
[137] Y. Xu, J. Yu, W. C. Headley, and R. M. Buehrer, “Deep reinforcement
Aerosp. Appl. Workshop (CCAAW), 2019, pp. 1–4.
learning for dynamic spectrum access in wireless networks,” in Proc.
[118] O. Maraqa, A. S. Rajasekaran, S. Al-Ahmadi, H. Yanikomeroglu, and
IEEE Military Commun. Conf. (MILCOM), 2018, pp. 207–212.
S. M. Sait, “A survey of rate-optimal power domain NOMA with
enabling technologies of future wireless networks,” IEEE Commun. [138] L. Liang, H. Ye, and G. Y. Li, “Multi-agent reinforcement learning
Surveys Tuts., vol. 22, no. 4, pp. 2192–2235, 4th Quart., 2020. for spectrum sharing in vehicular networks,” in Proc. IEEE 20th Int.
[119] X. Yan, K. An, Q. Zhang, G. Zheng, S. Chatzinotas, and J. Han, Workshop Signal Process. Adv. Wireless Commun. (SPAWC), 2019,
“Delay constrained resource allocation for NOMA enabled satellite pp. 1–5.
Internet of Things with deep reinforcement learning,” IEEE Internet [139] L. Liang, H. Yee, and G. Y. Li, “Spectrum sharing in vehicular
Things J., early access, 2020. [Online]. Available: https://orbilu.uni. networks based on multi-agent reinforcement learning,” IEEE J. Sel.
lu/bitstream/10993/45468/1/Delay%20Constrained%20Resource% Areas Commun., vol. 37, no. 10, pp. 2282–2292, Oct. 2019.
20Allocation%20for%20NOMA.pdf [140] J. Zhu, Y. Song, D. Jiang, and H. Song, “A new deep-Q-learning-
[120] A. Alwarafy, M. Alresheedi, A. F. Abas, and A. Alsanie, based transmission scheduling mechanism for the cognitive Internet
“Performance evaluation of space time coding techniques for indoor of Things,” IEEE Internet Things J., vol. 5, no. 4, pp. 2375–2385,
visible light communication systems,” in Proc. Int. Conf. Opt. Netw. Aug. 2018.
Design Model. (ONDM), 2018, pp. 88–93. [141] H. Khan, A. Elgabli, S. Samarakoon, M. Bennis, and C. S. Hong,
[121] A. Memedi and F. Dressler, “Vehicular visible light communications: “Reinforcement learning-based vehicle-cell association algorithm for
A survey,” IEEE Commun. Surveys Tuts., vol. 23, no. 1, pp. 161–181, highly mobile millimeter wave communication,” IEEE Trans. Cogn.
1st Quart., 2021. Commun. Netw., vol. 5, no. 4, pp. 1073–1085, Dec. 2019.
[122] M. Chen, W. Saad, and C. Yin, “Liquid state machine learning for [142] P. Yang et al., “Dynamic spectrum access in cognitive radio networks
resource allocation in a network of cache-enabled LTE-U UAVs,” in using deep reinforcement learning and evolutionary game,” in Proc.
Proc. IEEE Global Commun. Conf., 2017, pp. 1–6. IEEE/CIC Int. Conf. Commun. China (ICCC), 2018, pp. 405–409.
[185] S. Shrivastava, B. Chen, C. Chen, H. Wang, and M. Dai, “Deep [207] B. Das and S. Roy, “Load balancing techniques for wireless mesh
Q-network learning based downlink resource allocation for hybrid networks: A survey,” in Proc. Int. Symp. Comput. Bus. Intell., 2013,
RF/VLC systems,” IEEE Access, vol. 8, pp. 149412–149434, 2020. pp. 247–253.
[186] Q. Huang, X. Xie, and M. Cheriet, “Reinforcement learning-based [208] L. Zhu, W. Shen, S. Pan, R. Li, and Z. Li, “A dynamic load balancing
hybrid spectrum resource allocation scheme for the high load of method for spatial data network service,” in Proc. 5th Int. Conf.
URLLC services,” EURASIP J. Wireless Commun. Netw., vol. 2020, Wireless Commun. Netw. Mobile Comput., 2009, pp. 1–3.
no. 1, pp. 1–21, 2020. [209] H. Desai and R. Oza, “A study of dynamic load balancing in grid
[187] H. Yang et al., “Intelligent reflecting surface assisted anti-jamming environment,” in Proc. Int. Conf. Wireless Commun. Signal Process.
communications based on reinforcement learning,” in Proc. IEEE Netw. (WiSPNET), 2016, pp. 128–132.
Global Commun. Conf., 2020, pp. 1–6. [210] F. Khayatian, Z. Nagy, and A. Bollinger, “Using generative adver-
[188] H. Yang et al., “Intelligent reflecting surface assisted anti-jamming sarial networks to evaluate robustness of reinforcement learning
communications: A fast reinforcement learning approach,” IEEE agents against uncertainties,” Energy Build., vol. 251, Nov. 2021,
Trans. Wireless Commun., vol. 20, no. 3, pp. 1963–1974, Mar. 2021. Art. no. 111334.
[189] E. Wong, K. Leung, and T. Field, “State-space decomposition for [211] A. T. Z. Kasgari, W. Saad, M. Mozaffari, and H. V. Poor,
reinforcement learning,” Dept. Comput., Imperial College London, “Experienced deep reinforcement learning with generative adver-
London, U.K., Rep., 2021. sarial networks (GANs) for model-free ultra reliable low latency
communication,” IEEE Trans. Commun., vol. 69, no. 2, pp. 884–899,
[190] L. Zahedi, F. G. Mohammadi, S. Rezapour, M. W. Ohland, and
Feb. 2021.
M. H. Amini, “Search algorithms for automated hyper-parameter
[212] R. Alghamdi et al., “Intelligent surfaces for 6G wireless networks: A
tuning,” 2021, arXiv:2104.14677.
survey of optimization and performance analysis techniques,” IEEE
[191] N. V. Varghese and Q. H. Mahmoud, “A survey of multi-task deep Access, vol. 8, pp. 202795–202818, 2020.
reinforcement learning,” Electronics, vol. 9, no. 9, p. 1363, 2020. [213] C. Huang et al., “Holographic MIMO surfaces for 6G wireless
[192] K. Lei, Y. Liang, and W. Li, “Congestion control in SDN-based networks: Opportunities, challenges, and trends,” IEEE Wireless
networks via multi-task deep reinforcement learning,” IEEE Netw., Commun., vol. 27, no. 5, pp. 118–125, Oct. 2020.
vol. 34, no. 4, pp. 28–34, Jul./Aug. 2020. [214] Q. Wu and R. Zhang, “Towards smart and reconfigurable environ-
[193] H. Eghbal-Zadeh, F. Henkel, and G. Widmer, “Context-adaptive ment: Intelligent reflecting surface aided wireless network,” IEEE
reinforcement learning using unsupervised learning of context vari- Commun. Mag., vol. 58, no. 1, pp. 106–112, Jan. 2020.
ables,” in Proc. Workshop Pre-Registration Mach. Learn., 2021, [215] C. Pan et al., “Reconfigurable intelligent surface for 6G and beyond:
pp. 236–254. Motivations, principles, applications, and research directions,” 2020,
[194] T. T. Nguyen, N. D. Nguyen, P. Vamplew, S. Nahavandi, R. Dazeley, arXiv:2011.04300.
and C. P. Lim, “A multi-objective deep reinforcement learning frame- [216] A. Taha, Y. Zhang, F. B. Mismar, and A. Alkhateeb, “Deep reinforce-
work,” Eng. Appl. Artif. Intell., vol. 96, Nov. 2020, Art. no. 103915. ment learning for intelligent reflecting surfaces: Towards standalone
[195] A. Adadi and M. Berrada, “Peeking inside the black-box: A survey operation,” in Proc. IEEE 21st Int. Workshop Signal Process. Adv.
on explainable artificial intelligence (XAI),” IEEE Access, vol. 6, Wireless Commun. (SPAWC), 2020, pp. 1–5.
pp. 52138–52160, 2018. [217] C. Huang, R. Mo, and C. Yuen, “Reconfigurable intelligent sur-
[196] A. Heuillet, F. Couthouis, and N. Díaz-Rodríguez, “Explainability face assisted multiuser MISO systems exploiting deep reinforce-
in deep reinforcement learning,” Knowl. Based Syst., vol. 214, ment learning,” IEEE J. Sel. Areas Commun., vol. 38, no. 8,
Feb. 2021, Art. no. 106685. pp. 1839–1850, Aug. 2020.
[197] Y. He, Y. Wang, C. Qiu, Q. Lin, J. Li, and Z. Ming, “Blockchain- [218] Z. Yang, Y. Liu, Y. Chen, and J. T. Zhou, “Deep reinforcement learn-
based edge computing resource allocation in IoT: A deep reinforce- ing for ris-aided non-orthogonal multiple access downlink networks,”
ment learning approach,” IEEE Internet Things J., vol. 8, no. 4, in Proc. IEEE Global Commun. Conf., 2020, pp. 1–6.
pp. 2226–2237, Feb. 2021. [219] L. U. Khan, W. Saad, D. Niyato, Z. Han, and C. S. Hong, “Digital-
[198] F. Guo, F. R. Yu, H. Zhang, H. Ji, M. Liu, and V. C. M. Leung, twin-enabled 6G: Vision, architectural trends, and future directions,”
“Adaptive resource allocation in future wireless networks with 2021, arXiv:2102.12169.
blockchain and mobile edge computing,” IEEE Trans. Wireless [220] W. Kritzinger, M. Karner, G. Traar, J. Henjes, and W. Sihn, “Digital
Commun., vol. 19, no. 3, pp. 1689–1703, Mar. 2020. twin in manufacturing: A categorical literature review and clas-
[199] S. Hu, Y.-C. Liang, Z. Xiong, and D. Niyato, “Blockchain and arti- sification,” IFAC-PapersOnLine, vol. 51, no. 11, pp. 1016–1022,
ficial intelligence for dynamic resource sharing in 6G and beyond,” 2018.
IEEE Wireless Commun., vol. 28, no. 4, pp. 145–151, Aug. 2021. [221] Y. Lu, S. Maharjan, and Y. Zhang, “Adaptive edge association for
[200] L. Yang, M. Li, P. Si, R. Yang, E. Sun, and Y. Zhang, “Energy- wireless digital twin networks in 6G,” IEEE Internet Things J., vol. 8,
efficient resource allocation for blockchain-enabled Industrial no. 22, pp. 16219–16230, Nov. 2021.
[222] W. Sun, N. Xu, L. Wang, H. Zhang, and Y. Zhang, “Dynamic
Internet of Things with deep reinforcement learning,” IEEE Internet
digital twin and federated learning with incentives for air-ground
Things J., vol. 8, no. 4, pp. 2318–2329, Feb. 2020.
networks,” IEEE Trans. Netw. Sci. Eng., vol. 9, no. 1, pp. 321–333,
[201] O. A. Wahab, A. Mourad, H. Otrok, and T. Taleb, “Federated
Jan./Feb. 2022.
machine learning: Survey, multi-level classification, desirable criteria [223] Y. Lu, X. Huang, K. Zhang, S. Maharjan, and Y. Zhang,
and future directions in communication and networking systems,” “Communication-efficient federated learning and permissioned
IEEE Commun. Surveys Tuts., vol. 23, no. 2, pp. 1342–1397, blockchain for digital twin edge networks,” IEEE Internet Things
2nd Quart., 2021. J., vol. 8, no. 4, pp. 2276–2288, Feb. 2021.
[202] N. H. Tran, W. Bao, A. Zomaya, M. N. Nguyen, and C. S. Hong,
“Federated learning over wireless networks: Optimization model
design and analysis,” in Proc. IEEE Conf. Comput. Commun., 2019,
pp. 1387–1395.
[203] A. M. Albaseer, M. Abdallah, A. Al-Fuqaha, and A. Erbad, “Fine-
grained data selection for improved energy efficiency of federated
edge learning,” IEEE Trans. Netw. Sci. Eng., early access, Jul. 29,
2021, doi: 10.1109/TNSE.2021.3100805.
[204] S. Wang et al., “Adaptive federated learning in resource constrained ABDULMALIK ALWARAFY received the B.S. degree in electrical engi-
edge computing systems,” IEEE J. Sel. Areas Commun., vol. 37, neering with a minor in communication from IBB University, Yemen, and
no. 6, pp. 1205–1221, Jun. 2019. the M.Sc. degree in electrical engineering with a minor in communica-
[205] X. Wang, C. Wang, X. Li, V. C. Leung, and T. Taleb, “Federated tion from King Saud University, Saudi Arabia. He is currently pursuing
deep reinforcement learning for Internet of Things with decentralized the Ph.D. degree with the College of Science and Engineering, Hamad
cooperative edge caching,” IEEE Internet Things J., vol. 7, no. 10, Bin Khalifa University, Qatar. His current research interests include radio
pp. 9441–9455, Oct. 2020. resource allocation and management for mobile networks, and deep rein-
[206] H. H. Zhuo, W. Feng, Q. Xu, Q. Yang, and Y. Lin, “Federated forcement learning techniques for 6G and beyond wireless communication
reinforcement learning,” 2019, arXiv:1901.08277. networks.