0% found this document useful (0 votes)
25 views

The Frontiers of Deep Reinforcement Learning For Resource Management in Future Wireless HetNets Techniques Challenges and Research Directions

This paper surveys deep reinforcement learning techniques for radio resource allocation and management in future heterogeneous wireless networks. It first discusses the limitations of traditional resource management methods and the potential for DRL. It then reviews popular DRL algorithms and classifies related works based on the resources and network types addressed. Finally, it identifies open challenges and directions for future research.

Uploaded by

syed zain abbas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

The Frontiers of Deep Reinforcement Learning For Resource Management in Future Wireless HetNets Techniques Challenges and Research Directions

This paper surveys deep reinforcement learning techniques for radio resource allocation and management in future heterogeneous wireless networks. It first discusses the limitations of traditional resource management methods and the potential for DRL. It then reviews popular DRL algorithms and classifies related works based on the resources and network types addressed. Finally, it identifies open challenges and directions for future research.

Uploaded by

syed zain abbas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Received 6 January 2022; revised 31 January 2022; accepted 18 February 2022.

Date of publication 23 February 2022; date of current version 3 March 2022.


Digital Object Identifier 10.1109/OJCOMS.2022.3153226

The Frontiers of Deep Reinforcement Learning for


Resource Management in Future Wireless HetNets:
Techniques, Challenges, and Research Directions
ABDULMALIK ALWARAFY , MOHAMED ABDALLAH (Senior Member, IEEE),
BEKIR SAIT ÇIFTLER (Member, IEEE), ALA AL-FUQAHA (Senior Member, IEEE),
AND MOUNIR HAMDI (Fellow, IEEE)
Division of Information and Computing Technology, College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar
CORRESPONDING AUTHOR: A. ALWARAFY (e-mail: [email protected])
This publication was made possible by NPRP-Standard (NPRP-S) Thirteen (13th) Cycle grant # NPRP13S-0201-200219 from the Qatar National Research Fund
(a member of Qatar Foundation). The findings herein reflect the work, and are solely the responsibility, of the authors.

ABSTRACT Next generation wireless networks are expected to be extremely complex due to their
massive heterogeneity in terms of the types of network architectures they incorporate, the types and
numbers of smart IoT devices they serve, and the types of emerging applications they support. In such
large-scale and heterogeneous networks (HetNets), radio resource allocation and management (RRAM)
becomes one of the major challenges encountered during system design and deployment. In this context,
emerging Deep Reinforcement Learning (DRL) techniques are expected to be one of the main enabling
technologies to address the RRAM in future wireless HetNets. In this paper, we conduct a systematic
in-depth, and comprehensive survey of the applications of DRL techniques in RRAM for next generation
wireless networks. Towards this, we first overview the existing traditional RRAM methods and identify
their limitations that motivate the use of DRL techniques in RRAM. Then, we provide a comprehensive
review of the most widely used DRL algorithms to address RRAM problems, including the value- and
policy-based algorithms. The advantages, limitations, and use-cases for each algorithm are provided. We
then conduct a comprehensive and in-depth literature review and classify existing related works based on
both the radio resources they are addressing and the type of wireless networks they are investigating. To
this end, we carefully identify the types of DRL algorithms utilized in each related work, the elements
of these algorithms, and the main findings of each related work. Finally, we highlight important open
challenges and provide insights into several future research directions in the context of DRL-based RRAM.
This survey is intentionally designed to guide and stimulate more research endeavors towards building
efficient and fine-grained DRL-based RRAM schemes for future wireless networks.

INDEX TERMS Radio resource allocation and management, deep reinforcement learning, next generation
wireless networks, HetNets, power, bandwidth, rate, access control.

I. INTRODUCTION of disruptive applications and services they support [2], [3].

R ADIO resource allocation and management (RRAM) is


regarded as one of the essential challenges encountered
in modern wireless communication networks [1]. Nowadays,
It is envisaged that future networks will integrate land, air,
space, and deep-sea wireless networks into a single network
to meet the stringent requirements of a fully-connected
modern wireless networks are becoming more heteroge- world vision [4], [5], as shown in Fig. 1. This will ensure
neous and complex in terms of the types of emerging radio ubiquitous connectivity for user devices with enhanced qual-
access networks (RANs) they integrate, the explosive num- ity of service (QoS) in terms of coverage, reliability, and
ber and types of smart devices they serve, and the types throughput. In addition, future user devices will also witness

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

322 VOLUME 3, 2022


FIGURE 1. A pictorial illustration of next generation wireless networks characterized by their massive heterogeneity in terms of RANs infrastructures, types and numbers of
user devices served, and types of applications and services supported.

an unprecedented increase in their numbers and types of access points, to continuously interact with the environ-
data-hungry applications they require/support [3], [6]. It is ment to make autonomous control decisions [7]–[14]. DRL
expected that by 2023, the number of user networked devices Techniques have attracted considerable research recently and
and connections, including smart-phones, tablets, wearable demonstrated efficient performance in addressing complex
devices, and sensors, will reach 29.3 billion [6], and gener- wireless optimization problems, including RRAM problems.
ate a data rate exceeding 50 trillion GB [1]. All these trends Therefore, experts expect DRL methods to be one of the
will exacerbate the burdens during system design, plan- main enabling technologies for future wireless networks due
ning, deployment, operation, and management. In particular, to their ability to overcome the limitations of traditional
RRAM will become crucial in such complex and large-scale RRAM techniques [2], [15].
networks in order to guarantee an enhanced communications
experience. A. MOTIVATIONS OF THE PAPER
RRAM plays a pivotal role during infrastructure plan- The main motivations of this work stem from three
ning, implementation, and resource optimization of modern aspects. First, the paramount importance of allocating radio
wireless networks. Efficient RRAM solutions will guarantee resources in future wireless networks. Second, the limita-
enhanced network connectivity, increased system efficiency, tions and shortcomings of existing state-of-the-art RRAM
and reduced energy consumption. The performance of wire- techniques. Third, the robustness of Deep reinforcement
less networks heavily relies on two aspects. First, how techniques in alleviating these limitations and providing
network radio resources are being utilized, managed, and efficient performance in the context of RRAM. Here we
orchestrated, including transmit power control, spectrum elaborate more on each aspect.
channel allocations, and user access control. Second, how
efficiently the system can react to the rapid changes of 1) IMPORTANCE OF RRAM IN MODERN WIRELESS
network dynamics, including wireless channel statistics, NETWORKS
users mobility patterns, instantaneous radio resources avail- The explosive growth in the number and types of mod-
ability, and variability in traffic loads. Efficient RRAM ern smart devices, such as smartphones/tablets and wearable
techniques must efficiently and dynamically account for such devices, has led to the emergence of disruptive wireless
design aspects in order to ensure high network QoS and communications and networking technologies, such as 5G
enhanced users’ Quality of Experience (QoE). NR cellular networks, IoT networks, personal (or wireless
Deep reinforcement learning (DRL) is a branch of arti- body area networks), device-to-device (D2D) communica-
ficial intelligence (AI) that enables network entities, such tions, holographic imaging and haptic communications, and
as base stations, user devices, edge servers, gateways, and vehicular networks [3], [4], [16]–[23]. Such networks are

VOLUME 3, 2022 323


ALWARAFY et al.: FRONTIERS OF DRL FOR RESOURCE MANAGEMENT IN FUTURE WIRELESS HetNets

envisaged to meet the stringent requirements of the emerg- The large-scale nature of next generation networks makes
ing applications and services via supporting high data rates, it quite difficult to formulate RRAM optimization problems
coverage, and connectivity with significant enhancements in that are often intractable non-convex. Also, conventional
reliability, reduction in latency, and mitigation of energy techniques used to solve the RRAM problems require
consumption. complete or quasi-complete knowledge of the wireless envi-
However, achieving this goal in such large-scale, versa- ronment, including accurate channel models and real-time
tile, and complex wireless networks is quite challenging, channel state information (CSI). However, obtaining such
as it requires a judicious allocation and management of information in a real-time fashion in these large-scale
the networks’ limited radio resources [24], [25]. In partic- networks is quite difficult or even impossible. Furthermore,
ular, efficient and more advanced RRAM solutions must conventional techniques are often computationally-expensive
be developed to balance the tradeoff between enhancing and incur considerable timing overhead. This renders them
network performance while guaranteeing an efficient uti- inefficient for most emerging time-sensitive applications,
lization of radio resources. Furthermore, efficient RRAM such as autonomous vehicles and robotics.
solutions must also strike and intelligent tradeoff between Moreover, game theory-based techniques are unsuitable
optimizing network radio resources and satisfying users’ for future heterogeneous networks (HetNets) as such tech-
QoE. For example, RRAM techniques must jointly enhance niques are devised for homogeneous players. Also, the
network spectral efficiency (SE), energy efficiency (EE), and explosive number of network APs and user devices will
throughput while mitigating interference, reducing latency, create extra burdens on game theory-based techniques. In
and enhancing rate for user devices. particular, network players, such as BSs, APs, and user
Efficient and advanced RRAM schemes can considerably devices, need to exchange a tremendous amount of data
enhance the system’s SE compared to the traditional tech- and signaling. This will induce unmanageable overhead that
niques by relying on the advanced channel and/or source largely increases delay, computation, and energy/memory
coding methods. RRAM is essential in broadcast wireless consumption of network elements.
networks covering wide geographical areas as well as in
modern cellular communication networks comprised of sev- 3) HOW CAN DRL OVERCOME THESE CHALLENGES
eral adjacent and dense access points (APs) that typically AND PROVIDE EFFICIENT RRAM SOLUTIONS?
share and reuse the same radio frequencies. Emerging artificial intelligence (AI) techniques, such as
From a cost point of view, the deployment of wireless deep reinforcement learning (DRL), have shown efficient
APs and sites, e.g., base stations (BSs), including the real performance in addressing various issues in modern wire-
estate costs, planning, maintenance, and energy, is the most less communication networks, including solving complex
critical aspect alongside with the frequency license fees. RRAM optimization problems [7]–[15]. In the context of
Hence, the goal of RRAM is maximizing the network’s SE RRAM, DRL methods are mainly used as an alterna-
in terms of bits/sec/Hz/area unit or Erlang/MHz/site, under tive to overcome the shortcomings and limitations of the
some constraints related to user fairness. For instance, the conventional RRAM techniques discussed above. In partic-
service grade must meet a minimum acceptable level of QoS, ular, DRL techniques can solve complex network RRAM
including the coverage of certain geographical areas while optimization problems and take judicious control decisions
mitigating network outages caused by interference, noise, with only limited information about the network statistics.
large-scale fading (due to path losses and shadowing), and They achieve this by enabling network entities, such as BSs,
small-scale fading (due to multi-path). The service grade also RAN’s APs, edge servers (ESs), gateways nodes, and user
depends on blocking caused by admission control, scheduling devices, to make intelligent and autonomous control deci-
errors, or inability to meet certain QoS demands of edge sions, such as RRAM, user association, and RAN’s selection,
devices (EDs). in order to achieve various network goals such as sum-
rate maximization, reliability enhancement, delay reduction,
2) WHERE DO TRADITIONAL RRAM TECHNIQUES FAIL? and SE/EE maximization. In addition, DRL techniques are
Future wireless communication networks are complex due to model-free that enable different network entities to learn
their large-scale, versatile, and heterogeneous nature. To opti- optimal policies about the network, such as RRAM and user
mally allocate and manage radio resources in such networks, association, based on their continuous interactions with the
we typically formulate RRAM as complex optimization wireless environment, without knowing the exact channel
problems. The objective of such problems is to achieve a par- models or other network statistics a-priori. These appealing
ticular goal, such as maximizing network sum-rate, SE, and features make DRL methods one of the main key enabling
EE, given the available radio resources and QoS requirements technologies to address the RRAM issue in modern wireless
of user devices. Unfortunately, the massive heterogeneity communication networks [2], [3].
nature of modern networks poses tremendous challenges dur-
ing the process of formulating optimization problems as well B. RELATED WORK
as applying conventional techniques to solve them, such as There is a limited number of surveys that focus on the role
optimization, heuristic, and game theory algorithms. of DRL in RRAM. Existing related surveys are listed in

324 VOLUME 3, 2022


TABLE 1. Relationship between this survey and existing surveys on DRL-based RRAM For wireless networks.

Table 1. The table also summarizes the topics covered in supervised learning, Bayesian learning, K-means
these surveys along with a mapping to the relevant sections of clustering, Principal Component Analysis (PCA), etc.
this paper and a categorical discussion of the improvements • Even the surveys that address DRL for RRAM in
and value-added in this paper relative to these surveys. In wireless networks focus on specific wireless network
general, as reported in Table 1, these published surveys still types or applications [8], [9], [11], [12], [28], missing
have several research gaps that are addressed in this survey. some of the recent research, not providing an adequate
We summarize them as follows. overview of the most widely used DRL algorithms for
RRAM [12], or not covering the RRAM in-depth, but,
rather, just covering a limited number of radio resources.
• Some of the existing surveys focus on DRL applications Hence, the role of this paper to fill these research gaps
in wireless communications and networking in general, and overcome these shortcomings. In particular, we pro-
without paying much attention to RRAM [10], [15]. For vide a comprehensive survey on the application of DRL
example, existing surveys cover topics related to DRL techniques in RRAM for next generation wireless commu-
enabling technologies, use-cases, architectures, secu- nication networks. We have carefully cited up-to-date surveys
rity, scheduling, clustering and data aggregation, traffic and related research works. We should emphasize here that
management, etc. the scope of this paper is focused only on radio (or com-
• Some of the published surveys focus on RRAM munication) resources, and no computation resources are
for wireless networks using ML and/or DL included during the study and analysis. Fig. 2 shows the
techniques without paying much attention to DRL radio resources or issues addressed in this survey. However,
techniques [1], [24], [26], [27]. For example, they computation resource aspects such as offloading, storage,
consider ML techniques such as convolutional neural task scheduling, caching, etc., can be found in other studies
networks (CNN), recurrent neural networks (RNN), such as [29]–[33] and the references therein.

VOLUME 3, 2022 325


ALWARAFY et al.: FRONTIERS OF DRL FOR RESOURCE MANAGEMENT IN FUTURE WIRELESS HetNets

FIGURE 4. Percentages of related work based on (a) types of radio resources


covered and (b) types of networks and application investigated. RA: resource
allocation, WNs: wireless networks.

FIGURE 2. Classification based on radio resources (or issues) addressed in the


papers.

FIGURE 3. Classification based on networks types covered in the papers.

C. PAPER CONTRIBUTIONS
The main contributions of this paper are summarized as
follows. FIGURE 5. The review protocol followed in this survey.
1) We provide a detailed discussion on the state-of-the-
art techniques used for RRAM in wireless networks,
including their types, shortcomings, and limitations in-depth technical knowledge of how to efficiently
that led to the adoption of DRL solutions. engineer DRL models for RRAM problems in wireless
2) We identify the most widely used DRL techniques communications.
utilized in RRAM of wireless networks and provide a 4) Based on the papers reviewed in this survey, we outline
comprehensive overview of them. The advantages, fea- and identify some of the existing challenges and pro-
tures, and limitations of each technique are discussed. vide deep insights into some promising future research
Hence, the reader is provided with an in-depth knowl- directions in the context of using DRL for RRAM in
edge of which DRL techniques should be leveraged wireless networks.
for each RRAM problem under investigation. Fig. 4 shows the percentage of the related works, clas-
3) We conduct an extensive and up-to-date literature sified based on the types of radio resources discussed in
review and classify the papers as reported in the lit- each paper, Fig. 4 (a), and based on the types of wireless
erature based on the type of radio resources they networks studied in each paper, Fig. 4 (b). This survey is
address (as shown in Fig. 2) and the types of wireless designed by carefully following the review protocol illus-
networks, applications, and services they consider (as trated in Fig. 5. Since this survey mainly focuses on deep
shown in Fig. 3). Specifically, for each paper reviewed, reinforcement learning for RRAM in wireless networks, we
we identify the problem it addresses, type of wire- included the following terms during the search stage along
less network it investigates, type of DRL model(s) it with “AND/OR” combinations of them; “deep reinforcement
implements, main elements of the DRL models (i.e., learning,” “DRL,” “resource allocation,” “resource manage-
agent, state space, action space, and reward function), ment,” “power,” “spectrum,” “bandwidth,” “access control,”
and its main findings. This provides the reader with “user association,” “network selection,” “cell selection,”

326 VOLUME 3, 2022


TABLE 2. List of acronyms used and their definitions.

“rate control,” “joint resources,” “wireless networks,” “satel- It is observed from Fig. 4 (a) that the majority of
lite networks,” “cellular networks,” and “Heterogeneous related works are on the Spectrum and Access Control radio
networks.” The number of papers found and the databases resources, followed by both the Power radio resource and
searched are detailed in Fig. 5. The inclusion criteria are Joint radio resources. Also, as shown in Fig. 4 (b), the related
papers that address the use of DRL techniques to man- works on the IoT and Other Emerging Wireless Networks
age and allocate the radio resources shown in Fig. 2 for have received more attention than the other wireless network
the wireless networks shown in Fig. 3. The exclusion crite- types, followed by the Cellular Networks.
ria are papers that: 1) address computation resources, e.g., The rest of this paper is organized as follows. Table 2
task offloading, storage, scheduling, etc., 2) use conventional lists the acronyms used in this paper and their definitions.
RRAM approaches, i.e., not using DRL techniques, 3) use Section II discusses existing RRAM techniques, including
ML/DL techniques, or 4) address non-wireless networks, conventional methods and DRL-based methods. The def-
e.g., wired networks, optical networks, etc. In Fig. 5, the initions, types, and limitations of existing techniques are
number of papers excluded after a detailed check of the discussed. Also, the advantages of employing DRL tech-
body is 71, which are directly related to our survey but niques for RRAM are explained. Section III provides an
not influential or do not clearly identify the types of DRL overview of the DRL techniques widely employed for
algorithms used, elements of DRL models (i.e., agents, state RRAM, including their types and architectures. In-depth
space, action space, and reward function), type of wireless classifications of the existing research works is provided in
networks covered, and/or not well written. Section IV. Existing papers are classified based on the radio
In general, the research questions that this survey aims resources and the network types they cover. Section V pro-
to address are stated as follows. How can DRL techniques vides key open challenges, lessons learned, and some insights
be implemented to address the RRAM problems in modern for future research directions. Finally, Section VI con-
wireless networks? What are the performance advantages cludes the paper. The organization of the paper is pictorially
achieved when using DRL tools compared to the state-of- illustrated in Fig. 6.
the-art RRAM approaches? What are the most effective and
widely used DRL algorithms to address the RRAM prob-
lems, and how can they be implemented? What are the most II. RADIO RESOURCE ALLOCATION AND MANAGEMENT
important and influential papers that present DRL-based TECHNIQUES
solutions for RRAM in next generation wireless networks? In this section, we define the main radio resources of interest
What are the challenges and possible research directions that and provide a summary of the conventional techniques and
stem from the reviewed papers in the context of using DRL tools used for RRAM in wireless networks. Also, the lim-
for RRAM in wireless networks? The retrieved papers shown itations of these conventional techniques that motivate the
in Fig. 5, i.e., the 76 papers, are selected carefully to help use of DRL solutions will be highlighted. Then we discuss
with answering these questions, as we will elaborate in the how DRL techniques can be efficient alternatives to these
next sections. traditional approaches.

VOLUME 3, 2022 327


ALWARAFY et al.: FRONTIERS OF DRL FOR RESOURCE MANAGEMENT IN FUTURE WIRELESS HetNets

FIGURE 6. Organization of the paper.

A. RADIO RESOURCES: DEFINITIONS AND TYPES (OR enhance network QoS. RRAM would be essential in such
ISSUES) UDN-based network deployments [38].
In general, allocation and management of wireless network The most crucial radio resources or issues that play a fun-
resources include radio (i.e., communication) and compu- damental role in controlling wireless networks’ performance
tation resources. This paper focuses only on the RRAM are summarized below.
issue. This involves strategies and algorithms used to con-
trol and manage wireless network parameters and resources, • Power resource: Which is one of the most critical issues
such as transmit power, spectrum allocation, user associa- in the RRAM of modern HetNets. Transmit power allo-
tion/assignment, rate control, access control, etc. The main cation in the downlink/uplink from/to network APs,
goal of wireless networks, in general, is to utilize and man- such as BSs and edge servers (ESs), is essential
age these available radio resources as efficiently as possible to guarantee a satisfactory QoS for communication
to provide enhanced network QoS, such as enhanced data links. Power control is essential from two perspec-
rate, SE, EE, reliability, connectivity, and coverage while tives; physical limitations and communication links.
meeting users’ QoS demands. Practically, the maximum power is limited by the
Efficient RRAM schemes can considerably enhance the capability of APs’ power amplifiers or government reg-
system’s SE compared to the traditional techniques relying ulation. Hence, it is common to incorporate the limited
on advanced channel and/or source coding methods. For power resource as a constraint during the design and
example, future wireless networks are expected to cover implementation of HetNets. On the other hand, power
broad geographical areas with ultra-dense network (UDN) control is also needed to guarantee enhanced networks’
deployments. In these UDNs, a massive number of adjacent QoS and user devices’ QoE. For example, in large-
APs typically require sharing communication resources, such scale and UDNs such as the mmWave and THz band
as radio frequencies and channels, to utilize resources and systems [2], [3], [39], signal attenuation due to path

328 VOLUME 3, 2022


losses must be accounted for during power budget anal- unmanageable and intolerable overhead. Hence, DRL
ysis. Also, the coverage of BSs’ cells and the inter-and techniques can be adopted in such a scenario to dynam-
intra-cell interference issues become crucial, which are ically learn the channel and perform autonomous user
mainly determined by the transmit power level. Hence, association/assignment decisions.
developing adaptive and fine-grained power allocation • Rate control: Often, the main objective of RRAM is
and interference management strategies is essential to to maximize the QoS of HetNets in terms of network
address such challenges. sum-rate or SE. This is typically achieved by formulat-
• Spectrum resource and access control: This is also ing complex wireless network optimization problems
another main issue in the RRAM of modern HetNets. and deriving their solutions subjected to available
User devices must be allocated frequency channels to network radio resources while respecting the data rate
start transmitting/receiving data with acceptable SNR. demands of user devices. However, accurate solutions
Existing wireless networks, such as the sub 6 GHz, for such problems require full knowledge of wireless
suffer from a severe bandwidth shortage which is even channel gain, including the large-scale and small-scale
exacerbated with the explosive increase in the num- fading [43]. Obtaining such knowledge in real-time
ber of user devices [6]. Fortunately, the mmWave is quite difficult, especially in modern HetNets, due
and emerging THz bands can considerably overcome to their rapid increase in the underlying RANs/user
this shortcoming by providing an extra 3.25 GHz and devices and the type of applications. Moreover, multi-
10-100 GHz bandwidth, respectively [40]. It is also RANs data rate aggregation has also been proposed
expected that future user devices will be equipped with recently [41], [44] to support the multi-Gbps data rate
advanced capabilities that enable them to aggregate requirements of the emerging applications. Hence, it
all these three frequency bands, i.e., the sub 6GHz, becomes imperative to develop efficient schemes that
mmWave, and THz, to support future technologies enable rate aggregation while having limited knowl-
and services [41]. However, allocating and managing edge of channels. DRL methods can be employed to
the radio channels of these frequency bands across achieve this goal [41], [42], [44].
multi-RAN to a massive number of user devices man-
date developing advanced signal processing techniques. B. CONVENTIONAL RRAM TECHNIQUES
Unfortunately, such techniques require perfect knowl- In this subsection, we overview the state-of-the-art
edge of network statistics and CSI, which is quite approaches and tools used for RRAM in modern HetNets.
difficult or even impossible due to the large-scale and RRAM techniques can be classified into two broad cate-
massive heterogeneity of modern HetNets. Hence, it is gories based on their adaptivity to the wireless environment:
expected that future HetNets will integrate DRL meth- static and dynamic approaches. Each of which can be further
ods with signal processing techniques to overcome this classified based on various criteria, such as centralized or
issue. distributed, instantaneous or ergodic, optimal or sub-optimal,
• User association: With the ever-increase in the number single-cell or multi-cell, cooperative or non-cooperative, in
of IoT smart devices and the varying QoS demands addition to different combinations of these variants. In this
of emerging applications, it becomes necessary to paper, we discuss the general features of the static and
ensure reliable network hyper-connectivity to these dynamic techniques along with their types.
devices [2], [39]. User association defines which BS(s), RRAM has been one of the major research interests in
RAN’s AP(s), or edge server(s) that each user device wireless networks using conventional approaches. It has
must connect/associate to/with to guarantee its QoS been extensively surveyed for various wireless networks
demands. Taking into consideration the multi-RAN and systems. Table 3 lists some of the existing surveys
and multi-connectivity nature of modern HetNets [3], for resource allocation and management using conventional
it is expected that future devices will be equipped methods along with the types of wireless networks and
with SDR capabilities that enable them to support systems they study.
multi-association/assignment to multiple RANs simul-
taneously [41]. Based on users’ QoS demands, devices 1) STATIC TECHNIQUES
can operate in a multi-mode or multi-homing fashion. Static approaches are designed based on a priori statistical
In the multi-mode fashion, each device will be associ- information and cannot adapt to wireless network param-
ated with a single RAN AP at a time [41], [42] in a eters, such as traffic load, users’ mobility pattern, channel
traditional fashion. Whereas in the multi-homing fash- conditions/quality, network spectrum occupancy, and users’
ion, devices can be associated with multiple RAN APs QoS demands. These techniques are simple; however, they
simultaneously to aggregate RANs’ radio resources. suffer from several shortcomings, such as severe under-
Achieving such a goal, however, is also another chal- utilization of radio resources, increased network outage,
lenging issue. Obtaining real-time information on the reduced network throughput, and poor network QoS.
network statistics, such as CSI, traffic load, RANs Static RRAM techniques are employed in several tra-
occupancy, and user devices’ QoS demands, requires ditional networks, such as cellular networks and WLANs.

VOLUME 3, 2022 329


ALWARAFY et al.: FRONTIERS OF DRL FOR RESOURCE MANAGEMENT IN FUTURE WIRELESS HetNets

TABLE 3. Existing surveys on resource allocation and management for wireless


networks and systems using conventional approaches. types of greedy heuristics [64]. Examples of heuristic algo-
rithms include the recursive branch-and-bound state-space
search algorithm [65] and alpha-beta search algorithm [66].
b) Optimization-based techniques: Typically, most of the
RRAM optimization problems in modern HetNets are non-
convex (e.g., continuous power allocation) [67], combina-
torial (e.g., user association and channel access) [24], or
mixed-integer nonlinear programming (MINP) (e.g., com-
bined of continuous- and discrete-type problems) [41]. Many
algorithms have been developed to systematically solve such
problems and find either the global optimum solution or
sub-optimal solution, e.g., [24], [68]–[71] . Such algorithms
include, fractional programming (FP) [67], [72], Weighted
Minimum Mean Square Error (WMMSE) [67], [72], evo-
lutionary algorithms (e.g., particle swarm optimization
Examples of static RRAM techniques include circuit-mode (PSO) [73], [74], genetic algorithm [75], [76], ant/bee colony
communication using frequency division multiple access optimization algorithm [77], [78]), among others. These algo-
(FDMA) and time division multiple access (TDMA) schemes rithms are extremely computationally-extensive and typically
and fixed radio resource allocation, such as fixed power and executed in a central RNC with full and real-time information
channel allocation. about network statistics and CSI.
c) Game theory-based methods: Game theory techniques
2) DYNAMIC TECHNIQUES are used for distributed RRAM in modern HetNets when
On the contrary, dynamic or adaptive RRAM approaches are network entities (i.e., players) cooperate or compete on radio
more efficient as they can dynamically adjust the network resources. Such techniques have shown efficient results,
radio resources to accurately track variations in propagation and they are widely used as tools to model complex
conditions and user QoS requirements. wireless optimization problems in a decentralized fash-
Dynamic RRAM schemes are widely utilized in design- ion [1]. In particular, the RRAM problem is formulated as a
ing modern HetNets. They have shown efficient results in cooperative or non-cooperative game/optimization problem
reducing the expensive manual network planning and achiev- between network entities (e.g., BSs, RANs’ APs, and user
ing tighter radio resource utilization, which will lead to devices). In cooperative game techniques, players collabora-
enhanced network efficiency. Some RRAM schemes are tively solve the underlying RRAM game using heuristic- or
centralized, where several BSs, ESs, APs, and network gate- optimization-based techniques to achieve a specific network
ways are controlled by a central Radio Network Controller goal (e.g., sum-rate or SE/EE maximization). However, in
(RNC). Others are distributed, either autonomous algorithms non-cooperative game techniques, players try to solve the
implemented in user devices, BSs, ESs, or coordinated RRAM game in a greedy and non-collaborative fashion in
by exchanging information among these network entities. order to achieve their own goal (e.g., to satisfy their own
Examples of dynamic RRAM schemes include power control QoS demands). The main goal of most game theory algo-
algorithms, spectrum/channel allocation algorithms, multi- rithms is to find the Nash Equilibrium (NE) solution for the
access control schemes, traffic/link adaptation algorithms, underlying RRAM problem.
channel-dependent scheduling schemes, and cognitive radio
approaches. C. LIMITATION OF CONVENTIONAL RRAM TECHNIQUES
In dynamic RRAM, we typically formulate the RRAM as Unfortunately, all these state-of-the-art approaches will
complex optimization problems. The main objective of such encounter severe limitations in future HetNets, which mainly
problems is maximizing/minimizing some utility/cost func- motivate the usage of DRL in RRAM. Here we summarize
tions, e.g., network sum-rate, EE, and SE, while constraining the main limitations, and the interested reader can also refer
the available network’s radio resources. The state-of-the-art to [1].
approaches to solve these RRAM optimization problems are • Most of these approaches require complete or quasi-
heuristic-based, optimization-based, and game theory-based complete knowledge of the wireless environment,
approaches. Such approaches employ advanced algorithms to including accurate channel models and real-time CSI.
solve the RRAM problem either optimally or sub-optimally. However, obtaining such accurate information in future
a) Heuristic-based techniques: These techniques allocate HetNets is quite difficult or even impossible due to the
radio resources sub-optimally and without any performance large-scale, ultra-dense, and massive heterogeneity of
guarantee. They are typically used to provide approximate the system.
and sub-optimal solutions in cases the solution of the formu- • These approaches are generally not scalable, as they
lated optimization problem is quite complex or intractable. encounter several challenges when the number of user
Modern wireless systems such as 4G LTE implement some devices becomes very large or when used in UDNs.

330 VOLUME 3, 2022


The main reason is that the optimization space becomes cooperating/competing players is proportional to the
prohibitively large to cater to the whole network, number of playing nodes. Unfortunately, future HetNets
which will lead to a significant increase in compu- will be prohibitively large-scale in terms of the num-
tational complexity when finding optimal solutions. ber of network APs and user devices [2], [6]. Hence,
With the large-scale and massive heterogeneity of such techniques will fail. In particular, exchanging and
future networks, it becomes essential to engineer and updating the tremendous amount of data and signal-
devise more efficient and practical implementations ing among the massive number of players will create
from a computation performance perspective. Also, it extra and unmanageable overhead as well as a dras-
becomes challenging in many scenarios to mathemat- tic increase in delay, computation, and energy/memory
ically formulate RRAM optimization problems, or we consumption of network players.
may end up with non-well-defined or even intractable
optimization problems. These cases are encountered for D. ADVANTAGES OF USING DRL-BASED TECHNIQUES
many reasons, including the uncertain nature of wireless FOR RRAM
channels, network traffic load, and users’ mobility pat- Emerging AI tools, such as ML, DL, and DRL methods, have
terns. Hence, new innovative RRAM solutions must be been recently used to effectively address various problems
developed to address such challenges. In this context, and challenges in different areas of wireless communications
the data-driven AI-based RRAM techniques are feasi- and networking, including RRAM [1], [8], [13], [15], [24],
ble alternatives, and they have shown efficient adaptivity [26], [27], [79], [80]. Next generation wireless networks will
when applied on dynamic HetNets. generate a tremendous amount of data related to network
• Such approaches are heavily system-dependent and will statistics, such as user traffic, channel occupancy, channel
not be accurate for rapidly varying environments. They quality, etc. AI algorithms can leverage this data to develop
need, however, reconfiguration to reflect the new system automated and fine-grained schemes to optimize network
settings. Unfortunately, modern HetNets need to support radio resources. This paper is solely dedicated to providing
highly dynamic systems characterized by massive rapid- a comprehensive survey on DRL applications for RRAM in
ity, such as vehicular and railway networks. This renders modern wireless networks. However, the applications of ML
conventional methods impractical for such scenarios. and DL techniques in various wireless networks fields can
• Most of these methods are computationally expensive be found in [1], [24], [26], [27], [81] and the references
and incur considerable timing overhead. This renders therein.
them inefficient for most emerging time-sensitive appli- DRL is an advanced data-driven AI technique that
cations, e.g., autonomous vehicles/drones applications. combines neural networks (NNs) with traditional reinforce-
Also, the computational complexity of these methods ment learning (RL). It is mainly utilized to enhance the
proportionally increases with the increase in network learning rate of RL algorithms and address wireless com-
size, making them unscalable and unsuitable for modern munication and networking problems having high dimen-
large-scale networks. Furthermore, since most conven- sionality [8], [9], [36], [37]. DRL techniques have gained
tional algorithms are computationally expensive, they considerable fame lately to their superiority in making judi-
can be implemented only in sophisticated infrastructures cious control decisions in uncertain environments like the
with high computational capabilities, such as supercom- wireless channels. They enable various network components
puters and servers. Hence, tiny and self-powered user such as BSs, RAT APs, edge servers (ESs), gateways nodes,
devices will not support them. and user devices to make autonomous and local decisions,
• RRAM optimization problems in HetNets are generally such as RRAM, RATs selection, caching, and offloading, that
complex and non-convex [41]. Hence, leveraging conven- achieve the objectives of various wireless networks, includ-
tional optimization algorithms to solve them will likely ing sum-rate maximization and SE/EE maximization. Since
result in local optimal solutions rather than global ones. traditional approaches will not be able to address the RRAM
This case is regularly encountered in wireless optimization issue of future wireless networks, DRL methods have been
problems, which have too many local optima. proposed lately to be alternative solutions. In particular, DRL
• Game theory-based techniques are unsuitable for techniques are appealing for next generation communication
networks characterized by massive heterogeneity in networks due to the following distinct features.
system architecture and user devices. In particular, NE First, they enable network controllers to solve complex
solutions are obtained by assuming that all players are network optimization problems, including RRAM and other
homogeneous, have statistically equal capabilities, and wireless control problems, with only limited information
have complete network information. Unfortunately, this about the wireless networks. Second, DRL methods enable
is not the case in modern HetNets, in which network network entities (e.g., BSs, RAT APs, ESs, gateways nodes,
entities are massively heterogeneous in terms of phys- and user devices) to act as agents (i.e., decision-makers)
ical, communication, and computational capabilities. to learn and build knowledge about the wireless environ-
• Finally, the complexity of game theory-based techniques ment. This is achieved by learning optimal policies, such
and the amount of information exchanged between as radio resource allocation, RATs selection, and scheduling

VOLUME 3, 2022 331


ALWARAFY et al.: FRONTIERS OF DRL FOR RESOURCE MANAGEMENT IN FUTURE WIRELESS HetNets

TABLE 4. List of the model-free DRL algorithms that are widely used in RRAM for modern wireless networks.

decisions, based on continuous interaction between agents


and the wireless environment, without knowing the accu-
rate channel models or statistics of the underlying systems
a-priori. DRL algorithms employ the data collected during
the continuous interaction with the environment as a training
data-set to train their models. Once DRL agents learned the
optimal policies, they can be deployed in an online fashion
to make intelligent and autonomous decisions based on local
observations made on the wireless environment.
DRL techniques provide efficient solutions from both
the network and user devices’ points of view to overcome
the problems of the conventional RRAM approaches. By
employing DRL techniques, various network entities are
enabled to learn wireless environments in order to optimize
system configuration. Networks entities will be able to opti-
mally and autonomously allocate the optimal transmitting
power to mitigate signals interference and reduce energy con-
sumption. For this purpose, advanced DRL techniques such
as the deep deterministic policy gradient (DDPG) method FIGURE 7. Taxonomy of all DRL algorithms [37]. Algorithms colored in blue are
covered in Section III.
and its variants can be utilized. On the other hand, DRL
can also enable smart devices to autonomously access the
radio channels. For this purpose, deep Q-network (DQN) or a mathematical model exists, but applying conventional
and its variants can be leveraged. The wireless channels are algorithms is not possible. In general, most of the RRAM
extremely stochastic due to, e.g., the rapid mobility of user
problems in modern wireless networks fall under the above
devices and channel objects. Hence, accurate and real-time scenarios. The main reason is the large-scale and massive
knowledge of channel state information (CSI) becomes quite heterogeneity nature of networks in terms of types and num-
difficult, and DRL techniques can be efficiently used to learn
bers of underlying infrastructures, user devices, and QoS
wireless channel statistics. demands of applications.
Finally, spectrum prediction and forecasting is also another
All the aforementioned unique features of DRL techniques
promising field enabled by DRL techniques. Emerging DL
make them one of the leading AI-based enabling technolo-
models, such as recurrent neural networks (RNNs) and gies that can be leveraged to address the RRAM in future
convolutional neural networks (CNNs), can be integrated
wireless communication networks [2], [3].
with DRL to add the “prediction” capability to the DRL
algorithms. Also, conventional optimization techniques do
not incorporate the context, and hence they cannot adapt III. OVERVIEW OF DRL TECHNIQUES USED FOR RRAM
and react according to the sudden variations and changes In this section, we briefly review the foundations of DRL,
in the wireless environments. Therefore, such conventional such as the Markov Decision Process (MDP), and show how
approaches will result in unreliable and poor resource RRAM problems can be modeled as MDPs. Fig. 7 shows
management and utilization. DRL techniques can, however, a detailed taxonomy of existing DRL techniques/algorithms.
dynamically adapt and learn the context of wireless environ- Reviewing all these techniques is beyond the scope of this
ments, which makes their RRAM solutions more accurate paper, and we rather focus on the most widely used ones in
and reliable. the literature to address RRAM problems. Interested readers,
To sum up, DRL techniques are required in RRAM prob- however, can refer to [7], [15] for a thorough review of the
lems in four main scenarios; when there is insufficient remaining algorithms. Furthermore, we briefly review other
knowledge about the statistics of the wireless networks, accu- emerging technologies used for RRAM problems, such as
rate mathematical models do not exist, inference information multi-agent DRL models. Hence, this section is deliberately
is required to be incorporated into the decision process, designed to provide the reader with adequate knowledge of

332 VOLUME 3, 2022


the basics, advantages, limitations, and use-cases of the most
widely used DRL techniques employed in the RRAM field.
Table 4 lists the most widely used DRL tech-
niques/algorithms in RRAM of modern wireless networks.
Note that all of them are model-free learning algorithms,
which means that the agent does not build a model of the
wireless environment or reward; instead, it directly maps
states to the corresponding actions.
Depending on the dimensionality of the RRAM problem,
we can select the most appropriate DRL algorithm that fits
the problem settings. For example, RRAM problems could
have discrete action space, such as channel access, user
association, RAN assignment, etc., or could have contin-
uous action space, such as power allocation and continuous
spectrum allocation.

A. THE MARKOV DECISION PROCESS (MDP)


Under the uncertain and stochastic environments of modern FIGURE 8. Framework of DRL models [15].
HetNets, the problem of RRAM, or any decision-making
problem including control problems, are typically mod-
eled by the so-called Markov Decision Process (MDP). It p, and the agent receives a feedback numerical instanta-
provides a mathematical framework for modeling decision- neous reward rt , which quantifies the quality of the taken
making problems whose outcome is random and controlled action. This interaction, i.e., (st , at , rt , st+1 ), between the
by a decision-maker, aka agent. The MDP also has another agent and wireless environment repeatedly continues, and
variant, called partially observable MDP (POMDP), which the agent will utilize the received reward to adjust its strat-
models decision-making problems in partially observable egy until it learns the optimal policy π ∗ . The agent’s policy
wireless environments. π defines the mapping from states to the corresponding
The general practice in RRAM is to formulate the actions S → A, i.e., at = π(st ). Typically, we define the
radio resource allocation (RRA) as an optimization problem long-term reward as the expected accumulated discounted
whose objective is to maximize/minimize some network util- instantaneousreward over the time horizon T, which is given
ity/cost function while constraining on the available network by R = E[ Tt=1 γ rt (st , π(st ))]. The parameter 0 ≤ γ ≤ 1
radio resources and optional QoS demands of user devices. is the discounted factor, which trades-off between instan-
However, as we discussed in Section II, tremendous chal- taneous and future rewards. The main goal of the agent in
lenges are encountered during formulating such problems MDP is to obtain π ∗ (i.e., allocating optimal radio resources)
or/and even during solving them, which renders conven- that maximizes the long-term reward, i.e., π ∗ = max R.
tional approaches inapplicable. Hence, RL/DRL techniques π
Next, we discuss the most widely used DRL algorithms
are utilized instead. to handle MDP problems, i.e., RRAM problems. As shown
In order to apply DRL to solve RRA problems, we need in Fig. 7, these algorithms belong to two main families of
first to convert the formulated optimization problem into methods; the value-based and the policy-based methods.
the MDP framework. The resultant MDP-based model must
contain seven elements: the agent(s), environment, action
space A, state space S, instantaneous reward function r, a B. VALUE-BASED ALGORITHMS
transition probability p, and policy π , as shown in Fig. 8. This family of methods is used to estimate the value func-
The MDP is represented mathematically by the tuple (S, A, tion of the agent. This value function is then utilized to
p, r). implicitly and greedily obtain the optimal policy. Two value
In RRAM problems, the dynamicity of the agent’s learn- functions exists; the value function V π (s) and the state-action
ing process according to the MDP framework is shown function Q(st , at ). Both represent the expected accumulated
in Fig. 8. At time t, the agent observes a state st from discounted rewards received when taking action at (in state
the state space S. The state space should contain useful st for V π (s)) (or at pair (st , at ) for Q(st , at )) and then fol-
and effective information about the wireless environment, lowing the policy π thereafter. These functions are important
such as available radio resources, SNR, the number of user as they represent the link between the MDP mathematical
devices, and required QoS. Then, the agent takes action formulation and the DRL formulation, and they are given
at from the action space A such as the RRA and RAN by [7]:
assignment. The taken action must achieve network util- ∞ 

π
ity goal, such as sum-rate/SE/EE maximization. Then the V (s) = E γ rt (st , at , st+1 )|at ∼ π(.|st ), s0 = s ,
t
state moves to a new state st+1 with a transition probability t=0

VOLUME 3, 2022 333


ALWARAFY et al.: FRONTIERS OF DRL FOR RESOURCE MANAGEMENT IN FUTURE WIRELESS HetNets

∞

π
Q (s, a) = E γ t rt (st , at , st+1 )|
t=0

at ∼ π(.|st ), s0 = s, a0 = a .

The optimal value function V ∗ (s) and state-action func-


tion Q∗ (s, a) are obtained by solving the following Bellman
equations [7], [15]:
 
V ∗ (s) = max rt (st , at ) + γ Eπ V ∗ (st+1 ) ,
at

Q (s, a) = rt (st , at ) + γ Eπ maxQ∗ (st+1 , at+1 ) .

at+1

Recall that the main goal of MDP is to obtain the


optimal policy π ∗ (i.e., mapping states to optimum actions),
which is given by π ∗ = argmax R = argmax =
 π π
E[ Tt=1 γ rt (st , π(st ))]. Hence, the optimal actions can be
obtained to be the ones that maximize the above value
functions, and the optimal policy will be the one that maxi-
mizes these values functions [7]. In particular, the Q-function
Qπ (s, a) is commonly used, and the problem of obtaining

the optimal policy becomes π ∗ (s) = argmax Qπ (st , at ).
a
The ultimate goal of all the value-based DRL algorithm is
to approximate this function as discussed next.
FIGURE 9. Illustration of the DQN architecture.
1) Q-LEARNING TECHNIQUE
In RL, Q-learning is one of the most widely used algo-
rithms to address MDPs. It obtains the optimal values 2) DEEP Q NETWORK (DQN) TECHNIQUE
of the Q-function iteratively using the following Bellman Since the Q-learning algorithm relies on building a table for
equation: the Q values, it will fail to obtain the optimal policy when the
state and action spaces become prohibitively large. This case
Q(st , at ) = Q(st , at )
 is commonly encountered in the RRAM problems of modern
+ αt rt (st , at ) + γ maxQ(st+1 , at+1 ) − Q(st , at ) HetNets. To overcome this issue, the DQN algorithm has
at+1
been developed, which inherits the advantages of Q-learning
where αt is the learning rate that defines how much the new and DL techniques. The main idea is to replace the table
information contributes to the existing Q-value. The main in the Q-learning algorithms with a DNN that approximates
idea of this Bellman rule relies on finding the Temporal the Q values, i.e., Q(st , at |θ ), where θ represents the training
Difference (TD) between the current Q-value (Q(st , at )) parameters (i.e., weights) of the DNN. Fig. 9 shows the DQN
and the predicted Q-value (rt (st , at ) + γ maxQ(st+1 , at+1 ) − architecture. The replay memory is denoted by D, and it is
at+1
Q(st , at )). The Q-learning algorithm uses this rule to con- mainly used to break the correlation between the training
struct a table of all possible Q values for each stat-action pair. samples, i.e., (st , at , rt , st+1 ), by making them independently
The algorithm terminates when we reach a certain number and identically distributed i.i.d. During the learning process
of iterations or when all Q-values have converged. In such of the policy, we store the training transitions generated
a case, the optimal policy will determine the optimal action during the interaction with wireless environment in D. The

to take at each state such that Qπ (st , at ) is maximized for DQN’s agent will then randomly select minibatch transition

all states in the state space, i.e., π ∗ = argmax Qπ (st , at ). samples from D to train its DNN. To enhance the DQN
at+1 model’s stability, the target Q network is used, whose weights
However, the Q-learning algorithm has many limitations
θ  will be periodically updated to track those of the main Q
when applied for RRAM in modern HetNets. First, it is appli-
network.
cable only to problems with low dimensionality of both state
Since the DQN algorithm is mainly used to learn the
and action spaces, making it unscalable. Second, it is appli- ∗
optimal policy, i.e., π ∗ = argmax Qπ (st , at ), the optimal
cable only on RRAM with discrete state and action spaces, a
such as channel access and RANs assignment. If, however, Q-function is derived from the following iterative Bellman
they are applied to problems with continuous action spaces, equation:
e.g., power allocation, the action space must be digitized.
This renders them inaccurate due to quantization error. Q(st , at ) = rt (st , at ) + γ maxQ(st+1 , at ),
at+1

334 VOLUME 3, 2022


and the DQN algorithm is then optimized by iteratively 4) DUELING DQN ALGORITHM
updating θ to minimize the following Bellman loss function; This algorithm is another enhancement to the basic DQN
 algorithm [83]. Recall that the goal of the network is to
L(θt ) = Est ,at ,rt ,st+1 ∈D rt (st , at ) estimate the Q values, i.e., Q(st , at ). This function can be
divided into two terms; the state-value function V(s), which
2
+ γ maxQ(st+1 , at |θ  ) − Q(st , at |θ ) . tells the importance of being in a particular state, and the
at+1 action-value function (or the advantage function) A(s, a),
which tells the importance of selecting a particular action
The DQN algorithm is applicable to a wide variety of among all available actions. Hence, the Q value function can
RRAM problems, specifically for problems characterized by be written as Q(s, a) = V(s) + A(s, a). The authors in [83]
their discrete action space. As we will elaborate in-depth utilized this concept and suggested having two independent
in Section IV, the DQN technique can be used efficiently paths of fully-connected layers instead of having only a
for channel allocation, access control, spectrum access, user single path as the case in the basic DQN. One path will
association, and RANs assignment. The DQN algorithm can estimate V(s), and the other will estimate A(s, a). The two
also be used for RRAM problems with continuous action paths will eventually be combined to produce a single output,
space, such as power control, by discretizing the action space. which is Q(s, a). Here, the loss function is obtained similar
However, such a methodology makes DQN vulnerable to to the DQN and Double DQN algorithms.
serious quantization error that may considerably deteriorate
its accuracy. There are also other limitations in the basic
DQN, and various DRL algorithms have been proposed to C. POLICY-BASED ALGORITHM
overcome them, as we discuss in the following sections. The policy-based techniques are part of the policy gradient
family of methods. They provide an alternative way to solve
3) DOUBLE DQN ALGORITHM MDP problems having high dimensionality and continuous
action spaces. Recall that the main idea of the value-based
The Double DQN technique has been proposed in [82] to methods discussed before is to find the state-action value
enhance the basic DQN algorithm. The DQN algorithm tends function Q(s, a). This function is defined as the expected
to overestimate the Q values, which can degrade the training total discounted reward received by taking a particular action
process and lead to suboptimal policies. The overestimation from the state. If these Q values are known, the optimal
results from the fact that the same training transitions are policy is obtained by selecting actions that maximize the Q
utilized in selecting and evaluating an action in the Bellman values in each state. However, in environments with continu-
equation. As a solution, the authors in [82] propose to use ous action spaces, such as power control in wireless systems,
two Q value functions, one for selecting the best action and the Q function cannot be obtained as it is impossible to con-
the other to evaluate the best action. The action selection is duct a full search in a continuous action space to obtain the
still based on the online weights θ , while the second weights optimal action. Hence, value-based methods are inaccurate
parameters θ  are used to evaluate the value of this policy. for such problems, and the policy-based methods are applied
So, as in the conventional Q learning, the value of the policy instead.
is still estimated based on the current Q values. The weights In policy-based approaches [7], [84], we avoid calculat-
θ  are updated via switching between θ and θ  . ing Q values and directly obtain the optimal policy πθ (a|s)
The target Q values are derived from the following that maximizes the agent’s expected accumulated reward

modified Bellman equation [82]: J, i.e., J(θ ) = Eπθ [ ∞ t=0 γ t r (s , a )]. The policy gradi-
t t t
ent approaches learn the optimal weights θ ∗ via performing
Q(st , at ) = rt (st , at ) + γ Q st+1 , argmaxQ(st+1 , at |θt ), θt , gradient ascent on the function J. In particular, the pol-
at+1 icy gradients are derived from trajectories obtained via the
current policy, such that in each gradient update the agent
and the Double DQN algorithm uses the following modified interacts with the environment to collect new and fresh tra-
Bellman loss function to update its weights; jectories, and this is why policy-gradient methods are called
 on-policy algorithms.
L(θt ) = Est ,at ,rt ,st+1 ∈D rt (st , at )
1) REINFORCE ALGORITHM
2
The main idea of this algorithm is to increase the proba-
+ γ Q st+1 , argmaxQ(st+1 , at |θt ), θt − Q(st , at |θt ) .
at+1 bilities of good actions and reduce the probabilities of bad
ones. The REINFORCE algorithm differs from the Q learn-
The Double DQN algorithm is also widely used in RRAM ing methods in three aspects. First, REINFORCE algorithm
problems, as we will discuss in the next section. Although does not need a replay buffer D during training as it belongs
this algorithm has advantages over the basic DQN algorithm, to the on-policy family, which requires only fresh training
they both share the same shortcomings. transitions. Although this enhances its convergence speed,

VOLUME 3, 2022 335


ALWARAFY et al.: FRONTIERS OF DRL FOR RESOURCE MANAGEMENT IN FUTURE WIRELESS HetNets

it needs more interaction with the environment. Second, the training transitions to derive the gradients with respect to
REINFORCE algorithm implicitly performs the exploration its NN weights. Learners will then propagate their gradients
process, as it depends on the probabilities returned by the to the global NN to update its weights. This mechanism
network, which incorporate uniform random agent behavior. ensures a periodic update of the global model with diverse
Third, no target network is required in the REINFORCE transitions from each learner.
method as the Q values are obtained from the experiences
in the environment. 4) DEEP DETERMINISTIC POLICY GRADIENT (DDPG)
The disadvantage of the REINFORCE algorithm is that it ALGORITHM
suffers from high variance, meaning that any small shift DDPG is one of the most widely used DRL techniques in
in the return leads to a different policy. This limitation addressing RRAM problems for wireless networks charac-
motivated the actor-critic algorithms. terized by their high dimensionality and continuous action
space [86]. DDPG algorithm belongs to the actor-critic fam-
2) ACTOR-CRITIC ALGORITHM ily, and it combines both Q-learning and policy gradients
The actor-critic methods are mainly developed to enhance algorithms. It consists of actor and critic networks. The
the convergence speed and stability (i.e., reducing the vari- actor network takes the state as its input, and it outputs the
ance) of the policy-gradient method. Like the policy-based exact “deterministic” action, not probability distribution over
methods, it utilizes the accumulated discounted reward J to actions as in the actor-critic algorithm. Whereas the critic
obtain the gradient of policy J, which provides the direc- is a Q-value network that takes both the state and action as
tion that enhances the policy. This algorithm learns a critic inputs, and it outputs the Q-value as a single output.
to reduce the variance of gradient estimates since it uti- The deterministic policy gradient (DPG) algorithm is
lizes various samples, whereas the REINFORCE algorithm proposed in [87] to overcome the limitation caused by the max
utilizes only a single sample trajectory. operator in the Q-learning algorithm, i.e., maxQ(st+1 , at ). It
at+1
To select the best action in any state, the total discount simultaneously learns both the Q-function and the policy.
reward of the action is used, i.e., Q(s, a). The total reward In particular, the DPG algorithm has a parameterized actor
can be decomposed into state-value function V(s) and advan- function μ(s|θ μ ) with weights θ , which learns the determin-
tage function A(s, a), i.e., as Q(s, a) = V(s) + A(s, a). So, istic policy that gives the optimal action corresponding to
another DNN is utilized to estimate V(s), which is trained maxQ(st+1 , at ). The critic Q(s, a) is leaned via minimizing
based on the Bellman equation. The estimated V(s) is then at+1
the Bellman loss function as in the Q-learning algorithm.
leveraged to obtain the policy gradient and update the pol-
The learning process of the actor policy is updated using
icy network such that the probabilities of actions with good
gradient ascent with respect to θ μ in order to solve the
advantage values are increased. Hence, the actor is the policy
objective given by the following chain rule [87]:
network π(a|s) that takes actions by returning the probabil-
 
ity distribution of actions, while the critic network evaluates J(θ ) = Es∈D Q(s, μ(s|θ μ )) ,
the quality of the taken actions, V(s). This algorithm is also
called the advantage actor-critic method (A2C). θ μ J = Es∈D a Q(s, a|θ Q )|s=st ,a=μ(st ) θ μ μ(s|θ μ )|s=st .
In the A2C algorithm, the weights of actor network θπ The DDPG algorithm proposed in [86] is built based on
and critic network θv are updated using the accumulated the DPG algorithm, where both the policy and critic are
policy gradients ∂θπ and value gradients ∂θv , respectively, DNNs, as shown in Fig. 10. The DDPG algorithm creates

to move in the direction of the policy gradients and the a copy of both the actor and critic networks, Q (s, a|θ Q )

opposite direction of the value gradients. and μ (s|θ μ ), respectively, to compute the target values.
 
The weights of these target networks, θ Q and θ μ , are then
3) A3C ALGORITHM
updated to slowly track the weight of the learned network
The asynchronous advantage actor-critic (A3C) algorithm to provide more stable training using θ  ← τ θ + (1 − τ )θ 
is an extension of the basic A2C [85]. This algorithm with τ 1. The critic network is updated to minimize the
is used to solve the high variance issue in gradients that following Bellman loss;
results in non-optimal policies. A3C algorithm conducts a 
parallel implementation of the actor-critic algorithm, where L(θ ) = Est ,at ,rt ,st+1 ∈D (rt (st , at )
Q
the actor and critic share the network layers. A global

NN is trained to output action probabilities and an esti-      2
π
mate function A(st , at |θπ , θv ) given by + γ maxQ st+1 , μ st+1 |θ |θ Q
− Q st , at |θ Q
.
k−1 ofi the advantage at+1
i=0 γ r t+1 + γ k V(s
t+k |θv ) − V(st |θv ), where k depends
on the state and upper-bounded by the maximum number of Note that the DDPG algorithm is off-policy, which means
time steps. that we use a replay buffer D to store training transitions.
Several parallel actor learners are instantiated with copies The exploration-exploitation issue is addressed by adding
of both the environment and global NN weights. Each learner the Ornstein–Uhlenbeck (OU) process or some Gaussian noise
μ
independently interacts with its environment and gathers N to the action selected by the policy, i.e., μ(st |θt )+εN [86].

336 VOLUME 3, 2022


can either be deployed cooperatively, in which all agents
interact with each other to learn the same global policy, or
non-cooperatively, in which each agent learns its own pol-
icy. MADRL provides several performance advantages over
the single-agent case regarding the quality of the learned
policies, convergence speed, etc. However, it encounters sev-
eral challenges such as scalability, partial observability, and
agents’ non-stationarity. Nguyen et al. [97] provide a sur-
vey on MADRL systems and their applications. Different
methods are reviewed along with their advantages and dis-
advantages. In [98], the authors provide a selective overview
of the theories and algorithms for MARL.
MADRL is widely employed in addressing various RRAM
problems in modern wireless networks. The authors in [14]
provide an overview of the MADRL algorithms and highlight
their applications in future wireless networks. The learning
frameworks in MADRL are also investigated. The applica-
tion of MARL in solving problems for vehicular networks
is studied in [99]. In [100], an overview of the evolution of
cooperative MARL algorithms is presented with an emphasis
on distributed optimization.
Most of the RRAM problems in modern HetNets are of
a multi-agent nature [14]. Network entities such as user
devices, BSs, and APs can act as cooperative/non-cooperative
FIGURE 10. Illustration of the DDPG actor-critic architecture [88]. multi-agents to learn optimal RRA policies and solve com-
plex network optimization problems. For example, channel
access control may be formulated as a MADRL problem
D. OTHER DRL ALGORITHMS in which each user device represents a learning agent that
The DRL algorithms discussed above are the commonly used senses the radio channels and coordinates with other agents
approaches to address the problem of RRAM in wireless to avoid collisions. Next, we discuss how RRAM problems
networks, as we will discuss in the next section. Although in HetNets are formulated and solved using these algorithms.
there are several other algorithms, they are rarely utilized for
IV. DRL-BASED RESOURCE ALLOCATION AND
such types of problems. Therefore, they are not included in
MANAGEMENT FOR FUTURE HETEROGENEOUS
this article. However, generally speaking, all the other variants NETWORKS
are mainly developed to enhance the performance of the basic This section provides an extensive and in-depth review of
algorithms discussed above. For completeness, this section the related works for RRAM using DRL techniques. We
highlights some of these variants for the interested reader. classify them based on the radio resources (or issues) they
Other variants of the value-based algorithms are developed investigate as well as based on the wireless network types
to enhance the performance of vanilla DQN algorithm in they cover, as shown in Figs. 2 and 3, respectively. It must
terms of stability, convergence speed, implementation com- be noted that this survey is dedicated to only the application
plexity, sample/learning efficiency, etc. Such variants include of DRL algorithms for radio resources, i.e., no computation
prioritized experience replay DQN [89], distributed priori- resources are covered, which can be found in [15].
tized experience replay DQN [90], distributional DQN [91], DRL algorithms enable various network entities to effi-
Rainbow DQN [92], and recurrent DQN [93]. ciently learn the wireless networks, which allows them to
For the policy-based algorithms, several algorithms are make optimal control decisions that achieve some network
envisioned to enhance the overestimation issue, such as utility function. For example, DRL methods can be deployed
the Twin Delayed DDPG (TD3) [94], enhance stability and to maximize network sum-rate, minimize network energy
robustness, such as the Soft Actor-Critic (SAC) [95], and to consumption, or enhance spectral efficiency. In this section,
enhance stability, convergence, and sample efficiency, such we review the applications of DRL methods in the follow-
as the distributed distributional DDPG (D4PG) [96]. ing RRAM issues: power allocation, spectrum allocation and
access control, rate control, and the joint use of these radio
E. MULTI-AGENT DRL ALGORITHMS resources.
Multi-agent DRL (MADRL) is a natural generalization of the
single-agent DRL that allows multiple agents to concurrently A. DRL FOR POWER ALLOCATION
learn optimal RRAM policies based on their interactions Energy-efficient communication is one of the main objec-
with the environment and with each other. These agents tives of modern wireless networks. It is achieved via efficient

VOLUME 3, 2022 337


ALWARAFY et al.: FRONTIERS OF DRL FOR RESOURCE MANAGEMENT IN FUTURE WIRELESS HetNets

Power allocation in small-cell multi-user cellular systems is


fundamental to increase system performance while reducing
inter-cell interference. In [102], the authors propose single- and
multi-agent actor-critic DRL methods to tackle the problem
of downlink sum-rate maximization through power allocation
in multi-cell, multi-user cellular networks. In their model,
the agents are the base stations (BSs), whose state space is
continuous and comprises network CSI and the transmit power
allocation by previous BSs. The action space is continuous,
representing the power allocation, while the reward function is
the cellular network sum SE. Experimental results demonstrate
that their proposed DRL-based method can both achieve
higher SE than conventional optimization algorithms, such as
fractional programming (FP) and weighted minimum mean-
squared error (WMMSE), while performing two times faster
than these conventional methods.
On the same context, the authors in [103] address the
power allocation issue by building on their initial investi-
gation in [104]. A multi-agent DQN-based DRL algorithm
FIGURE 11. Importance of power allocation in modern wireless communication is proposed in which each BS-user link is considered as an
networks. agent. The state space is continuous, comprised of a logarith-
mic normalized interferer, the link’s corresponding downlink
rate, and the transmitting power. The action space is discrete,
power allocation to ensure high QoS, better coverage, and corresponding to the downlink power allocation, while the
enhanced data rate, as shown in Fig. 11. Power alloca- reward is continuous, which is a function of the downlink
tion is mainly involved in vital network operations such data rate of the communication link. Experimental results
as modulation and coding schemes, path loss compensation, indicate that their proposed DQN outperforms benchmark
interference management, etc. On the other hand, almost all algorithms such as FP, WMMSE, random power allocation,
modern user devices and IoT sensors are battery-powered and maximum power allocation in terms of achievable aver-
with very limited battery capacity and charging capabili- aged sum-rate and the convergence time when considering
ties. Hence, designing energy-efficient resource allocation different user densities.
schemes, protocols, and algorithms becomes fundamental in A pioneer work is presented in [105], in which the
dynamic wireless network environments. authors design a multi-agent DQN and DDPG-based DRL
Several conventional approaches have been applied for framework to address the problem of power allocation in
power allocation and management. Most of them rely on HetNets. A centralized-training-distributed-execution algo-
solving power-constrained optimization problems, such as rithm is designed in which the APs are the agents, each of
FP algorithm [72] and WMMSE algorithm [101]. These which implements a local DNN. The state space of each local
approaches are iterative and model-driven, which means that DNN is continuous, representing the local state information,
they need a mathematically tractable and accurate model. while the local action space is continuous, representing the
They are typically executed in a centralized fashion in which transmit power. Then, multiple-actor-shared-critic method
a network controller has full CSI. In such a mechanism, BSs, (MASC) is proposed to separately train each of these local
wireless APs, and/or user devices require to wait until the DNN in an online fashion. The main idea is that the MASC
centralized controller’s iterations converge and send the out- training method is composed of multiple actor DNNs and
come back over backhaul links. However, as discussed in a shared critic DNN. An actor DNN is first established in
Section II, such approaches become impractical due to the the core network for each local DNN, and the structure of
large-scale nature of modern wireless networks and the dif- each actor DNN is the same as the corresponding local DNN.
ficulty in obtaining accurate and instantaneous CSI. Hence, Then, a shared critic DNN is established in the core network
DRL techniques are used instead due to their superiority for these actor DNNs. Historical global information is pro-
in obtaining optimal power allocation policies based on vided into the critic DNN, and the output of the critic DNN
limited CSI. will evaluate whether the output power of each actor DNN
is optimal or not from a global view. The reward function
1) IN CELLULAR NETWORKS is continuous, representing the data rate between each AP
In the following paragraphs, we review works that employ and its associated user. Simulation results show that their
DRL algorithms to address the power allocation problem in proposed algorithm outperforms the WMMSE and FP algo-
cellular, cellular IoT, and wireless homogeneous networks rithms in terms of both convergence rate and computational
(HomNets) depicted in Fig. 3. complexity.

338 VOLUME 3, 2022


Similar to the work in [105], the authors in [106] address the buffer size of nearby relay users. The action space is to
the problem of sum-rate maximization via continuous power specify the required user pairing between the near relay user
allocation in wireless mobile networks based on a distribu- group and edge user group, along with the pre-processing of
tive multi-agent DDPG algorithm. Unlike authors’ previous EE power allocation. The reward is a function of the EE of
work in [107], which was based on the DQN technique, the the mmWave network. Experimental results are compared
authors extended their work to leverage the unique advan- with a conventional centralized iteration algorithm, which
tages of the DDPG algorithm when addressing problems demonstrate both the superiority of their proposed algorithm
with continuous state space nature. Particularly, in [105], in terms of the convergence speed and the efficiency to
the agents are each transmitter (e.g., mobile devices, links, provide near-optimal results.
etc.) whose state is a combination of three feature groups; DRL methods have also been investigated for beam-
the local information, interfering neighbors, and interfered forming design in cellular networks. The authors in [111]
neighbors feature groups. Each agent’s action is to choose propose a single-agent DDPG-based model to address the
the transmit power level, while the reward is a function of problem of SE maximization via hybrid beamforming design
the sum-rate maximization problem. Simulation results show in mmWave MIMO cellular systems. The action space is
that their proposed method gives better performance results continuous, comprised of the digital beamformer and analog
than the conventional FP methods and comparable results combiner. The state space is also continuous representing
with the WMMSE methods. the digital beamformer and analog combiner at the previous
D2D underlying cellular communication has emerged as time step. The reward is a continuous function defined in
one of the main enabling technologies for modern wireless terms of network SE. Simulation results show the efficiency
networks. Establishing communication links in such highly of their proposed model in terms of SE, bit error rate, and
dynamic environments is an essential issue. In this con- computation time.
text, the authors in [108] present a centralized multi-agent
DQN-based DRL algorithm to address the problem of power
2) IN IOT AND OTHER EMERGING WIRELESS
allocation of D2D cellular communication in a time-varying
NETWORKS
environment. The agents are the D2D transmitters, whose
state space is continuous, comprised of the SINR and chan- In the following paragraphs, we review works that utilize
nel gain of users. The action space is discrete, representing DRL algorithms to address the power allocation issue in
the transmit power of each D2D user, while the reward is IoT and other emerging wireless networks shown in Fig. 2.
a function of system throughput. Simulation results show Developing efficient spectrum sharing schemes is regarded
that their proposed algorithm outperforms the traditional RL as one of the main persistent objectives and challenges
methods in terms of network capacity and user’s achieved in CRNs. In [112], the authors propose a non-cooperative
QoS. single-agent DQN-based DRL scheme to address the
5G UDNs are characterized by their high vulnerability to problem of spectrum sharing via power control in CRNs.
inter-cell interference, which can be greatly reduced via judi- In their model, the agent is the SU, whose action space is
cious power management. Towards this, Saeidian et al. [109] discrete, corresponding to selecting the transmit power from
propose a data-driven approach based on a multi-agent DQN a pre-defined power set. The state space is discrete, defined
algorithm to tackle the downlink power control in dense 5G by four parts; the transmit power of PU and SU, the path
cellular networks. The agents are the BSs, whose state space loss between PU and a sensor that measures the RSS, the
is continuous, comprised of path-gain, SINR, downlink rate, path loss between the SU and a sensor that measures the
and downlink power. The action space is discrete, represent- RSS, and some Gaussian random variable. The reward is a
ing the downlink power, while the reward is a function of discrete function defined by the achieved SINR level and the
the network-wide harmonic-mean of throughput. Simulation minimum SINR requirements of both PU and SU. Simulation
results indicate that their approach can improve data rates results show that their proposed algorithm is robust against
at the cell edge while ensuring a reduced transmitted power random variation in state observations, and the SU interacts
compared to the baseline fixed power allocation approaches. with PU efficiently until they reach a state in which both
Non-orthogonal multiple access (NOMA) technology has users successfully transmit their own data.
recently emerged as an efficient tool to enhance the QoS and In another interesting work in [25], the authors present
EE of millimeter-wave (mmWave) communication systems a non-cooperative multi-agent algorithm to address the
by enhancing the power level of received signals. The problem of power allocation in D2D underlying communica-
authors in [110] propose a multi-agent DQN-based DRL tion networks based on three DQNs, namely, DQN, Double
framework to optimize the EE in downlink full-duplex coop- DQN, and Dueling DQN. The agents are the D2D trans-
erative NOMA of mmWave UDNs. The agents are the relay mitters in each D2D pair, whose state space is discrete,
near users, whose state space is continuous, consisting of comprised of the level of the interference indicator func-
information related to wireless environment and channel, tion. The action space is discrete, representing the set of
the user’s battery capacity, energy power transfer coeffi- transmitting power levels, while the reward is a function
cient, self-interference cancellation residue coefficient, and of the system EE. Simulation results show the ability of

VOLUME 3, 2022 339


ALWARAFY et al.: FRONTIERS OF DRL FOR RESOURCE MANAGEMENT IN FUTURE WIRELESS HetNets

their DQN-based models to provide energy-efficient power Also, the state space is continuous, defined by; each MR
allocation for the underlying D2D network. own signal channel, local observation information of each
UAV IoT networks are attracting considerable attention MR, i.e., beamforming design, each MR achievable rate,
recently due to their ability to provide enhanced QoS com- and each MR transmit power in the last time step. The
munication in harsh and vital environments. However, power reward function is continuous, defined by the achievable
management is one of the key challenges in such networks. sum-rate of the network. Simulation results demonstrate that
In this context, the authors in [113] address the problem of the SE of their proposed algorithm is comparable to the
downlink power control in ultra-dense UAV networks with full digital beamforming scheme, and it outperforms con-
the aim of maximizing the network’s EE. A multi-agent ventional approaches such as maximum power allocation,
DQN-based DRL model is proposed in which the agents random power allocation, DQN, and FP.
are the UAVs in the network. The state space is continu- Federated deep reinforcement learning (FDRL) is an
ous, representing the remaining energy of the UAV and the emerging AI paradigm that integrates FD and DRL methods.
interference caused by neighboring UAVs. The action space FDRL can be utilized as an efficient technique to enhance
is discrete, representing the set of possible discrete transmit the RRAM solutions in large-scale distributed systems. As
power values, while the reward function is the EE of the UAV an example, an interesting approach is proposed in [115],
network. Simulation results are compared with Q-learning in which the authors propose a cooperative multi-agent
and random algorithms, which show the superiority of their actor-critic-based FDRL framework for distributed wire-
proposed scheme in terms of both the convergence speed less networks. The authors particularly address the problem
and EE. of energy/spectrum efficiency maximization via distributed
In the same context for multi-UAV wireless networks, the power allocation for network edge users. In their proposed
authors in [114] propose a multi-agent DDPG-based DRL model, the agents are the edge users, whose action space is
to address the problem of joint trajectory design and power continuous, defined as the power allocation strategies. The
allocation. In their scheme, the agents are the UAVs, whose state space is continuous, defined by the allocated trans-
state space is a discrete binary indicator function representing mit power, the SINR on the assigned RBs, and the reward
whether the QoS of the user ends (UEs) are satisfied or not. of the previous training time step. The system is defined
The action space is also discrete, corresponding to selecting in terms of a local continuous cost function expressed in
UAVs’ trajectory and transmission power. The reward is a terms of SINR, power, path loss, and environmental noise.
continuous function defined by the joint trajectory design Using simulation results, the authors demonstrate that their
and power allocation as well as the number of UEs covered proposed framework achieves better performance accuracy
by the UAVs. Simulation results show that the proposed in terms of power allocation than other approaches such
algorithm achieves higher network utility and capacity than as greedy, non-cooperation power allocation, and traditional
the other optimization methods in wireless UAV networks FL.
with reduced computational complexity.
Another interesting work [107] proposes a multi-agent 3) IN SATELLITE NETWORKS
DQN-based DRL method to study the problem of trans- In the following paragraphs, we review works that employ
mit power control in wireless networks. The agents are DRL techniques to address the power allocation issue in
the transmitters whose state space is continuous, consisting satellite networks as well as emerging satellite IoT systems.
of three main feature groups; local information, interfering Managing downlink transmit power in satellite networks is
neighbors, and interfered neighbors. The action space is also one of the major persistent challenges. To this end, the
discrete corresponds to discrete power levels, while the authors in [116] extended their work in [117] and present
reward is a function of the weighted sum-rate of the whole a single-agent Proximal Policy Optimization (PPO)-based
network. Experimental results demonstrate that the proposed DRL model to solve the problem of power allocation in
distributed algorithm provides comparable and even better multi-beam satellite systems. In their model, the agent is the
performance results to the state-of-the-art optimization-based processing engine that allocates power within the satellite,
algorithms available in the literature. whose state space is continuous, comprises the set of demand
High-speed railway (HSR) systems are one of the emerg- requirements per beam, and the optimal power allocations for
ing IoT applications for next-generation wireless networks. the two previous time steps. The action space is continuous,
Such systems are characterized by their rapid variations in representing the allocation of the power for each beam, while
the wireless environment, which mandate the development the reward is a function of both the link data rate achieved by
of light-weighted RRAM solutions. As a response to this, the beam and the power set of the agent. Experimental results
Xu and Ai [88] propose a multi-agent DDPG-based DRL demonstrate the robustness of their proposed DRL algorithm
model to address the problem of sum-rate maximization in dynamic power resource allocation for multi-beam satellite
via power allocation in hybrid beamforming-based mmWave systems.
HSR systems. In their approach, each mobile relay (MR) NOMA technique has shown efficient results in improv-
acts as an agent. The action space is continuous, corre- ing the performance of terrestrial mmWave cellular
sponding to the transmit power level of each MR agent. systems [118]. This has motivated the use of NOMA for

340 VOLUME 3, 2022


satellite communication systems. However, managing the discrete, which is a function of users’ achievable and target
radio resources in such a system becomes an imperative rates from the RF and VLC APs. The reward is also discrete,
issue. In this context, Yan et al. [119] conducted a pioneer which is a function of the achieved and target rates from all
work to study the problem of power allocation for NOMA- RF and VLC APs. Experimental results demonstrate that not
enabled SIoT using a single-agent DQN-based DRL scheme. only the users’ target rates are satisfied, but also the ability
In their system, the agent is the satellite, whose action space of their algorithm to adapt to the network’s dynamics.
is discrete, corresponding to selecting the power allocation For the same network settings as in [42], Ciftler et al. [44]
coefficient for each NOMA user. The state space is con- propose a DRL-based scheme to enhance the results and
tinuous, consisting of the average SNR, link budget, and overcome the shortcomings. While the work in [42] was
delay-QoS requirements of NOMA users, while the reward based on the vanilla Q-learning algorithm, the work in [44]
is discrete, which is a function of the effective capacity has shown the advantages of utilizing the DQN algorithm to
of each NOMA user. Experimental results demonstrate that improve the convergence rate and accuracy. In particular, the
their proposed DRL-based power allocation scheme can pro- authors in [44] propose a non-cooperative multi-agent DQN-
duce optimal/near-optimal actions, and it provides superior based algorithm to address the problem of power allocation
performance to both the fixed power allocation strategies in hybrid RF/VLC networks. The agents are the RF and VLC
and OMA scheme. APs whose action space is discrete, representing the transmit
power. The state space is continuous, comprised of the actual
4) IN MULTI-RAT NETWORKS and target rates, while the reward function is continuous
Multi-RAT wireless HetNets is one of the main enabling and is a function of target rate band, target rate, and actual
technologies for modern wireless systems, including 6G rate. Using simulation results, the authors demonstrate that
networks [3]. In HetNets, several RATs with different oper- the DQN-based algorithm converges with a rate of 96.1%
ating characteristics coexist to enhance network coverage compared with the Q-learning-based algorithm’s convergence
and reliability while providing enhanced QoE to users. rate of 72.3%.
The underlying RATs have non-overlapping radio resources; Findings and lessons learned: In this section, we review
therefore, there would not be typically interference in the the applications of DRL techniques for power allocation and
network. management in modern wireless networks. The reviewed
Since a stand-alone network with a single RAT would papers are summarized in Table 5. We observe that various
not be able to support the stringent QoS requirements DRL techniques can efficiently solve the power alloca-
of emerging disruptive applications, modern user devices tion optimization problems in diversified wireless network
are equipped with advanced capabilities that enable them scenarios, and their performance outperforms the state-of-
to aggregate various radio resources to boost their QoE. the-art heuristic approaches. Besides, as we discussed in the
Modern user devices can operate in a multi-mode scenario, previous paragraphs, DRL methods can provide compara-
in which each user device can be connected to a single ble results to the conventional centralized optimization-based
RAT at any time. Alternatively, user devices can operate approaches that have full knowledge of the wireless environ-
in a multi-homing scenario such that they can be con- ments as reported in [106], or even better results as reported
nected simultaneously to various RATs to aggregate their in [105]. Moreover, note that the main motivations of using
radio resources, such as bandwidth and data rate. Multi- DRL techniques in all the papers presented in this subsec-
RAT networks include the coexistence of RATs, such as the tion are the complexity of the formulated power allocation
licensed band networks, unlicensed bands networks, hybrid problems, the limited information about network dynamics
systems, and any combination of the wireless networks that and CSI, and the difficulty in applying conventional methods
are shown in Fig. 3. to solve the formulated power allocation problems.
Visible Light Communication (VLC) is a promising RAT We also observe that most of the papers implement
that can support multi-Gbps of data rates over wireless multi-agent DRL interactions, and the value-based DRL
links [120]. It is mainly developed for indoor applica- algorithms, such as DQN and Q-learning, are utilized more
tions; however, it is gaining considerable attention lately than the policy-based counterparts. However, since the power
for outdoor applications as well [121]. This has motivated allocation problem falls in the continuous action space, the
researchers to propose solutions that integrate VLC with con- use of value-based algorithms to address these types of prob-
ventional radio systems to boost data rates. Managing radio lems makes the learned policies vulnerable to discretization
resources in these integrated systems, however, becomes errors that degrade the accuracy and reliability of the learned
a challenge. In this context, in [42], the authors propose models. Hence, the emerging policy-based algorithms, such
a multi-agent Q-learning-based two-time scale scheme to as DDPG and actor-critic, have received more attention
address the power allocation issue for multi-Homing hybrid lately, and they have shown more accurate and reliable
RF/VLC networks. In their technique, the agents are the RF results compared to the value-based counterparts with addi-
and VLC APs, whose action space is discrete, corresponding tional complexity, as discussed in [88], [102], [105], [106].
to selecting the downlink power level that ensures the QoS’s In addition, we observe that the definition of the state space
satisfaction of the multi-homing users. The state space is and the reward function for the RRAM problems must be

VOLUME 3, 2022 341


ALWARAFY et al.: FRONTIERS OF DRL FOR RESOURCE MANAGEMENT IN FUTURE WIRELESS HetNets

TABLE 5. A summary list of papers related to DRL for power allocation.

deliberately engineered as they play a crucial role in the DRL techniques have attracted considerable research interest
convergence and accuracy of the learned policies. For policy- recently due to their robustness in making optimal deci-
based power allocation algorithms, it is more convenient to sions in dynamic and stochastic environments. This section
define the reward as a continuous function since the learn- presents the related works to the applications of DRL algo-
ing process depends on taking its derivative, which is not rithms for radio spectrum allocation in modern wireless
necessarily the case with the value-based algorithms. networks. This includes issues, such as dynamic network
It is also observed that DRL-based power allocation algo- access, user association or cell selection, spectrum access or
rithms can be deployed in a centralized and distributed channels selection/assignment, and the joint of any of these
fashion, depending on the deployment scenario. Distributed issues, as shown in Fig. 3.
scenarios provide more accurate and reliable policies than In modern wireless networks, a massive number of
centralized ones at the expense of added complexity and user devices may request to access the wireless channel
signaling overhead, especially as the number of agents simultaneously. This may drastically overload and congest
increases. Therefore, the tradeoff between the centralized and the channel, causing communication failure and unreliable
distributed policies heavily depends on the scenario under QoS. Hence, efficient communication schemes and protocols
investigation. For example, it is preferable to deploy DRL must be developed to address this issue in channel access
models in a distributed fashion to address the power allo- via employing various access scheduling and prioritization
cation problem for time-sensitive applications. However, for techniques. RRAM for modern wireless networks requires
ultra-reliable applications, it is preferable to adopt centralized considering dynamic load balancing and access management
DRL deployment. Moreover, most of the papers consider the methods to support the massive capacity and connectivity
rate maximization, SE, and EE as key performance metrics requirements of the future wireless networks while utilizing
(e.g., [88], [102], [110]). However, other KPI metrics must their radio resources efficiently. DRL methods have been
be considered as well during the design of DRL frame- used recently to address these issues, and they have demon-
works, such as latency, reliability, and coverage, especially strated efficient results in the context of massive channel
for emerging real-time and time-sensitive IoT applications. access.
We also observe from Table 5 that both the cellular On the other hand, user devices in cellular networks are
HomNets and emerging IoT wireless networks gain more required to associate or be assigned to BS(s) or network
attention than satellite and multi-RAT networks that still in AP(s) to get a service. The association process could be
their early stages and require more in-depth investigation. symmetric, i.e., both uplink and downlink are from the same
BS/AP, or it may be asymmetric in which the uplink and
B. DRL FOR SPECTRUM ALLOCATION AND ACCESS downlink may associate to different BSs/APs. This associa-
CONTROL tion or cell selection process must be carefully addressed as
One of the significant challenges in modern wireless com- it strongly affects the allocation of network radio resources.
munication networks that still needs more investigation is Unfortunately, such types of problems are typically non-
spectrum management and access control. In this context, convex and combinatorial [41] and need accurate network

342 VOLUME 3, 2022


information to obtain the optimal solution. In this context, network and the individual association for only a single IoT
DRL techniques have also shown efficient results in address- device. The reward function of the first DQN algorithm is
ing user association and cell selection issues for modern the sum-rate of all IoT devices, while for the second DQN
wireless networks. includes both the current transmission rate of IoT devices
and the interference with other IoT devices. Experimental
1) IN CELLULAR NETWORKS results demonstrate that their proposed DRL framework both
In the following paragraphs, we review works that employ scalable and achieves performance comparable to the optimal
DRL algorithms to address the spectrum and access control user association policy.
problem in cellular networks depicted in Fig. 2. Emerging integrated access and backhaul (IAB) cellular
Users-BSs association and bandwidth allocation in UAV- networks are characterized by their dynamic environment and
assisted cellular networks are also among the main emerging large-scale deployment. In another interesting work in [126],
challenges. Towards this end, interesting work is proposed the authors study the problem of spectrum allocation in the
in [122] based on the multi-agent DQN model to address IAB networks. The problem is first formulated as a non-
the joint user association, spectrum allocation, and content convex mix-integer and non-linear programming, and then
caching in an LTE network consisting of UAVs serving a DRL framework based on single-agent Double DQN and
ground users. In their model, the agents are the UAVs, which actor-critic algorithms is proposed to solve it. In their model,
have storage units and have the ability to cached contents the agent is a center-located controller or distributed UE. The
in LTE-BSs. These UAV agents can access the licensed as state space is discrete, indicating the status of UEs’ QoS,
well as the unlicensed spectrum bands, and a remote cloud- and the action space is discrete, corresponding to the allo-
based server is used to control them. The licensed cellular cation matrix for the donor BS and IAB nodes. The reward
spectrum band is used in the transmissions from the cloud function is modeled to optimize the proportional fairness
to the UAVs. Each UAV agent has to obtain 1) its user asso- allocation of the network. Experimental results demonstrate
ciation, 2) bandwidth assignment indicators in the licensed that their framework has promising results compared to other
spectrum band, 3) time slot indicators in the unlicensed spec- conventional spectrum allocation policies.
trum band, and 4) content that the users request. The input The problem of load balancing in large-scale and dynamic
of the DQL is the other agents’ actions (the UAV-user asso- wireless networks is also another important issue. In this con-
ciation schemes), and the output is the set of users that the text, the authors in [127] present a multi-agent Q-learning-
UAV can handle. Simulation results demonstrate that their based algorithm to address the problem of user association
proposed DQL strategy enhances the number of users up to for load balancing in cellular vehicular networks. In their
50% compared to the standard Q-learning strategy. scheme, the agents are the BSs, whose action space is
Based on their initial work in [123], the authors in [124] discrete, representing the associations with the network’s
propose a multi-agent Dueling Double DQN (D3QN)-based vehicles. The state space is a hybrid (continuous and dis-
DRL model to handle the joint BS and channel selections crete), consisting of the service resources and its service
in macro and femto BS networks sharing a set of radio demands, SINR matrix, and association matrix. The reward
channels. In their scheme, the agents are the UEs, whose is a continuous function defined through the association and
state space is a discrete binary vector that shows whether SINR matrices. The main advantage of this paper is that the
UEs’ SINR higher than the minimum QoS requirement or performance of their proposed algorithm is evaluated using
not. The action space is discrete, corresponding to the BS and experiments on real-field taxi movements. The authors show
channel association. The reward function is discrete in which that their approach provides higher quality load balancing
the UE agent will receive a utility as a reward if the QoS is compared to conventional association methods.
met; otherwise, it will receive a negative value for the reward. Most recently, Zheng et al. [128] propose a single agent
Simulation results demonstrate that their proposed strategy actor-critic-based DRL algorithm to address the problem of
outperforms the standard Q-learning strategy in terms of channel assignment for the emerging hybrid NOMA-based
generalization, system capacity, and convergence speed. 5G cellular networks. The agent is the BS, whose action
The problem of user association in cellular IoT networks space is discrete, corresponding to assigning channels for
is studied in an interesting work in [125]. The goal is to users. The state space is a hybrid (continuous and discrete)
assign IoT devices to particular cellular users to maximize comprised of three elements; the CSI matrix, achieved users’
the sum-rate of the IoT network. Two single-agent DQN data rate in the previous time slot, and the assigned chan-
DRL algorithms are proposed; the first one utilizes global nels in the previous time slot. The reward is a discrete
information to make decisions for all IoT devices at one function defined in terms of users’ SE, the number of chan-
time, while the other algorithm uses local information to nels that use NOMA for transmission, and the number of
make a distributed decision for only a single IoT device at users whose data rate is zero. Simulation results demonstrate
one time. In their model, the BS acts as the agent whose that their proposed method outperforms some conventional
state space is continuous, consisting of both historical CSI approaches, such as greedy, random, match theory-based,
and interference information. The action space is discrete, and Genetic Algorithms, in terms of both network SE and
representing both all possible association schemes of the sum-rate.

VOLUME 3, 2022 343


ALWARAFY et al.: FRONTIERS OF DRL FOR RESOURCE MANAGEMENT IN FUTURE WIRELESS HetNets

The problem of spectrum management in wireless DSA IoT sensor networks. In that work, the agent is one BS, which
is addressed in [129] based on distributed multi-agent DQN. controls the channel assignments for energy harvesting-
In their approach, the agents are each DSA user, whose enabled sensors. The problem of the agent is to predict the
action space is discrete, corresponding to the transmit power battery level of the sensors and to assign channels to sen-
change for each channel. The state space is discrete, defined sors such that the total rate is maximized. The DQL model
as the transmit power on wireless channels. The reward is a used to solve this problem has two long-short-term-memory
continuous function defined by the SE and the penalty caused (LSTM) neural network (NN) layers, one for predicting the
by the interference to PUs. Experimental results show that sensor’s battery state and one for obtaining channel access
their proposed model with echo state network-based DQN policy based on the predicted states obtained from the first
achieves a higher reward with both achievable data rate and layer. The agent’s action is all the available sensors that
PU protections. require to access the channels. The state contains the his-
Antenna selection is widely used for physical layer secu- tory of channel access scheduling, true and predicted battery
rity in multi-antenna-based cellular networks. In this context, information history and the current sensor’s CSI. Simulation
the authors in [130] propose a single-agent DQN algorithm results show that the total rates obtained using the DQL
to predict the optimal transmit antenna in the MIMO wire- scheme are 6.8 kbps compared to 7 kbps obtained from the
tap channel. The state space is discrete, defined in terms optimal scheme rate.
of the security capacity and maximum SNR of the MIMO Managing spectrum allocation is one of the main objec-
wiretap channel. The action space is discrete, corresponds to tives in cognitive radio networks (CRNs). The main idea
selecting the transmit antenna. The reward function is dis- is to efficiently utilize the available spectrum via enabling
crete, defined in terms of the achieved SNR at the antenna. SUs to use the spectrum resources when the PUs are inac-
Experimental results demonstrate that their proposed algo- tive. The authors in [135] propose a multi-agent DQN-based
rithm proactively predicts the optimal antenna while reducing model to address the cooperative spectrum sensing issue in
the secrecy outage probability of MIMO wiretap system CRNs. In their scheme, the agents are the SUs whose action
compared to the support vector machine and conventional is discrete, corresponding to sensing the spectrum for possi-
approaches. ble transmission without interfering with the PUs. The state
space is discrete, and it is comprised of four elements repre-
2) IN IOT AND OTHER EMERGING WIRELESS senting cases when the spectrum is sensed as occupied, the
NETWORKS spectrum is not sensed in a particular time slot, the spectrum
In the following paragraphs, we review works that employ is sensed as idle, and one of the other SUs broadcast the
DRL algorithms to address the spectrum and access control sensing result first. The reward function is the binary indi-
problem in IoT and emerging wireless networks illustrated cator, which is “+1” if the spectrum is sensed as idle and
in Fig. 2. “0” otherwise. Simulation results show that their proposed
IoT sensor networks are characterized by their high algorithm has a faster convergence speed and better reward
dynamicity, which necessitates efficient channel access for performance than the conventional Q-learning algorithm.
the connecting nodes. In [131], the authors build on their For the same network in [135], the authors in [136] extend
initial work in [132] and propose a single-agent DQN-based the work and propose a multi-agent DQN-based DRL scheme
DRL scheme to tackle the problem of dynamic channel to address the problem of dynamic joint spectrum access
access for IoT sensor networks. In their scheme, the agent is and mode selection (SAMS) in CRNs. The agents are the
the sensor, and its action is discrete, corresponding to select- secondary users (SUs) whose action space is discrete, corre-
ing one channel to transmit its packets at each time slot. The sponding to selecting the access channel and access mode.
state space is discrete, which is a combination of rewards The state space of each SU agent is discrete, comprised of
and actions in the previous time slots. The reward func- the action taken by the mth SU agent, the ACKs of all SUs
tion is also discrete, which is “+1” if the selected channel agents, and the ACK of the mth SU agent. The reward func-
is in low interference in such case a successful transmis- tion is discrete, which is “1” if the action selection process is
sion occurs; otherwise, the reward is “−1” in such case the successful; otherwise, there is a collision, and the agent will
selected channel is in high interference, and a transmission receive a “0” reward. Simulation results demonstrate that
failure occurs. Simulation results show that their proposed their proposed DQN algorithm provides comparable results
scheme achieves an average reward of 4.4 compared to 4.5 to the Max benchmark after the model’s convergence.
obtained using the conventional myopic policy [133], which Xu et al. [137] propose a single-agent DQN and DDQN-
needs a compact knowledge of the transition matrix of the based DRL approaches to address the problem of dynamic
system. spectrum access in wireless networks. In their model, the
Energy consumption is considered one of the persistent agent is a wireless node (e.g., a user) whose action is discrete,
challenges for emerging wireless sensor networks. In this corresponding to sensing the discrete frequency channel for
context, an interesting work is proposed in [134] in which possible data transmission. The state space is discrete, defin-
the authors develop a single-agent DQN-based DRL model ing if the channel is occupied or idle at time slot t. The
to address the channel selection in energy harvesting-based reward function is discrete, which is ranging from 0 to 100

344 VOLUME 3, 2022


for successful transmission; otherwise, the reward is “−10” to determining the optimal vehicle-RSU association for RSU.
if the channel state is occupied and the user transmission The state space is a hybrid (discrete and continuous) defined
fails. It is shown using simulation results that both DQN in terms of the last channel observations, rate threshold vio-
and DDQN can learn different nodes’ communication pat- lation indicator, and experienced data rate of vehicles. The
terns and achieve near-optimal performance without prior reward function is continuous, defined in terms of the aver-
knowledge of system dynamics. age rate per vehicle and threshold rate. Using experimental
Allocating spectrum resources is also a major challenge in results, it is shown that their proposed algorithm achieves
vehicular IoT networks. Based on their initial work in [138], around 15% sum-rate gains and a 20% reduction in vehicular
the authors in [139] propose a distributed single- and multi- user outages compared to baseline approaches.
agent DQN-based DRL schemes to address the spectrum The problem of dynamic spectrum access in CRNs is
sharing problem in V2X networks. In their proposed system, investigated in [142] through combining DRL and evolu-
multiple V2V links reuse the frequency spectrum preoccu- tionary game theory. In particular, uncooperative multi-agent
pied with V2I links. The agents are the V2V links whose DQN is considered in which the agents are the SUs whose
action space is discrete, corresponding to spectrum sub-band action is discrete, corresponding to selecting the access chan-
and power selection. Each agent’s local observation space nel. The state space is discrete, which includes two main
includes local channel information (such as its own signal parts; the channel selected by the agent and the utility
channel gain, interference channels from other V2V trans- obtained after transmission on the selected channel. The
mitters, interference channel from its own transmitter to the reward function is defined in terms of evolutionary game
BS, and the interference channel from all V2I transmitters), theory. Simulation results indicate the performance enhance-
the remaining V2V payload, and the remaining time bud- ment of their proposed algorithm over the case without
get. The reward is continuous, which is a function of both learning in terms of average system capacity.
the instantaneous sum capacity of all V2I links and the Another interesting work is presented in [143] in which the
effective V2V transmission rate until the payload is deliv- authors propose a multi-agent DQN-based DRL algorithm to
ered. Experimental results show that the agents cooperatively address the problem of optimum multi-user access control
learn a policy that enables them to simultaneously improve in Non-Terrestrial Networks (NTNs). In their model, UEs
the sum capacity of V2I links and payload delivery rate are the independent agents that report their experiences and
of V2V links. The authors also show that their proposed local observations to a centralized trainer controller located
models for the single-agent and multi-agent settings pro- at the backhaul network. The latter will then utilize the
vide very close performance to the conventional exhaustive collected experiences to update the global DQN parameter.
search. The agent’s state space is continuous, comprised of the con-
Multi-sensor network is an emerging technology that is nected NT-BS of UEs at the previous time slots, the RSS
expected to play a key role in future wireless networks. In of UEs, the number of connected UEs of each NT-BS, and
this context, the authors in [140] propose a single-agent DQN the transmission rate of UEs. The action space is discrete,
model to address the joint channel access and packet for- representing the binary indicator functions of UEs, while
warding in a multi-sensor scenario. In the proposed scheme, the reward is a function of the transmission rate of UEs.
one sensor is the agent, which acts as a relay to forward Experimental performance results show that their proposed
packets arriving from its surrounding sensors to the sink. The scheme is efficient in addressing the fundamental issues in
agent has a buffer to store arriving packets. The agent’s action the deployment of NTNs infrastructure, and it outperforms
is to choose channels for the packet forwarding, the packets the traditional algorithms in terms of both the data rate and
transmitted on these channels, and a modulation scheme at the number of handovers.
each time slot to maximize its utility (defined as the ratio of The integration of various DRL algorithms to improve the
the transmitted packets number to the transmit power). The efficiency and accuracy of the learned RRAM policies has
state is the combination of the buffer and channel states. shown promising results lately. In this context, Tomovic and
The input of the DQL model is the state, while the out- Radusinovic [144] propose an interesting single-agent DRL
put is the corresponding action selection. Simulation results model based on the integration of Double deep Q-learning
demonstrate that the proposed DQL scheme enhances system architecture and RNN to address the problem of DSA in
utility (i.e., 0.63) compared to the conventional random multi-channel wireless networks. In particular, the agent is
action selection scheme (i.e., 0.37). the SU node, whose action space is discrete, representing
One of the major challenges in mmWave wireless the selection of a channel for sensing. The state space is also
networks is establishing radio links and coping with the discrete, comprised of a history of the binary observations
high vulnerability of intermittent communication. This issue and history of taken actions. The reward function is a discrete
is even exacerbated in mmWave V2X due to the high mobil- binary function, which is “1” if the observation is “1” and
ity of vehicles. Towards this end, Khan et al. [141] propose “0” otherwise. Simulation results show that their proposed
a multi-agent A3C-based DRL to address the problem of method is able to find a near-optimal policy in a smaller
vehicle-cell association in mmWave V2X networks. The number of iterations, and it can support a wide range of
agents are the RSUs whose action is discrete, corresponding communication environment conditions.

VOLUME 3, 2022 345


ALWARAFY et al.: FRONTIERS OF DRL FOR RESOURCE MANAGEMENT IN FUTURE WIRELESS HetNets

In other work in [145], the authors propose both a scheme are the users. The action space is discrete, which
single-agent and multi-agent deep actor-critic DRL-based is “0” if the user does not attempt to transmit packets dur-
framework for dynamic multi-channel access in wireless ing the current time slot, and it is “1” if it has attempted
networks. In their system, the agents are the users whose to transmit. The state space is discrete, consisting of four
action space is discrete, corresponding to selecting a channel. main elements; each user action taken on the current time
The observation space is also discrete, which is defined based slot, channel capacity (which could be negative, positive, or
on the status of the channel and collision status. The reward zero), a binary acknowledgment signal showing if the user
function is discrete, which is “+1” if the selected channel transmits successfully or not, and a parameter that enables
is good; otherwise, it is “−1”. Using simulation results, the each user to estimate other users’ situations. The reward is a
authors show that their proposed actor-critic framework out- discrete binary function that takes the value of “1” if the user
performs the DQN-based algorithm, random access, and the transmits successfully; otherwise, it is “0” meaning that the
optimal policy when there is full knowledge of the channel user transmitted with collision. It is shown using simulation
dynamics. results that their scheme can maximize the total through-
The problem of DSA for the CRN is studied in [146] put while trying to make fair resource allocation among
based on an uncoordinated and distributed multi-agent DQN users. Also, it is shown that their proposed scheme outper-
model. The agents are CRs, whose action is discrete, rep- forms the conventional Slotted-Aloha scheme in terms of
resenting the possible transmit powers. The state space is sum-throughput.
discrete, reflecting whether the limits for DSA are being Vehicular ad hoc networks (VANETs) are one of the
met or not, depending on the relative throughput change promising networks for next generation wireless networks,
at all the primary network links. The reward is also dis- where networks are formed and information is relayed among
crete, which is a function of the throughput of the links vehicles. Wang et al. [150] address the problem of DSA in
and the environment’s state. Experimental results reveal that VANETs, by proposing an interesting scheme that combines
their proposed scheme finds policies that yield performance multi-hop forwarding via vehicles and DSA. The optimal
within 3% of an exhaustive search solution, and it finds the DSA policy is defined to be the joint maximization of chan-
optimal policy in nearly 70% of cases. nel utilization and minimization of the packet loss rate. A
Industrial IoT (IIoT) has emerged recently as an innova- multi-agent DRL network structure is presented that com-
tive networking ecosystem that facilitates data collection and bines RNN and DQN for learning the time-varying process of
exchange in order to improve network efficiency, productiv- the system. In their scheme, each user acts as an agent whose
ity, and other economic benefits [147]. RRAM in such a action space is discrete, corresponding to choosing a channel
sophisticated paradigm is also a challenge that needs more for transmission at time slot t. The state space is discrete,
investigation. The work in [148] can be considered to be a composed of three components; a binary transmission condi-
pioneer in which the authors propose a solution for spectrum tion η, which is “1” if the transmission is successful and “0”
resource management for IIoT networks, with the goal of otherwise, the channel selection action, and the channel sta-
enabling spectrum sharing between different kinds of UEs. tus indicator after each dynamic access process. The reward
In particular, a single-agent DQN algorithm is proposed in is a discrete binary function, which takes a positive value if
which the agent is the BS. The action space is discrete, which η = 1, otherwise it takes the value of “0”. Simulation results
corresponds to the allocation of spectrum resources for all show that their proposed scheme: 1) reduces the packet loss
UEs. The observation space is a hybrid (continuous and rate from 0.8 to around 0.1, 2) outperforms Slotted-Aloha
discrete) consisting of four elements; the current action (i.e., and DQN in terms of reducing collision probability and
the allocation of spectrum resources), the data priority of channel idle probability by about 60%, and 3) enhances the
type I UEs, the buffer length of type II UE, and the commu- transmission success rate by around 20%.
nication status of the first type of UEs. The reward function Due to their ability to improve communication in
is continuous, defined to address their optimization problem. harsh environments, UAV networks have gained consid-
It is divided into four objectives; 1) maximizing the spectrum erable research lately [151]. For example, most recently
resource utilization; 2) quickly transmitting the high-priority in [152], the authors propose efficient multi-agent DRL-
data; 3) meeting the threshold rate requirement of the first based schemes to address the problem of joint cooperative
type of UEs; 4) ensuring that the second type of UEs com- spectrum sensing and channel access in clustered cognitive
pletes the transmission in time. Using simulation results, UAV (CUAV) communication networks. Three algorithms
it is demonstrated that their proposed algorithm achieves are proposed: 1) a time slot multi-round revisit exhaustive
better network performance with a faster convergence rate search based on virtual controller (VC-EXH), 2) a Q-learning
compared with other algorithms. based on independent learner (IL-Q), and 3) a DQN based on
Most recently in [149], the authors propose a multi-agent independent learner (IL-DQN). The agents are the CUAVs
Double DQN-based DRL model to address the problem in the network. The action space of any CUAV agent is a
of DSA in distributed wireless networks. In particular, discrete function defined by the steps that this agent moves
they design a channel access scheme to maximize channel clockwise in time slot t relative to the channel selected in
throughput regarding fair channel access. The agents in their time slot t − 1 on the PU channel ring. The state space is a

346 VOLUME 3, 2022


discrete set consisting of two main elements: 1) the number to determining the resource allocation schemes. The state
of CUAVs agents that have selected a particular channel to space is continuous, consisting of two elements; the system
sense and access in the previous time slot and 2) a binary resource allocation state and the users’ service request state.
indicator function that shows the occupancy status of a par- The reward function is discrete, which is defined in terms
ticular channel in the previous time slot. The reward is a of the optimization objective function.
discrete function defined in terms of spectrum sensing, chan- SIoT has emerged lately as a promising wireless system
nel access, utility, and cost. Experimental results show that all that provides global satellite IoT services with reliable and
the three algorithms proposed show efficient results in terms ubiquitous coverage. Recently, the work in [156] can be
of convergence speed and the enhancement of utilization of considered to be a pioneer in which the authors propose a
idle spectrum resources. single-agent DQN-based approach for energy-efficient chan-
An interesting work is conducted in [153] in which the nel allocation in SIoT. The agent in their model is the LEO
authors propose a multi-agent deep recurrent Q-network- satellite, whose action space is discrete, corresponding to
based model to address the problem of DSA in dynamic mapping from newly coming node tasks to channels to be
heterogeneous environments with partial observations. In allocated. The state space is discrete, including information
their work, the authors consider a case-study with multiple about user tasks, such as the size and location of tasks. The
heterogeneous PUs sharing multiple independent radio chan- reward is continuous, which is divided into two normalized
nels. The agents are the SUs, whose action space is discrete, reward function components; the power efficiency reward
corresponding to deciding whether to transmit in a particular and the normalized value of the service blocking rate. Both
band or wait during the next time slot. The state space is dis- of these reward components are functions of power set up by
crete, representing whether the channels are occupied, idle, the agent, the optimal power decided by the location of the
or unknown. The reward function is discrete, represented by beam, and the number of served nodes. Experimental results
two values; 100 per channel for successful transmission and demonstrate that their proposed algorithm saves energy con-
-50 per channel for collision. Using simulation results, the sumption of around 67.86% compared to some conventional
authors show that their proposed algorithm handles various approaches.
dynamic communication environments, and its performance In the same context, the authors in [157] propose a central-
outperforms the myopic conventional methods and is very ized single-agent DQN-based scheme to address the problem
close to the optimization-based approaches that have a full of dynamic channel allocation in SIoT. The agent in their
observation of the environment. model is the satellite, whose action is discrete, corresponding
to selecting which sensors to allocate channels to. The state
3) IN SATELLITE NETWORKS space is discrete, comprised of three parts; the number of tasks
In the following paragraphs, we review works that employ in each time step, the bandwidth that a sensor node requires,
DRL algorithms to address the spectrum and access con- and the duration of a new task. The reward is continuous,
trol problem in satellite networks and emerging satellite IoT which is a function of the duration of data transmission for
systems. the sensor. Using simulation results, it is shown that their
The work in [154] proposes a single-agent DQN-based proposed algorithm both provides higher transmission success
DRL algorithm that considers the problem of channel rates and reduces data transmission latency by at least 87.4%
assignment in multi-beam satellite systems. In their scheme, compared to the conventional channel allocation algorithms.
the agent is the satellite, whose action is discrete, including An interesting work is reported by Zheng et al. [158] in
an index that indicates the channel that the newly arrived user which the authors propose a single-agent Q-learning-based
has occupied. The agent’s reward is discrete, which contains RL model to address the problem of combination allocation
a positive value if the service is satisfied, and a negative value of fixed channel pre-allocation and dynamic channel schedul-
if the service is not satisfied or blocked. The state space is ing in a network architecture of LEO satellites that utilizes a
also discrete, which comprises three elements; the current centralized resource pool. In their model, the satellite serves
users, the current channel assignment matrix, and a list of as an agent whose action is discrete, corresponding to assign-
the new user arrivals. Experimental results demonstrate that ing channels to users. The state space is discrete, defined by
their proposed scheme decreases the blocking probability and the channel assignment of users in each beam. The reward is
improves the carried traffic up to 24.4% as well as enhances continuous, which is a function of the user’s supply-demand
the spectrum efficiency compared to the conventional fixed ratio. Experimental results demonstrate that their proposed
channel assignment approach. approach enhances the system supply-demand ratio by 14%
In the same context, the authors in [155] propose a single- and 18% compared to the static channel allocation and the
agent DQN-based DRL algorithm to address the problem of Lagrange algorithm channel allocation methods, respectively.
dynamic channel allocation in multi-beam satellite systems.
In particular, an image-like tensor formulation on the system
4) IN MULTI-RAT NETWORKS
environments is considered in order to extract traffic spa-
tial and temporal features. The agent in their model is In the following paragraphs, we review works that employ
the satellite, whose action space is discrete, corresponding DRL algorithms to address the spectrum and access control

VOLUME 3, 2022 347


ALWARAFY et al.: FRONTIERS OF DRL FOR RESOURCE MANAGEMENT IN FUTURE WIRELESS HetNets

problem in multi-RAT HetNets. This includes the coexistence in a hierarchical fashion considering a network comprised of
of various variants of the wireless networks shown in Fig. 2. macro eNodeB (MeNB) and Wi-Fi APs. The agents are the
Managing the spectrum bands in unlicenced cellular controller installed at MEC servers. The agents’ action space
networks is also another persistent challenge. In this con- is discrete including the spectrum slicing ratio set, spectrum
text, the authors in [159] present a multi-agent DQN-based allocation fraction sets for the MeNB and for each Wi-Fi AP,
model that jointly tackles the dynamic channel selection and computing resource allocation fraction, and storing resource
interference management in Small Base Stations (SBSs) cel- allocation fraction. The state space is discrete representing
lular networks that share a set of unlicensed channels in Long information of the vehicles within the coverage area of the
Term Evolution (LTE) networks. In the proposed scheme, the MEC server, including vehicles’ number, x-y coordinates,
SBSs are the agents who choose one of the available chan- moving state, position, and task information. The reward
nels for transmitting packets in each time slot. The agent’s function is discrete, defined in terms of the delay require-
action is channel access and channel selection probability. ment, and requested storing resources required to guarantee
The DQL input includes the channels’ traffic history of both the QoS demands of an offloaded task. Provided experi-
the SBSs and Wireless Local Area Networks (WLAN), while mental results reveal that their proposed schemes achieve
the output is the agent’s predicted action vectors. Simulation high QoS satisfaction ratios compared with the random
results reveal that their proposed DQL strategy enhances the assignment techniques.
average data rate by up to 28% compared to the conventional The integration of various cellular wireless networks is
Q-learning scheme. also one of the main enabling technologies for the next
For the same network settings in [159], the authors generation wireless networks. Recently in [163], the authors
in [160] propose a single-agent DQN-based model to tackle propose an efficient single-agent DQN algorithm based on
the dynamic spectrum allocation for multiple users that Monte Carlo Tree Search (MCTS) to address the problem
share a set of K channels. In their scheme, the agent is of dynamic spectrum sharing between 4G LTE and 5G NR
the user whose action is either choosing a channel with systems. In particular, the authors used the MuZero algorithm
a particular attempt probability or selecting not to trans- to enable a proactive BW split between 4G LTE and 5G NR.
mit. The agent’s state includes the history of the actions The agent is a controller located at the network core, whose
of the agent and its current observations. The DQL model action space is discrete, corresponding to a horizontal line
input is the previous actions along with their observa- splitting the BW to both 4G LTE and 5G NR. The state space
tions, while the output is the Q-values corresponding to the is discrete, defined by five elements: 1) an indicator if the
actions. Simulation results demonstrate that their proposed user is an NR user or not, 2) the number of bits in the user’s
DQL strategy achieves a double data rate compared to the buffer, 3) an indicator of whether the user is configured with
state-of-the-art Slotted-Aloha scheme. multimedia broadcast single frequency network (MBSFN) or
The integration of cellular networks and indoor networks not, 4) the number of bits that can be transmitted for the
has also shown efficient results in enhancing the QoS of user in a given subframe, and 5) the number of bits that
wireless communication in terms of coverage and data rate. will arrive for each user in the future subframes. The reward
Towards this end, Wang and Lv [161] propose an effi- function is a continuous function defined as a summation of
cient single-agent prediction-DDPG-based DRL algorithm the exponential of the delayed packet per user. Experimental
to study the problem of the dynamic multichannel access results show that their proposed scheme provides comparable
(MCA) for the hybrid long-term evolution and wireless local performance to the state-of-the-art optimal solutions.
area network (LTE-WLAN) aggregation in dynamic HetNets. Findings and lessons learned: This section reviews the
The agent is the central BS controller, whose state space is applications of DRL for dynamic spectrum allocation and
continuous, consisting of both the channels’ service rates access control in modern wireless networks. These types of
and the users’ requirement rates. The action space, on the radio resources are inherently coupled with user associa-
other hand, is discrete, representing the users’ index. Two tion, network/RAT selection, dynamic multi-channel access,
reward functions are provided; online traffic real reward and and DSA. Table 6 summarizes the reviewed papers in this
online traffic prediction reward, each of which are functions section. In general, the application of DRL for spectrum allo-
of users’ requirements, channels’ supplies, degree of system cation and access control problems has received considerable
fluctuation, the relative resource utilization, and the qual- attention lately. We observe that most DRL algorithms, when
ity of user experience. Using simulation results, the authors deployed for non-IoT networks, are implemented in central-
demonstrate the efficiency of the proposed prediction-DDPG ized fashions at network controllers, such as BSs, RSUs,
model in solving the dynamic MCA problem compared to and satellites [125], [141], [154]. This is done to utilize
conventional methods. the controllers’ powerful and advanced hardware capabilities
Another interesting work in [162], the authors investigate in collecting network information and designing cross-layer
the joint allocation of the spectrum, computing, and stor- policies. Hence, we observe that DRL models are deployed
ing resources in multi-access edge computing (MEC)-based as a single-agent at the network controllers [148]. On the
vehicular networks. In particular, the authors propose multi- contrary, DRL provides a flexible tool in diversified IoT
agent DDPG-based DRL algorithms to address the problem networks and systems, conventionally involving dynamic

348 VOLUME 3, 2022


TABLE 6. A summary list of papers related to DRL for spectrum allocation and access control.

system modeling and multi-agent interactions, such as CRNs The exponential increase in smart IoT devices man-
and distributed systems. Also, note that the main motivations dates making autonomous decisions locally, especially for
of using DRL techniques in almost all the papers presented in delay-sensitive IoT applications and services. In this con-
this subsection are the complexity of the formulated spec- text, we anticipate that the research on spectrum allocation
trum allocation and access control problems, the inability and access control using distributed multi-agent DRL algo-
to obtain accurate CSI, and the inadequacy of conventional rithms for future IoT networks will attract more attention as
methods to solve the formulated problems. in [139], [141], [150], [152].
In addition, the management of such types of radio
resources falls in general in the discrete action space. C. DRL FOR RATE CONTROL
Therefore, the value-based algorithms are utilized more This refers to the adaptive rate control in the uplink and
than the policy-based ones, and they have shown efficient downlink of wireless networks. With the explosive increase
results, as we discussed in the surveyed papers. We also in the number of user devices and the emergence of massive
observe that embedding prediction-based DRL algorithms, types of data-hungry applications [3], it becomes essen-
such as RNN, with the conventional DNN models has shown tial to keep high network KPIs in terms of data rates and
efficient results in enabling DRL to perform a proactive users’ QoE. Adaptive rate control schemes must ensure sat-
spectrum prediction. Such integrated models have been seen isfactory QoS in highly dynamic and unpredictable wireless
in [144], [150], [153] and we expect that they will attract environments. In this context, DRL techniques can be effi-
more attention in the future. In addition, it is always prefer- ciently deployed to solve adaptive rate control problems
able to utilize the DQN-based algorithms to the Q-learning instead of conventional approaches that possess high com-
algorithm as they provide better performance in terms of plexity and heavily rely on accurate network information and
convergence speed and accuracy of the learned policies. instantaneous CSI.
Moreover, as is the case with the other DRL models, the In the following paragraphs, we review works that employ
definitions of the state space and reward function are crucial, DRL algorithms to address the rate control issue in cellular
and they must provide representative and rich information networks.
about the system and environment to the agent in order to 5G network slicing is a technique based on the network
learn efficient and reliable RRAM policies. virtualization concept that enables dividing the single
We also observe from Table 6 that the use of DRL tech- network connections into multiple unique virtual connec-
niques for IoT and emerging wireless networks receives more tions to provide various radio resources to various types of
attention than other wireless networks, especially for the traffic. Liu et al. [165] conduct a pioneer DRL-based work
cognitive radio-based systems as in [152]. to address the problem of network resource allocation, in

VOLUME 3, 2022 349


ALWARAFY et al.: FRONTIERS OF DRL FOR RESOURCE MANAGEMENT IN FUTURE WIRELESS HetNets

terms of rate, for 5G network slices. The problem is decom- efficient schemes that account for the joint radio resources.
posed into a master-slave, and a multi-agent DDPG-based In many scenarios, the design problem in wireless networks
DRL scheme is then proposed to solve it. The agents are might end up with competing objectives. For example, in
located in every network slice, whose action space is contin- UDNs, increasing the power level is beneficial in com-
uous, defining the resource allocation to users in the network bating path loss and enhancing the received signal quality.
slice. The state space is continuous and has two main parts; However, this might cause serious interference to the neigh-
the first one shows how much utility the user obtained com- boring user devices and BSs. Hence, the joint design of
pared to its minimum utility requirement, while the second power level control and interference management becomes
part shows the auxiliary and dual variables from the mas- mandatory. Conventional approaches for solving joint RRAM
ter problem. The reward is a continuous function defined in problems require complete and instantaneous knowledge
terms of utility, utility requirements, and auxiliary and dual about network statistics, such as traffic load, channel qual-
variables. Simulation results demonstrate that their proposed ity, and radio resources availability. However, obtaining such
algorithm outperforms the baseline approaches and gives a information is not possible in such large-scale networks.
near-optimal solution. In this context, DRL techniques can be adopted to learn
High mobility networks are characterized by their rapid system dynamics and communication context to overcome
variations that render link establishment a major issue. In the limited knowledge of wireless parameters.
this context, the authors in [166] propose an interesting work This section intensively reviews the most important and
using a single-agent DQN-based DRL algorithm to address influential works that implement DRL algorithms for the
the problem of dynamic uplink/downlink radio resources allo- problem of joint RRAM in modern wireless networks.
cation in terms of network capacity in high mobility 5G Particularly, we present related works that jointly optimize
HetNets. Their proposed algorithm is based on the Time the radio resources shown in Fig. 2, such as power alloca-
Division Duplex (TDD) configuration in which the agent is tion, spectrum resources, user association, dynamic channel
the BS, whose action space is discrete, corresponding to the access, cell selection, etc.
configurations of TDD sub-frame allocation at the BS. The
state space is discrete, comprised of different kinds of fea-
1) IN CELLULAR NETWORKS
tures of the BS, including uplink/downlink occupancy, buffer
occupancy, and channel condition of all uplinks/downlinks to In the following paragraphs, we review works that employ
the BS. The reward is discrete, defined as a function of the DRL algorithms to address the joint RRAM issue in cellular
uplink and downlink channel utility, which mainly depends on networks shown in Fig. 2.
channel occupancy with chosen TDD configuration. Using Cellular vehicular communication (CV2X) is regarded as
experimental results, the authors show that their proposed one of the main enabling technologies for next generation
algorithm achieves performance improvement in terms of wireless networks. RRAM in such networks has received
both network throughput and packet loss rate, compared to significant momentum using conventional methods, and they
some conventional TDD resource allocation algorithms. are now gaining notable attention using DRL methods. For
Findings and lessons learned: This section reviews the example, an interesting work is reported in [169], in which
use of DRL techniques for adaptive rate control in next the authors study the problem of joint optimization of trans-
generation wireless networks. In general, there is limited mission mode selection and resource allocation for CV2X.
research that is solely dedicated to addressing the rate radio They propose single-agent settings in which DQN and fed-
resource issue. We consider [165] and [166] as pioneer works erated learning (FL) models are integrated to improve the
in this type of RRAM. Most of the research in the litera- model’s robustness. The agent in their model is each V2V
ture is dedicated to video streaming applications, and the pair. The action space is discrete, representing the resource
paper [15] highlighted some of them. However, as we dis- block (RB) allocation, communication mode selection, and
cussed in the previous sections, the data rate control issue transmit power level of the V2V transmitter. The state space
is typically addressed via controlling other radio resources is a hybrid (continuous and discrete) consisting of five main
such as power, user association, and spectrum. In addition, parts; the received interference power at the V2V receiver
the adaptive rate control is typically addressed as a joint and the BS on each RB at the previous subframe, the number
optimization with other radio resources, as we will elaborate of selected neighbors on each RB at the previous subframe,
in the next section, e.g., as in [167], [168]. the large-scale channel gains from the V2V transmitter to its
We also observe that DRL-based solutions for cellu- corresponding V2V receiver and the BS, current load, and
lar networks receive more attention than other wireless remaining time to meet the latency threshold. The reward is
networks, and there is a lack of research on adaptive rate a continuous function defined in terms of the sum-capacity
control for IoT and satellite networks. This also deserves of vehicular UEs as well as the QoS requirements of both
more in-depth investigation and analysis. vehicular UEs and V2V pairs. Using experimental results,
the authors show that their proposed two-timescale federated
D. DRL FOR JOINT RRAM DRL algorithm outperforms other decentralized baselines.
Due to the massive complexity and large-scale nature of RRAM in small cell networks is one of the ongo-
modern wireless networks, it becomes necessary to design ing challenges for cellular operators. Towards this end,
350 VOLUME 3, 2022
Jang and Yang [170] propose a multi-agent DQN-based algo- 4) the interference that each agent causes to its neigh-
rithm to address the problem of sum-rate maximization via bors. The reward is a continuous function comprised of
a joint optimization of resource allocation and power control three elements: 1) the SE achieved by each agent, 2) the
in small cell wireless networks. The agents in their proposed SE degradation of the agent’s interfered neighbors, and
model are the small cell BSs, whose action space is discrete, 3) the penalty due to the interference generated at the BS.
corresponding to selecting the resource allocation and power Experimental results show that their proposed algorithm out-
control of small BS on RB. The state space is continuous, performs both the exhaustive and random subcarrier and even
including all the CSI that the small BS collects on RB, such power (RSEP) assignment methods in terms of SE of D2D
as local CSI, local CSI at the transmitter, etc. The reward is pairs.
a continuous function expressed by the average sum-rate of Mission-Critical Communications (MCC) is an emerging
its own serving users and the other small BSs. Experimental service in the next generation wireless networks. It is envi-
results show that their proposed approach both outperforms sioned to enable First Responders, such as firefighters and
the conventional algorithms under the same CSI assumptions emergency medical personnel, to replace conventional radio
and provides a flexible tradeoff between the amount of CSI with advanced communication capabilities available to next
and the achievable sum-rate. generation smartphones and IoT devices. Most recently, a
In the same context, another interesting work is presented pioneer work is conducted by Wang et al. [173] in which
in [171] in which the authors propose a model-driven multi- the authors propose a multi-agent DQN-based DRL scheme
agent Double DQN-based framework for resource allocation to address the problem of spectrum allocation and power con-
in UDNs. In particular, They first develop a DNN-based trol for MCC in 5G networks. In MCC, multiple D2D users
optimization framework comprised of a series of ADMM reuse non-orthogonal wireless resources of cellular users
iterative procedures that uses the CSI as the learned weights. without BS in order to enhance the network’s reliability.
Then, channel information absent Q-learning resource allo- The paper aims to help the D2D users autonomously select
cation algorithm is presented to train the DNN-based the channel and allocate power to maximize system capacity
optimization scheme without massive labeling data, where and SE while minimizing interference to cellular users. The
the EE, SE, and fairness are jointly optimized. The agents agents are the D2D transmitters whose action space is dis-
are each D2D transmitter, whose action space is discrete, crete, corresponding to channel and power level selection.
corresponding to selecting a subcarrier and corresponding The state space is discrete, defined in a three-dimensional
transmission power. The state space is a hybrid (continu- matrix, which includes information on the channel state of
ous and discrete) consisting of two parts; user association uses, the state of power level, and the number of the D2D
information and interference power. The reward function is pairs. The reward function is discrete, defined in terms of
continuous, comprised of two components; the network EE the total system capacity and constraints. Simulation results
and the fairness of service quality, which is expressed by show that their proposed learning approach significantly
the variance of throughput between authorized users. Using improves spectrum allocation and power control compared
experimental results, it is demonstrated that their proposed to traditional methods.
algorithm has a rapid convergence speed, well characterizes RRAM in OFDM-based systems is also one of th main
the extent of optimization objective with partial CSI, and challenging issues. In this context, the authors in [174] pro-
outperforms other existing resource allocation algorithms. pose a multi-agent DQN-based model to address the problem
D2D-enabled cellular networks are also one of the key of joint user association and power control in OFDMA-based
enabling technologies for next generation cellular systems. wireless HetNet. The agents are the UEs, whose action space
RRAM in such networks is one major concern, especially is discrete, corresponding to jointly associate with the BS
for mmWave-based cellular networks, as the D2D links and determine the transmit power. The state space is discrete,
require frequent link re-establishment to combat the high which is defined by the situation of all UEs association with
blockage rate. The authors in [172] propose a multi-agent BS and power control. The reward function is continuous,
Double DQN-based scheme to address the problem of joint which is defined in terms of the sum-EE of all UEs. Using
subcarrier assignment and power allocation in D2D under- simulation results, it is shown that their proposed method
lying 5G cellular networks. The agents in their model outperforms the Q-learning method in terms of convergence
are the D2D pairs, whose action space is discrete, cor- and EE.
responding to determining the transmit power allocation Another interesting work is reported in [175] in which the
on the available subcarriers. The state space is a hybrid authors propose a single-agent DQN-based DRL model to
(continuous and discrete), comprised of four components: address the problem of joint optimization of user associa-
1) local information (including the previous transmit power, tion, resource allocation, and power allocation in HetNets.
previous SE achieved by transmitters, channel gain, and The agent is the BS, whose action is discrete, correspond-
SINR), 2) the interference that each agent causes at the BS ing to power allocation to users. The state space is discrete,
side, 3) the interference received from agent’s interfering defined by the channel gain matrix and the set of users asso-
neighbors and the SE achieved by agent’s neighbors, and ciation. The reward function is continuous, defined by the

VOLUME 3, 2022 351


ALWARAFY et al.: FRONTIERS OF DRL FOR RESOURCE MANAGEMENT IN FUTURE WIRELESS HetNets

utility function of users’ achieved data rate. Using simula- to the selection of the frequency band and transmission
tion results, the authors show that their proposed algorithm power level that generate small interference to all V2V and
outperforms some of the existing methods in terms of SE V2I links while ensuring enough resources to meet latency
and convergence speed. constraints. The state space is continuous, containing the
following information; the CSI of the V2I link, the received
2) IN IOT AND OTHER EMERGING WIRELESS interference signal strength in the previous time slot, the
NETWORKS channel indices selected by neighbors in the previous time
In the following paragraphs, we review works that employ slot, the remaining load for transmission, and the time left
DRL algorithms to address the joint RRAM issue in IoT before violating the latency constraint. The reward function
and emerging wireless networks depicted in Fig. 2. is continuous, consisting of three components; the capacity
For the same system settings in [106], [107], the authors of V2I links, the capacity of V2V links, and the latency con-
in [67] extended their work and propose a multi-agent straint. Using experimental results, it is shown that agents
DDPG-based DRL framework to address the problem of the learn to satisfy the latency constraints on V2V links while
joint spectrum and power allocation in wireless networks. minimizing the interference to V2I communications.
Two DRL-based algorithms are proposed, which are exe- Another pioneering work is reported in [181] in which
cuted and trained simultaneously in two layers in order to the authors propose a single-agent Double DQN-based DRL
jointly optimize the discrete subband selection and contin- to address the problem of joint channel selection and power
uous power allocation. The agent in their approach is each allocation with network slicing in CRNs. Their study aims
transmitter. In the top layer, the action space of all agents is to provide high SE and QoS for cognitive users. The agent is
discrete, representing the discrete subband selection, while the overall CRN, whose action space is discrete, correspond-
the bottom layer has a continuous action space correspond- ing to the channel selection and power allocation of SUs.
ing to the transmit power allocation. The state space is The state space is continuous, defined by the SINR of the
a hybrid (continuous and discrete), containing information PU. The reward function is continuous, which is a function
on achieved SE, transmit power, sub-band selection, rank, of the system SE, user QoS, interference temperature, and
and downlink channel gain. The reward is shared by both the interference temperature threshold. Experimental results
layers, which is a continuous function defined in terms show that their proposed algorithm improves the SE and QoS
of the externality of agents to interference and the spec- and provides faster convergence and more stable performance
tral efficiency. Using experimental results, the authors show than the Q-learning and DQN algorithms.
that their proposed framework outperforms the conventional NOMA-based systems are characterized by their ability
fractional programming algorithm. to provide enhanced QoS in cellular networks. However,
Based on their initial work in [176], the authors in [177] allocating radio resources in such systems is quite chal-
extended their work and propose a distributed multi-agent lenging. In this context, the problem of joint subcarrier
DQN-based DRL scheme to address the problem of joint assignment and power allocation in an uplink multi-user
channel selection and power control in D2D networks. The 5G-based NOMA systems is addressed in [182]. A multi-
agents in their model are the D2D pairs, whose action agent two-step DRL algorithm is proposed; the first step
space is discrete, corresponding to selecting a channel and employs the DQN algorithm to output the optimum subcar-
a transmit power. The state space of each agent is a rier assignment policy, while the second step employs the
hybrid (continuous and discrete) which contains three sets of DDPG algorithm to dynamically allocate the transmit power
information; local information, non-local information from for the network’s users. The agent is a controller located at
the agent’s receiver-neighbor set, and non-local information the BS, whose action space is a hybrid (discrete and contin-
from the agent’s transmitter-neighbor set. The reward func- uous), corresponding to the subcarrier assignment decisions
tion of each agent is continuous, which is decomposed and power allocation decisions. The state space is continu-
into the following elements; its own received signal power, ous, which is defined by the users’ channel gains at each
its own total received SINR, its interference caused to subcarrier. The reward function is defined as the sum EE
transmitter-neighbors, the received signal power, and the of the NOMA system. Experimental results show that their
total received SINR of transmitter-neighbors. Using sim- proposed algorithm provides better EE than the fixed and
ulation results, it is shown that the performance of their DQN-based power allocation schemes.
scheme closely approaches that of the FP-based algorithm Unlike the work in [182], a pioneer work is reported
even without knowing the instantaneous global CSI. in [183] in which the authors propose a multi-agent DDPG-
In [178], the authors extended their previous works based model to address the problem of joint power and
in [179], [180] and present a distributed multi-agent DQN- spectrum allocation in NOMA-based V2X networks. In par-
based model to address the problem of joint sub-band ticular, the authors are looking to maximize the sum-rate of
selection and power level control in V2V communication V2I communications. The agents are the V2V communica-
networks. Their proposed model is applicable to both uni- tion links. The state space is discrete, defined by a set of
cast and broadcast scenarios. The agents are the V2V link actions performed by V2I and V2V communication links.
or vehicles whose action space is discrete, corresponding The set includes the transmit power of both V2I and V2V

352 VOLUME 3, 2022


links as well as the spectrum multiplexing factor of both V2V the possible direction and speeds of the UAVs, RISs’ phase
and V2V links. The state space is continuous, defined by five shifts, and association indicator. The reward function is con-
parts; the local channel gain information of each V2V link, tinuous defined in terms of power consumption. Simulation
interference channels from other V2V communication links, results show that their proposed algorithm reduces the trans-
interference channel from each link’s own transmitter to the mit power consumption by 6 dBm compared to other baseline
BS, interference channel from all V2I transmitters, and the methods.
state of queue length in the buffer of each V2V transmitter.
The reward function is continuous, defined by the achieved 3) IN MULTI-RAT NETWORKS
sum-rate of V2I communication links and the delivery prob- In the following paragraphs, we review works that employ
ability of V2V communication links. Compared with both DRL algorithms to address the joint RRAM problem in
the DQN algorithm and random resource allocation scheme, multi-RAT HetNets. This includes the coexistence of various
simulation results show that their proposed algorithm out- variants of the wireless networks as illustrated in Fig. 2.
performs both of them in terms of maximizing the sum-rate Integrating RF and VLC RATs is a promising solution to
of V2I communication links while meeting the latency and enhance networks’ QoS. Towards this, recently in [185], the
reliability requirements of V2V communications. authors present a multi-agent DQN-based algorithm to address
On the other hand, another interesting work is conducted the problem of joint optimization of bandwidth, power, and
by Munaye et al. [168] in which they propose a multi- user association in hybrid RF/VLC systems. The APs are the
agent DQN-based DRL model to address the problem of agents whose action is discrete, representing the bandwidth,
joint radio resources of bandwidth, throughput, and power in association function, and power level. The state space is
UAV-assisted IoT networks. The agents are the UAVs, whose discrete, which is a function of the problem constraints such
action space is discrete, corresponding to jointly selecting as system bandwidth, association function, and power levels.
channel allocation of bandwidth, throughput, and power. The The reward is discrete, which is a function of the rates
state space is discrete, comprising three components; the delivered by the APs. Experimental results show that their
air-to-ground channel used by users, the rate of power con- algorithms improve the achievable sum-rate and number of
sumption, and the interference. The reward is a discrete iterations for convergence by 10% and 54% compared to that
function, defined in terms of throughput, power allocation, obtained using conventional optimization approaches.
bandwidth, and SINR levels. Simulation results show that Another interesting work is proposed recently by
their proposed algorithm outperforms other algorithms in Alwarafy et al. [41]. The authors propose a hierarchical
terms of accuracy, convergence speed, and error. multi-agent DQN and DDPG-based algorithm to address
Reconfigurable intelligent surface (RIS) technology has the problem of sum-rate maximization in multi-RAT multi-
emerged recently as one of the main technologies for connectivity wireless HetNets. The authors addressed the
future wireless networks [187]. RISs employ many pas- problem of multi-RATs assignment and continuous power
sive reflecting elements with controllable phase shifts and allocation that maximize the network sum rate. In their
negligible power consumption, which provide a favorable model, single and multi-agents are proposed. The edge server
wireless propagation environment for transmitted signals. acts as a single agent employed by DQN, while RATs
In particular, RISs can be used to overcome block- APs behave as multi-agents employed by DDPG. For the
age by providing virtual LoS links between transmitters single-agent DQN model, the action space is discrete, cor-
and receivers, interference cancellation, and physical layer responding to the RATs-EDs assigning process. The state
security [187], [188]. However, RISs encounter massive space of the DQN is continuous, comprised of the link gains
challenges related to environment uncertainty and real- and the required data rates of users. The reward function
time channel estimation issues [187], [188]. Hence, DRL of the DQN agent is continuous, defined by the difference
approaches have attracted considerable research lately as effi- between the achieved rate and the required rate by users.
cient tools to assist the RIS technology. In this context, the For the multi-agent DDPG models, the action space is con-
authors in [184] provide an interesting work based on the tinuous, representing the power allocation of each RAT AP
multi-agent dueling DQN model to address the problem of agent. The state space is a hybrid (continuous and discrete)
power minimization in UAV-RIS-based multi-cell HetNets. consisting of four elements: the RATs-EDs assignment pro-
In particular, they proposed to solve their problem in two cess performed by the DQN agent, the minimum data rate
stages. The first stage employs dueling DQN to solve the of users, the gains of the links, and the achieved data rate.
problem of UAVs’ trajectories/velocities, RISs’ phase con- Experimental results show that their algorithm’s performance
trol, and subcarrier allocations for microwave band. The is approximately 98.1% and 95.6% compared to the conven-
second stage employs alternating methods to solve active tional CVXPY solver that assumes full knowledge of the
beamforming and subcarrier allocation for mmWave. The wireless environment.
agents are the UAV-RISs, whose state space is continuous Hybrid access networks are a special architecture for
defined by the trajectory of the UAV-RISs and all channel broadband access networks where different types of
gains, i.e., from the BS to the RISs, RISs to users, and small- access networks are integrated to improve bandwidth.
cell BSs to users. The action space is discrete, defined by Huang et al. [186] propose a single-agent DQN model to

VOLUME 3, 2022 353


ALWARAFY et al.: FRONTIERS OF DRL FOR RESOURCE MANAGEMENT IN FUTURE WIRELESS HetNets

TABLE 7. Summary of the related works that address the joint RRAM.

address the problem of delay minimization via joint spectrum the fraction of time that the PENs should use over a partic-
and power resource allocation in mmWave mobile hybrid ular, and the data rate. Simulation results demonstrate that
access network. The agent is located in the roadside BS, their proposed scheme outperforms the greedy techniques in
whose action space is discrete, corresponding to allocating terms of minimizing energy consumption and latency while
spectrum and power resources for data. The state space is satisfying different PENs requirements.
discrete, consisting of information about the current power Findings and lessons learned: This section reviews the use
and spectrum of the resource pool, required spectrum and of DRL methods for joint radio resources of power, spectrum,
power, and the number of spectrum and power levels. The access control, user association, and rate. Table 7 summa-
reward signal is a continuous function defined in terms of rizes the reviewed papers in this section. We observe that
queueing delay and the resource length required for each DRL tools can be efficiently deployed to address different
data. Using simulation results, it is shown that their proposed types of joint radio resources for diversified network scenar-
model guarantees the URLLC delay constraint when the load ios. The results obtained using DRL models are better than
does not exceed 130%, which outperforms other conventional the heuristic methods [168], [183] and comparable to the
methods such as random and greedy algorithms. state-of-the-art optimization approaches [67], [177]. Also,
Healthcare systems are one of the main services in next note that the main motivations of using DRL techniques
generation wireless systems. Unlike the work in [41], a pio- in addressing the joint radio resources problems presented
neer work is presented in [167] to address the problem of in this subsection are the complexity of these formulated
network selection with the aim of optimizing medical data problems, the limited information about system dynamics
delivery over heterogeneous health systems. In particular, and CSI, and the difficulty in applying traditional RRAM
an optimization problem is formulated in which the network methods to solve such problems.
selection problem is integrated with adaptive compression We also observe that multi-agent DRL deployment based
to minimize network energy consumption and latency while on value-based algorithms receives more attention than
meeting applications’ QoS requirements. A single-agent policy-based algorithms. The reason is that users tend to
DDPG-based DRL model is proposed to solve it. The agent is have more control over their channel selection, data control,
a centralized entity that can access all radio access networks and transmission mode selection, and hence we find a more
(RANs) information and Patient Edge Node (PEN) data popular implementation of DRL agents at local IoT devices.
running in the core network. The action space is discrete, In addition, the integration of value-based and policy-based
corresponding to the joint selection of data split over the algorithms for joint RRAM is also an interesting concept
existing RANs and the adequate compression ratio. The state that requires more investigation, especially for multi-agent
space is a hybrid (continuous and discrete) defined by two deployment scenarios. In particular, depending on the type
elements: the fraction of time that the PENs should use over of radio resources under investigation, resources with contin-
a particular RAN and the PEN investigated in the current uous nature such as power typically implement policy-based
timestamp. The reward is a continuous function, which is algorithms, while resources with discreet nature such as
defined in terms of: the fraction of data of PEN that will be channel allocation and user association typically implement
transferred through RAN, the energy consumed by PEN to value-based algorithms. Simultaneous dealing with continu-
transfer bits over RAN, distortion, expected latency of RANs, ous and discrete types of radio resources may integrate both
the monetary cost of PENs to use RANs, the resource share, the policy- and value-based DRL algorithms to learn a global

354 VOLUME 3, 2022


TABLE 8. Advantages and disadvantages/shortcomings of DRL methods when applied for RRAM problems in next generation HetNets.

policy as in [41], [182], or even adopting the value-based algorithms are one promising research direction in this con-
algorithms as in, e.g., [172], [174], [185] with an expense text [14]. However, the rapid increase in the number of
of added quantization error. edge devices (players) makes information exchange in such
We also observe that DRL methods for cellular networks networks unmanageable. Also, the partial observability of
as well as IoT wireless networks gain more attention than agents might lead to suboptimal RRAM policies. Therefore,
multi-RAT networks, particularly for D2D and V2V com- it is an open challenge to develop DRL-assisted algorithms
munications. In addition, there is a lack of research on that optimally balance between the centralization and dis-
applications of DRL for emerging IoT applications, such tribution issue in RRAM. A possible solution is to develop
as healthcare systems as investigated recently in [167], hybrid ecosystems that implement some DRL models at the
which is also a promising field that requires more atten- network’s edge, e.g., at the ESs or user devices, instead of
tion. Furthermore, we observe a lack of research on DRL deploying all DRL algorithms on a centralized network.
applications for joint RRAM in satellite networks, which
also deserves more in-depth investigation. 2) DIMENSIONALITY OF STATE SPACE IN HETNETS
In modern wireless HetNets, service requirements and
network conditions are rapidly changing. Hence, single-agent
V. OPEN CHALLENGES AND FUTURE RESEARCH
DIRECTIONS DRL algorithms must be designed to capture and respond
Throughout the previous section, we have demonstrated the to these fast network changes. To this end, it is required
superiority of DRL algorithms over traditional methods in to reduce the state space and action space during the learn-
addressing complex RRAM problems for modern wireless ing process, which inevitably degrades the quality of the
networks. However, there are still several challenges and learned policies. The existence of multi-agents and their
open issues that either not explored yet or need further interactions will also complicate the agents’ environment
exploration. This section provides highlights these open chal- and prohibitively increase the dimensionality of state space,
lenges and provides insights on future research directions which will slow down the learning algorithms. A possi-
in the context of DRL-based RRAM for next generation ble solution to this issue is to split the large state spaces
wireless networks. Table 8 summarizes the advantages and into smaller ones through state-space decomposition. The
disadvantages/shortcomings of DRL methods when applied idea is to use smaller DNNs to learn the dynamics of the
for RRAM in next generation wireless networks. decomposed sub-state spaces, while another DNN considers
the relatively less frequent interactions between the different
sub-state spaces [189]. This approach enables us to distribute
A. OPEN CHALLENGES computation and accelerate agents’ training.
1) CENTRALIZED VS. DECENTRALIZED RRAM
TECHNIQUES 3) RELIABILITY OF TRAINING DATASET
Future wireless networks are characterized mainly by their Although the DRL-based solutions for RRAM we reviewed
massive heterogeneity in wireless RANs, the number of previously demonstrate efficient performance results, almost
user devices, and types of applications. Centralized DRL- all the models are developed based on simulated training and
based RRAM schemes are efficient in guaranteeing enhanced testing datasets. The simulated dataset is typically produced
network QoS and fairness in allocating radio resources. They based on some stochastic models, which provide simpli-
also ensure that RRAM optimization problems will not get fied versions of practical systems and greatly ignore hidden
stuck in local minima due to their holistic view of the system. system patterns. This methodology greatly weakens the reli-
However, formulating and solving RRAM optimization prob- ability of the developed policies as their performance on
lems become tough tasks in such large-scale HetNets. practical networks would be skeptical. Hence, it is imperative
Hence, centralized DRL-based RRAM solutions are typically to develop more effective and reliable approaches that gener-
unscalable. This motivates distributed multi-agent DRL- ate precise simulation datasets and capture practical system
based algorithms that enable edge devices to make resource settings as much as possible. This ensures high reliability
allocation decisions locally. Stochastic Game-based DRL and confidence during the training and testing modes of

VOLUME 3, 2022 355


ALWARAFY et al.: FRONTIERS OF DRL FOR RESOURCE MANAGEMENT IN FUTURE WIRELESS HetNets

the developed RRAM policies. Developing such approaches RRAM problem, such as network topology and available
is still a challenge due to the large-scale nature and rapid radio resources, the DRL model must be retrained as the
variations of future wireless environments. old model is no longer reflecting the new training experi-
On the other hand, the DRL models are sensitive to any ences. In modern wireless HetNets, such cases are frequently
change in the input data. Any minor changes in the input encountered, especially with real-time applications or in
data will cause considerable change in the models’ output. highly dynamic environments. In such a case, it becomes
This mainly deteriorates the reliability of DRL algorithms, quite challenging for DRL agents to update and retrain their
especially when deployed for modern IoT applications that DNNs with rapidly changing input information from the
require ultra-reliability, such as remote surgery or any other HetNet environment [1]. A possible solution is to design
mission-critical IoT applications. Hence, ensuring high reli- DRL-based RRAM models in a manner that supports gener-
ability for DRL models is a challenging issue. A possible alization via transfer learning and meta-learning. Multi-tasks
solution to such issues is to exploit real-field measurement DRL approaches [191], [192] are efficient frameworks to
data collected from various cellular and IoT wireless sce- support these aspects.
narios to train and test the DRL-based RRAM models. This On the other hand, if domain knowledge is available or
will increase the reliability of the learned policies and also easy to obtain, it becomes hard for the DRL algorithms to
enables DRL model generalization. beat the well-designed algorithms based on the full domain
knowledge. This fact has been observed and reported in the
4) ENGINEERING OF DRL MODELS FOR RRAM surveyed papers in Section IV.
Since DRL employs DNNs as function approximators for
6) CONTINUOUS TRAINING OF DRL MODELS
the reward functions, DRL models will inherit some of the
challenges that exist in the DNN world. For example, it is DRL algorithms require big datasets to train their models,
still quite challenging to optimize the DNN hyperparame- which is typically associated with a high cost [15]. The
ters, such as the type of DNNs used (e.g., convolutional, network system pays this cost during the information col-
fully connected, or RNN), the number of hidden layers, lection process due to, e.g., the high delays, extra overhead,
the number of neurons per hidden layer, the learning rate, and energy consumption. The emergence of a large number
batch size, etc. DRL models suffer from high sensitivity to of real-time applications and services has even increased
these hyperparameters. This challenge is even exacerbated this training cost. In this context, DRL models require to
in multi-agent settings as all agents share the same radio be continuously retrained with fresh data collected from the
resources and must converge simultaneously to some poli- wireless environment to be up-to-date and ensure accurate
cies. A possible solution is to implement some optimization and long-term control decisions. It is not practical to con-
techniques from the deep learning field, such as grid and duct manual retraining of the models in such large-scale
random search methods, to find the optimal configuration of HetNets settings. Also, manually monitoring and updating
these hyperparameters [190]. DRL models in multi-agent scenarios becomes an expensive
On the other hand, the engineering of DRL param- task. Therefore, continuous retraining can solve this issue,
eters such as state space and reward function is chal- in which a dedicated autonomous system is employed to
lenging for RRAM. The state space must be engineered continuously assess and retrain old DRL models.
to capture useful and representative information about 7) CONTEXT OF RRAM
the wireless environment, such as the available radio The implementation of DRL algorithms basically depends
resources, users’ QoS requirements, channel quality, etc. on the use-cases. The context and deployment scenarios in
Such information is crucial and heavily defines the learning which RRAM is required must be considered during the
and convergence behaviors of DRL agents. Again, the development of DRL models. For example, RRAM in health-
presence of multi-agents will even make it more challenging, sector IoT applications is different from the environmental
as discussed in [14]. Also, since DRL models are reward- IoT applications counterparts. Due to the high sensitivity
driven learning algorithms, the design of the reward function of data in the health-sector applications, extra data pre-
is also essential to guide the agent during the policy-learning processing must be performed, including data compression
stage. Formulating reward functions that capture the network and encryption [167]. This will directly affect the number of
objective and account for the available radio resources is also radio resources to be allocated for such applications. Hence,
challenging. DRL models must be aware of the context aspect of appli-
cations, which is considered another challenge. A possible
5) SYSTEM DEPENDENCY OF DRL MODELS
solution is to develop context-aware DRL models that are
DRL models are system-dependent as they are trained and able to learn context variables in an unsupervised manner
tested for specific wireless environments and networks. and adapt the policy to the existing context, e.g., as in [193].
Therefore, they provide effective results when employed to
solve specific types of problems for which they are trained. 8) COMPETING OBJECTIVE DESIGN OF DRL MODELS
However, if there would be a significant change in the char- Next generation wireless networks are expected to provide
acteristics of the wireless environment or the nature of the enhanced system QoS in terms of high data rate, high

356 VOLUME 3, 2022


EE/SE, and reduced latency in order to support the emerg- DRL can be utilized to ensure efficiency of the consensus
ing IoT vital applications. Depending on the deployment process, enhance energy-efficient resource allocation, and
scenario, formulating multi-objective RRAM optimization reduce computation overhead in Blockchain-enabled wire-
problems usually ends with many competing objectives less networks [200]. In addition, many of the auction’s
and/or constraints. For instance, in cellular UDNs, high winner-determination problems in future wireless HetNets
resource utilization of, e.g., power allocation or channel may are expected to be extremely complex and intractable due to
cause severe interference. Also, for IoT applications such the massive increase in the number of participants, e.g., bid-
as vehicular communications, we require to ensure ultra- ders and sellers. Hence, DRL algorithms are efficient tools
reliable and low-latency communication links, which are that can be utilized to solve such types of problems.
usually competing objectives. Therefore, developing multi-
objective DRL-based RRAM models that accommodate these 3) FEDERATED DRL (FDRL)-BASED RRAM
competing requirements is still a persisting challenge. For Federated learning (FL) framework is envisioned mainly to
example, frameworks to facilitate the development of multi- preserve data privacy in ML algorithms [201]–[203]. In FL,
agent algorithms similar to those presented in [194] can be ML algorithms are locally distributed at the wireless network
adopted for RRAM problems. edge, and the data is processed locally and not shared glob-
ally. The local ML models are then utilized for training
B. FUTURE RESEARCH DIRECTIONS a centralized global model. In this context, the federated
1) DRL WITH EXPLAINABLE AI (XDRL) FOR RRAM DRL learning (FDRL) scheme can be leveraged when many
Explainable AI (XDRL) has recently emerged as an efficient user devices require making autonomous decisions locally. In
technology to improve the performance of DRL models. It is such a case, DRL agents do not exchange their local obser-
mainly envisioned to unlock the “black-box” nature of con- vations, and also, not necessarily all agents receive reward
ventional ML approaches and provide interpretability and signals [204].
explainability for DRL models [195]. In particular, XDRL Developing fine-grained policies in DRL becomes chal-
explains the reasons behind certain predictions made by DRL lenging when the state space is small and the training dataset
models (or ML models in general) by fully understand- is very limited [205]. In FDRL, the direct exchange of data
ing the precise working principle of these models. Hence, between agents is not possible as this will preach the pri-
ensuring trust, reliability, and transparency in the DRL algo- vacy promise of FL scheme. Instead, local DRL models can
rithms’ policy development and decision-making processes. be developed and trained for agents with the help of other
The research on XDRL technologies in wireless communi- agents while preserving users’ data privacy, as in [206].
cation is still at its initial stages, and there are still some Hence, developing algorithms and schemes that guarantee
key issues for future research in the context of RRAM for data and models privacy during both information sharing
next generation wireless networks [196]. For example, DRL and models updating is an interesting research direction.
models can get stuck easily into local optimal solutions when FDRL framework can also be exploited in the RRAM
utilized to solve complex RRAM problems. This issue can of modern HetNets. For example, it can be deployed for
be significantly avoided with the help of XDRL. Fortunately, solving complex wireless network optimization problems,
the heterogeneity of information in modern wireless HetNets such as power control in cellular UDNs. In this context,
will significantly help to achieve the interpretation for DRL FDRL can ensure a global solution for optimization problems
algorithms. In this context, developing RRAM schemes without sharing information between BSs; each BS solves
for wireless HetNets, through entity recognition, Shapley its optimization problem locally and shares the results with
value-based methods, entity-relationship extraction, and rep- neighboring BSs. Also, FDRL can be adapted in distributed
resentation learning (e.g., Hindsight Experience Replay, optimization settings, such as user association and channel
Hierarchical DRL, and self-attention) makes the DRL mod- access, to ensure optimal global solutions.
els’ interpretation more reliable, accurate, and intuitive,
which is a promising research direction. 4) DRL-BASED LOAD BALANCING FOR
SELF-SUSTAINING NETWORKS
2) INTEGRATING DRL AND BLOCKCHAIN TECHNIQUES Load balancing in modern wireless UDNs is another promis-
Blockchain-based RRAM has emerged recently as one of ing research direction. The objective is to balance the
the promising enabling technologies for future wireless wireless networks by moving some users from the heavily
HetNets [3]. It has gained considerable momentum lately congested BSs to uncongested ones, thus improving BSs uti-
due to its ability to provide intelligent, secure, and highly lization and providing enhanced QoS provisioning. Although
efficient distributed resource sharing and management. The the load balancing field has been heavily investigated
integration of DRL with Blockchain is also an interesting in the literature using conventional resource management
research direction, as in [197]–[199]. For example, DRL approaches, as in [207]–[209], there still a research gap in
algorithms can be distributively deployed within partici- applying DRL for such a field. In this context, DRL can be
pants or within the centralized spectrum-access systems to adopted to realize the self-sustaining (or self-organization)
facilitate spectrum auctions and transactions [199]. Also, vision of next generation wireless networks [3]. Hence,

VOLUME 3, 2022 357


ALWARAFY et al.: FRONTIERS OF DRL FOR RESOURCE MANAGEMENT IN FUTURE WIRELESS HetNets

developing single/multi-agent DRL models to achieve intelli- on a mix of real and synthetic data is a promising research
gent load balancing in future HetNets, is a possible research direction as in [211].
direction.
7) DRL FOR RRAM IN RIS-ASSISTED WIRELESS
5) MADRL ALGORITHMS IN SUPPORT OF MASSIVE NETWORKS
HETEROGENEITY AND MOBILITY Reconfigurable Intelligent Surfaces (RIS) have emerged
Load balancing in modern wireless UDNs is another promis- recently as an innovative technology to enhance the QoS of
ing research direction. The objective is to balance the future wireless networks [212], [213]. RISs can be deployed
wireless networks by moving some users from the heavily in cellular networks as passive reflecting elements to pro-
congested BSs to uncongested ones, thus improving BSs uti- vide near line-of-sight communication links to users, hence
lization and providing enhanced QoS provisioning. Although enhancing communication reliability, increasing through-
the load balancing field has been heavily investigated put, and reducing latency [214], [215]. Deploying RIS to
in the literature using conventional resource management assist cellular communication, however, requires judicious
approaches, as in [207]–[209], there still a research gap in RRAM schemes to optimize network performance. This
applying DRL for such a field. In this context, DRL can be research field is still nascent, and there is much to do for
adopted to realize the self-sustaining (or self-organization) future research and investigation, especially in the context of
vision of next generation wireless networks [3]. Hence, DRL-based RRAM techniques. Towards this, it is required
developing single/multi-agent DRL models to achieve intelli- to develop end-to-end DRL-based algorithms that jointly
gent load balancing in future HetNets, is a possible research optimize the configuration of the RIS system, i.e., elements’
direction. Such models must be agile to network dynam- phases and amplitudes, and radio resources of BSs. For
ics, including varying users’ mobility patterns and network instance, designing DRL models that intelligently and opti-
resources availability. mally allocate the downlink BSs’ transmit power and/or BSs’
beamforming configuration from one side and the amplitude
6) DRL-BASED RRAM WITH GENERATIVE ADVERSARIAL and phase shifts of the RIS elements on the other side is
NETWORKS (GANS) FOR RRAM a promising research direction, as in [43]. We also believe
that the currently ongoing research in RIS-assisted wireless
Ensuring the reliability of DRL algorithms is one of the
networks, e.g., [43], [216]–[218] will be cornerstones.
major challenges and objectives in DRL-based RRAM
methods. In many real-life scenarios, we may require to
deploy DRL models to allocate resources in vital systems 8) DRL FOR RRAM IN WIRELESS DIGITAL TWIN
requiring ultra-reliability, such as IoT healthcare applica- NETWORKS
tions [167]. In this context, there are proposals on Generative Digital twin (DT) has recently emerged as a promising tech-
Adversarial Networks (GANs), which have emerged recently nology for future wireless networks [219]. DT is a virtual
as an effective technique to enhance the reliability of DRL representation of the components and dynamics of a given
algorithms [210]. physical system, which is envisioned to bridge the connection
In practice, the shortage of realistic training datasets gap between physical systems and digital spaces. The digital
required to train DRL models and learn optimal policies replicas of physical systems, such as user devices, BSs, and
is a challenging issue. To overcome this, GANs are utilized, machines, are constructed at the server based on historical
which generate large amounts of realistic datasets syntheti- data and real-time running status. DT utilizes tools from
cally by expanding the available limited amounts of real-time ML, data analytics, and multiphysics simulation to study
datasets. From a DRL perspective, GANs-generated synthetic and analyze the dynamics of physical systems. Therefore, DT
data is more effective and reliable than traditional augmen- enables system monitoring, real-time interaction, and reliable
tation methods [79]. This is because DRL agents will be communication between physical systems and digital space
exposed to various extreme challenging and practical situa- in order to optimize the operation of physical systems [220].
tions by merging the realistic and synthetic data, enabling With these promising features, DT is getting considerable
DRL models to be trained on unpredicted and rare events. interest recently in enhancing the performance of wireless
Another advantage of GAN over traditional data augmen- communication networks for applications, such as computa-
tation methods is that it eliminates dataset biases in the tion offloading, content caching, and RRAM. For example,
synthetic data, which greatly enhances the quality of the a promising research direction is to develop DRL algo-
generated data and leads to more reliability in DRL models’ rithms to address various problems in wireless DT networks,
training and learning processes. such as the DT placement and migration problems [221], in
In general, the research in the GANs-based DRL methods capturing the dynamics of UAV-based networks [222], in
for RRAM is still in its early stages, and we believe that it blockchain-based networks to enhance network security and
will take further pace in the future. For example, developing users privacy [223].
experienced DRL-based algorithms for URLL communica- Table 9 summarizes the open challenges and future
tion using GANs in which DRL models are pre-trained based research directions provided in this section.

358 VOLUME 3, 2022


TABLE 9. Summary of challenges and future research directions in the context of using DRL for RRAM in future wireless networks.

VI. CONCLUSION [3] W. Saad, M. Bennis, and M. Chen, “A vision of 6G wireless systems:
This paper presents a comprehensive survey on the appli- Applications, trends, technologies, and open research problems,”
IEEE Netw., vol. 34, no. 3, pp. 134–142, May/Jun. 2020.
cations of DRL techniques in RRAM for next generation [4] Z. Zhang et al., “6G wireless networks: Vision, requirements, archi-
wireless HetNets. We thoroughly review the conventional tecture, and key technologies,” IEEE Veh. Technol. Mag., vol. 14,
approaches for RRAM, including their types, advantages, no. 3, pp. 28–41, Sep. 2019.
and limitations. We then illustrate how the emerging DRL [5] “6G Summit Connecting the Unconnected.” [Online]. Available:
https://6gsummit.org (accessed Feb. 18, 2022).
approaches can overcome these shortcomings to enable DRL- [6] “Cisco visual networking index: Global mobile data traffic forecast
based RRAM. After that, we illustrate how the RRAM update, 2017–2022,” Cisco, San Jose, CA, USA, White Paper, 2019.
optimization problems can be formulated as an MDP before [7] R. S. Sutton and A. G. Barto, Reinforcement Learning: An
solving them using DRL techniques. Furthermore, we con- Introduction. Cambridge, MA, USA: MIT Press, 2018.
[8] Y. L. Lee and D. Qin, “A survey on applications of deep rein-
duct an extensive overview of the most efficient DRL forcement learning in resource management for 5G heterogeneous
algorithms that are widely leveraged in addressing RRAM networks,” in Proc. Asia-Pacific Signal Inf. Process. Assoc. Annu.
problems, including the value- and policy-based algorithms. Summit Conf. (APSIPA ASC), 2019, pp. 1856–1862.
The advantages, limitations, and use-cases for each algo- [9] F. Obite, A. D. Usman, and E. Okafor, “An overview of deep rein-
forcement learning for spectrum sensing in cognitive radio networks,”
rithm are provided. We then conduct a comprehensive and Digit. Signal Process., vol. 113, Jun. 2021, Art. no. 103014.
in-depth literature review and classified the existing related [10] S. Gupta, G. Singal, and D. Garg, “Deep reinforcement learning
works based on both the radio resources they are address- techniques in diversified domains: A survey,” Arch. Comput. Methods
Eng., vol. 28, pp. 4715–4754, Feb. 2021.
ing and the type of wireless networks they are considering.
[11] Z. Du, Y. Deng, W. Guo, A. Nallanathan, and Q. Wu, “Green deep
To this end, the types of DRL models developed in these reinforcement learning for radio resource management: Architecture,
related works and their main elements are carefully iden- algorithm compression, and challenges,” IEEE Veh. Technol. Mag.,
tified. Finally, we outline important open challenges and vol. 16, no. 1, pp. 29–39, Mar. 2021.
[12] Y. Qian, J. Wu, R. Wang, F. Zhu, and W. Zhang, “Survey on reinforce-
provide insights into future research directions in the context ment learning applications in communication networks,” J. Commun.
of DRL-based RRAM. Inf. Netw., vol. 4, no. 2, pp. 30–39, Jun. 2019.
[13] Z. Xiong, Y. Zhang, D. Niyato, R. Deng, P. Wang, and
L.-C. Wang, “Deep reinforcement learning for mobile 5G and
beyond: Fundamentals, applications, and challenges,” IEEE Veh.
REFERENCES Technol. Mag., vol. 14, no. 2, pp. 44–52, Jun. 2019.
[1] F. Hussain, S. A. Hassan, R. Hussain, and E. Hossain, “Machine [14] A. Feriani and E. Hossain, “Single and multi-agent deep rein-
learning for resource management in cellular and IoT networks: forcement learning for ai-enabled wireless networks: A tutorial,”
Potentials, current solutions, and open challenges,” IEEE Commun. IEEE Commun. Surveys Tuts., vol. 23, no. 2, pp. 1226–1252,
Surveys Tuts., vol. 22, no. 2, pp. 1251–1275, 2nd Quart., 2020. 2nd Quart., 2021.
[2] K. B. Letaief, W. Chen, Y. Shi, J. Zhang, and Y.-J. A. Zhang, “The [15] N. C. Luong et al., “Applications of deep reinforcement learning in
roadmap to 6G: AI empowered wireless networks,” IEEE Commun. communications and networking: A survey,” IEEE Commun. Surveys
Mag., vol. 57, no. 8, pp. 84–90, Aug. 2019. Tuts., vol. 21, no. 4, pp. 3133–3174, 4th Quart., 2019.

VOLUME 3, 2022 359


ALWARAFY et al.: FRONTIERS OF DRL FOR RESOURCE MANAGEMENT IN FUTURE WIRELESS HetNets

[16] I. Tomkos, D. Klonidis, E. Pikasis, and S. Theodoridis, “Toward the [37] Z. Zhang, D. Zhang, and R. C. Qiu, “Deep reinforcement learning
6G network era: Opportunities and challenges,” IT Prof., vol. 22, for power system applications: An overview,” CSEE J. Power Energy
no. 1, pp. 34–38, Jan./Feb. 2020. Syst., vol. 6, no. 1, pp. 213–225, Mar. 2019.
[17] P. Yang, Y. Xiao, M. Xiao, and S. Li, “6G wireless communica- [38] Y. Xu, G. Gui, H. Gacanin, and F. Adachi, “A survey on resource
tions: Vision and potential techniques,” IEEE Netw., vol. 33, no. 4, allocation for 5G heterogeneous networks: Current research, future
pp. 70–75, Jul./Aug. 2019. trends, and challenges,” IEEE Commun. Surveys Tuts., vol. 23, no. 2,
[18] K. David and H. Berndt, “6G vision and requirements: Is there any pp. 668–695, 2nd Quart., 2021.
need for beyond 5G?” IEEE Veh. Technol. Mag., vol. 13, no. 3, [39] T. S. Rappaport et al., “Wireless communications and applications
pp. 72–80, Sep. 2018. above 100 GHz: Opportunities and challenges for 6G and beyond,”
[19] S. Elmeadawy and R. M. Shubair, “6G wireless communications: IEEE Access, vol. 7, pp. 78729–78757, 2019.
Future technologies and research challenges,” in Proc. Int. Conf. [40] H. Tataria, M. Shafi, A. F. Molisch, M. Dohler, H. Sjöland, and
Electr. Comput. Technol. Appl. (ICECTA), 2019, pp. 1–5. F. Tufvesson, “6G wireless systems: Vision, requirements, chal-
[20] T. Huang, W. Yang, J. Wu, J. Ma, X. Zhang, and D. Zhang, “A lenges, insights, and opportunities,” Proc. IEEE, vol. 109, no. 7,
survey on green 6G network: Architecture and technologies,” IEEE pp. 1166–1199, Jul. 2021.
Access, vol. 7, pp. 175758–175768, 2019. [41] A. Alwarafy, B. S. Ciftler, M. Abdallah, and M. Hamdi, “DeepRAT:
[21] A. Alwarafy, K. A. Al-Thelaya, M. Abdallah, J. Schneider, and A DRL-based framework for multi-RAT assignment and power allo-
M. Hamdi, “A survey on security and privacy issues in edge- cation in hetnets,” in Proc. IEEE Int. Conf. Commun. Workshops
computing-assisted Internet of Things,” IEEE Internet Things J., (ICC Workshops), 2021, pp. 1–6.
vol. 8, no. 6, pp. 4004–4022, Mar. 2021. [42] J. Kong, Z.-Y. Wu, M. Ismail, E. Serpedin, and K. A. Qaraqe,
[22] A. I. Sulyman, A. Alwarafy, G. R. MacCartney, T. S. Rappaport, “Q-learning based two-timescale power allocation for multi-homing
and A. Alsanie, “Directional radio propagation path loss mod- hybrid RF/VLC networks,” IEEE Wireless Commun. Lett., vol. 9,
els for millimeter-wave wireless networks in the 28-, 60-, and no. 4, pp. 443–447, Apr. 2020.
73-GHz bands,” IEEE Trans. Wireless Commun., vol. 15, no. 10, [43] G. Lee, M. Jung, A. T. Z. Kasgari, W. Saad, and M. Bennis, “Deep
pp. 6939–6947, Oct. 2016. reinforcement learning for energy-efficient networking with recon-
[23] A. Alwarafy, A. Albaseer, B. S. Ciftler, M. Abdallah, and figurable intelligent surfaces,” in Proc. IEEE Int. Conf. Commun.
A. Al-Fuqaha, “AI-based radio resource allocation in support of (ICC), 2020, pp. 1–6.
the massive heterogeneity of 6G networks,” in Proc. IEEE 4th 5G [44] B. S. Ciftler, M. Abdallah, A. Alwarafy, and M. Hamdi, “DQN-based
World Forum (5GWF), Oct. 2021, pp. 464–469. multi-user power allocation for hybrid RF/VLC networks,” in Proc.
[24] L. Liang, H. Ye, G. Yu, and G. Y. Li, “Deep-learning-based wireless IEEE Int. Conf. Commun., 2021, pp. 1–6.
resource allocation with application to vehicular networks,” Proc. [45] A. Ahmad, S. Ahmad, M. H. Rehmani, and N. U. Hassan,
IEEE, vol. 108, no. 2, pp. 341–356, Feb. 2020. “A survey on radio resource allocation in cognitive radio sensor
networks,” IEEE Commun. Surveys Tuts., vol. 17, no. 2, pp. 888–917,
[25] K. K. Nguyen, T. Q. Duong, N. A. Vien, N.-A. Le-Khac, and
2nd Quart., 2015.
M.-N. Nguyen, “Non-cooperative energy efficient power allocation
[46] M. El Tanab and W. Hamouda, “Resource allocation for underlay
game in D2D communication: A multi-agent deep reinforcement
cognitive radio networks: A survey,” IEEE Commun. Surveys Tuts.,
learning approach,” IEEE Access, vol. 7, pp. 100480–100490, 2019.
vol. 19, no. 2, pp. 1249–1276, 2nd Quart., 2017.
[26] M. Chen, U. Challita, W. Saad, C. Yin, and M. Debbah,
[47] M. Naeem, A. Anpalagan, M. Jaseemuddin, and D. C. Lee, “Resource
“Artificial neural networks-based machine learning for wireless
allocation techniques in cooperative cognitive radio networks,” IEEE
networks: A tutorial,” IEEE Commun. Surveys Tuts., vol. 21, no. 4,
Commun. Surveys Tuts., vol. 16, no. 2, pp. 729–744, 2nd Quart.,
pp. 3039–3071, 4th Quart., 2019.
2014.
[27] A. Zappone, M. D. Renzo, and M. Debbah, “Wireless networks
[48] S. Manap, K. Dimyati, M. N. Hindia, M. S. A. Talip, and
design in the era of deep learning: Model-based, AI-based, or both?”
R. Tafazolli, “Survey of radio resource management in 5G het-
IEEE Trans. Commun., vol. 67, no. 10, pp. 7331–7376, Oct. 2019.
erogeneous networks,” IEEE Access, vol. 8, pp. 131202–131223,
[28] H. Khorasgani, H. Wang, and C. Gupta, “Challenges of apply- 2020.
ing deep reinforcement learning in dynamic dispatching,” 2020, [49] M. Peng, C. Wang, J. Li, H. Xiang, and V. Lau, “Recent advances in
arXiv:2011.05570. underlay heterogeneous networks: Interference control, resource allo-
[29] G. S. Rahman, T. Dang, and M. Ahmed, “Deep reinforcement cation, and self-organization,” IEEE Commun. Surveys Tuts., vol. 17,
learning based computation offloading and resource allocation for no. 2, pp. 700–729, 2nd Quart., 2015.
low-latency fog radio access networks,” Intell. Converged Netw., [50] Y. Teng, M. Liu, F. R. Yu, V. C. M. Leung, M. Song, and
vol. 1, no. 3, pp. 243–257, 2020. Y. Zhang, “Resource allocation for ultra-dense networks: A survey,
[30] A. Mohammed, H. Nahom, A. Tewodros, Y. Habtamu, and some research issues and challenges,” IEEE Commun. Surveys Tuts.,
G. Hayelom, “Deep reinforcement learning for computation offload- vol. 21, no. 3, pp. 2134–2168, 3rd Quart., 2019.
ing and resource allocation in blockchain-based multi-UAV-enabled [51] K. Piamrat, A. Ksentini, J.-M. Bonnin, and C. Viho, “Radio resource
mobile edge computing,” in Proc. 17th Int. Comput. Conf. management in emerging heterogeneous wireless networks,” Comput.
Wavelet Active Media Technol. Inf. Process. (ICCWAMTIP), 2020, Commun., vol. 34, no. 9, pp. 1066–1076, 2011.
pp. 295–299. [52] N. Xia, H. Chen, and C. Yang, “Radio resource management in
[31] S. Sheng, P. Chen, Z. Chen, L. Wu, and Y. Yao, “Deep reinforcement machine-to-machine communications—A survey,” IEEE Commun.
learning-based task scheduling in IoT edge computing,” Sensors, Surveys Tuts., vol. 20, no. 1, pp. 791–828, 1st Quart., 2018.
vol. 21, no. 5, p. 1666, 2021. [53] S. Sadr, A. Anpalagan, and K. Raahemifar, “Radio resource alloca-
[32] X. Chen and G. Liu, “Energy-efficient task offloading and resource tion algorithms for the downlink of multiuser OFDM communication
allocation via deep reinforcement learning for augmented reality in systems,” IEEE Commun. Surveys Tuts., vol. 11, no. 3, pp. 92–106,
mobile edge networks,” IEEE Internet Things J., vol. 8, no. 13, 3rd Quart., 2009.
pp. 10843–10856, Jul. 2021. [54] E. Yaacoub and Z. Dawy, “A survey on uplink resource allocation in
[33] Q. Liu, T. Han, and E. Moges, “Edgeslice: Slicing wireless edge OFDMA wireless networks,” IEEE Commun. Surveys Tuts., vol. 14,
computing network with decentralized deep reinforcement learning,” no. 2, pp. 322–337, 2nd Quart., 2012.
2020, arXiv:2003.12911. [55] R. O. Afolabi, A. Dadlani, and K. Kim, “Multicast scheduling and
[34] M. Lin and Y. Zhao, “Artificial intelligence-empowered resource resource allocation algorithms for OFDMA-based systems: A sur-
management for future wireless communications: A survey,” China vey,” IEEE Commun. Surveys Tuts., vol. 15, no. 1, pp. 240–254,
Commun., vol. 17, no. 3, pp. 58–77, Mar. 2020. 1std Quart., 2013.
[35] Q. T. A. Pham, K. Piamrat, and C. Viho, “Resource management in [56] S. Chieochan and E. Hossain, “Adaptive radio resource allocation
wireless access networks: A layer-based classification-version 1.0,” in OFDMA systems: A survey of the state-of-the-art approaches,”
Rep. PI-2017, 2014, p. 23. Wireless Commun. Mobile Comput., vol. 9, no. 4, pp. 513–527, 2009.
[36] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath, [57] D. Niyato and E. Hossain, “Radio resource management in MIMO-
“Deep reinforcement learning: A brief survey,” IEEE Signal Process. OFDM- mesh networks: Issues and approaches,” IEEE Commun.
Mag., vol. 34, no. 6, pp. 26–38, Nov. 2017. Mag., vol. 45, no. 11, pp. 100–107, Nov. 2007.

360 VOLUME 3, 2022


[58] W. Zhao and S. Wang, “Resource sharing scheme for device-to- [80] H. Sun, X. Chen, Q. Shi, M. Hong, X. Fu, and N. D. Sidiropoulos,
device communication underlaying cellular networks,” IEEE Trans. “Learning to optimize: Training deep neural networks for interference
Commun., vol. 63, no. 12, pp. 4838–4848, Dec. 2015. management,” IEEE Trans. Signal Process., vol. 66, no. 20,
[59] L. Song, D. Niyato, Z. Han, and E. Hossain, “Game-theoretic pp. 5438–5453, Oct. 2018.
resource allocation methods for device-to-device communication,” [81] H. Yang, A. Alphones, Z. Xiong, D. Niyato, J. Zhao, and K. Wu,
IEEE Wireless Commun., vol. 21, no. 3, pp. 136–144, Jun. 2014. “Artificial-intelligence-enabled intelligent 6G networks,” IEEE Netw.,
[60] Y. Kawamoto, H. Nishiyama, N. Kato, F. Ono, and R. Miura, “Toward vol. 34, no. 6, pp. 272–280, Nov./Dec. 2020.
future unmanned aerial vehicle networks: Architecture, resource allo- [82] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning
cation and field experiments,” IEEE Wireless Commun., vol. 26, no. 1, with double Q-learning,” in Proc. AAAI Conf. Artif. Intell., vol. 30,
pp. 94–99, Feb. 2019. 2016, pp. 2094–2100.
[61] A. Masmoudi, K. Mnif, and F. Zarai, “A survey on radio resource
allocation for V2X communication,” Wireless Commun. Mobile [83] Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot,
Comput., vol. 2019, Oct. 2019, Art. no. 2430656. and N. Freitas, “Dueling network architectures for deep rein-
forcement learning,” in Proc. Int. Conf. Mach. Learn., 2016,
[62] M. Allouch, S. Kallel, A. Soua, O. Shagdar, and S. Tohme,
pp. 1995–2003.
“Survey on radio resource allocation in long-term evolution-
vehicle,” Concurrency Comput. Pract. Exp., vol. 34, no. 7, 2021, [84] R. Sutton, “Policy gradient methods for reinforcement learning
Art. no. e6228. with function approximation,” in Advances in Neural Information
[63] S. Xu, G. Zhu, B. Ai, and Z. Zhong, “A survey on high-speed Processing Systems, vol. 12. Cambridge, MA, USA: MIT Press,
railway communications: A radio resource management perspective,” 2000, pp. 1057–1063.
Comput. Commun., vol. 86, pp. 12–28, Jul. 2016. [85] V. Mnih et al., “Asynchronous methods for deep reinforcement
[64] S. Dhilipkumar, C. Arunachalaperumal, and K. Thanigaivelu, “A learning,” in Proc. Int. Conf. Mach. Learn., 2016, pp. 1928–1937.
comparative study of resource allocation schemes for D2D networks [86] T. P. Lillicrap et al., “Continuous control with deep reinforcement
underlay cellular networks,” Wireless Personal Commun., vol. 106, learning,” 2015, arXiv:1509.02971.
no. 3, pp. 1075–1087, 2019. [87] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and
[65] J. Clausen, Branch and Bound Algorithms-Principles and Examples, M. Riedmiller, “Deterministic policy gradient algorithms,” in Proc.
Univ. Copenhagen, Copenhagen, Denmark, 1999, pp. 1–30. 31st Int. Conf. Mach. Learn., vol. 32, 2014, pp. 387–395. [Online].
[66] D. S. Nau, V. Kumar, and L. Kanal, “General branch and bound, and Available: http://proceedings.mlr.press/v32/silver14.html
its relation to A∗ and AO∗ ,” Artif. Intell., vol. 23, no. 1, pp. 29–58,
[88] J. Xu and B. Ai, “Experience-driven power allocation using multi-
1984.
agent deep reinforcement learning for millimeter-wave high-speed
[67] Y. S. Nasir and D. Guo, “Deep reinforcement learning for
railway systems,” IEEE Trans. Intell. Transp. Syst., early access,
joint spectrum and power allocation in cellular networks,” 2020,
Feb. 3, 2021, doi: 10.1109/TITS.2021.3054511.
arXiv:2012.10682.
[68] Y. Xu et al., “Robust resource allocation for two-tier HetNets: An [89] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized
interference-efficiency perspective,” IEEE Trans. Green Commun. experience replay,” 2015, arXiv:1511.05952.
Netw., vol. 5, no. 3, pp. 1514–1528, Sep. 2021. [90] D. Horgan et al., “Distributed prioritized experience replay,” 2018,
[69] Y. Xu, G. Li, Y. Yang, M. Liu, and G. Gui, “Robust resource alloca- arXiv:1803.00933.
tion and power splitting in SWIPT enabled heterogeneous networks: [91] M. G. Bellemare, W. Dabney, and R. Munos, “A distributional per-
A robust minimax approach,” IEEE Internet Things J., vol. 6, no. 6, spective on reinforcement learning,” in Proc. Int. Conf. Mach. Learn.,
pp. 10799–10811, Dec. 2019. 2017, pp. 449–458.
[70] Y. Xu, H. Xie, and R. Q. Hu, “Max-min beamforming design for het- [92] M. Hessel et al., “Rainbow: Combining improvements in deep rein-
erogeneous networks with hardware impairments,” IEEE Commun. forcement learning,” in Proc. AAAI Conf. Artif. Intell., vol. 32, 2018,
Lett., vol. 25, no. 4, pp. 1328–1332, Apr. 2020. pp. 3215–3222.
[71] Y. Xu, H. Xie, C. Liang, and F. R. Yu, “Robust secure energy-
[93] M. Hausknecht and P. Stone, “Deep recurrent Q-learning for partially
efficiency optimization in SWIPT-aided heterogeneous networks with
observable MDPs,” in Proc. AAAI Fall Symp. Series, 2015, pp. 1–7.
a nonlinear energy-harvesting model,” IEEE Internet Things J., vol. 8,
no. 19, pp. 14908–14919, Oct. 2021. [94] S. Fujimoto, H. Hoof, and D. Meger, “Addressing function approx-
[72] K. Shen and W. Yu, “Fractional programming for communication imation error in actor-critic methods,” in Proc. Int. Conf. Mach.
systems—Part I: Power control and beamforming,” IEEE Trans. Learn., 2018, pp. 1587–1596.
Signal Process., vol. 66, no. 10, pp. 2616–2630, May 2018. [95] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-
[73] X. Zhang, X. Zhang, and Z. Wu, “Utility- and fairness-based spec- critic: Off-policy maximum entropy deep reinforcement learning
trum allocation of cellular networks by an adaptive particle swarm with a stochastic actor,” in Proc. Int. Conf. Mach. Learn., 2018,
optimization algorithm,” IEEE Trans. Emerg. Topics Comput. Intell., pp. 1861–1870.
vol. 4, no. 1, pp. 42–50, Feb. 2020. [96] G. Barth-Maron et al., “Distributed distributional deterministic policy
[74] X. He, X. Li, H. Ji, and H. Zhang, “Resource allocation for secrecy gradients,” 2018, arXiv:1804.08617.
rate optimization in UAV-assisted cognitive radio network,” in Proc. [97] T. T. Nguyen, N. D. Nguyen, and S. Nahavandi, “Deep reinforce-
IEEE Wireless Commun. Netw. Conf. (WCNC), 2021, pp. 1–6. ment learning for multiagent systems: A review of challenges,
[75] M. Kim and I.-Y. Ko, “An efficient resource allocation approach solutions, and applications,” IEEE Trans. Cybern., vol. 50, no. 9,
based on a genetic algorithm for composite services in IoT environ- pp. 3826–3839, Sep. 2020.
ments,” in Proc. IEEE Int. Conf. Web Services, 2015, pp. 543–550.
[98] K. Zhang, Z. Yang, and T. Başar, “Multi-agent reinforcement
[76] Y. El Morabit, F. Mrabti, and E. H. Abarkan, “Spectrum allocation learning: A selective overview of theories and algorithms,” 2019,
using genetic algorithm in cognitive radio networks,” in Proc. 3rd arXiv:1911.10635.
Int. Workshop RFID Adapt. Wireless Sens. Netw. (RAWSN), 2015,
pp. 90–93. [99] I. Althamary, C.-W. Huang, and P. Lin, “A survey on multi-agent
[77] M. B. Satria, I. W. Mustika, and Widyawan, “Resource alloca- reinforcement learning methods for vehicular networks,” in Proc.
tion in cognitive radio networks based on modified ant colony 15th Int. Wireless Commun. Mobile Comput. Conf. (IWCMC), 2019,
optimization,” in Proc. 4th Int. Conf. Sci. Technol. (ICST), 2018, pp. 1154–1159.
pp. 1–5. [100] D. Lee, N. He, P. Kamalaruban, and V. Cevher, “Optimization
[78] M. Tian, H. Deng, and M. Xu, “Immune parallel artificial bee colony for reinforcement learning: From a single agent to cooperative
algorithm for spectrum allocation in cognitive radio sensor networks,” agents,” IEEE Signal Process. Mag., vol. 37, no. 3, pp. 123–135,
in Proc. Int. Conf. Comput. Inf. Telecommun. Syst. (CITS), 2020, May 2020.
pp. 1–4. [101] Q. Shi, M. Razaviyayn, Z.-Q. Luo, and C. He, “An iteratively
[79] M. Naeem, S. T. H. Rizvi, and A. Coronato, “A gentle introduction to weighted MMSE approach to distributed sum-utility maximization
reinforcement learning and its application in different fields,” IEEE for a MIMO interfering broadcast channel,” IEEE Trans. Signal
Access, vol. 8, pp. 209320–209344, 2020. Process., vol. 59, no. 9, pp. 4331–4340, Sep. 2011.

VOLUME 3, 2022 361


ALWARAFY et al.: FRONTIERS OF DRL FOR RESOURCE MANAGEMENT IN FUTURE WIRELESS HetNets

[102] A. A. Khan and R. S. Adve, “Centralized and distributed deep rein- [123] N. Zhao, Y.-C. Liang, D. Niyato, Y. Pei, and Y. Jiang, “Deep
forcement learning methods for downlink sum-rate optimization,” reinforcement learning for user association and resource allocation
IEEE Trans. Wireless Commun., vol. 19, no. 12, pp. 8410–8426, in heterogeneous networks,” in Proc. IEEE Global Commun. Conf.
Dec. 2020. (GLOBECOM), 2018, pp. 1–6.
[103] F. Meng, P. Chen, L. Wu, and J. Cheng, “Power allocation in multi- [124] N. Zhao, Y.-C. Liang, D. Niyato, Y. Pei, M. Wu, and Y. Jiang, “Deep
user cellular networks: Deep reinforcement learning approaches,” reinforcement learning for user association and resource allocation
IEEE Trans. Wireless Commun., vol. 19, no. 10, pp. 6255–6267, in heterogeneous cellular networks,” IEEE Trans. Wireless Commun.,
Oct. 2020. vol. 18, no. 11, pp. 5141–5152, Nov. 2019.
[104] F. Meng, P. Chen, and L. Wu, “Power allocation in multi-user cellular [125] Q. Zhang, Y.-C. Liang, and H. V. Poor, “Intelligent user association
networks with deep Q learning approach,” in Proc. IEEE Int. Conf. for symbiotic radio networks using deep reinforcement learning,”
Commun. (ICC), 2019, pp. 1–6. IEEE Trans. Wireless Commun., vol. 19, no. 7, pp. 4535–4548,
[105] L. Zhang and Y.-C. Liang, “Deep reinforcement learning for multi- Jul. 2020.
agent power control in heterogeneous networks,” IEEE Trans. [126] W. Lei, Y. Ye, and M. Xiao, “Deep reinforcement learning-based
Wireless Commun., vol. 20, no. 4, pp. 2551–2564, Apr. 2021. spectrum allocation in integrated access and backhaul networks,”
[106] Y. S. Nasir and D. Guo, “Deep actor-critic learning for distributed IEEE Trans. Cogn. Commun. Netw., vol. 6, no. 3, pp. 970–979,
power control in wireless mobile networks,” in Proc. 54th Asilomar Sep. 2020.
Conf. Signals Syst. Comput., 2020, pp. 398–402. [127] Z. Li, C. Wang, and C.-J. Jiang, “User association for load balancing
[107] Y. S. Nasir and D. Guo, “Multi-agent deep reinforcement learning in vehicular networks: An online reinforcement learning approach,”
for dynamic power allocation in wireless networks,” IEEE J. Sel. IEEE Trans. Intell. Transp. Syst., vol. 18, no. 8, pp. 2217–2228,
Areas Commun., vol. 37, no. 10, pp. 2239–2250, Oct. 2019. Aug. 2017.
[108] Z. Bi and W. Zhou, “Deep reinforcement learning based power allo-
[128] J. Zheng, X. Tang, X. Wei, H. Shen, and L. Zhao, “Channel assign-
cation for D2D network,” in Proc. IEEE 91st Veh. Technol. Conf.
ment for hybrid NOMA systems with deep reinforcement learning,”
(VTC-Spring), 2020, pp. 1–5.
IEEE Wireless Commun. Lett., vol. 10, no. 7, pp. 1370–1374,
[109] S. Saeidian, S. Tayamon, and E. Ghadimi, “Downlink power con-
Jul. 2021.
trol in dense 5G radio access networks through deep reinforcement
learning,” in Proc. IEEE Int. Conf. Commun. (ICC), 2020, pp. 1–6. [129] H. Song, L. Liu, J. Ashdown, and Y. Yi, “A deep reinforcement
[110] Z. Zhang, H. Qu, J. Zhao, and W. Wang, “Deep reinforcement learn- learning framework for spectrum management in dynamic spectrum
ing method for energy efficient resource allocation in next generation access,” IEEE Internet Things J., vol. 8, no. 14, pp. 11208–11218,
wireless networks,” in Proc. Int. Conf. Comput. Netw. Internet Things, Jul. 2021.
2020, pp. 18–24. [130] Y. Hu et al., “Optimal transmit antenna selection strategy for MIMO
[111] Q. Wang, K. Feng, X. Li, and S. Jin, “PrecoderNet: Hybrid wiretap channel based on deep reinforcement learning,” in Proc.
beamforming for millimeter wave systems with deep reinforce- IEEE/CIC Int. Conf. Commun. China (ICCC), 2018, pp. 803–807.
ment learning,” IEEE Wireless Commun. Lett., vol. 9, no. 10, [131] S. Wang, H. Liu, P. H. Gomes, and B. Krishnamachari, “Deep
pp. 1677–1681, Oct. 2020. reinforcement learning for dynamic multichannel access in wire-
[112] X. Li, J. Fang, W. Cheng, H. Duan, Z. Chen, and H. Li, less networks,” IEEE Trans. Cogn. Commun. Netw., vol. 4, no. 2,
“Intelligent power control for spectrum sharing in cognitive radios: pp. 257–265, Jun. 2018.
A deep reinforcement learning approach,” IEEE Access, vol. 6, [132] S. Wang, H. Liu, P. Gomes, and B. Krishnamachari, “Deep rein-
pp. 25463–25473, 2018. forcement learning for dynamic multichannel access,” in Proc. Int.
[113] L. Li, Q. Cheng, K. Xue, C. Yang, and Z. Han, “Downlink transmit Conf. Comput. Netw. Commun. (ICNC), 2017, pp. 257–265.
power control in ultra-dense UAV network based on mean field game [133] S. H. A. Ahmad, M. Liu, T. Javidi, Q. Zhao, and B. Krishnamachari,
and deep reinforcement learning,” IEEE Trans. Veh. Technol., vol. 69, “Optimality of myopic sensing in multichannel opportunistic access,”
no. 12, pp. 15594–15605, Dec. 2020. IEEE Trans. Inf. Theory, vol. 55, no. 9, pp. 4040–4050, Sep. 2009.
[114] N. Zhao, Z. Liu, and Y. Cheng, “Multi-agent deep reinforcement [134] M. Chu, H. Li, X. Liao, and S. Cui, “Reinforcement learning-based
learning for trajectory design and power allocation in multi-UAV multiaccess control and battery prediction with energy harvesting in
networks,” IEEE Access, vol. 8, pp. 139670–139679, 2020. IoT systems,” IEEE Internet Things J., vol. 6, no. 2, pp. 2009–2020,
[115] M. Yan, B. Chen, G. Feng, and S. Qin, “Federated cooperation Apr. 2019.
and augmentation for power allocation in decentralized wireless [135] Y. Zhang, P. Cai, C. Pan, and S. Zhang, “Multi-agent deep reinforce-
networks,” IEEE Access, vol. 8, pp. 48088–48100, 2020. ment learning-based cooperative spectrum sensing with upper confi-
[116] J. G. Luis, M. Guerster, I. del Portillo, E. Crawley, and B. Cameron, dence bound exploration,” IEEE Access, vol. 7, pp. 118898–118906,
“Deep reinforcement learning architecture for continuous power 2019.
allocation in high throughput satellites,” 2019, arXiv:1906.00571. [136] N. Yang, H. Zhang, and R. Berry, “Partially observable multi-agent
[117] J. J. G. Luis, M. Guerster, I. del Portillo, E. Crawley, and B. Cameron, deep reinforcement learning for cognitive resource management,” in
“Deep reinforcement learning for continuous power allocation in Proc. IEEE Global Commun. Conf., 2020, pp. 1–6.
flexible high throughput satellites,” in Proc. IEEE Cogn. Commun.
[137] Y. Xu, J. Yu, W. C. Headley, and R. M. Buehrer, “Deep reinforcement
Aerosp. Appl. Workshop (CCAAW), 2019, pp. 1–4.
learning for dynamic spectrum access in wireless networks,” in Proc.
[118] O. Maraqa, A. S. Rajasekaran, S. Al-Ahmadi, H. Yanikomeroglu, and
IEEE Military Commun. Conf. (MILCOM), 2018, pp. 207–212.
S. M. Sait, “A survey of rate-optimal power domain NOMA with
enabling technologies of future wireless networks,” IEEE Commun. [138] L. Liang, H. Ye, and G. Y. Li, “Multi-agent reinforcement learning
Surveys Tuts., vol. 22, no. 4, pp. 2192–2235, 4th Quart., 2020. for spectrum sharing in vehicular networks,” in Proc. IEEE 20th Int.
[119] X. Yan, K. An, Q. Zhang, G. Zheng, S. Chatzinotas, and J. Han, Workshop Signal Process. Adv. Wireless Commun. (SPAWC), 2019,
“Delay constrained resource allocation for NOMA enabled satellite pp. 1–5.
Internet of Things with deep reinforcement learning,” IEEE Internet [139] L. Liang, H. Yee, and G. Y. Li, “Spectrum sharing in vehicular
Things J., early access, 2020. [Online]. Available: https://orbilu.uni. networks based on multi-agent reinforcement learning,” IEEE J. Sel.
lu/bitstream/10993/45468/1/Delay%20Constrained%20Resource% Areas Commun., vol. 37, no. 10, pp. 2282–2292, Oct. 2019.
20Allocation%20for%20NOMA.pdf [140] J. Zhu, Y. Song, D. Jiang, and H. Song, “A new deep-Q-learning-
[120] A. Alwarafy, M. Alresheedi, A. F. Abas, and A. Alsanie, based transmission scheduling mechanism for the cognitive Internet
“Performance evaluation of space time coding techniques for indoor of Things,” IEEE Internet Things J., vol. 5, no. 4, pp. 2375–2385,
visible light communication systems,” in Proc. Int. Conf. Opt. Netw. Aug. 2018.
Design Model. (ONDM), 2018, pp. 88–93. [141] H. Khan, A. Elgabli, S. Samarakoon, M. Bennis, and C. S. Hong,
[121] A. Memedi and F. Dressler, “Vehicular visible light communications: “Reinforcement learning-based vehicle-cell association algorithm for
A survey,” IEEE Commun. Surveys Tuts., vol. 23, no. 1, pp. 161–181, highly mobile millimeter wave communication,” IEEE Trans. Cogn.
1st Quart., 2021. Commun. Netw., vol. 5, no. 4, pp. 1073–1085, Dec. 2019.
[122] M. Chen, W. Saad, and C. Yin, “Liquid state machine learning for [142] P. Yang et al., “Dynamic spectrum access in cognitive radio networks
resource allocation in a network of cache-enabled LTE-U UAVs,” in using deep reinforcement learning and evolutionary game,” in Proc.
Proc. IEEE Global Commun. Conf., 2017, pp. 1–6. IEEE/CIC Int. Conf. Commun. China (ICCC), 2018, pp. 405–409.

362 VOLUME 3, 2022


[143] Y. Cao, S.-Y. Lien, and Y.-C. Liang, “Deep reinforcement learning for [165] Q. Liu, T. Han, N. Zhang, and Y. Wang, “Deepslicing: Deep rein-
multi-user access control in non-terrestrial networks,” IEEE Trans. forcement learning assisted resource allocation for network slicing,”
Commun., vol. 69, no. 3, pp. 1605–1619, Mar. 2021. in Proc. IEEE Global Commun. Conf., 2020, pp. 1–6.
[144] S. Tomovic and I. Radusinovic, “A novel deep Q-learning method [166] F. Tang, Y. Zhou, and N. Kato, “Deep reinforcement learning
for dynamic spectrum access,” in Proc. 28th Telecommun. Forum for dynamic uplink/downlink resource allocation in high mobil-
(TELFOR), 2020, pp. 1–4. ity 5G HetNet,” IEEE J. Sel. Areas Commun., vol. 38, no. 12,
[145] C. Zhong, Z. Lu, M. C. Gursoy, and S. Velipasalar, “A deep actor- pp. 2773–2782, Dec. 2020.
critic reinforcement learning framework for dynamic multichannel [167] Z. Chkirbene, A. A. Abdellatif, A. Mohamed, A. Erbad, and
access,” IEEE Trans. Cogn. Commun. Netw., vol. 5, no. 4, M. Guizani, “Deep reinforcement learning for network selection over
pp. 1125–1139, Dec. 2019. heterogeneous health systems,” IEEE Trans. Netw. Sci. Eng., vol. 9,
[146] A. Tondwalkar and D. A. Kwasinski, “Deep reinforcement learning no. 1, pp. 258–270, Jan./Feb. 2022.
for distributed uncoordinated cognitive radios resource allocation,” [168] Y. Y. Munaye, R.-T. Juang, H.-P. Lin, G. B. Tarekegn, and D.-B. Lin,
2019, arXiv:1911.03366. “Deep reinforcement learning based resource management in UAV-
[147] K. Al-Gumaei et al., “A survey of Internet of Things and big assisted IoT networks,” Appl. Sci., vol. 11, no. 5, p. 2163, 2021.
data integrated solutions for industrie 4.0,” in Proc. IEEE 23rd [169] X. Zhang, M. Peng, S. Yan, and Y. Sun, “Deep-reinforcement-
Int. Conf. Emerg. Technol. Factory Autom. (ETFA), vol. 1, 2018, learning-based mode selection and resource allocation for cellular
pp. 1417–1424. V2X communications,” IEEE Internet Things J., vol. 7, no. 7,
pp. 6380–6391, Jul. 2019.
[148] Z. Shi, X. Xie, H. Lu, H. Yang, M. Kadoch, and M. Cheriet,
[170] J. Jang and H. J. Yang, “Deep reinforcement learning-based
“Deep-reinforcement-learning-based spectrum resource management
resource allocation and power control in small cells with limited
for Industrial Internet of Things,” IEEE Internet Things J., vol. 8,
information exchange,” IEEE Trans. Veh. Technol., vol. 69, no. 11,
no. 5, pp. 3476–3489, Mar. 2021.
pp. 13768–13783, Nov. 2020.
[149] S. B. Janiar and V. Pourahmadi, “Deep-reinforcement learning for fair [171] X. Liao, J. Shi, Z. Li, L. Zhang, and B. Xia, “A model-driven deep
distributed dynamic spectrum access in wireless networks,” in Proc. reinforcement learning heuristic algorithm for resource allocation in
IEEE 18th Annu. Consum. Commun. Netw. Conf. (CCNC), 2021, ultra-dense cellular networks,” IEEE Trans. Veh. Technol., vol. 69,
pp. 1–4. no. 1, pp. 983–997, Jan. 2020.
[150] Y. Wang, X. Li, P. Wan, and R. Shao, “Intelligent dynamic spec- [172] X. Zhang, Z. Lin, B. Ding, B. Gu, and Y. Han, “Deep multi-agent
trum access using deep reinforcement learning for VANETs,” IEEE reinforcement learning for resource allocation in D2D communica-
Sensors J., vol. 21, no. 14, pp. 15554–15563, Jul. 2021. tion underlaying cellular networks,” in Proc. 21st Asia-Pacific Netw.
[151] Y. Hu, M. Chen, W. Saad, H. V. Poor, and S. Cui, “Distributed multi- Oper. Manage. Sympo. (APNOMS), 2020, pp. 55–60.
agent meta learning for trajectory design in wireless drone networks,” [173] D. Wang, H. Qin, B. Song, K. Xu, X. Du, and M. Guizani, “Joint
IEEE J. Sel. Areas Commun., vol. 39, no. 10, pp. 3177–3192, resource allocation and power control for D2D communication with
Oct. 2021. deep reinforcement learning in MCC,” Phys. Commun., vol. 45,
[152] W. Jiang and W. Yu, “Multi-agent reinforcement learning based joint Apr. 2021, Art. no. 101262.
cooperative spectrum sensing and channel access for cognitive UAV [174] H. Ding, F. Zhao, J. Tian, D. Li, and H. Zhang, “A deep reinforcement
networks,” 2021, arXiv:2103.08181. learning for user association and power control in heterogeneous
[153] Y. Xu, J. Yu, and R. M. Buehrer, “The application of deep networks,” Ad Hoc Netw., vol. 102, 2020, Art. no. 102069.
reinforcement learning to distributed spectrum access in dynamic [175] Y. Zhang, C. Kang, Y. Teng, S. Li, W. Zheng, and J. Fang, “Deep
heterogeneous environments with partial observations,” IEEE Trans. reinforcement learning framework for joint resource allocation in
Wireless Commun., vol. 19, no. 7, pp. 4494–4506, Jul. 2020. heterogeneous networks,” in Proc. IEEE 90th Veh. Technol. Conf.
[154] S. Liu, X. Hu, and W. Wang, “Deep reinforcement learning (VTC–Fall), 2019, pp. 1–6.
based dynamic channel allocation algorithm in multibeam satellite [176] J. Tan, L. Zhang, and Y.-C. Liang, “Deep reinforcement learning
systems,” IEEE Access, vol. 6, pp. 15733–15742, 2018. for channel selection and power control in D2D networks,” in Proc.
[155] X. Hu, S. Liu, R. Chen, W. Wang, and C. Wang, “A deep reinforce- IEEE Global Commun. Conf. (GLOBECOM), 2019, pp. 1–6.
ment learning-based framework for dynamic resource allocation in [177] J. Tan, Y.-C. Liang, L. Zhang, and G. Feng, “Deep reinforce-
multibeam satellite systems,” IEEE Commun. Lett., vol. 22, no. 8, ment learning for joint channel selection and power control in
pp. 1612–1615, Aug. 2018. D2D networks,” IEEE Trans. Wireless Commun., vol. 20, no. 2,
[156] B. Zhao, J. Liu, Z. Wei, and I. You, “A deep reinforcement learning pp. 1363–1378, Feb. 2021.
based approach for energy-efficient channel allocation in satellite [178] H. Ye, G. Y. Li, and B.-H. F. Juang, “Deep reinforcement learning
Internet of Things,” IEEE Access, vol. 8, pp. 62197–62206, 2020. based resource allocation for V2V communications,” IEEE Trans.
[157] J. Liu, B. Zhao, Q. Xin, and H. Liu, “Dynamic channel allocation Veh. Technol., vol. 68, no. 4, pp. 3163–3173, Apr. 2019.
for satellite Internet of Things via deep reinforcement learning,” in [179] H. Ye and G. Y. Li, “Deep reinforcement learning for resource allo-
Proc. Int. Conf. Inf. Netw. (ICOIN), 2020, pp. 465–470. cation in V2V communications,” in Proc. IEEE Int. Conf. Commun.
[158] F. Zheng, Z. Pi, Z. Zhou, and K. Wang, “Leo satellite channel (ICC), 2018, pp. 1–6.
allocation scheme based on reinforcement learning,” Mobile Inf. Syst., [180] H. Ye and G. Li, “Deep reinforcement learning based dis-
vol. 2020, Dec. 2020, Art. no. 8868888. tributed resource allocation for V2V broadcasting,” in Proc. 14th
Int. Wireless Commun. Mobile Comput. Conf. (IWCMC), 2018,
[159] U. Challita, L. Dong, and W. Saad, “Proactive resource man-
pp. 440–445.
agement in LTE-U systems: A deep learning perspective,” 2017,
[181] S. Yuan, Y. Zhang, W. Qie, T. Ma, and S. Li, “Deep reinforcement
arXiv:1702.07031.
learning for resource allocation with network slicing in cognitive
[160] O. Naparstek and K. Cohen, “Deep multi-user reinforcement learn- radio network,” Comput. Sci. Inf. Syst., vol. 18, no. 3, pp. 979–999,
ing for distributed dynamic spectrum access,” IEEE Trans. Wireless 2020.
Commun., vol. 18, no. 1, pp. 310–323, Jan. 2019. [182] Y. Zhang, X. Wang, and Y. Xu, “Energy-efficient resource allocation
[161] S. Wang and T. Lv, “Deep reinforcement learning based dynamic in uplink NOMA systems with deep reinforcement learning,” in Proc.
multichannel access in hetnets,” in Proc. IEEE Wireless Commun. 11th Int. Conf. Wireless Commun. Signal Process. (WCSP) 2019,
Netw. Conf. (WCNC), 2019, pp. 1–6. pp. 1–6.
[162] H. Peng and X. Shen, “Deep reinforcement learning based resource [183] Y.-H. Xu, C.-C. Yang, M. Hua, and W. Zhou, “Deep deterministic
management for multi-access edge computing in vehicular networks,” policy gradient (DDPG)-based resource allocation scheme for NOMA
IEEE Trans. Netw. Sci. Eng., vol. 7, no. 4, pp. 2416–2428, vehicular communications,” IEEE Access, vol. 8, pp. 18797–18807,
Oct.–Dec. 2020. 2020.
[163] U. Challita and D. Sandberg, “Deep reinforcement learning for [184] A. Khalili, E. M. Monfared, S. Zargari, M. R. Javan, N. Mokari,
dynamic spectrum sharing of LTE and NR,” 2021, arXiv:2102.11176. and E. A. Jorswieck, “Resource management for transmit power
[164] S. Wang, H. Liu, P. H. Gomes, and B. Krishnamachari, “Deep minimization in UAV-assisted RIS HetNets supported by dual con-
reinforcement learning for dynamic multichannel access in wireless nectivity,” IEEE Trans. Wireless Commun., early access, Aug. 31,
networks,” 2018, arXiv:1802.06958. 2021, doi: 10.1109/TWC.2021.3107306.

VOLUME 3, 2022 363


ALWARAFY et al.: FRONTIERS OF DRL FOR RESOURCE MANAGEMENT IN FUTURE WIRELESS HetNets

[185] S. Shrivastava, B. Chen, C. Chen, H. Wang, and M. Dai, “Deep [207] B. Das and S. Roy, “Load balancing techniques for wireless mesh
Q-network learning based downlink resource allocation for hybrid networks: A survey,” in Proc. Int. Symp. Comput. Bus. Intell., 2013,
RF/VLC systems,” IEEE Access, vol. 8, pp. 149412–149434, 2020. pp. 247–253.
[186] Q. Huang, X. Xie, and M. Cheriet, “Reinforcement learning-based [208] L. Zhu, W. Shen, S. Pan, R. Li, and Z. Li, “A dynamic load balancing
hybrid spectrum resource allocation scheme for the high load of method for spatial data network service,” in Proc. 5th Int. Conf.
URLLC services,” EURASIP J. Wireless Commun. Netw., vol. 2020, Wireless Commun. Netw. Mobile Comput., 2009, pp. 1–3.
no. 1, pp. 1–21, 2020. [209] H. Desai and R. Oza, “A study of dynamic load balancing in grid
[187] H. Yang et al., “Intelligent reflecting surface assisted anti-jamming environment,” in Proc. Int. Conf. Wireless Commun. Signal Process.
communications based on reinforcement learning,” in Proc. IEEE Netw. (WiSPNET), 2016, pp. 128–132.
Global Commun. Conf., 2020, pp. 1–6. [210] F. Khayatian, Z. Nagy, and A. Bollinger, “Using generative adver-
[188] H. Yang et al., “Intelligent reflecting surface assisted anti-jamming sarial networks to evaluate robustness of reinforcement learning
communications: A fast reinforcement learning approach,” IEEE agents against uncertainties,” Energy Build., vol. 251, Nov. 2021,
Trans. Wireless Commun., vol. 20, no. 3, pp. 1963–1974, Mar. 2021. Art. no. 111334.
[189] E. Wong, K. Leung, and T. Field, “State-space decomposition for [211] A. T. Z. Kasgari, W. Saad, M. Mozaffari, and H. V. Poor,
reinforcement learning,” Dept. Comput., Imperial College London, “Experienced deep reinforcement learning with generative adver-
London, U.K., Rep., 2021. sarial networks (GANs) for model-free ultra reliable low latency
communication,” IEEE Trans. Commun., vol. 69, no. 2, pp. 884–899,
[190] L. Zahedi, F. G. Mohammadi, S. Rezapour, M. W. Ohland, and
Feb. 2021.
M. H. Amini, “Search algorithms for automated hyper-parameter
[212] R. Alghamdi et al., “Intelligent surfaces for 6G wireless networks: A
tuning,” 2021, arXiv:2104.14677.
survey of optimization and performance analysis techniques,” IEEE
[191] N. V. Varghese and Q. H. Mahmoud, “A survey of multi-task deep Access, vol. 8, pp. 202795–202818, 2020.
reinforcement learning,” Electronics, vol. 9, no. 9, p. 1363, 2020. [213] C. Huang et al., “Holographic MIMO surfaces for 6G wireless
[192] K. Lei, Y. Liang, and W. Li, “Congestion control in SDN-based networks: Opportunities, challenges, and trends,” IEEE Wireless
networks via multi-task deep reinforcement learning,” IEEE Netw., Commun., vol. 27, no. 5, pp. 118–125, Oct. 2020.
vol. 34, no. 4, pp. 28–34, Jul./Aug. 2020. [214] Q. Wu and R. Zhang, “Towards smart and reconfigurable environ-
[193] H. Eghbal-Zadeh, F. Henkel, and G. Widmer, “Context-adaptive ment: Intelligent reflecting surface aided wireless network,” IEEE
reinforcement learning using unsupervised learning of context vari- Commun. Mag., vol. 58, no. 1, pp. 106–112, Jan. 2020.
ables,” in Proc. Workshop Pre-Registration Mach. Learn., 2021, [215] C. Pan et al., “Reconfigurable intelligent surface for 6G and beyond:
pp. 236–254. Motivations, principles, applications, and research directions,” 2020,
[194] T. T. Nguyen, N. D. Nguyen, P. Vamplew, S. Nahavandi, R. Dazeley, arXiv:2011.04300.
and C. P. Lim, “A multi-objective deep reinforcement learning frame- [216] A. Taha, Y. Zhang, F. B. Mismar, and A. Alkhateeb, “Deep reinforce-
work,” Eng. Appl. Artif. Intell., vol. 96, Nov. 2020, Art. no. 103915. ment learning for intelligent reflecting surfaces: Towards standalone
[195] A. Adadi and M. Berrada, “Peeking inside the black-box: A survey operation,” in Proc. IEEE 21st Int. Workshop Signal Process. Adv.
on explainable artificial intelligence (XAI),” IEEE Access, vol. 6, Wireless Commun. (SPAWC), 2020, pp. 1–5.
pp. 52138–52160, 2018. [217] C. Huang, R. Mo, and C. Yuen, “Reconfigurable intelligent sur-
[196] A. Heuillet, F. Couthouis, and N. Díaz-Rodríguez, “Explainability face assisted multiuser MISO systems exploiting deep reinforce-
in deep reinforcement learning,” Knowl. Based Syst., vol. 214, ment learning,” IEEE J. Sel. Areas Commun., vol. 38, no. 8,
Feb. 2021, Art. no. 106685. pp. 1839–1850, Aug. 2020.
[197] Y. He, Y. Wang, C. Qiu, Q. Lin, J. Li, and Z. Ming, “Blockchain- [218] Z. Yang, Y. Liu, Y. Chen, and J. T. Zhou, “Deep reinforcement learn-
based edge computing resource allocation in IoT: A deep reinforce- ing for ris-aided non-orthogonal multiple access downlink networks,”
ment learning approach,” IEEE Internet Things J., vol. 8, no. 4, in Proc. IEEE Global Commun. Conf., 2020, pp. 1–6.
pp. 2226–2237, Feb. 2021. [219] L. U. Khan, W. Saad, D. Niyato, Z. Han, and C. S. Hong, “Digital-
[198] F. Guo, F. R. Yu, H. Zhang, H. Ji, M. Liu, and V. C. M. Leung, twin-enabled 6G: Vision, architectural trends, and future directions,”
“Adaptive resource allocation in future wireless networks with 2021, arXiv:2102.12169.
blockchain and mobile edge computing,” IEEE Trans. Wireless [220] W. Kritzinger, M. Karner, G. Traar, J. Henjes, and W. Sihn, “Digital
Commun., vol. 19, no. 3, pp. 1689–1703, Mar. 2020. twin in manufacturing: A categorical literature review and clas-
[199] S. Hu, Y.-C. Liang, Z. Xiong, and D. Niyato, “Blockchain and arti- sification,” IFAC-PapersOnLine, vol. 51, no. 11, pp. 1016–1022,
ficial intelligence for dynamic resource sharing in 6G and beyond,” 2018.
IEEE Wireless Commun., vol. 28, no. 4, pp. 145–151, Aug. 2021. [221] Y. Lu, S. Maharjan, and Y. Zhang, “Adaptive edge association for
[200] L. Yang, M. Li, P. Si, R. Yang, E. Sun, and Y. Zhang, “Energy- wireless digital twin networks in 6G,” IEEE Internet Things J., vol. 8,
efficient resource allocation for blockchain-enabled Industrial no. 22, pp. 16219–16230, Nov. 2021.
[222] W. Sun, N. Xu, L. Wang, H. Zhang, and Y. Zhang, “Dynamic
Internet of Things with deep reinforcement learning,” IEEE Internet
digital twin and federated learning with incentives for air-ground
Things J., vol. 8, no. 4, pp. 2318–2329, Feb. 2020.
networks,” IEEE Trans. Netw. Sci. Eng., vol. 9, no. 1, pp. 321–333,
[201] O. A. Wahab, A. Mourad, H. Otrok, and T. Taleb, “Federated
Jan./Feb. 2022.
machine learning: Survey, multi-level classification, desirable criteria [223] Y. Lu, X. Huang, K. Zhang, S. Maharjan, and Y. Zhang,
and future directions in communication and networking systems,” “Communication-efficient federated learning and permissioned
IEEE Commun. Surveys Tuts., vol. 23, no. 2, pp. 1342–1397, blockchain for digital twin edge networks,” IEEE Internet Things
2nd Quart., 2021. J., vol. 8, no. 4, pp. 2276–2288, Feb. 2021.
[202] N. H. Tran, W. Bao, A. Zomaya, M. N. Nguyen, and C. S. Hong,
“Federated learning over wireless networks: Optimization model
design and analysis,” in Proc. IEEE Conf. Comput. Commun., 2019,
pp. 1387–1395.
[203] A. M. Albaseer, M. Abdallah, A. Al-Fuqaha, and A. Erbad, “Fine-
grained data selection for improved energy efficiency of federated
edge learning,” IEEE Trans. Netw. Sci. Eng., early access, Jul. 29,
2021, doi: 10.1109/TNSE.2021.3100805.
[204] S. Wang et al., “Adaptive federated learning in resource constrained ABDULMALIK ALWARAFY received the B.S. degree in electrical engi-
edge computing systems,” IEEE J. Sel. Areas Commun., vol. 37, neering with a minor in communication from IBB University, Yemen, and
no. 6, pp. 1205–1221, Jun. 2019. the M.Sc. degree in electrical engineering with a minor in communica-
[205] X. Wang, C. Wang, X. Li, V. C. Leung, and T. Taleb, “Federated tion from King Saud University, Saudi Arabia. He is currently pursuing
deep reinforcement learning for Internet of Things with decentralized the Ph.D. degree with the College of Science and Engineering, Hamad
cooperative edge caching,” IEEE Internet Things J., vol. 7, no. 10, Bin Khalifa University, Qatar. His current research interests include radio
pp. 9441–9455, Oct. 2020. resource allocation and management for mobile networks, and deep rein-
[206] H. H. Zhuo, W. Feng, Q. Xu, Q. Yang, and Y. Lin, “Federated forcement learning techniques for 6G and beyond wireless communication
reinforcement learning,” 2019, arXiv:1901.08277. networks.

364 VOLUME 3, 2022


MOHAMED ABDALLAH (Senior Member, IEEE) received the B.Sc. ALA AL-FUQAHA (Senior Member, IEEE) received the Ph.D. degree in
degree from Cairo University, Giza, Egypt, in 1996, and the M.Sc. and computer engineering and networking from the University of Missouri–
Ph.D. degrees from the University of Maryland at College Park, College Kansas City, Kansas City, MO, USA, in 2004. He is currently a Professor
Park, MD, USA, in 2001 and 2006, respectively. with Hamad Bin Khalifa University. His research interests include the use of
From 2006 to 2016, he held academic and research positions with Cairo machine learning in general and deep learning in particular in support of the
University and Texas A&M University at Qatar, Doha, Qatar. He is currently data-driven and self-driven management of large-scale deployments of the
a Founding Faculty Member with the rank of Associate Professor with the IoT and smart city infrastructure and services, wireless vehicular networks,
College of Science and Engineering, Hamad Bin Khalifa University, Doha. cooperation and spectrum access etiquette in cognitive radio networks, and
He has published more than 150 journals and conferences and four book management and planning of software-defined networks. He is an ABET
chapters, and co-invented four patents. His current research interests include Program Evaluator. He has also served as the Chair, the Co-Chair, and
wireless networks, wireless security, smart grids, optical wireless communi- a Technical Program Committee Member of multiple international confer-
cation, and blockchain applications for emerging networks. He is a recipient ences, including the IEEE VTC, IEEE Globecom, IEEE ICC, and IWCMC.
of the Research Fellow Excellence Award at Texas A&M University at Qatar He serves on editorial boards of multiple journals, including the IEEE
in 2016, the Best Paper Award in multiple IEEE conferences, including COMMUNICATIONS LETTER and the IEEE Network Magazine.
IEEE BlackSeaCom 2019 and the IEEE First Workshop on Smart Grid and
Renewable Energy in 2015, and the Nortel Networks Industrial Fellowship
for five consecutive years, 1999–2003. His professional activities include
an Associate Editor of the IEEE TRANSACTIONS ON COMMUNICATIONS
and the IEEE OPEN ACCESS JOURNAL OF COMMUNICATIONS, the Track MOUNIR HAMDI (Fellow, IEEE) received the B.S. degree (Hons.) in elec-
Co-Chair of the IEEE VTC Fall 2019 Conference, the Technical Program trical engineering (computer engineering) from the University of Louisiana
Chair of the 10th International Conference on Cognitive Radio-Oriented at Lafayette, Lafayette, LA, USA, in 1985, and the M.S. and Ph.D. degrees
Wireless Networks, and a technical program committee member of several in electrical engineering from the University of Pittsburgh, Pittsburgh, PA,
major IEEE conferences. USA, in 1987 and 1991, respectively.
He was a Chair Professor and a Founding Member of the Hong Kong
University of Science and Technology (HKUST), Hong Kong, where he
was the Head of the Department of Computer Science and Engineering.
From 1999 to 2000, he was a Visiting Professor with Stanford University,
Stanford, CA, USA, and the Swiss Federal Institute of Technology, Zürich,
Switzerland. He is currently the Founding Dean of the College of Science
and Engineering, Hamad Bin Khalifa University, Doha, Qatar. His area of
research is in high-speed wired/wireless networking, in which he has pub-
lished more than 360 publications, graduated more 50 M.S./Ph.D. students,
and awarded numerous research grants. In addition, he has frequently con-
sulted for companies and governmental organizations in the USA, Europe,
and Asia. He received the Best 10 Lecturer Award and the Distinguished
Engineering Teaching Appreciation Award from HKUST. He is frequently
involved in higher education quality assurance activities as well as engineer-
ing programs accreditation all over the world. He is a Fellow of the IEEE
BEKIR SAIT ÇIFTLER (Member, IEEE) received the B.S. degree from for his contributions to design and analysis of high-speed packet switching,
Middle East Technical University, Ankara, Turkey, in 2011, the M.S. degree which is the highest research distinction bestowed by IEEE. He is also a
from the TOBB University of Economics and Technology, Ankara, in 2013, frequent keynote speaker in international conferences and forums. He is/was
and the Ph.D. degree in electrical and computer engineering from Florida on the editorial board of more than ten prestigious journals and magazines.
International University, Miami, FL, USA, in 2017. His current research He has chaired more than 20 international conferences and workshops. In
interests include wireless localization and tracking for Internet of Things, addition to his commitment to research and academic/professional service,
software-defined radios, 5G and mmWave networks, and MIMO systems. he is also a dedicated teacher and a quality assurance educator.

VOLUME 3, 2022 365

You might also like