Multi-Agent Reinforcement Learning Based Resource Allocation For UAV Networks
Multi-Agent Reinforcement Learning Based Resource Allocation For UAV Networks
Abstract
Unmanned aerial vehicles (UAVs) are capable of serving as aerial base stations (BSs) for providing
both cost-effective and on-demand wireless communications. This article investigates dynamic resource
allocation of multiple UAVs enabled communication networks with the goal of maximizing long-term
rewards. More particularly, each UAV communicates with a ground user by automatically selecting its
communicating users, power levels and subchannels without any information exchange among UAVs.
To model the uncertainty of environments, we formulate the long-term resource allocation problem as a
stochastic game for maximizing the expected rewards, where each UAV becomes a learning agent and
each resource allocation solution corresponds to an action taken by the UAVs. Afterwards, we develop
a multi-agent reinforcement learning (MARL) framework that each agent discovers its best strategy
according to its local observations using learning. More specifically, we propose an agent-independent
method, for which all agents conduct a decision algorithm independently but share a common structure
based on Q-learning. Finally, simulation results reveal that: 1) appropriate parameters for exploitation and
exploration are capable of enhancing the performance of the proposed MARL based resource allocation
algorithm; 2) the proposed MARL algorithm provides acceptable performance compared to the case
with complete information exchanges among UAVs. By doing so, it strikes a good tradeoff between
performance gains and information exchange overheads.
Index Terms
Dynamic resource allocation, multi-agent reinforcement learning (MARL), stochastic games, UAV
communications
J.cui, Y. Liu and A. Nallanathan are with the School of Electronic Engineering and Computer Science, Queen Mary University
of London, London E1 4NS, U.K. (email: {j.cui, yuanwei.liu, a.nallanathan}@qmul.ac.uk).
2
I. I NTRODUCTION
A. Prior Works
Compared to terrestrial BSs, UAVs are generally faster to deploy and more flexible to config-
ure. The deployment of UAVs in terms of altitude and distance between UAVs was investigated
for UAV-enabled small cells in [5]. In [6], a three-dimensional (3D) deployment algorithm based
on circle packing is developed for maximizing the downlink coverage performance. Additionaly,
a 3D deployment algorithm of a single UAV is developed for maximizing the number of covered
users in [7]. Moreover, by fixing the altitudes, a successive UAV placement approach was
proposed to minimize the number of UAVs required while guaranteeing each ground user to
be covered by at least one UAV in [8].
Despite the deployment optimization of UAVs, trajectory designs of UAVs for optimizing the
communication performance have attracted tremendous attentions, such as in [9]–[11]. In [9],
the authors considered one UAV as a mobile relay and investigated the throughput maximization
problem by optimizing power allocation and the UAV’s trajectory. Then, a designing approach of
the UAV’s trajectory based on successive convex approximation (SCA) techniques was proposed
in [9]. By transforming the continuous trajectory into a set of discrete waypoints, the authors in
[10] investigated the UAV’s trajectory design with minimizing the mission completion time in a
UAV-enabled multicasting system. Additionally, multiple-UAV enabled wireless communication
3
networks (multi-UAV networks) were considered in [11], where a joint design for optimizing
trajectory and resource allocation was studied with the goal of guaranteeing fairness by maximiz-
ing the minimum throughput among users. In [12], the authors proposed a joint of subchannel
assignment and trajectory design approach to strike a tardeoff between the sum rate and the
delay of sensing tasks for a multi-UAV aided uplink single cell network.
Due to the versatility and manoeuvrability of UAVs, human intervention becomes restricted for
UAVs’ control design. Therefore, machine learning based intelligent control of UAVs is desired
for enhancing the performance for UAV-enabled communication networks. Neuro network based
trajectory designs were considered from the perspective of UAVs’ manufactured structures in
[13] and [14]. Regarding UAVs enabled communication networks, a weighted expectation based
predictive on-demand deployment approach of UAVs was proposed to minimize the transmit
power in [15], where Gaussian mixture model was used for building data distributions. In [16],
the authors studied the autonomous path planning of UAVs by jointly taking energy efficiency,
lantency and interference into consideration, in which a echo state network based deep rein-
forement learning algorithm was proposed. In [17], the authors proposed a liquid state machine
(LSM) based resource allocation algorithm for cache enabled UAVs over LTE licensed and
unlicensed bands. Additionally, a log-linear learning based joint channel-slot selection algorithm
was developed for multi-UAV networks in [18].
As discussed above, machine learning is a promising and power tool to provide autonomous
and effective solutions in an intelligent manner to enhance the UAV-enabled communication
networks. However, most research contributions focus on the deployment and trajectory designs
of UAVs in communication networks, such as [15]–[17]. Though resource allocation schemes
such as transmit power and subchannels were considered for UAV-enabled communication net-
works in [11] and [12], the prior studies focused on time-independent scenarios. That is the
optimization design is independent for each time slot. Moreover, for time-dependent scenarios,
[17] and [18] investigated the potentials of machine learning based resource allocation algorithms.
However, most of the proposed machine learning algorithms mainly focused on single UAV
scenarios or multi-UAV scenarios by assuming the availability of complete network information
for each UAV. In practice, it is non-trivial to obtain perfect knowledge of dynamic environments
due to the high movement speed of UAVs [19], [20], which imposes formidable challenges on
4
the design of reliable UAV-enabled wireless communications. Besides, most existing research
contributions focus on centralized approaches, which makes modeling and computational tasks
become challenging as the network size continues to increase. Multi-agent reinforcement learning
(MARL) is capable of providing a distributed perspective on the intelligent resource management
for UAV-enabled communication networks especially when these UAVs only have individual local
information.
The main benefits of MARL are: 1) agents consider individual application-specific nature and
environment; 2) local exchanges between agents can be modeled and investigated; 3) difficulties
in modelling and computation can be handled in distributed manners. The applications of MARL
for cognitive radio networks were studied in [21] and [22]. Specifically, in [21], the authors
focused on the feasibilities of MARL based channel selection algorithms for a specific scenario
with two secondary users. A real-time aggregated interference scheme based on MARL was
investigated in [22] for wireless regional area networks (WRANs). Moreover, in [23], the authors
proposed a MARL based channel and power level selection algorithm for device-to-device
(D2D) pairs in heterogeneous cellular networks. The potential of machine learning based user
clustering for mmWave-NOMA networks was presented in [24]. Therefore, invoking MARL
to UAV-enabled communication networks provides a promising solution for intelligent resource
management. Due to the high mobility and adaptive altitude, to the best of our knowledge,
multi-UAV networks are not well-investigated, especially for the resource allocation from the
perspective of MARL. However, it is challenging for MARL based multi-UAV networks to
specify a suitable objective and strike a exploration-exploitation tradeoff.
Motivated by the features of MARL and UAVs, this article aims to develop a MARL framework
for multi-UAV networks. More specifically, we consider a multi-UAV enabled downlink wireless
network, in which multiple UAVs try to communicate with ground users simultaneously. Each
UAV flies according to the predefined trajectory. It is assumed that all UAVs communicate with
ground users without the assistance of a central controller. Hence, each UAV can only observe
its local information. Based on the proposed framework, our major contributions are summarized
as follows:
1) We investigate the optimization problem of maximizing long-term rewards of multi-UAV
downlink networks by jointly designing user, power level and subchannel selection strate-
gies. Specifically, we formulate a quality of service (QoS) constrained energy efficiency
function as the reward function for providing a reliable communication. Because of the
5
C. Organization
The rest of this article is organized as follows. In Section II, the system model for downlink
multi-UAV networks is presented. The problem of resource allocation is formulated and a
stochastic game framework for the considered multi-UAV network is presented in Section III. In
Section IV, a Q-learning based MARL algorithm for resource allocation is designed. Simulation
results are presented in Section V, which is followed by the conclusions in Section VI.
At a time slot
UAV 1
Power level
UAV 3
UAV 3
UAV 2
UAV 1
UAV 2
user 6
user 5
x
Communication link UAV’s trajectiory
denoted by M = {1, · · · , M } and L = {1, · · · , L}, respectively. The ground users are randomly
distributed in the considered disk with radius rd . As shown in Fig. 1, multiple UAVs fly over
this region and communicate with ground users by providing direct communication connectivity
from the sky [1]. The total bandwidth W that the UAVs can operate is divided into K orthogonal
subchannels, denoted by K = {1, · · · , K}. Note that the subchannels occupied by UAVs may
overlap with each other. Moreover, it is assumed that UAVs fly autonomously without human
intervention based on pre-programmed flight plans as in [27]. That is the trajectories of UAVs
are predefined based on the pre-programmed flight plans. As shown in Fig. 1, there are three
UAVs flying on the considered region based on the pre-defined trajectories, respectively. This
article focuses on the dynamic design of resource allocation for multi-UAV networks in term of
user, power level and subchannel selections. Additionally, assuming that all UAVs communicate
without the assistance of a central controller and have no global knowledge of wireless com-
munication environments. In other words, the channel state information (CSI) between a UAV
and users are known locally. This assumption is reasonable in practical due to the mobilities of
UAVs, which is similar to the research contributions [19], [20].
P LLoS FS
m,l = Lm,l (t) + η
LoS
, (2a)
P LNLoS
m,l = LFS
m,l (t) + η
NLoS
, (2b)
where LFS FS
m,l (t) denotes the free space pathloss with Lm,l (t) = 20 log(dm,l (t)) + 20 log(f ) +
20 log( 4π
c
), and f is the carrier frequency. Furthermore, η LoS and η NLoS are the mean additional
losses for LoS and NLoS, respectively. Therefore, at time slot t, the average pathloss between
UAV m and user l can be expressed as
2) LoS Model: As discussed in [9], the LoS model provides a good approximation for practical
UAV-to-ground communications. In the LoS model, the path loss between a UAV and a ground
user relies on the locations of the UAV and the ground user as well as the type of propagation.
Specifically, under the LoS model, the channel gains between the UAVs and the users follow the
free space path loss model, which is determined by the distance between the UAV and the user.
Therefore, at time slot t, the LoS channel power gain from the m-th UAV to the l-th ground
user can be expressed as
β0
gm,l (t) = β0 d−α
m,l (t) = α2 , (4)
kvl − um (t)k2 + Hm
2
where um (t) = (xm (t), ym (t)), and (xm (t), ym (t)) denotes the location of UAV m in the
horizontal dimension at time slot t. Correspondingly, vl = (xl , yl ) denotes the location of user
8
l. Furthermore, β0 denotes the channel power gain at the reference distance of d0 = 1 m, and
α ≥ 2 is the path loss exponent.
B. Signal Model
In the UAV-to-ground transmission, the interference to each UAV-to-ground user pair is created
by other UAVs operating on the same subchannel. Let ckm (t) denote the indicator of subchannel,
where ckm (t) = 1 if subchannel k occupied by UAV m at time slot t; ckm (t) = 0, otherwise. It
satisfies
X
ckm (t) ≤ 1. (5)
k∈K
That is each UAV can only occupy a single subchannel for each time slot. Let alm (t) be the
indicator of users. alm (t) = 1 if user l served by UAV m in time slot t; alm (t) = 0, otherwise.
Therefore, the observed signal-to-interference-plus-noise ratio (SINR) for a UAV-to-ground user
communication between UAV m and user l over subchannel k at time slot t is given by
k
Gkm,l (t)alm (t)ckm (t)Pm (t)
γm,l (t) = k
, (6)
Im,l (t) + σ 2
where Gkm,l (t) denotes the channel gain between UAV m and user l over subchannel k at time slot
k
t. Pm (t) denotes the transmit power selected by UAV m at time slot t. Im,l (t) is the interference
k
(t) = j∈M,j6=m Gkj,l (t)ckm (t)Pj (t). Therefore, at any time slot t, the SINR
P
to UAV m with Im,l
for UAV m can be expressed as
XX
k
γm (t) = γm,l (t). (7)
l∈L k∈K
In this article, discrete transmit power control is adopted at UAVs [28]. The transmit power
values by each UAV to communicate with its respective connected user can be expressed as
a vector P = {P1 , · · · , PJ }. For each UAV m, we define a binary variable pjm (t), j ∈ J =
{1, · · · , J}. pjm (t) = 1, if UAV m selects to transmit at a power level Pj at time slot t; and
pjm (t) = 0, otherwise. Note that only one power level can be selected at each time slot t by
UAV m, we have
X
pjm (t) ≤ 1, ∀m ∈ M. (8)
j∈J
9
As a result, we can define a finite set of possible power level selection decisions made by
UAV m, as follows.
X
Pm = {pm (t) ∈ P| pjm (t) ≤ 1}, ∀m ∈ M. (9)
j∈J
Similarly, we also define finite sets of all possible subchannel selection and user selection by
UAV m, respectively, which are given as follows:
X
Cm ={cm (t) ∈ K| ckm (t) ≤ 1}, ∀m ∈ M, (10)
k∈K
X
Am ={am (t) ∈ L| alm (t) ≤ 1}, ∀m ∈ M. (11)
l∈L
To proceed further, we assume that the considered multi-UAV network operates on a discrete-
time basis where the time axis is partitioned into equal non-overlapping time intervals (slots).
Furthermore, the communication parameters are assumed to remain constant during each time
slot. Let t denote an integer valued time slot index. Particularly, each UAV holds the CSI of
all ground users and decisions for a fixed time interval Ts ≥ 1 slots, which is called decision
period. We consider the following scheduling strategy for the transmissions of UAVs: Any UAV
is assigned a time slot t to start its transmission and must finish its transmission and select the
new strategy or reselect the old strategy by the end of its decision period, i.e., at slot t + Ts .
We also assume that the UAVs do not know the accurate duration of their stay in the network.
This feature motivates us to design an on-line learning algorithm for optimizing the long-term
energy-efficiency performance of multi-UAV networks.
In this section, we first describe the optimization problem investigated in this article. Then, to
model the uncertainty of stochastic environments, we formulate the problem of joint user, power
level and subchannel selections by UAVs to be a stochastic game.
A. Problem Formulation
Note that from (6) to achieve the maximal throughput, each UAV transmits at a maximal
power level, which, in turn, results in increasing interference to other UAVs. Hence, to provide
reliable communications of UAVs, the main goal of the dynamic design for joint user, power
10
level and subchannel selection is to ensure that the SINRs provided by the UAVs no less than
the predefined thresholds. Specifically, the mathematical form can be expressed as
where γ̄ denotes the targeted QoS threshold of users served by UAVs. At time slot t, if the
constraint (12) is satisfied, then the UAV obtains a reward Rm (t), defined as the difference
between the throughput and the cost of power consumption achieved by the selected user,
subchannel and power level. Otherwise, it receives a zero reward. Therefore, we can express
the reward function Rm (t) of UAV m at time slot t, as follows:
W
K
log(1 + γm (t)) − ωm Pm (t), if γm (t) ≥ γ̄m ,
Rm (t) = (13)
0, o.w.,
for all m ∈ M and the corresponding immediate reward is denoted as Rm (t). In (13), ωm is the
cost per unit level of power. Note that at any time slot t, the instantaneous reward of UAV m
in (13) relies on: 1) the observed information: the individual user, subchannel and power level
decisions of UAV m, i.e., am (t), cm (t) and pm (t). In addition, it also relates with the current
channel gain Gkm,l (t); 2) unobserved information: the subchannels and power levels selected by
other UAVs and the channel gains. It should be pointed out that we omitted the fixed power
consumption for UAVs, such as the power consumed by controller units and data processing
[29].
Next, we consider to maximize the long-term reward vm (t) by selecting the served user,
subchannel and transmit power level at each time slot. Particularly, we adopt a future discounted
reward [30] as the measurement for each UAV. Specifically, at a certain time slot of the process,
the discounted reward is the sum of its payoff in the present time slot, plus the sum of future
rewards discounted by a constant factor. Therefore, the considered long-term reward of UAV m
is given by
+∞
X
vm (t) = δ τ Rm (t + τ + 1), (14)
τ =0
where δ denotes the discount factor with 0 ≤ δ < 1. Specifically, values of δ reflect the effect of
future rewards on the optimal decisions: if δ is close to 0, it means that the decision emphasizes
the near-term gain; By contrast, if δ is close to 1, it gives more weights to future rewards and
we say the decisions are farsighted.
11
Next we introduce the set of all possible user, subchannel and power level decisions made
by UAV m, m ∈ M, which can be denoted as Θm = Am ⊗ Cm ⊗ Pm with ⊗ denoting
the Cartesian product. Consequently, the objective of each UAV m is to make a selection
∗
θm (t) = (a∗m (t), c∗m (t), p∗m (t)) ∈ Θm , which maximizes its long-term reward in (14). Hence
the optimization problem for UAV m, m ∈ M, can be formulated as
∗
θm (t) = arg maxθm ∈Θm Rm (t). (15)
Note that the optimization design for the considered multi-UAV network consists of M subprob-
lems, which corresponds to M different UAVs. Moreover, each UAV has no information about
other UAVs such as their rewards, hence one cannot solve problem (15) accurately. To solve the
optimization problem (15) in stochastic environments, we try to formulate the problem of joint
user, subchannel and power level selections by UAVs to a stochastic non-cooperative game in
the following subsection.
In this subsection, we consider to model the formulated problem (15) by adopting a stochastic
game (also called Markov game) framework [25], since it is the generalization of the Markov
decision processes to the multi-agent case.
In the considered network, M UAVs communicate to users with having no information about
the operating environment. It is assumed that all UAVs are selfish and rational. Hence, at any
time slot t, all UAVs select their actions non-cooperatively to maximize the long-term rewards
in (15). Note that the action for each UAV m is selected from its action space Θm . The action
conducted by UAV m at time slot t, is a triple θm (t) = (am (t), cm (t), pm (t)) ∈ Θm , where
am (t), cm (t) and pm (t) represent the selected user, subchannel and power level respectively, for
UAV m at time slot t. For each UAV m, denote by θ−m (t) the actions conducted by the other
M − 1 UAVs at time slot t, i.e., θ−m (t) ∈ Θ \ Θm .
As a result, the instantaneous SINR of UAV m at time slot t can be rewritten as
X X Sm,l k
(t)[θm (t), θ−m (t), Gm,l (t)]
γm (t)[θm (t), θ−m (t), Gm (t)] = k 2
, (16)
l∈L k∈K
Im,l (t)[θm (t), θ−m (t), G m,l (t)] + σ
12
k
where Sm,l (t) = Gkm,l (t)alm (t)ckm (t)Pm (t), and Im,l
k
(t)(·) is given in (6). Furthermore, Gm,l (t)
denotes the matrix of instantaneous channel responses between UAV m and user l at time slot
t, which can be expressed as
G11,l (t) ··· GK
1,l (t)
.. ... ..
Gm,l (t) =
. . ,
(17)
G1M,l (t) · · · GK
M,l (t)
Let s = (s1 , · · · , sM ) be a state vector for all UAVs. In this article, UAV m does not know the
states for other UAVs as UAV cannot cooperate with each other.
We assume that the actions for each UAV satisfy the properties of Markov chain, that is the
reward of a UAV is only dependant on the current state and action. As discussed in [26], Markov
chain is used to describes the dynamics of the states of a stochastic game where each player
has a single action in each state. Specifically, the formal definition of Markov chains is given
as follows.
Definition 1. A finite state Markov chain is a discrete stochastic process, which can be described
as follows: Let a finite set of states S = {s1 , · · · , sq } and a q × q transition matrix F with each
entry 0 ≤ Fi,j ≤ 1 and qj=1 Fi,j = 1 for any 1 ≤ i ≤ q. The process starts in one of the
P
states and moves to another state successively. Assume that the chain is currently in state si .
The probability of moving to the next state sj is
which depends only on the present state and not on the previous states and is also called Markov
property.
13
Here we put the time slot index t in the superscript for notation compactness and it is adopted in
the following of this article for notational simplicity. In (20), the instantaneous transmit power
t
is a function of the action θm and the instantaneous rate of UAV m is given by
t t t t W t t t
Cm (θm , θ−m , Gm ) = log 1 + γm (θm , θ−m , Gm ) , (21)
K
t
Notice that from (20), at any time slot t, the reward rm received by UAV m depends on the
current state stm , which is fully observed, and partially-observed actions (θm
t t
, θ−m ). At the next
time slot t + 1, UAV m moves to a new random state st+1
m whose possibilities are only based
t t
on the previous state sm (t) and the selected actions (θm , θ−m ). This procedure repeats for the
indefinite number of slots. Specifically, at any time slot t, UAV m can observe its state stm and
t t
the corresponding action θm , but it does not know the actions of other players, θ−m , and the
precise values Gtm . The state transition probabilities are also unknown to each player UAV m.
Therefore, the considered UAV system can be formulated as a stochastic game [31].
In a stochastic game, a mixed strategy πm : Sm → Θm , denoting the mapping from the state set
to the action set, is a collection of probability distribution over the available actions. Specifically,
for UAV m in the state sm , its mixed strategy is πm (sm ) = {πm (sm , θm )|θm ∈ Θm }, where each
element πm (sm , θm ) of πm (sm ) is the probability with UAV m selecting an action θm in state
sm . A joint strategy π = {π1 (s1 ), · · · , πM (sM )} is a vector of strategies for M players with
14
one strategy for each player. Let π−m = {π1 , · · · , πm−1 , πm+1 , · · · , πM (sM )} denote the same
strategy profile but without the strategy πm of player UAV m. Based on the above discussions,
the optimization goal of each player UAV m in the formulated stochastic game is to maximize its
expected reward over time. Therefore, for player UAV m under a joint strategy π = (π1 , · · · , πm )
with assigning a strategy πi to each UAV i, the optimization objective in (14) can be reformulated
as
X+∞
Vm (s, π) = E δ τ rm
t+τ +1
| st = s , (22)
τ =0
t+τ +1
where rm represents the immediate reward received by UAV m at time t + τ + 1 and
E{·} denotes the expectation operations. In the formulated stochastic game, players (UAVs)
have individual expected reward which depends on the joint strategy and not on the individual
strategies of the players. Hence one cannot simply expect players to maximize their expected
rewards as it may not be possible for all players to achieve this goal at the same time. Next, we
describe a solution for the stochastic game by Nash equilibrium [32].
Definition 3. A Nash equilibrium is a collection of strategies, one for each player, so that each
∗
individual strategy is a best-response to the others. That is if a solution π ∗ = {π1∗ , · · · , πM } is
∗
a Nash equilibrium, then for each UAV m, the strategy πm such that
∗ 0 0 (23)
Vm (πm , π−m ) ≥ Vm (πm , π−m ), ∀πm .
It means that in a Nash equilibrium, each UAV’s action is the best response to other UAVs’
choice. Thus, in a Nash equilibrium solution, no UAV can benefit by changing its strategy as
long as all the other UAVs keep their strategies constant. Note that the presence of imperfect
information in the formulated non-cooperative stochastic game provides opportunities for the
players to learn their optimal strategies through repeated interactions with the stochastic envi-
ronment. Hence, each player UAV m is regarded as a learning agent whose task is to find a Nash
equilibrium strategy for any state sm . In next section, we propose a multi-agent reinforcement-
learning framework for maximizing the sum expected reward in (22) with partial observations.
In this section, we first describe the proposed MARL framework for multi-UAV networks.
Then a Q-Learning based resource allocation algorithm will be proposed for maximizing the
expected long-term reward of the considered for multi-UAV network.
15
s1 (t)
q1 (t)
UAV 1
r1 (t)
...
sm (t)
q m (t) Joint action Q(t) Environment: CSI,
UAV m
rm (t) network configurations
...
sM (t)
q M (t)
UAV M
rM (t)
Fig. 2 describes the key components of MARL studied in this article. Specifically, for each
UAV m, the left-hand side of the box is the locally observed information at time slot t–state
t
stm and reward rm ; the right-hand side of the box is the action for UAV m at time slot t. The
decision problem faced by a player in a stochastic game when all other players choose a fixed
strategy profile is equivalent to an Markov decision processes (MDP) [26]. An agent-independent
method is proposed, for which all agents conduct a decision algorithm independently but share
a common structure based on Q-learning.
Since Markov property is used to model the dynamics of the environment, the rewards of
UAVs are based only on the current state and action. MDP for agent (UAV) m consists of: 1)
a discrete set of environment state Sm , 2) a discrete set of possible actions Θm , 3) a one-slot
dynamics of the environment given by the state transition probabilities Fstm →st+1
m
= F (stm , θ, st+1
m )
next reward for UAV m. For instance, given the current state sm , action θm and the next state s0m :
Rm (sm , θm , s0m ) = E{rm
t+1 t t
|sm = sm , θm = θm , st+1 0 t+1
m = sm }, where rm denotes the immediate
reward of the environment to UAV m at time t + 1. Notice that UAVs cannot interact with each
other, hence each UAV knows imperfect information of its operating stochastic environment.
In this article, Q-learning is used to solve MDPs, for which a learning agent operates in an
unknown stochastic environment and does not know the reward and transition functions [33].
Next we describe the Q-learning algorithm for solving the MDP for one UAV. Without loss of
generalities, UAV m is considered for simplicity. Two fundamental concepts of algorithms for
solving the above MDP is the state value function and action value function (Q-function) [34].
16
Specifically, the former in fact is the expected reward for some state in (22) giving the agent
in following some policy. Similarly, the Q-function for UAV m is the expected reward starting
from the state sm , taking the action θm and following policy π, which can be expressed as
X+∞
τ t+τ +1 t t
Qm (sm , θm , π) = E δ rm | s = s, θm = θm , (24)
τ =0
where the corresponding values of (24) are called action values (Q-values).
Proposition 1. A recursive relationship for the state value function can be derived from the
established return. Specifically, for any strategy π and any state sm , the following condition
0 0
holds between two consistency states stm = sm and st+1
m = sm , with sm , sm ∈ Sm :
( +∞ )
X
τ t+τ +1 t
Vm (sm , π) = E δ rm |sm = sm
τ =0
X XY (25)
= F (sm , θ, s0m ) πj (sj , θj ) × [Rm (sm , θ, s0m ) + δV (s0m , π)] ,
s0m ∈Sm θ∈Θ j∈M
Note that the state value function Vm (sm , π) is the expected return when starting in state sm
and following a strategy π thereafter. Based on Proposition 1, we can rewrite the Q-function in
(24) also into a recursive from, which is given by
X+∞
t+1 τ t+τ +2 t t t+1 0
Qm (sm , θm , π) = E rm + δ δ rm |sm = sm , θm = θ, sm = sm
τ =0
(26)
X X Y
= F (sm , θ, s0m ) πj (sj , θj ) × [R(sm , θ, s0m ) + δVm (s0m , π)] .
s0m ∈Sm θ−m ∈Θ−m j∈M\{m}
Note that from (26), Q-values depend on the actions of all the UAVs. It should be pointed
out that (25) and (26) are the basic equations for the Q-learning based reinforcement learning
algorithm for solving the MDP of each UAV. From (25) and (26), we also can derive the following
relationship between state values and Q-values:
X
Vm (sm , π) = πm (sm , θm )Qm (sm , θm , π). (27)
θm ∈Θm
17
As discussed above, the goal of solving a MDP is to find an optimal strategy to obtain a
maximal reward. An optimal strategy for UAV m at state sm , can be defined, from the perspective
of state value function, as
Substituting (27) to (28), the optimal state value equation in (28) can be reformulated as
π(sm , θm )Q∗m (sm , θm ) ≤ maxθm Q∗m (sm , θm ) was applied to obtain (30).
P
where the fact that θm
Note that in (30), the optimal state value equation is a maximization over the action space instead
of the strategy space.
Next by combining (30) with (25) and (26), one can obtain the Bellman optimality equations,
for state values and for Q-values, respectively:
X Y X
Vm∗ (sm ) = πj (sj , θj ) × max F (sm , θ, s0m ) [R(sm , θm , s0m ) + δVm∗ (s0m )] , (31)
θm
θ−m ∈Θ−m j∈M\{m} s0m
and
Q∗m (sm , θm ) =
X Y X (32)
πj (sj , θj ) × F (sm , θ, s0m )R(sm , θm , s0m ) +δ max
0
Q∗m (s0m , θm
0
) .
θm
θ−m ∈Θ−m j∈M\{m} s0m
Note that (32) indicates that the optimal strategy will always choose an action that maximizes the
Q-function for the current state. In the multi-agent case, the Q-function of each agent depends
on the joint action and is conditioned on the joint policy, which makes it complex to find an
optimal joint strategy [34]. To overcome these challenges, we consider UAV are independent
learners (ILs), that is UAVs do not observe the rewards and actions of the other UAVs, they
interact with the environment as if no other UAUs exist.
In this subsection, an ILs [35] based MARL algorithm is proposed to solve the resource
allocation among UAVs. Specifically, each UAV runs a standard Q-learning algorithm to learn
18
its optimal Q-values and simultaneously determines an optimal strategy for the MDP. Specifically,
the selection of an action in each iteration depends on Q-values in terms of two states- sm and its
successors. Hence Q-values provide insights on the future quality of the actions in the successor
state. The update rule for Q-learning [33] is given by
0 0
Qt+1
m (s m , θm ) = Q t
m (s m , θm ) + α t
r t
+ δ max Qt
(s , θ ) − Qt
(s m , θm ) , (33)
m 0 ∈Θ
θm m
m m m m
with stm = sm , θm
t
= θm , where s0m and θm
0
correspond to st+1
m
t+1
and θm , respectively. Note
that an optimal action-value function can be obtained recursively from the corresponding action-
values. Specifically, each agent learns the optimal action-values based on the updating rule in
(33), where αt denotes the learning rate and Qtm is the action-value of UAV m at time slot t.
Another important component of Q-learning is action selection mechanisms, which are used
to select the actions that the agent will perform during the learning process. Its purpose is to
strike a balance between exploration and exploitation that the agent can reinforce the evaluation
it already knows to be good but also explore new actions [33]. In this article, we consider
-greedy exploration. In -greedy selection, the agent selects a random action with probability
and selects the best action, which corresponds to the highest Q-value at the moment, with
probability 1 − . As such, the probability of selecting action θm at state sm is given by
1 − , if Qm of θm is the highest,
πm (sm , θm ) = (34)
, otherwise.
where ∈ (0, 1). To ensure the convergence of Q-learning, the learning rate αt are set as in
[36], which is given by
1
αt = , (35)
(t + cα )ϕα
where cα > 0, ϕα ∈ ( 21 , 1] .
Note that each UAV runs the Q-learning procedure independently in the proposed ILs based
MARL algorithm. Hence, for each UAV m, m ∈ M, the Q-learning procedure is concluded in
Algorithm 1. In Algorithm 1, the initial Q-values are set to zero, therefore, it is also called
zero-initialized Q-learning [37]. Since UAVs have no prior information on the initial state, a
1
UAV takes a strategy with equal probabilities, i.e., πm (sm , θm ) = |Θm |
.
19
In this subsection, we investigate the convergence of the proposed MARL based resource
allocation algorithm. Notice that the proposed MARL algorithm can be treated as an independent
multi-agent Q-learning algorithm, in which each UAV as a learning agent makes a decision
based on the Q-learning algorithm. Therefore, the convergence is concluded in the following
proposition.
20
Proposition 2. In the proposed MARL algorithm of Algorithm 1, the Q-learning procedure for
each UAV is always converged to the Q-value for individual optimal strategy.
The proof of Proposition 2 depends on the following observations. Due to the non-cooperative
property of UAVs, the convergence of the proposed MARL algorithm is dependent on the
convergence of Q-learning algorithm [35]. Therefore, we focus on the proof of convergence
for the Q-learning algorithm in Algorithm 1.
Theorem 1. The Q-learning algorithm in Algorithm 1 with the update rule in (33) converges
with probability one (w.p.1) to the optimal Q∗m (sm , θm ) value if
1) The state and action spaces are finite;
P+∞ t P+∞ t 2
2) t=0 α = ∞, t=0 (α ) < ∞ uniformly w.p. 1;
t
3) Var{rm } is bounded;
V. S IMULATION R ESULTS
In this section, we verify the effectiveness of the proposed MARL based resource allocation
algorithm for multi-UAV networks by simulations. We consider multi-UAV networks deployed in
a disc area with a radius rd = 500 m. The ground users are randomly and uniformly distributed
inside the disk. All UAVs are assumed to fly at a fixed altitude H = 100 m. In the simulations,
W
the noise power is assumed to be σ 2 = −80 dBm, the subchannel bandwidth is K
= 75 KHz
and T s = 0.1 s. For the probabilistic model, the channel parameters in the simulations follow
[7], where a = 9.61 and b = 0.16. Moreover, the carrier frequency is f = 2 GHz, η LoS = 1 and
η NLoS = 20. For the LoS channel model, the channel power gain at reference distance d0 = 1
m is set as β0 = −60 dB and the path loss coefficient is set as α = 2 [11]. In the simulations,
the maximal power level number is J = 3, the maximal power for each UAV is Pm = P = 23
dBm, where the maximal power is equally divided into J discrete power values. The cost per
unit level of power is ωm = ω = 100 and the minimum SINR for the users is set as γ0 = 3 dB.
Moreover, cα = 0.5, ρα = 0.8 and δ = 1.
In Fig. 3, we consider a random realization of a multi-UAV network in horizontal plane, where
L = 100 users are uniformly distributed in a disk with radius r = 500 m and two UAVs are
21
500
400
300
200
100
y
0
-100
-200
-300
User
-400 UAV
Predefined trajectory
-500
-600 -400 -200 0 200 400 600
x
initially located at the edge of the disk with the angle φ = π4 . For illustrative purposes, Fig. 4
shows the average reward and the average reward per time slot of the UAVs under the setup of
Fig. 3, where the speed of the UAVs are set as 40 m/s. Fig. 4(a) shows average rewards with
different , which is calculated as v t = M1 t
P
m∈M vm . As can be observed from Fig. 4(a), the
average reward increases with the algorithm iterations. This is because the long-term reward
can be improved by the proposed MARL algorithm. However, the curves of the average reward
become flat when t is higher that 250 time slots. In fact, the UAVs will fly outside the disk when
t > 250. As a result, the average reward will not increase. Correspondingly, Fig. 4(b) illustrates
the average instantaneous reward per time slot rt = m∈M rm t
P
. As can be observed from Fig.
4(b), the average reward per time slot decreases with algorithm iterations. This is because that
the learning rate αt in the adopted Q-learning procedure is a function of t in (35), where αt
decreases with time slots increasing. Notice that from (35), αt will decrease with algorithm
iterations, which means that the update rate of the Q-values becomes slow with increasing t.
Moreover, Fig. 4 also investigates the average reward with different = {0, 0.2, 0.5, 0.9}. If
= 0, each UAV will choose a greedy action which is also called exploit strategy. If goes
to 1, each UAV will choose a random action with higher probabilities. Notice that from Fig. 4,
= 0.5 is a good choice in the considered setup.
In Fig. 5 and Fig. 6, we investigate the average reward under different system configurations.
Fig. 5 illustrates the average reward with LoS channel model given in (4) over different .
22
106 104
3 6
=0 =0
2 4
5000
1.5 3
1 2 0
5 10 15
0.5 1
0 0
0 100 200 300 400 500 0 100 200 300 400 500
Algorithm iterations, t (in slots) Algorithm iterations, t (in slots)
(a) Comparisons of average rewards. (b) Average rewards per time slot.
Fig. 4: Comparisons for average rewards with different , where M = 2 and L = 100.
105 104
15 2.5
=0 =0
=0.2 =0.2
Average reward per time slot
=0.5 2 =0.5
=0.9 4000 =0.9
Average reward
10 3000
1.5
2000
1000
1 0
5 10 15
5
0.5
0 0
0 100 200 300 400 500 0 100 200 300 400 500
Algorithm iterations, t (in slots) Algorithm iterations, t (in slots)
(a) Comparisons of average rewards. (b) Average rewards per time slot.
Moreover, Fig. 6 illustrates the average reward under probabilistic model with M = 4, K = 3
and L = 200. Specifically, the UAVs randomly distributed in the cell edge. In the iteration
procedure, each UAV flies over the cell followed by a straight line over the cell center, that is
the center of the disk. As can be observed from Fig. 5 and Fig. 6, the curves of the average
reward have the similar trends with that of Fig. 4 under different . Besides, the considered
multi-UAV network attains the optimal average reward when = 0.5 under different network
configurations.
23
106
2.5
=0
=0.2
2 =0.5
=0.9
Average reward
1.5
0.5
0
0 100 200 300 400 500
Algorithm iterations, t (in slots)
In Fig. 7, we investigate the average reward of the proposed MARL algorithm by comparing
it to the matching theory based resource allocation algorithm (Match). In Fig. 7, we consider the
same setup as in Fig. 4 but with J = 1 for the simiplicity of algorithm implementation, which
indicates that the UAV’s action only contains the user selection for each time slot. Furthermore,
we consider complete information exchanges among UAVs are performed in the matching theory
based user selection algorithm, that is each UAV knows other UAVs’ action before making its own
decision. comparisons, in the matching theory based user selection procedure, we adopt the Gale-
Shapley (GS) algorithm [38] at each time slot. Moreover, we also consider the performance of the
randomly user selection algorithm (Rand) as a baseline scheme in Fig. 7. As can be observed that
from 7, the achieved average reward of the matching based user selection algorithm outperforms
that of the proposed MARL algorithm. This is because there is not information exchanges in the
proposed MARL algorithm. In this case, each UAV cannot observe the other UAVs’ information
such as rewards and decisions, and thus it makes its decision independently. Moreover, it can be
observed from Fig. 7, the average reward for the randomly user selection algorithm is lower than
that of the proposed MARL algorithm. This is because of the randomness of user selections, it
cannot exploit the observed information effectively. As a result, the proposed MARL algorithm
can achieve a tradeoff between reducing the information exchange overhead and improving the
system performance.
In Fig. 8, we investigate the average reward as a function of algorithm iterations and the
24
107
10
MARLA Pm = 25
Mach Pm = 35
8
Rand Pm = 45
Average reward
6
0
0 100 200 300 400 500
Algorithm iterations, t (in slots)
UAV’s speed, where a UAU from a random initial location in the disc edge, flies over the disc
along a direct line across the disc center with different speeds. The setup in Fig. 8 is the same as
that in Fig. 6 but with M = 1 and K = 1 for illustrative purposes. As can be observed that for a
fixed speed, the average reward increases monotonically with increasing the algorithm iterations.
Besides, for a fixed time slot, the average reward of larger speeds increases faster than that with
smaller speeds when t is smaller than 150. This is due to the randomness of the locations for
users and the UAV, at the start point the UAV may not find an appropriate user satisfying its
QoS requirement. Fig. 8 also shows that the achieved average reward decreases when the speed
increases at the end of algorithm iterations. This is because that if the UAV flies with a high
speed, it will take less time to fly out the disc. As a result, the UAV with higher speeds has less
serving time than that of slower speeds.
VI. C ONCLUSIONS
In this article, we investigated the real-time designs of resource allocation for multi-UAV
downlink networks to maximize the long-term rewards. Motivated by the uncertainty of en-
vironments, we proposed a stochastic game formulation for the dynamic resource allocation
problem of the considered multi-UAV networks, in which the goal of each UAV was to find a
strategy of the resource allocation for maximizing its expected reward. To overcome the overhead
25
of the information exchange and computation, we developed an ILs based MARL algorithm to
solve the formulated stochastic game, where all UAVs conducted a decision independently based
on Q-learning. Simulation results revealed that the proposed MARL based resource allocation
algorithm for the multi-UAV networks can attain a tradeoff between the information exchange
overhead and the system performance. One promising extension of this work is to consider
more complicated joint learning algorithms for multi-UAV networks with the partial information
exchanges, that is the need of cooperation. Moreover, incorporating the optimization of deploy-
ment and trajectories of UAVs into multi-UAV networks is capable of further improving energy
efficiency of multi-UAV networks, which is another promising future research direction.
Here, we show that the state values for one UAV m over time in (25). For one UAV m with
state sm ∈ Sm at time step t, its state value function can be expressed as
( +∞ )
X
V (sm , π) = E δ τ rm
t+τ +1 t
|sm = sm
τ =0
( +∞
)
X
t+1
=E rm +δ δ τ rm
t+τ +2 t
|sm = sm (A.1)
τ =0
( +∞ )
X
t+1 t
δ τ rm
t+τ +2 t
= E rm |sm = sm + δE |sm = sm ,
τ =0
26
where the first part and the second part represent the expected value and the state value function,
respectively, at time t+1 over the state space and the action space. Next we show the relationship
between the first part and the reward function R(sm , θ, s0m ) with stm = sm , θm t
= θ and st+1 0
m = sm .
t+1 t
E rm |sm = sm
X XY
F (sm , θ, s0m ) πj (sj , θj ) × E rt+1 |stm = sm , θm
t
= θm , st+1 0
= m = s m
s0m ∈Sm θ∈Θ j∈M
(A.2)
X XY
= F (sm , θ, s0m ) πj (sj , θj )Rm (sm , θ, s0m ),
s0m ∈Sm θ∈Θ j∈M
where the definition of Rm (sm , θ, s0m ) has been used to obtain the final step. Similarly, the second
part can be transformed into
( +∞ )
X
E δ τ rm
t+τ +2 t
|sm = sm
τ =0
( +∞ )
X XY X
= F (sm , θ, s0m ) πj (sj , θj ) × E δ τ rm
t+τ +2 t t
|sm = sm , θm = θm , st+1 0
m = sm (A.3)
s0m ∈Sm θ∈Θ j∈M τ =0
X XY
= F (sm , θ, s0m ) πj (sj , θj )V (s0m , π).
s0m ∈Sm θ∈Θ j∈M
The proof of Theorem 1 follows from the idea in [36], [39]. Here we give a more general
procedure for Algorithm 1. Note that the Q-learning algorithm is a stochastic form of value
iteration [36], which can be observed from (26) and (32). That is to perform a step of value
iteration requires knowing the expected reward and the transition probabilities. Therefore, to
prove the convergence of the Q-learning algorithm, stochastic approximation theory is applied.
We first introduce a result of stochastic approcximation given in [36].
converges to zero w.p.1 if and only if the following conditions are satisfied.
27
4t+1 t t t t
m (sm , θm ) = (1 − α ) 4m (sm , θm ) + α δΨ (sm , θm ),
(B.3)
where
Ψtm (sm , θm ) = rm
t
+ δ 0max Qtm (s0m , θm
0
) − Q∗m (sm , θm ). (B.5)
θm ∈Θm
Therefore, the Q-learning algorithm can be seen as the random process of Lemma 1 with
β t = αt .
Next we prove that the Ψt (sm , θm ) has the properties of 3) and 4) in Lemma 1. We start by
showing that Ψt (sm , θm ) is a contraction mapping with respect to some maximum norm.
for any x1 , x2 ∈ X .
Proposition 3. There exists a contraction mapping H for the function q with the form of the
optimal Q-function in (B.8). That is
kHq1 (sm , θm ) − Hq2 (sm , θm )k∞ ≤ δkq1 (sm , θm ) − q2 (sm , θm )k∞ , (B.7)
28
Proof. From (32), the optimal Q-function for Algorithm 1 can be expressed as
X h i
Q∗m (sm , θm ) = F (sm , θm , s0m ) × R(sm , θm , s0m ) + δ max
0
Q∗ 0
m m m .
(s , θ 0
) (B.8)
θm
s0m
Hence, we have
X h i
Hq(sm , θm ) = F (sm , θm , s0m ) × R(sm , θm , s0m ) + δ max
0
q(s 0
m m .
, θ 0
) (B.9)
θm
s0m
To obtain (B.7), we make the following calculations in (B.10). Note that the definition of q is
used in (a), (b) and (c) follows properties of absolute value inequalities. Moreover, (d) comes
from the definition of infinity norm and (e) is based on the maximum calculation.
X h i
(a)
F (sm , θm , s0m ) max 0 0 0 0
kHq1 (sm , θm ) − Hq2 (sm , θm )k∞ = max δ 0
q 1 (s m , θm ) − max0
q2 (sm , θm )
sm ,θm 0 θm θm
sm
(b) X
0 0 0 0 0
≤ max δ F (sm , θm , sm ) max 0
q1 (sm , θm ) − max 0
q2 (sm , θm )
sm ,θm mθ θ m
s0m
(B.10)
(c) X
F (sm , θ, s0m ) max q1 (s0m , θm
0
) − q2 (s0m , θm
0
≤ max δ 0
)
sm ,θm θm
s0m
(d) X (e)
= max δ F (sm , θ, s0m )kq1 (s0m , θm0
) − q2 (s0m , θm
0
)k∞ = δkq1 (s0m , θm 0
) − q2 (s0m , θm
0
)k∞
sm ,θm
s0m
(B.11)
= HQtm (sm , θm ) − Q∗m (sm , θm )
Var{Ψt (sm , θm )}
t
= E{rm 0
+ δ 0max Qtm (s0m , θm ) − Q∗m (sm , θm ) − HQtm (sm , θm ) + Q∗m (sm , θm )}
θm ∈Θm
t
= E{rm + δ 0max Qtm (s0m , θm
0
) − HQtm (sm , θm )} (B.13)
θm ∈Θm
t
= Var{rm + δ 0max Qtm (s0m , θm
0
)}
θm ∈Θm
[16] U. Challita, W. Saad, and C. Bettstetter, “Cellular-connected UAVs over 5G: Deep reinforcement learning for interference
management,” CoRR, vol. abs/1801.05500, 2018. [Online]. Available: http://arxiv.org/abs/1801.05500
[17] M. Chen, W. Saad, and C. Yin, “Liquid state machine learning for resource allocation in a network of cache-enabled
LTE-U UAVs,” in IEEE Proc. of Global Commun. Conf. (GLOBECOM), Dec 2017, pp. 1–6.
[18] J. Chen, Q. Wu, Y. Xu, Y. Zhang, and Y. Yang, “Distributed demand-aware channel-slot selection for multi-UAV networks:
A game-theoretic learning approach,” IEEE Access, vol. 6, pp. 14 799–14 811, 2018.
[19] N. Sun and J. Wu, “Minimum error transmissions with imperfect channel information in high mobility systems,” in IEEE
Proc. of Mil. Commun. Conf. (MILCOM), Nov. 2013, pp. 922–927.
[20] Y. Cai, F. R. Yu, J. Li, Y. Zhou, and L. Lamont, “Medium access control for unmanned aerial vehicle (UAV) ad-hoc
networks with full-duplex radios and multipacket reception capability,” IEEE Trans. Veh. Technol., vol. 62, no. 1, pp.
390–394, Jan. 2013.
[21] H. Li, “Multi-agent Q-learning of channel selection in multi-user cognitive radio systems: A two by two case,” in IEEE
International Conference on Systems, Man and Cybernetics, Oct. 2009, pp. 1893–1898.
[22] A. Galindo-Serrano and L. Giupponi, “Distributed Q-learning for aggregated interference control in cognitive radio
networks,” IEEE Trans. Veh. Technol., vol. 59, no. 4, pp. 1823–1834, May 2010.
[23] A. Asheralieva and Y. Miyanaga, “An autonomous learning-based algorithm for joint channel and power level selection
by D2D pairs in heterogeneous cellular networks,” IEEE Trans. Commun., vol. 64, no. 9, pp. 3996–4012, Sep. 2016.
[24] J. Cui, Z. Ding, P. Fan, and N. Al-Dhahir, “Unsupervised machine learning based user clustering in mmWave-NOMA
systems,” IEEE Trans. Wireless Commun., to be published.
[25] A. Nowé, P. Vrancx, and Y.-M. De Hauwere, “Game theory and multi-agent reinforcement learning,” in Reinforcement
Learning. Springer, 2012, pp. 441–470.
[26] A. Neyman, “From Markov chains to stochastic games,” in Stochastic Games and Applications. Springer, 2003, pp. 9–25.
[27] J. How, Y. Kuwata, and E. King, “Flight demonstrations of cooperative control for UAV teams,” in AIAA 3rd” Unmanned
Unlimited” Technical Conference, Workshop and Exhibit, 2004, p. 6490.
[28] J. Zheng, Y. Cai, Y. Liu, Y. Xu, B. Duan, and X. . Shen, “Optimal power allocation and user scheduling in multicell
networks: Base station cooperation using a game-theoretic approach,” IEEE Trans. Wireless Commun., vol. 13, no. 12, pp.
6928–6942, Dec. 2014.
[29] B. Uragun, “Energy efficiency for unmanned aerial vehicles,” in International Conference on Machine Learning and
Applications and Workshops, vol. 2, Dec. 2011, pp. 316–320.
[30] Y. Shoham and K. Leyton-Brown, Multiagent systems: Algorithmic, game-theoretic, and logical foundations. Cambridge
University Press, 2008.
[31] A. Neyman and S. Sorin, Stochastic games and applications. Springer Science & Business Media, 2003, vol. 570.
[32] M. J. Osborne and A. Rubinstein, A course in game theory. MIT press, 1994.
[33] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press Cambridge, 1998.
[34] G. Neto, “From single-agent to multi-agent reinforcement learning: Foundational concepts and methods,” Learning theory
course, 2005.
[35] L. Matignon, G. J. Laurent, and N. Le Fort-Piat, “Independent reinforcement learners in cooperative markov games: a
survey regarding coordination problems,” The Knowledge Engineering Review, vol. 27, no. 1, pp. 1–31, 2012.
[36] T. Jaakkola, M. I. Jordan, and S. P. Singh, “Convergence of stochastic iterative dynamic programming algorithms,” in
Advances in neural information processing systems, 1994, pp. 703–710.
[37] S. Koenig and R. G. Simmons, “Complexity analysis of real-time reinforcement learning applied to finding shortest paths
in deterministic domains,” CARNEGIE-MELLON UNIV PITTSBURGH PA SCHOOL OF COMPUTER SCIENCE, Tech.
Rep., 1992.
[38] D. Gale and L. S. Shapley, “College admissions and the stability of marriage,” The American Mathematical Monthly,
vol. 69, no. 1, pp. 9–15, 1962.
[39] F. S. Melo, “Convergence of Q-learning: A simple proof,” Institute Of Systems and Robotics, Tech. Rep, pp. 1–4, 2001.