Reinforcement Learning-Based Resource Allocation For M2M Communications Over Cellular Networks
Reinforcement Learning-Based Resource Allocation For M2M Communications Over Cellular Networks
∗
Dept. of Electrical, Electronic and Communication Engineering, Military Institute of Science and Technology, Bangladesh
†
Dept. of Electrical and Electronic Engineering, American International University-Bangladesh, Bangladesh
‡
James Watt School of Engineering, University of Glasgow, UK
?
Department of Engineering, Manchester Metropolitan University, UK
Email: [email protected], [email protected], [email protected], [email protected],
[email protected]
Abstract—The spectrum efficiency can be greatly enhanced data rate, decreasing power consumption, and reducing end-
by the deployment of machine-to-machine (M2M) communi- to-end (E2E) latency [2]. For M2M user equipments (MUEs)
cations through cellular networks. Existing resource allocation coexisting with cellular user equipments (CUEs), there are
approaches allocate maximum resource blocks (RBs) for cellu-
lar user equipments (CUEs). However, M2M user equipments two different types of deployment such as (i) overlaying
(MUEs) share the same frequency among themselves within the mode, and (ii) underlaying mode. CUEs and MUEs share
same tier. This results in generating co-tier interference, which the same radio resources through the underlaying mode. They
may deteriorate the MUE’s quality-of-service (QoS). To tackle suffer interference with each other. In the overlaying mode,
this problem and improve the user experience, in this paper, dedicated spectrum resources are allocated without creating
we propose a novel resource utilization policy, which exploits
reinforcement learning (RL) algorithm considering the pointer cross-tier interference. However, since a high number of users
network (PN). In particular, we design an optimization problem exist in the wireless network, the radio resources are usually
that determines the optimal frequency and power allocation inadequate [3], [4]. In order to entirely exploit the possible
needed to maximize the achievable rate performance of all facilities of underlaid M2M communications, it is essential to
M2M pairs and CUEs in the network subject to the co-tier provide the proper power for each UE by using the power
interference and QoS constraints. The proposed scheme enables
the user equipment (UE) to autonomously select an available control scheme and designing an efficient machine learning
channel and optimal power to maximize the network capacity (ML)-based resource utilization policy that mitigates the co-
and spectrum efficiency while minimizing co-tier interference. tier and cross-tier interference between MUEs and CUEs.
Moreover, the proposed scheme is compared with traditional
spectrum allocation schemes. Simulation results demonstrate the
superiority of the proposed scheme than that of the traditional
schemes. Moreover, the convergence of the proposed scheme is
investigated which reduces the computational complexity (CC). A. Related Works
Index Terms—M2M communications, resource allocation,
throughput, pointer network, reinforcement learning. In recent years, there has been a great deal of attention
to stochastic optimization and robust optimization methods
I. I NTRODUCTION because of the needs to handle the unpredictable value of CSI
in M2M communications. The studies in [4] addressed CSI by
By 2020, 4 billion devices are linked over 25 billion em- maximizing the predicted linking capacity through the use of
bedded intelligent systems creating 50 trillion GB of data [1]. stochastic optimization. In addition, accurate CSI is often not
Following these figures, internet of things (IoT), in particular possible or may require high feedback rates which makes the
wireless IoT, are potential candidates for future smart world. channel condition uncertain. [2], [5] studied how to maximize
Its vast adoption puts forward several technical challenges the resource efficiency (RE) in M2M communications by using
which include network design and storage architecture for the technique of joint power control and spectrum allocation to
smart devices, effective information transmission protocols, improve the user’s data rate and prolong the battery lifetime
proactive IoT device identification, malicious attack pre- of UE by facilitating the reuse of radio resources between
vention, technology standardization and appliance interfaces. MUEs and CUEs. In real time signal transmission, the size and
As a result, machine-to-machine (M2M) communications is shape parameters of the uncertainty set would fluctuate with
deemed as a promising paradigm in addressing these issues the channel conditions [6], [7]. That is, CSI frequently changes
and offering efficient operation of beyond fifth generation due to the high mobility of CUEs and MUEs. However,
(B5G) and sixth generation (6G) cellular networks. Besides, the preceding works presented so far emphasize on M2M
unlike conventional communications, M2M communications communications with ideal CSI [3]. Therefore, it is critical
involve direct links transmission with the evolved node B to investigate how to meet the increasing demand of higher
(eNB), resulting in many benefits, such as enhancing user transmission rate in M2M communications.
Authorized licensed use limited to: McGill University. Downloaded on March 31,2024 at 16:59:09 UTC from IEEE Xplore. Restrictions apply.
978-1-6654-4266-4/22/$31.00 ©2022 IEEE 1473
2022 IEEE Wireless Communications and Networking Conference (WCNC)
B. Motivation
Next generation wireless networks will generate a tremen-
dous amount of data related to network statistics, such as
user traffic, channel occupancy, channel quality, etc. This will
induce unmanageable overhead that largely increases delay,
computation, and energy consumption of network elements
[8]. Neural network (NN) can leverage this data to develop
automated and fine-grained schemes to optimize network radio
resources. The multipurpose pointer network (PN) can predict
sequences over variable length input dictionaries when it
integrates with NN resulting in improvement of sequences
with the assistance of input attention and generalization of Fig. 1. Proposed system model of M2M communications over cellular
networks.
variable size output dictionaries. However, the PN based on
the NN is not taken into account in [7], and interferences are
not taken into account in [4]. Furthermore, resource allocation resource utilization policy of the proposed scheme. It is
techniques based on the conventional optimization theory, also found that the proposed scheme reduces the com-
such as multidimensional knapsack problems (MKP), greedy putational complexity (CC). Also, the proposed scheme
algorithm, and heuristic algorithm [6], [7], are generally highly provides better network performance than that of the
complex and not feasible for real-time applications. Reinforce- traditional schemes.
ment learning (RL) is deemed as a promising algorithm to The rest of the paper is organized as follows. Section II
solve cellular communication problems, especially for spec- illustrates the system model. We put forward the radio resource
trum allocation, data offloading, adaptive modulation, power utilization policy and develop the RL algorithm with low CC
control and interference mitigation more efficiently compared in section III. Section IV presents simulation results, followed
to supervised and unsupervised learning algorithms. However, by conclusions in Section V.
RL algorithms reveal low convergence speed as well as overall
efficiency while working with large state–action spaces in II. P ROPOSED N ETWORK M ODEL
complex networks. Moreover, in combined resource sharing A. System Model
and power controlling schemes, RL is unable to manage The system model considers the downlink data transfer situ-
large action spaces and state space. Therefore, this paper ation when the eNB is located in the center of the cellular cell.
considers RL-based PN for solving multi-dimensional state The M2M communication comprises of a couple of devices
space and complexity discrete action space problems and those are able to directly transmit data without the help of the
proposes a joint power and spectrum allocation algorithm. The eNB. On the other hand, the CUEs are mobile terminals that
key contributions of this paper can be summarized as follows. can only be connected via the eNB. There are M M2M pairs
• This paper proposes a RL-based resource utilization pol- and N CUEs deployed randomly in the coverage area of the
icy for M2M communications over cellular networks. eNB as shown in Fig. 1. UEs operate in orthogonal frequency
• We adopt a mixed integer non-linear programming prob- bands following orthogonal frequency division multiple access
lem and NP-hard. Then, we assign the orthogonal sub- (OFDMA) technique. In this case, MUEs do not share the
frequenices for different MUEs and CUEs to maximize spectrum with CUEs. The unavailability of resource blocks
the sum rate of the network. Moreover, the M2M pairs (RBs) allows the MUEs to share the same frequency but it
are permitted to reuse resources to better use the scarce also causes co-tier interference between MUEs. Moreover, we
resources, when the co-tier interference is below than assume that the complete CSI is accessible [9]. In other words,
threshold interference. Besides, MUEs can choose a num- for simplicity, the eNB is capable of obtaining the full CSI
ber of channels and proper power to transmit services between CUEs and M2M pairings.
as soon as possible without affecting the traditional B. Performance Metrics
communication of CUEs.
As OFDMA is incorporated for CUE and MUE transmis-
• This paper considers the RL algorithm empowered PN
sions, MUE receivers are subject to the interference caused
architecture which is based on a low-complexity process
only by other MUE transmitters that reuse the same frequency.
to effectively utilize the spectrum resources. Besides, the
In this sense, the co-tier interference for the MUE receiver at
PN is a prominent type of NN which efficiently solves
subfrequency k is given as follows
combinatorial optimization related problems in M2M
communications. Furthermore, this method achieves the M
X
∗
goal of significantly improving the QoS of the system, Ik = vm,k p∗m,k h∗m,k , (1)
such as optimizing system capacity and simultaneously m=1
∗
reducing interference. where vm,k ∈ {0,1}, designates whether the subfrequency k
• Extensive simulations are carried out for evaluating the is allocated to M2M pair m, if the m M2M pair reuses the
Authorized licensed use limited to: McGill University. Downloaded on March 31,2024 at 16:59:09 UTC from IEEE Xplore. Restrictions apply.
1474
2022 IEEE Wireless Communications and Networking Conference (WCNC)
∗ K X
N
subfrequency k, vm,k sets 1 as its value, else sets 0. Also, X
wnk ≤ 1, ∀n ∈ N, ∀k ∈ K (8)
pm,k denotes the transmit power while h∗m,k is the channel
∗
k=1 n=1
gain on subfrequency k between the MUE receiver and other
MUE transmitters including the 3rd generation partnership wnk ∈ {0, 1}, ∀k, n, (9)
project (3GPP) path loss (PL) model, channel fading on both
M2M pairs and CUEs that follows the Rayleigh distribution k
vm ∈ {0, 1}, ∀k, m, (10)
with uniform variance. The signal to interference-plus-noise
ratio (SINR) expression for CUE n and the M2M pair m at N X
X K
Authorized licensed use limited to: McGill University. Downloaded on March 31,2024 at 16:59:09 UTC from IEEE Xplore. Restrictions apply.
1475
2022 IEEE Wireless Communications and Networking Conference (WCNC)
power control technique to mitigate the interference at the Algorithm 1: RL-based resource utilization policy
M2M receivers, such as 1 Step 1: Spectrum distribution for CUEs;
k
γth (Imax + N0 ) 2 Set n = [1, 2, ..., N ], m = [1, 2, ..., M ] and entire
pkn∗ ≥ , (15)
hkn∗ number of simulation times;
max
k
where Imax is the highest permitted interference at each 3 Initialize : Rm = 0, Rn = 0 and pkn = P N ;
subfrequency. 4 for k = [1, 2, ..., K] do
2) Spectrum allocation for each MUE: After finding the 5 Obtain n∗ on subfrequency k;
allocation for each CUE, then the subfrequency assignment 6 Update the optimal transmission power of each
for M2M pairs can be given as CUE according to (15);
7 Update the data rate of each CUE according to
OP 3 : max Rm , (16) (17);
{V }
8 end
s.t. (10), (12), and (13).
9 Step 2: Spectrum distribution for MUEs;
From equation (2), (12) is rewritten as follows
k k 10 Assign the training set S, amount of training phases T ,
pn hn batch size Q and PN parameter P ;
min − N0 Bn ≥ 0. (17)
n γth 11 for t = [1, 2, ..., T ] do
Similarly, (13) is rewritten as 12 Choose a batch of sample sb for b ∈ [1, 2, ..., Q];
M k 13 Trial solution ob based θp (.|sb ) for b ∈ [1, 2, ..., Q];
X
∗ P hm 14 Calculate value: V (ob |sb );
vm,k p∗m,k h∗m,k ≤ min − N0 Bm + vm k
P hkm .
m=1
m∈Sk γth 15 Update the PG (gp ) according to (22);
(18) 16 Update the parameter of the PN according to (23);
After the alteration of (12) and (13), (OP 3 ) is transfigured into 17 Update the baseline function
a two-dimensional (2D) knapsack problem (KP), especially, q(sb ) = q(sb ) + α(V (ob |sb ) − q(sb )) for
there are two dimensions for the weights of the KP. It is ap- b ∈ [1, 2, ..., Q];
plied to achieve the optimal solution for resource allocation in 18 end
M2M communications under power and threshold interference 19 for k = [1, 2, ..., K] do
k
constraints [9]. This type of issue is a part of the MKP [9] 20 Utilize the PN to calculate the vm for mth MUE;
and only water-filling algorithms can be applied to address 21 Update the date rate for each MUE;
this type of NP-hard problems. However, efficient water-filling 22 end
algorithms require high computation time, making them un-
suitable for real-time applications [5]. In this article, the PN is
employed to successfully handle the problem of combinatorial 2) RL algorithm: RL algorithm is considered as a proper
optimization. method to train NN while solving combinatorial optimization
C. Proposed Algorithm Representation problems. Our proposed low complexity policy-empowered
1) PN architecture: RL-based optimal resource allocation RL aims the parameter optimization a PN, which is symbol-
scheme is proposed in Algorithm 1. As above, a 2D KP is ized as P . Besides, the expected tour length expressed by an
a resource optimization problem for each subfrequency k. input graph s is given below [10]
Given the CUEs’ resource distribution state, each of the M2M
J(P |s) = Eo∼θp (.|s) V (o|s). (19)
pairs is a characteristics vector of 3D (v, x, y), where v is
the achievable data rate for M2M pairs according to (5), x The graphs generate from distribution S while training, where
and y are the weights on the 2D KP limitations, which may the overall training target includes sampling from the graph
be obtained from (17) and (18), individually. The PN is a distribution and written as follows
special type of recurrent NN which differentiates the encoder
and the decoder via distinct colors. The input of v, x and y J(P ) = Es∼S J(P |s). (20)
should be in a sequence that comprises the 3D characteristics
vectors as specified, because the PN is built on the model of For optimizing the parameters, we recourse to policy gradient
sequence-to-sequence. The output is also a sequence and may (PG) techniques and stochastic gradient descent. Using the RL
be obtained from the PN by using the pointing mechanism algorithm, the gradient of (20) can be expressed as [10]
which re-arranges the input. The output is a collection of valid ∇p J(P |s) = Eo∼θp (.|s) [(V (o|s)−q(s))∇p log θp (o|s)], (21)
entities that meet the requirements. In particular, we cross the
solution series and terminate when the obtained entities exceed where q(s) signifies the baseline act in the training procedure
the limitations of (17) and (18). The identified entities are the that does not lie in the arrangement of the order in the proposed
solution to the KP. We name it solution o and designate V (o) network and calculates the predicted value to decrease the
as the total value of the corresponding set of the entities. The divergence of the gradients [10]. The popular RL algorithm
details of the PN framework are provided in [9]. has been utilized to extract the gradient for improving the
Authorized licensed use limited to: McGill University. Downloaded on March 31,2024 at 16:59:09 UTC from IEEE Xplore. Restrictions apply.
1476
2022 IEEE Wireless Communications and Networking Conference (WCNC)
network parameters using the Adam method [10]. The pro- network performance while mitigating the co-tier interference
posed resource allocation scheme uses the RL algorithm in a of the M2M Links. The M2M mode maximizes the SINR of
further practicable approach with Monte Carlo sampling. After M2M links compared to traditional methods, leading to only
assigning the batch size Q, by producing Q independently a slightly better performance than the cellular mode.
and identically allocated sample KP, the gradient is stated in
a randomly determined mean form as TABLE I
T HE PARAMETERS OF S IMULATION
Q
1 X
gp = (V (ob |sb ) − q(sb ))∇p log θp (ob |sb ), (22) Parameter Value
Q Cell radius 250m
b=1
Carrier frequency 2 GHz
P = ADAM (P, gp ), (23) System bandwidth 6 MHz
Per RB bandwidth 200 kHz
The baseline function is an exponential moving mean value of Number of M2M pairs 30
the system rewards achieved through the network upon time Number of CUEs 10
to justify the detail that the strategy enhances with training. M2M communication distance 50m
Maximum transmission power [2] 30 dBm
3) CC analysis: Here we will analyze the CC of the pro- Noise power −174 dBm/Hz
posed scheme. According to Step 1, the CC of the suggested PL model between eNB and CUE [2] 128.1 + 37.6 log(d[km])
algorithm 1 is O(N ). The long-short term memory (LSTM) PL model between M2M pair [2] 148 + 40 log(d[km])
Threshold SINR 10 dB
units with attention are the elementary modules in the PN of Training samples 2000
the Step 2. The CC for every given fine-tuned PN is O(M 2 ) Testing samples 500
whereas the attention variable calculation is done M times Batch size 64
at every creation time [9], [10]. Thus, the entire CC of the Hidden layer 64
proposed scheme is O(M 2 + N ).
A model-free policy-based RL algorithm optimizes the
parameters without knowledge of the environment. This al-
gorithm measures the time of making model inference, that
is, the step for the trained model to make decisions which
minimizes the error on training samples, while keeping the
bound on its model complexity small. Thus, computation time
of the proposed scheme is faster than existing ML schemes
which are analyzed in Fig. 4.
IV. S IMULATION R ESULTS AND A NALYSIS
A. Simulation Setup
The MATLAB based simulation results for the suggested
scheme are presented in this section. Furthermore, the CUEs
and M2M pairs are randomly placed in the network. In
both CUEs and M2M pairs the channel fading follows a
uniform variance of the Rayleigh distribution. Table I lists
the main parameters of the simulation in detail. We compare
the proposed algorithm with two traditional schemes, such as Fig. 2. Average throughput versus number of M2M pairs.
M2M mode and cellular mode. In the M2M mode, the M2M
pair communicates with each other without using learning-
C. The Convergence of the Proposed Mechanism
based resource utilization [5]. Furthermore, in the cellular
mode, the CUE communicates without using learning-based Fig. 3 shows the reward comparison of the proposed
resource utilization in the proposed network [5]. Moreover, the scheme, M2M mode and cellular mode. As illustrated in the
proposed mechanism selects the suitable communication mode figure, our proposed scheme outperforms other existing M2M
using RL empowered PN-based resource utilization policy distribution algorithms in the proposed network. It shows that
without considering the UE selection. when the iteration increases, the network performance of users
is improved, and our proposed method is much better than
B. The Throughput Comparison of M2M Pairs other methods. Compared to traditional resource distribution
Fig. 2 shows the average throughput versus the number of approaches, MUEs achieve reasonable system performance
M2M pairs. The average throughput achieved by the proposed while co-tier interference is appropriately managed. Moreover,
scheme rises as the number of M2M pairs rises. But, the rise we can observe that the proposed scheme converges very
is not remarkable due to the interference limitation in the rapidly. This is because the proposed scheme adopts the PN
cellular mode. From the figure, it can be observed that the which reduces the CC. The numbers of M2M pairs and CUEs
proposed approach achieves a significant improvement in the are set to be 30 and 10, respectively, which are randomly
Authorized licensed use limited to: McGill University. Downloaded on March 31,2024 at 16:59:09 UTC from IEEE Xplore. Restrictions apply.
1477
2022 IEEE Wireless Communications and Networking Conference (WCNC)
Authorized licensed use limited to: McGill University. Downloaded on March 31,2024 at 16:59:09 UTC from IEEE Xplore. Restrictions apply.
1478