Gedikli et al. - 2022 - Deep reinforcement learning based flexible preambl
Gedikli et al. - 2022 - Deep reinforcement learning based flexible preambl
Computer Networks
journal homepage: www.elsevier.com/locate/comnet
Keywords: One of the most difficult challenges in radio access network slicing occurs in the connection establishment
Deep reinforcement learning phase where multiple devices use a common random access channel in order to gain access to the network. It
Preamble allocation is now very well known that random access channel congestion is a serious issue in case of sporadic arrival of
Network slicing
machine-to-machine nodes and may result in a significant delay for all nodes. Hence, random access channel
5G
resources are also needed to be allocated to different services to enable random access network slicing. In the
RAN
M2M
random access channel procedure, the nodes transmit a selected preamble from a predefined set of preambles.
If multiple nodes transmit the same preamble at the same random access channel opportunity, a collision
occurs at the eNodeB. To isolate the one service class from others during this phase, one approach is to
allocate different preamble subsets to different service classes. This research proposes an adaptive preamble
subset allocation method using deep reinforcement learning. The proposed method can distribute preambles
to different service classes according to their priority providing virtual isolation for service classes. The results
indicate that the proposed mechanism can quickly adapt the preamble allocation according to the changing
traffic demands of service classes.
∗ Corresponding author.
E-mail address: [email protected] (A.M. Gedikli).
https://doi.org/10.1016/j.comnet.2022.109202
Received 19 April 2021; Received in revised form 16 May 2022; Accepted 15 July 2022
Available online 21 July 2022
1389-1286/© 2022 Elsevier B.V. All rights reserved.
A.M. Gedikli et al. Computer Networks 215 (2022) 109202
services. In both LTE and 5G, there are 54 PRACH preambles each one • The experimental results show that the proposed approach suc-
selected by devices randomly and transmitted for channel allocation re- cessfully allocates preambles dynamically and improves the ac-
quests. If multiple nodes transmit the same preamble at the same RAO, cess probability of slices to the PRACH based on their prior-
a collision occurs at the base station. To isolate one service class from ity levels. Moreover, it can adapt very quickly to changes in
others during this phase, one approach is to divide RACH resources to the network. The proposed approach is compared with a recent
different service classes. Static allocation of those resources, however, approach [16] in the literature and shown to produce results
may result in inefficiencies when the traffic generated by each service comparable to [16].
changes significantly over time. For example, in the event of a power • Three reward functions, namely Successful Preambles Reward
outage and restore, all IoT devices in the network may try to reconnect Function (SRF), Proportional Reward Function (PRF) and Colli-
to the network at the same time since they wake up at the same time. sion Penalizing Reward (CRF) are proposed for the reinforcement
There may also be temporary dense networks, such as networks of learning and mathematically analyzed. CRF has been shown to be
thousands of spectators gathered in a stadium for a football match [9, the most successful, since it penalizes collisions.
10]. Moreover, since 5G requires more heterogeneous and smaller cells,
the number of users in each cell could vary greatly and dynamically The rest of the paper is organized in the following manner. In
due to user mobility [11]. All these example scenarios highlight the Section 2, related work is introduced. The system model is given in
need for dynamic allocation mechanism for up-link resources, since the Section 3. In Section 4, RL-based preamble grouping is explained. The
contention on resources may cause delays and collisions. Hence, slicing analysis of the proposed 3 reward functions of RL is carried out in
the resource into dynamic groups is more suitable to be able to keep Section 5. The evaluation and results are discussed in Section 6. The
the delay and collision probabilities around the desired levels. discussion on results, limitations and possible future work is given in
Dynamic slicing of resources of the RACH preambles is a challenging Section 7. Finally, Section 8 is devoted to concluding remarks.
problem since the eNodeB has limited information about the number
of nodes waiting for different services. Previous studies on service dif-
2. Related work
ferentiation in the RACH context either focused on heuristics methods
or analytical models. This research focuses on a reinforcement learning
In this section, studies on RACH congestion control, reinforcement
(RL) approach for this complex and dynamic problem. RL methods have
learning in wireless networks and RAN slicing in the literature are
the advantage of being able to operate without exact system models.
reviewed.
RL-based approaches have become very attractive for solving problems
The most well known proposed congestion control technique, Access
in networking and communication due to their very characteristics such
Class Barring (ACB), prevents user equipment (UE) to access RACH
as autonomous decision making and obtaining the best policy with
the minimum information for network entities [12]. Moreover, it is resources when there is congestion in the network. The eNodeB period-
stated in the literature [13] that DRL-based network resource allocation ically distributes a probability factor and barring time. Then, the UEs
schemes outperform conventional resource allocation schemes. Also, select a random number. If the selected random number is lower than
they have the ability to adapt to changes in an environment and to the probability factor, the UE is permitted to access RACH resources.
give a suitable action automatically with the help of their reward Otherwise, UE waits a random amount of time based on the barring
mechanism [14]. Furthermore, where a large amount of data may not time distributed by the eNodeB.
be available to train complex deep learning models, RL might provide a Since ACB does not consider multiple service classes, Extended
practical solution by learning from a network environment [15]. These Access Barring (EAB) is proposed as an extension of ACB for service
features make RL very attractive for dynamic preamble allocation and differentiation among different M2M devices. In this scheme, the main
suitable for changing service requirements over time. Therefore, in this aim is to reduce the number of collisions among delay-tolerant M2M
study, the use of reinforcement learning for automatically RAN slicing devices, so generally, the delay is higher than ACB [17]. There are also
is investigated. few studies to incorporate service differentiation into random access
An adaptive preamble subset allocation method is proposed in this channel by distributing multiple barring factors to different service
research using deep reinforcement learning (DRL) in order to increase classes in combination with different techniques [18–20]. While the
QoS by preserving the balance between service types such that the study in [18] bases their strategy on three (low, medium and high)
PRACH could be shared with respect to priorities of services and traffic ACB factors and names it multiple ACB (MACB), the authors in [19]
in the network. The proposed method can distribute preambles to proposes a method for delay-sensitive devices by combining ACB and
different service classes according to their priority providing virtual RACH partitioning. Moreover, they assert the work in [18] does not
isolation of service classes. The results indicate that the proposed mech- realize partition of RACH and it ignores the quality of delay-tolerant
anism can adapt the preamble allocation according to the changing devices. Another multiple ACB method [20] introduces a priority-based
traffic demands of service classes quickly. Besides, several reward func- access class barring (PACB) algorithm that performs higher throughput
tions are proposed for the RL algorithm and mathematically analyze the when compared to MACB and some other relevant techniques.
behavior of these functions. Simulations indicate that the behavior of There are also a limited number of studies proposing to prioritize
the proposed RL mechanism closely follows the proposed mathematical RACH access through preamble separation [21–24]. Most of these stud-
analyses. The contribution of this current study could be summarized ies propose a fixed preamble separation configuration which divides the
as follows: preambles into two groups. For example, the total set of preambles are
• DRL is proposed in order to solve the preamble allocation problem divided into two groups in [21,23], one group is exclusively reserved
in 5G. To the best of our knowledge, this is the first study in the for H2H devices and the other group is either reserved for M2M or can
literature that explores the use of DRL for allocating preambles be used by both M2M and H2H devices. The authors in [24] proposed to
dynamically based on the priorities of services, and traffic in the partition the preambles into 3 groups by introducing a higher priority
network. traffic type for the smart grid. However, the groupings are not adaptive.
• The proposed method is evaluated thoroughly on varying simu- Another static approach where PRACH slots are assigned to different
lated networks where traffic could increase or decrease suddenly service classes is also proposed in [22]. The issue with these works
or constantly. Since at least three services are defined in 5G and is that the preamble groupings are static which may be inadequate to
this number could increase up to 5 services [6], network scenarios respond to changes in the traffic of different service classes.
with 2, 3, 5 slices are considered in the experiments. Please note There are few studies that investigate the use of dynamic preamble
that, evaluations are generally carried out for 2 or 3 slices in the subset allocation in the literature [6,25][16]. The main difference
literature, however 5 slices are also taken into account here. between the approach in [6] and this research is that the authors
2
A.M. Gedikli et al. Computer Networks 215 (2022) 109202
use heuristic algorithms for subset allocation whereas reinforcement efficiency, where the first layer is used to allocate resources to each
learning is used in this research. Besides they consider a setup where slice and the second one is used to allocate resources for each user.
nodes transmit multiple consecutive preambles whereas this research Another RL-based slice admission controller which makes its decisions
considers single preamble transmission as in the LTE standard. Another based on resource availability is also proposed in [37].
dynamic preamble separation method which is combined with binary
exponential back-off with respect to three different priority classes is 3. System model
proposed in [26]. A load adaptive dynamic preamble allocation method
called LATMAPA [25] is proposed for prioritizing in the context of 5G In a typical LTE configuration, 64 preambles are used in the random
Random Access. However, it proposes a solution for only 2 slices which access procedure. While 10 preambles are reserved for contention-free
are delay tolerant and delay intolerant. access, 54 preambles are allocated for contention-based access [38].
Very recently, an online control method for dynamic preamble Hence, it is assumed that the total number of available preambles is 54
distribution over prioritized preamble groups is proposed in [16]. At and it is assumed that a system where there are 𝑁 slices with different
first, the number of active devices in each priority is estimated in a service requirements. Therefore, 54 preambles used in the random
recursive Bayesian way. Then, together with this estimation, a heuristic access procedure are divided into 𝑁 different groups. Each preamble
novel algorithm that distributes the preambles over the service groups group is assigned to a slice and a UE is only allowed to select a preamble
with respect to their priority levels in a dynamic manner is applied. from the group of its service class. Hence, each slice is isolated from the
Finally, they extend their approach with ACB. In terms of separat- rest of the slices in the sense that the preamble transmitted by a UE can
ing the preambles and prioritizing the services, [16] shows similarity only collide with the other UEs in the same slice. There are studies in
to the proposed approach in this current study. For example, while the literature that aim to model and analyze the PRACH considering
they classify the devices into different priorities by assigning priority the factors like path loss, low transmission power of nodes, deafness
weights, priority coefficients are employed in this study. Furthermore, of nodes [39–41]. In this research, these factors are ignored and the
both studies support any number of services. Therefore, the approach system is assumed to have an ideal channel where preambles losses
in this current study is compared with mentioned approach by using only occur due to collisions.
a scenario given in [16]. Moreover, the ideal algorithm given as the In the model, in each RAO, the nodes which have been collided will
baseline in [16] is used in the experiments. remain backlogged in the system and will attempt to transmit a new
RL-based approaches have recently been proposed for several prob- preamble in the next round. The preamble groupings are adaptive and
lems in radio access networks. In general, these works are focused it is assumed that these groupings are announced by the eNodeB before
on optimizing the allocation of resources and increasing the quality each RAO. Hence, the backlogged nodes will respect the new preamble
of service (QoS). One relevant study on RACH proposes ACB barring groupings announced by the eNodeB in the next round. As well as
rate adaptation using Q-learning to increase the success probability of the backlogged nodes, newly arriving nodes will also be attempting to
M2M communications with low impact on H2H communication [27]. transmit in the next slot.
Another study employs RL to optimize the joint allocation of fiber and The notations used throughout this study and how they are cal-
radio resources for Cloud RANs. The authors note that RL improves culated are listed in Table 1. Here, 𝑀 represents the total available
performance with respect to genetic and tabu search algorithms [28]. contention based PRACH orthogonal distinguishable signal (preamble)
There are also several studies about mobile edge computing (MEC) to count in each RAO period (54). 𝑁 shows the number of slices in the
optimize the allocation of network and computing resources [29–31]. network. 𝑊 is the maximum number of allowed preamble transmission
The study in [29] proposes a DRL solution to allocate the networking attempts, if it is reached, the preamble requests are removed from
resources adaptively to reduce service time for users under the diver- the backlog and evaluated as dropped. 𝑚𝑗 represents the number of
sified MEC environment. The authors in [30] exploit Deep Q-Network preambles reserved for slice 𝑗 and it is announced by eNodeB periodi-
(DQN) in order to solve computational offloading problem in multi- cally. At the end of each RAO period, the contention on the PRACH is
user MEC systems and to avoid the curse of dimensionality. A similar resolved, and the number of collided preambles for slice 𝑗 (𝑐𝑗 ), number
work [31] focuses vehicle-assisted computational offloading using DQN of successful preambles for slice 𝑗 (𝑠𝑗 ) and number of unused preambles
in a similar manner. Another application of RL is to improve the for slice 𝑗 (𝑢𝑗 ) are obtained in that period. These values with priority
energy efficiency of heterogeneous networks through user scheduling coefficients (𝑘𝑗 ) are used in the reward functions, hence the reward
and resource allocation [32]. value for slice 𝑗 (𝑟𝑗 ) could be calculated. 𝑛𝑗 here represents the number
A DRL-based random access optimization method is proposed in [33] new arrival preamble requests for slice 𝑗. 𝑛𝑗 values are determined for
in order to dynamically adjust the ACB parameter. The prioritization is each scenario in testing and these requests are added to the testing
addressed using different ACB parameters for service types, however, backlogs (𝑏𝑡𝑗 ) in every RAO period (10 ms). On the other hand, 𝑛𝑗
only three types are sampled. Furthermore, the types are assumed to values are randomly generated in training and, the requests are added
have default traffic characteristics. Hence, the diversity of service types to the training backlogs (𝑏𝑡𝑟 𝑗 ). Finally, 𝜆𝑗 , which is widely used in the
and flexible characteristics of traffic are not addressed in [33]. To the literature, represents the normalized arrival rate for slice 𝑗.
best of our knowledge, RL has not been applied to the problem of A sample RAO is shown in Fig. 1. In the figure, the eNodeB has
preamble subset allocation for network slicing. The advantage of using announced that 9 preambles are reserved for slice-1, 12 preambles are
RL with respect to approaches based on heuristic methods is that there reserved for slice-2, etc. There were 22 UEs transmitted a preamble
is no assumption on the traffic model in RL. RL can learn to maximize from slice-1 and four of them became successful as they are the ones
channel performance regardless of the traffic distribution. that did not experience a preamble collision. Hence, the number of
Network slicing is also gaining popularity as a key enabler technol- backlogged UEs is reduced to 18 after the RAO. Slicing the preamble
ogy for the 5G vision and RAN slicing is one of its main components. resources to services given in Fig. 1 as an example can be called as
The main aim of RAN slicing is to enable dynamic on-demand allo- preamble grouping.
cation of radio resources among multiple services. A run time slicing
method that isolates RAN slices and a set of algorithms in order to 4. Reinforcement learning based preamble grouping
partition inter-slice resources are proposed in [34]. Another RAN slicing
method allocates radio resources between enhanced mobile broadband In recent years, there has been a tremendous increase in the use of
and vehicle-to-everything services using RL combined with a heuristic deep reinforcement learning in order to solve highly-complex problems.
algorithm to maximize utilization [35]. A two-layer scheduler approach Here, it is used for the adaptive selection of the preamble groups. In the
is proposed in [36] in order to manage balance between isolation and following parts, the proposed RL-based preamble grouping technique
3
A.M. Gedikli et al. Computer Networks 215 (2022) 109202
Table 1
Notations.
Notation Explanation Value Formula
𝑀 Total number of available preambles for contention-based RACH 54 –
𝑁 Number of slices in the available system model – –
𝑊 Maximum number of allowed preamble transmission attempt 10 –
𝑚𝑗 Number of preambles reserved to slice 𝑗 for the RAO – –
𝑐𝑗 Number of collided preambles in the RAO period for slice 𝑗 – –
𝑠𝑗 Number of successful preambles in the RAO period for slice 𝑗 – –
𝑢𝑗 Number of unused preambles in the RAO period for slice 𝑗 – 𝑚𝑗 − 𝑐𝑗 − 𝑠𝑗
𝑛𝑗 Number of new arrival preamble requests in the RAO period for slice 𝑗 – –
𝑏𝑡𝑟
𝑗 Number of backlogged preamble requests waiting for the next RAO period – –
for slice 𝑗 used in the training
𝑏𝑡𝑗 Number of backlogged preamble requests waiting for the next RAO period – –
for slice 𝑗 used in the testing
𝑏𝑗 Number of backlogged nodes waiting for the next RAO period for slice 𝑗 – –
𝜆𝑗 Normalized arrival rate for slice 𝑗 – 𝜆𝑗 = 𝑛𝑗 ∕𝑀
𝑘𝑗 Reward or priority coefficient for successful transmission of a preamble 𝑗 –
request for slice 𝑗
𝑟𝑗 Collected reward value in a RAO period for slice 𝑗 – 𝑠𝑗 𝑘𝑗 − 𝑐𝑗 for CRF
𝑠𝑗
𝑘𝑗 𝑚 −𝑢 for PRF
𝑗 𝑗
is explained in detail. Firstly, the focused problem in this research is randomly selects its preamble and the number of UEs transmitting is
formulated in the RL context and, then, the RL algorithm that is used to not known to the eNodeB. Moreover, the number of UEs waiting to
solve the problem at hand is described. Further, the training procedure transmit a preamble changes as new UEs arrive at the channel over
used in the system model is clarified. time. Hence, the environment is stochastic.
In the RL context, the agent takes actions according to its policy Fig. 2 illustrates the interaction between the agent and the simula-
to interact with the environment. The policy is a function that maps tion environment. There are two neural networks in policy optimization
the environment’s state into agent’s actions. As a result of the actions, based DRL. The policy network plays the actor role and produces
the environment returns a reward and the state of the environment the action space to the environment. Meanwhile, the value network
changes. Observing this new state, the agent chooses a new action and produces state values by assigning each state a score calculated using
the interaction goes on. During this interaction, the agent refines its the sum of rewards and the state of the previous round so that the states
policy based on the collected rewards to reach an optimal policy that with higher rewards have more value in the network. Actions which
maximizes the expected cumulative reward. results in a better state is preferred since it produces a higher reward.
In this problem, the learning agent is the eNodeB and the action Here, a scenario where the eNodeB policy is trained offline but
in RL corresponds to the preamble groupings for each slice. More tested with actual traffic after deployment is assumed. RL algorithms
specifically, the eNodeB decides on the number of preambles for each start by exploring the action space. At the initial state, the first actions
slice such that their sum will be 54. Before each random access opportu- are mostly random. Moreover, their initial behavior may be very far
nity, the groupings is distributed in System Information Grouping (SIB) from optimum. Hence, RL algorithms take a considerable amount of
along with other RACH parameters. The grouping information includes time in order to converge, especially if the training process starts from
slice id, the number of preambles reserved for this slice and the first scratch.
preamble index of the group. The environment is the random access The most crucial aspect in the RL framework is the reward function.
channel where a number of UEs are waiting to transmit a preamble. The reward function defines the objective of the problem. In this study,
Based on the announced preamble groupings, each UE transmits one the reward is defined as a function of the number of successful pream-
of the preambles from their respective preamble group. Each node bles and collided preambles for each service class. For each service class
4
A.M. Gedikli et al. Computer Networks 215 (2022) 109202
𝑗, a reward function is calculated as follows: newer environments. In the scenario it is considered that the system
could over-fit to the traffic arrival distributions used in training. In
𝑅𝑗 (𝑡) = 𝑠𝑗 𝑘𝑗 − 𝑐𝑗 (1)
order to avoid over-fitting, the policy is trained by using a randomly
where 𝑠𝑗 and 𝑐𝑗 are the number of successful and collided preambles generated traffic model. In the model, the generator redistributes the
from class 𝑗, respectively. 𝑘𝑗 is defined as the reward or priority number of preamble requests randomly in 5 s periods. Recalling 𝑀 is
coefficient for class 𝑗 which is used to prioritize among different slices. the total number of available preamble requests, it is shown in [47]
Higher the priority of a slice, the higher its reward coefficient should that the number of successful transmissions in a round yields to 𝑀∕𝑒 if
be. Then, the total reward is calculated as the sum of rewards for all the number of arrival of preamble requests is equal to 𝑀 in a round. In
service classes. the next round, the number of backlogged request is 𝑀 −𝑀∕𝑒 if no new
∑ request is made. As a result, if the arrival request count is greater than
𝑅(𝑡) = 𝑅𝑗 (𝑡). (2) 𝑀∕𝑒 constantly, the number of backlogged preamble request increases
𝑗
in time and the RACH becomes unstable irreversibly. Therefore, the
The selection of this reward function is discussed in detail in Section 5. number of preamble requests are distributed to slices in a way that their
The state of the environment is the number of UEs from each service average sum yields 𝑀∕𝑒 in the long run. Since this given randomly
class waiting to transmit a preamble. However, this information is not generated traffic does not show any patterns as might be in a real
available to the eNodeB, hence, it uses the most recent observation traffic, it is less prone to over-fitting. Therefore, it is preferred to be
about the channel in terms of successful and collided preambles. The used in training. Then the generated model is evaluated in testing. The
state is defined as performance of the model in varying network scenarios also supports
⋃
𝑆(𝑡) = 𝑆𝑗 (𝑡) (3) that the model does not over-fit.
𝑗
𝑆𝑗 (𝑡) = {𝑐𝑗 , 𝑠𝑗 }. (4) Reward function is the objective function used for evaluating the
The policy of the agent maps the state to actions and the RL model that has to be maximized over successive steps. The choice of
algorithm optimizes the policy. Here, trust region policy optimization the reward function is crucial, since a poor selection may result in sub-
(TRPO) which has recently achieved state-of-the-art results in various optimal allocation of resources. In this part, analytical expressions are
AI tasks is employed in RL [42]. In the TRPO algorithm, the two derived for several possible reward functions for a two-slice scenario.
dependent neural networks shown in Fig. 2 are for approximating the Then, the behavior of these functions is evaluated. Finally, these ana-
value function and the policy function. From the TRPO perspective, the lytical results are compared with the experimental results presented in
value network is used to calculate the expected future reward from Section 6.
the current policy. The policy network is used for approximating the Let 𝑚1 and 𝑚2 are the number of preambles assigned to each slice
policy function. This network gets the current state of the environment such that 𝑀 = 𝑚1 + 𝑚2 . Let 𝑛1 and 𝑛2 are the average number of
as input and produces the probability distributions of each action arrivals per RAO for each slice. First, consider the case where 𝑛1 < 𝑚1 ∕𝑒
using the value function. Then, the algorithm selects the most probable and 𝑛2 < 𝑚2 ∕𝑒. In this case, all incoming traffic can be supported
action [43]. because the capacity of the RACH with 𝑚 preambles is approximated
In this study, an open implementation of TRPO [44] is used. This as 𝑚∕𝑒 [48].
framework uses ADAM optimizer for both neural networks due to its When there are 𝑏𝑗 nodes randomly selecting preambles from a set
fast convergence [45,46]. The default values of the framework [44] with size 𝑚𝑗 , the expected number of successful preambles is given
is used in the experiments. Only the number of episodes and batch size by [49]:
parameters are modified in these settings. Batch size is set to 1 in 𝑏𝑗 −1
𝑠𝑗 = 𝑏𝑗 (1 − 𝑚−1
𝑗 ) . (5)
order to increase the frequency of policy evaluation and the number of
episodes is set to 100,000 in order to increase the training time. The Similarly, the expected number of unused preambles can be found
simulation parameters and their values used in the experiments are as [49]:
given in Table 2. The details of other parameters used in the framework
can be found in [44]. 𝑢𝑗 = 𝑚𝑗 (1 − 𝑚−1 𝑏𝑗
𝑗 ) . (6)
Over-fitting is a crucial problem to avoid in all types of machine The remaining preambles are collided:
learning approaches. RL algorithms are also prone to over-fitting to the
environment and the trained policies may not generalize very well to 𝑐 𝑗 = 𝑚 𝑗 − 𝑠𝑗 − 𝑢𝑗 . (7)
5
A.M. Gedikli et al. Computer Networks 215 (2022) 109202
Table 2
Simulation Parameters.
Networks Scenarios
2-slice 3-slice 5-slice
Parameters Value Value Value
𝑁 2 3 5
𝑘𝑗 𝑘1 = 1 𝑘1 = 1 𝑘1 =1
𝑘2 = 2 𝑘2 = 2 𝑘2 =2
𝑘3 = 3 𝑘3 =3
𝑘4 =4
𝑘5 =5
Reward Functions CRF CRF CRF
PRF PRF
Discount Factor 0.9
Number of Episodes 100,000
Batch Size 1
GAE (𝜆) 0.98
𝐷𝐾𝐿 target 0.003
Size of the First Hidden Layer 10
Initial Policy Log-variance −1
Fig. 3. Change in the (a) SRF (b) PRF (c) CRF as 𝑚1 changes for 𝑘1 = 1 and 𝑘2 = 2 for different arrival rates.
When the RACH is stable, the expected number of successful pream- function. In general, the reward function can be written as:
bles must be equal to the average number of arrivals in the long run,
⎧𝑘 𝑛 + 𝑘 𝑛 if 𝑛1 < 𝑚1 ∕𝑒, 𝑛2 < 𝑚2 ∕𝑒
i.e., 𝑠𝑗 = 𝑛𝑗 . To satisfy this equality, the number of attempting nodes ⎪ 1 1 2 2
at each slot needs to increase to a level 𝑏𝑗 > 𝑛𝑗 . This value can be ⎪𝑘 1 𝑛 1 if 𝑛1 < 𝑚1 ∕𝑒, 𝑛2 > 𝑚2 ∕𝑒
𝑟𝑠𝑢𝑐𝑐 =⎨ (11)
found by solving (5) which gives 𝑏𝑗 = −𝑚𝑗 𝑊 (−𝑛𝑗 ∕𝑚𝑗 ) where 𝑊 () ⎪𝑘 2 𝑛 2 if 𝑛1 > 𝑚1 ∕𝑒, 𝑛2 < 𝑚2 ∕𝑒
is the principal branch of the Lambert W function [48]. Using the ⎪0 if 𝑛1 > 𝑚1 ∕𝑒, 𝑛2 > 𝑚2 ∕𝑒
⎩
approximation (1 − 1∕𝑥)𝑛 ≈ 𝑒−𝑛∕𝑥 , 𝑢 and 𝑐 can be further simplified
The behavior of this reward function is illustrated in Fig. 3(a) for
as
𝑛1 = 𝑛2 = 5. As the reward function is flat for the stable region, the
𝑢𝑗 ≈ 𝑚𝑗 𝑒−𝑏𝑗 ∕𝑚𝑗 = 𝑚𝑗 𝑒𝑊 (−𝑛𝑗 ∕𝑚𝑗 ) (8) RL agent does not have any incentive to change the groupings as long
as the RACH is stable. This behavior may bring the system very close
and to instability when the number of preambles reserved for one slice is
barely sufficient for the traffic of that slice. In that case, a slight increase
𝑐𝑗 ≈ 𝑚𝑗 (1 − 𝑒𝑊 (−𝑛𝑗 ∕𝑚𝑗 ) ) − 𝑛𝑗 . (9)
in traffic may result in serious congestion which could have been easily
avoided if the agent used a more robust allocation of preambles. Hence,
5.1. Successful preambles reward function (SRF) a reward function is needed which would ‘‘steer" the system towards a
better operating point.
The most intuitive and simple reward function is the number of suc-
cessfully transmitted preambles at each RAO. In this case, the reward 5.2. Proportional reward function (PRF)
for a slice is 𝑟𝑗 = 𝑠𝑗 and the total reward can be defined as the weighted
sum of rewards of each slice: Another intuitive reward function for the RACH channel is the ratio
of successful preambles among the number of transmitted preambles.
𝑟𝑠𝑢𝑐𝑐 = 𝑘1 𝑟1 + 𝑘2 𝑟2 = 𝑘1 𝑠1 + 𝑘2 𝑠2 (10) Weighting each ratio corresponding to each slice, it is possible to
differentiate preambles among slices according to their priority. Let 𝑘1
where 𝑘1 and 𝑘2 are used to prioritize different slices. This reward
and 𝑘2 denote the priority coefficients of each slice. Then, for 𝑛1 < 𝑚1 ∕𝑒
function, however, has some undesirable properties. For any choice
and 𝑛2 < 𝑚2 ∕𝑒, total reward for a single RAO is given by
of 𝑚1 and 𝑚2 which satisfies 𝑛1 < 𝑚1 ∕𝑒 and 𝑛2 < 𝑚2 ∕𝑒, RACH is
stable and the number of successful preambles equal to the number of 𝑠1 𝑠2
𝑟𝑝𝑟 = 𝑟1 + 𝑟2 = 𝑘1 + 𝑘2 (12)
arrivals, 𝑠𝑗 = 𝑛𝑗 . Hence, the reward function is flat in this region. For 𝑚1 − 𝑢1 𝑚 2 − 𝑢2
𝑛1 > 𝑚1 ∕𝑒 and 𝑛2 < 𝑚2 ∕𝑒, all preambles will be unsuccessful for slice-1 For 𝑛1 > 𝑚1 ∕𝑒, there are not enough preambles for slice-1 which will
and all traffic can be supported in slice-2, again leading to a flat reward result in a growing backlog and all preambles will collide resulting in
6
A.M. Gedikli et al. Computer Networks 215 (2022) 109202
𝑟1 = 0. Similarly, for 𝑛2 > 𝑚2 ∕𝑒, 𝑟2 = 0. Hence, communications (mMTC) and ultra-reliable machine-type communi-
cations (uMTC) [5]. This number, however, can increase up to 5 as
⎧∑2 𝑘𝑗 𝑛𝑗
⎪ 𝑗=1 𝑚 (1−𝑒𝑊 (−𝑛𝑗 ∕𝑚𝑗 ) ) if 𝑛1 < 𝑚1 ∕𝑒, 𝑛2 < 𝑚2 ∕𝑒 delay-tolerant IoT, emergency, high priority IoT, human-to-human and
⎪ 𝑗
𝑘1 𝑛1 mobile broadband [6]. On the other hand, increasing the number of
⎪ if 𝑛1 < 𝑚1 ∕𝑒, 𝑛2 > 𝑚2 ∕𝑒
𝑟𝑝𝑟 = ⎨ 𝑚1 (1−𝑒𝑊 (−𝑛1 ∕𝑚1 ) ) (13) slices also increases the state space drastically and finding the optimum
𝑘2 𝑛2
⎪ 𝑊 (−𝑛2 ∕𝑚2 ) ) if 𝑛1 > 𝑚1 ∕𝑒, 𝑛2 < 𝑚2 ∕𝑒 policy through analysis gets intractable. Here, the simulations are also
⎪ 𝑚2 (1−𝑒 conducted for 3 and 5-slice scenarios besides the 2-slice scenario.
⎪0 if 𝑛1 > 𝑚1 ∕𝑒, 𝑛2 > 𝑚2 ∕𝑒
⎩ In the simulations, slices are prioritized using their reward coef-
For a given 𝑛1 , 𝑛2 , 𝑘1 and 𝑘2 , this function can be numerically ficients, 𝑘𝑗 . Slice numbers are used as the prioritization factor such
maximized to find the values of 𝑚1 and 𝑚2 such that 𝑚1 + 𝑚2 = 𝑚. The that 𝑘𝑗 = 𝑗. In other words, assuming there are 𝑁 slices, the 𝑁th
behavior of the PRF is shown in Fig. 3(b) for 𝑛1 = 𝑛2 = 5. This reward slice has the highest priority. This prioritization scheme is employed
function does not suffer from the problem mentioned in the previous in all simulations through this paper. Therefore, the slice numbers in
part. The optimum values of 𝑚1 for different priority coefficients is all figures also indicate their priority. In addition, the first and lowest
25, 24 and 23 for (𝑛1 , 𝑛2 ) = (5, 4), (6, 6) and (7, 8), respectively. Even priority slice will be named as low-priority slice and the 𝑁th and
the traffic of low-priority is higher than the traffic of high-priority, highest priority slice will be named as high-priority slice in the rest
this reward function allocates a lower number of preambles to the of the paper. The Figs. 4–8 are plotted using the average of collected
lower-priority traffic. data of 10 simulation runs.
for the 2-slice scenario since the complete state space is relatively Here, 𝑥 denotes the proportionality factor which satisfies the above
small. However, with the context of 5G there are defined at least 3 ser- equation. Assuming 𝑏𝑗 values are known, 𝑥 is also can be found by
vice types: extreme mobile broadBand (xMBB), massive machine-type solving the equation and hence, 𝑚𝑗 values are obtained.
7
A.M. Gedikli et al. Computer Networks 215 (2022) 109202
Fig. 4. Preamble allocations computed by the exhaustive search, ideal algorithm and mathematical analysis against the DRL-based approach in 2-slice scenario (a)–(b)–(c) respectively
for the CRF and (d)–(e)–(f) for the PRF reward function.
Fig. 5. The ratio of dropped messages to all transmitted messages for 2-slice scenario while the arrival rate of second slice consistently increased. (a)–(b)-(c) show the performance
for CRF, PRF and ideal respectively.
Fig. 6. The average waiting time for 2-slice scenario while the arrival rate of second slice consistently increased. (a)–(b)-(c) show the performance for the CRF, PRF and ideal
respectively.
8
A.M. Gedikli et al. Computer Networks 215 (2022) 109202
Fig. 7. The performance of the proposed method to 3 network slices while the arrival rate of third slice consistently increased.
Fig. 8. The performance of the ideal algorithm to 3 network slices while the arrival rate of third slice consistently increased.
6.3. Simulation results waiting time which is 10 × 𝑊 ms. The average waiting time for the
high-priority slice almost stays constant despite the increasing traffic
Firstly, the proposed DRL-based approach is compared with the rate. On the other hand, the average waiting time for the high-priority
mathematical analysis given in Section 5, the exhaustive search and the slice for PRF is higher in comparison to the waiting time of the same
ideal algorithm for the scenario with 2 slices. Since the 2-slice scenario slice for CRF. Moreover, while PRF scheme distributes the preambles
has a smaller state space, it is preferred for a baseline comparison. Fig. 4 more closely to two slices as seen in Fig. 4, when the total load passes
plots the optimum number of reserved preambles for each slice that (𝑛1 + 𝑛2 )∕𝑀 = 21∕54 ≈ 0.39, the dropped preamble requests from high-
maximizes the reward functions against mathematical analysis, ideal priority slice are dramatically increased. Therefore, the results which
algorithm and exhaustive search. The results show that the preamble use only CRF are presented in the rest of the experiments.
allocations which yield maximum rewards for each technique behave The simulation results for the 3-slice scenario is given in Fig. 7. As
similarly. noted above, here the following performance metrics are evaluated for
Fig. 5 demonstrates the ratio of dropped messages to all transmitted only the CRF reward function. At time 𝑡 = 0, 𝑛1 = 5, 𝑛2 = 4, 𝑛3 = 2 and
messages in a random access opportunity in the 2-slice scenario. The the relation between number of reserved preambles is 𝑚1 > 𝑚2 > 𝑚3 .
results for the reward functions CRF, PRF and ideal algorithm are given Since the total load is ((𝑛1 + 𝑛2 + 𝑛3 )∕54 ≈ 0.20) < (1∕𝑒 ≈ 0.37),
in Fig. 5(a), Fig. 5(b) and Fig. 5(c) respectively. While the value of the proposed method allocates more preambles to 1st and 2nd slices
normalized arrival rate of the high-priority slice approaches to 0.3, the as shown in Fig. 7(a) in order to relieve the traffic to lower priority
total load 𝑛1 ∕𝑀 + 𝑛2 ∕𝑀 ≈ 0.38 exceeds the capacity 1∕𝑒 ≈ 0.37. In slices when the high-priority slice easily handles the traffic. As 𝑛3
both figures, the normalized arrival rate of the high-priority slice rises increases, the proposed method aggressively increases the reserved
up to 𝑛2 ∕𝑀 ≈ 0.39 and the total load in that maximum point passes preambles count for the 3rd slice in order to avoid dropping messages
(𝑛1 + 𝑛2 )∕𝑀 = 26∕54 ≈ 0.48. In Fig. 5, while the high-priority slice out- and long average waiting. The proposed method manages not to drop
performs the unsliced case for both reward functions, the low-priority any preamble allocation requests from the 3rd slice even the total load
slice also gets a considerable amount of reserved preambles. Fig. 5(a) passes (𝑛1 + 𝑛2 + 𝑛3 )∕𝑀 = 26∕54 ≈ 0.48. The average waiting time
points out that the high-priority slice using CRF can satisfy almost all for the 3rd slice follows a horizontal line up to the normalized arrival
preamble requests so that preamble request messages are nearly not rate of 0.3. After that, it slightly increases due to high load on the
dropped even when the total load passes (𝑛1 +𝑛2 )∕𝑀 = 25∕54 ≈ 0.46. On random access channel; yet, the proposed method is able to prevent
the other hand, the unsliced plot has a sharp increase after passing the a sharp increase. On the other hand, the unsliced curve shows a sharp
maximum normalized traffic load limit that the random access channel increase after passing the maximum normalized traffic load limit that
can handle (1∕𝑒 ≈ 0.37). the random access channel can handle (1∕𝑒 ≈ 0.37), as in the two-slice
For the same scenario, Fig. 6 presents the average waiting time, scenario.
which denotes how much time successful preamble requests waited in Figs. 5(c), 6(c) and 8 demonstrates the behavior of the ideal al-
the message backlog on average. If there is no successful preamble gorithm for scenarios with 2-slice and 3-slice. It slightly prioritize
request in the random access opportunity, the average waiting time the slices in a similar way to the unsliced plot with respect to their
cannot be calculated for that random access opportunity. Nevertheless, prioritization levels in the figures. Furthermore, Fig. 8(a) shows that
to show the performance, the value is set to the maximum average for the 3-slice scenario it prefers to allocate more number of preambles
9
A.M. Gedikli et al. Computer Networks 215 (2022) 109202
Fig. 9. The timeline simulation graphs for 3-slice (a) and for 5-slice (b) using CRF .
Fig. 10. The timeline simulation graphs for 3-slice (a) and for 5-slice (b) using the ideal alg.
to the 3rd slice even in the beginning of the scenario when the total pattern and the behavior of the proposed method are shown in Fig. 9
load is much less than 1∕𝑒 ≈ 0.37 and normalized arrival rate for the for both scenarios. The uppermost plots in Figs. 9(a) and 9(b) show
1st slice is highest. As a result of using the ideal algorithm, there are the normalized arrival rate for each slice over time. The following
dropped preamble requests observed in the 1st slice even at the early below plots show the response of the proposed method to the changing
stage of the scenario when the arrival rate of the 3rd is slice between environment as reserved preamble counts for each slice. The next below
0.05 and 0.1 as and the total load is 0.24 shown in Fig. 8(b). plots demonstrate the collision rates of each slice in that random access
opportunity which is found using 𝑐𝑗 ∕𝑚𝑗 . Finally, the bottom plots show
6.4. Performance of reinforcement learning based method in a dynamic the ratio of dropped messages to all transmitted messages on that
environment random access opportunity. The same scenarios are implemented with
also the ideal algorithm. The performance of the ideal algorithm is plotted
In this part, the temporal behavior of the proposed method is in Fig. 10.
presented to show how it reacts to changes in the traffic demands Fig. 9(a) plots the simulation results for the 3-slice scenario. From
and how much time it is needed to rearrange the preamble allocations 𝑡 = 0 to 𝑡 = 2, the reserved preamble count for each slice stays
to meet the requirements. Different from the previous experiments, almost constant since there exists no congestion and change in traffic.
here not only the requests of the high-priority service change, but At 𝑡 = 2, 𝜆1 increases suddenly while other slices still have the
all service requests may also vary in time. The simulations for the same normalized arrival rates. After a reaction time of approximately
5-slice scenario are also run in addition to the 3-slice scenario. For 50 ms, the proposed method gives most of the preambles to slice-1
these simulations, only CRF is used as the reward function. The traffic immediately. First, the collision rate of slice-1 increases a little, then,
10
A.M. Gedikli et al. Computer Networks 215 (2022) 109202
immediately drops thanks to the reallocation of preambles. Similarly, at the end of the timeline, the traffic assigned to slices decreases starting
𝑡 = 2.5, 𝜆2 increases to a higher value than 𝜆1 . Right after, 𝑚2 increases from the high-priority slice to the low-priority slice respectively. The
and slice-2 gets a major part of the preambles. Up until 𝑡 = 3, 𝑚1 and preamble allocation of the proposed method acts in a reverse direc-
𝑚2 stays around the same levels with little difference between them. tion. In the end, the preamble allocations return to the first position.
In Fig. 9(a), the total load exceeds the total capacity at 𝑡 = 3. At These results confirm the effectiveness and adaptability of the proposed
this point, since the preamble resources are not enough for all slices, approach in a dynamic environment.
the method allocates most of the preambles to the high-priority slice,
slice-3. Therefore, 𝑚3 spikes, 𝑚2 is nearly halved and 𝑚1 is dropped to 6.5. Two priorities scenario used in [16]
around between 5 and 10. As all slices cannot be supported at the same
time, the proposed method sacrifices the low-priority slice in favor of The authors in [16] propose to assign weights 𝛾𝑖 to services based on
other slices. The collision rate for the slice-1 oscillate around 1 up to their priorities. In this specific scenario, they use two services having
𝑡 = 8. The collision rates for the other two slices change between 0.2 priority weights 𝛾1 = 2 and 𝛾2 = 1. For each service, 10,000 devices are
and 0.8, and the collision rate for slice-2 is higher than slice-3 even activated over 10 s (2000 RAO slots each 5 ms) with uniform arrival
if the traffic to slice-3 is more than twice. In the last two seconds, for the low-priority service and bursty beta arrival for the high-priority
𝜆3 falls of suddenly, right after, 𝑚3 also drops. 𝑚1 and 𝑚2 increase to service. They stated that their algorithm produces results close to the
their previous values where right before 𝜆3 increases. Then, respectively ideal case in which the average delay (slots/device) is close to 0. In
𝜆2 and 𝜆3 drops to their previous values and the preamble allocation addition, they specify the average delay for the high-priority service
returns to its initial assignment. is less than for the low-priority scenario. For a fair comparison, the
Similarly, Fig. 9(b) plots the timeline simulation for the 5-slice same scenario is implemented for 2-slice and the results are plotted in
scenario. From 𝑡 = 0 to 𝑡 = 2, there is no congestion and the preamble Fig. 11. Please note that the RAO period is taken as 5 ms here, whereas
allocation of slices does not change. During this period, the number of it is taken as (10 ms) in previous experiments. Although the authors
preambles reserved to each slice is proportional to their priority levels. use 𝑊 = ∞ and this value is taken as 10 in this current study, since no
Normalized arrival rates of slices are increased starting from the low- dropped preamble request is observed in these simulations, 𝑊 has no
priority slice to the high-priority slice. At the end of the simulation, effect on the comparison.
the rates are decreased in the reverse order. The preamble allocation Fig. 11 shows the comparison of the proposed approach with the
behavior is similar to the 3-slice scenario and the RL agent successfully ideal algorithm on the scenario as defined above [16]. As shown in the
prioritizes the slices. The full load is exceeded when the normalized figure, the average waiting time never reaches 5 ms for both slices. The
arrival rate of slice-5 increases. In that case, the majority of preambles average waiting times for the high-priority slice are 0.98 ms in the DRL-
are reserved for slice-5. based approach and 0.65 ms in the ideal algorithm. The average waiting
Figs. 10(a) and 10(b) plot the timeline simulation of the ideal times for the low-priority slice are 1.2 ms in the DRL-based approach
algorithm for 3-slice and 5-slice scenarios respectively. As pointed out and 4.7 ms in the ideal algorithm. Please note that the average waiting
above, the algorithm slightly prioritizes the slices in a way that the time is how much time successful preamble requests waited in the
collision rates in each figure are very close to each other, hence the message backlog on average. Since it is taken as the time spent by one
allocation of each slice 𝑗 is slightly better than the allocation of slice 𝑗 − device/service for successfully accessing into the channel (slots/device)
1. The plots in the time intervals [2,3] and [8,9] for 3-slice in Fig. 10(a) in [16], the average waiting time here is divided by the slot interval (5
and, in the time intervals [2,4] and [7,9] for 5-slice in Fig. 10(b) show ms) in order to make a fair comparison with [16]. For the high-priority
that the ideal algorithm cannot prevent collisions for lower priority slices slice, it is calculated as 0.98 ms∕5 ms ≈ 0.2 by the proposed approach
when the total load in the network is less than (1∕𝑒 ≈ 0.37). Fig. 9 shows and as 0.65 ms∕5 ms ≈ 0.13 by the ideal algorithm. For the low-priority
that DRL-based approach manages not to drop any preamble requests slice, it is calculated as 1.2 ms∕5 ms ≈ 0.24 and 4.7 ms∕5 ms ≈ 0.94 for
in these time intervals and achieves to keep the collision rates lower both algorithms respectively.
than the ideal algorithm does. On the other hand, when the arrival rate To sum up, the proposed method can successfully prioritize the
of the high-priority slice increases and the total load becomes bigger slices in a way that the average delay (slots/device) of each slice
than (1∕𝑒 ≈ 0.37), the DRL-based approach sacrifices the lowest priority is close to 0. Moreover, the high-priority slice has a lower average
slices (1st one in 3-slice scenario, 1st and 2nd ones in 5-slice scenario) delay than the low-priority slice. It is also stated in [16] that they
in favor of higher priority ones. Although the ideal algorithm tries to obtain average delays (slots/device) close to 0 without specifying exact
keep the collision rates under control for even the lowest priority slice values. Although the ideal algorithm manages to keep average delays
when the total load is bigger than (1∕𝑒 ≈ 0.37), it may not be the best (slots/device) lower than 1, there is a noticeable difference between
behavior when the total load is such high. Especially for some real life the slices (0.94 for the low-priority slice and 0.13 for the high-priority
scenarios such as where a remote surgery slice has the highest priority slice). Even the ideal algorithm performs slightly better than the pro-
and a sensor-IoT slice has the lowest priority, the lowest priority slice posed approach for the high-priority slice based on the average delay
could need to be sacrificed. metric, it drops a few number of preamble requests from the low-
In summary, in timeline simulations, the sudden increase of arrival priority slice. As stated above, the proposed approach does not drop
preamble requests to slices from low-priority to high-priority and the any requests.
sudden decrease from high-priority to low-priority respectively are
tested in the timeline. For both the 3-slice and 5-slice scenarios, when 7. Discussion
the traffic of the low-priority slice increases the preamble allocation
is increased in favor of the low-priority slice. Following, the traffic to The experimental results show the effectiveness and adaptability of
slices increases from lower to higher priority ones respectively and in the proposed DRL-based approach in a dynamic environment. Different
each increase more preambles are reserved to the slice having lastly reward functions are evaluated and shown that penalizing collisions in
increased traffic. The moment when the traffic to high-priority slice addition to rewarding successful preambles gives the best performance.
increased, the total normalized arrival rate reaches 0.39 so that passes When penalizing is used in the reward function, reinforcement learning
the full load (1∕𝑒 ≈ 0.37). After that, the proposed method suddenly act more quickly and sharply due to the very nature of penalizing,
allocates the vast majority of preambles to the high-priority slice. which is fed into the system in a very short time (in a few RAO). How-
Nonetheless, Figs. 9(a) and 9(b) show that the proposed method does ever, using the other functions which do not include penalizing, the
not neglect performances of lower priority slices, while the collision response time of the algorithm takes longer and it adapts more softly
rate of the high-priority slice is kept under control. With approaching to to the changing environment. Moreover, when penalizing is applied in
11
A.M. Gedikli et al. Computer Networks 215 (2022) 109202
Fig. 11. The timeline simulation graphs for 2-slice using the two priority scenario for DRL-based method (a) and for ideal algorithm (b) in [16].
reward functions, the number of reserved preambles are inclined to study, the effects of some parameters such as transmission power of
be given to lower priority slices when there is no traffic congestion. nodes, distance of nodes to eNodeBs, deafness of nodes, path loss and
In other words, more importance is given to lower slices, when there shadowing are not simulated. In the future, these parameters could be
is no high traffic for higher priority slices. This is the reason behind included in the experiments for more realistic simulations.
why the DRL-based approach using penalizing can perform better than
ideal algorithm under some scenarios. On the other hand, in the event
of traffic congestion DRL-based approach perform a steady and more 8. Conclusion
prioritizing attitude in favor of higher priority slices when compared to
ideal algorithm. There exists an obvious opposite attitude of DRL-based
With the increasing significance of RAN resource allocation in 5G,
approach towards slices between low and high traffic congestion cases.
flexible preamble allocation becomes an important problem. This study
These outcomes stand for DRL-based approach can change its pattern of
aims to find a solution to the RAN slicing problem by prioritizing slices.
prioritization behavior under varying network traffic load. Please note
that the same priority coefficient values for slices are employed in all It explores deep reinforcement learning (DRL) to solve the optimum re-
reward functions in order to have a comparison basis. However, these source allocation problem in RAN slicing. Here, three reward functions
coefficients could affect reward functions in different ways, therefore, are proposed for the RL formulation and mathematically analyzed.
in the future, different coefficients could be applied for different reward Since at least 3 service types are defined for 5G [5] and, some
functions. studies propose 5 service types [6] for 5G in the literature, 3-slice and
The running time of the proposed approach is proportional to the 5-slice scenarios are implemented in the experiments. The proposed
TRPO algorithm. However, an increase in the number of slices, which approach is evaluated by the following metrics: the number of reserved
is varying in the experiments, causes an expansion in the state space. preambles, the average waiting time and the ratio of dropped messages
Recalling 𝑀 is the number available preambles, the state space size for to all transmitted messages. The proposed approach is compared with
𝑁 number of slices can be calculated as (𝑀+3𝑁−1)!
𝑀!(3𝑁−1)!
since there are 3 a recent study [16] in the literature. Moreover, in order to show its
state space members 𝑠𝑗 , 𝑐𝑗 and 𝑢𝑗 . The action space size is (𝑀+𝑁−1)!
𝑀!(𝑁−1)!
effectiveness it is compared with the unsliced scenario and the exhaus-
since the action list includes 𝑁 members. Therefore, increasing 𝑁 by tive search, in which all possible preamble allocations for a given traffic
3
1, the state space size expands nearly 𝑀 times and the action space load are searched through and the best preamble allocation is chosen.
size expands 𝑀 times. Since the solution space is the mapping from The results show that the proposed method distributes preambles to
the state space to the action space, increasing 𝑁 expands the solution different service classes according to their priorities successfully. It
space immensely. It is not easy to build a mathematical model for produces comparable results with [16]. In some network scenarios, it
representing such a complex solution space. However, DRL is known outperforms the ideal algorithm given in [16] by giving importance to
to be able to diminish the curse of dimensionality caused by the state
lower priority slices in addition to higher priority slices when there is
space theoretically [50].
no traffic congestion.
In this study, the proposed approach is trained and evaluated by
To sum up, the proposed DRL-based approach is shown to be a
using simulated network traffic, since real world RACH preamble traffic
data is not available. It is hard to collect such data from eNodeBs suitable approach for RAN slicing, since it adapts to changes in the en-
since they cannot detect how many nodes are available in the envi- vironment in a timely manner. It also can be deployed as an instantiable
ronment. Moreover, since the nodes in the environment are distributed Virtual Network Function (VNF) to eNodeBs. Afterwards, eNodeBs slice
and mobile, collecting such data synchronously is hard. Due to all of PRACH preambles for network service classes that have different QoS
these reasons, randomly generated traffic is used in training in this needs under the dynamic environment. The VNF can be instantiated at
study. On other hand, having a real world data for different service the very beginning of eNodeB operation, then it can step in when there
types which have various QoS help to generate a more precise model exists network congestion like in a power outage and restore scenario
for that environment. Moreover, in the experiments of this current or in a temporary dense network formation scenario.
12
A.M. Gedikli et al. Computer Networks 215 (2022) 109202
CRediT authorship contribution statement [19] N. Li, C. Cao, C. Wang, Dynamic resource allocation and access class
barring scheme for delay-sensitive devices in machine to machine (M2M)
communications, Sensors 17 (6) (2017) 1407, URL http://dx.doi.org/10.3390/
Ahmet Melih Gedikli: Conceptualization, Methodology, Software,
s17061407.
Visualization, Validation, Investigation, Data curation, Writing – orig- [20] Y. Sim, D.-H. Cho, Performance analysis of priority-based access class barring
inal draft. Mehmet Koseoglu: Formal analysis, Writing – review & scheme for massive MTC random access, IEEE Syst. J. 14 (4) (2020) 5245–5252.
editing, Software, Investigation. Sevil Sen: Writing – review & editing. [21] K.-D. Lee, S. Kim, B. Yi, Throughput comparison of random access methods for
M2M service over LTE networks, in: 2011 IEEE GLOBECOM Workshops, GC
Wkshps, 2011, pp. 373–377, ISSN: 2166-0077.
Declaration of competing interest [22] T. Lin, C. Lee, J. Cheng, W. Chen, PRADA: Prioritized random access with
dynamic access barring for MTC in 3GPP LTE-A networks, IEEE Trans. Veh.
The authors declare that they have no known competing finan- Technol. 63 (5) (2014) 2467–2472.
[23] K. Lee, M. Reisslein, K. Ryu, S. Kim, Handling randomness of multi-class random
cial interests or personal relationships that could have appeared to
access loads in LTE-advanced network supporting small data applications, in:
influence the work reported in this paper. 2012 IEEE Globecom Workshops, 2012, pp. 436–440.
[24] C. Kalalas, F. Vazquez-Gallego, J. Alonso-Zarate, Handling mission-critical com-
References munication in smart grid distribution automation services through LTE, in: 2016
IEEE International Conference on Smart Grid Communications, SmartGridComm,
2016, pp. 399–404.
[1] J. Kim, J. Lee, J. Kim, J. Yun, M2M service platforms: Survey, issues, and
[25] M. Vilgelm, M. Gürsu, W. Kellerer, M. Reisslein, LATMAPA: Load-adaptive
enabling technologies, IEEE Commun. Surv. Tutor. 16 (1) (2014) 61–76.
throughput-maximizing preamble allocation for prioritization in 5G random
[2] Internet of Things (IoT) connected devices installed base worldwide from access, IEEE Access PP (2017) 1.
2015 to 2025 (in billions), 2019, https://www.statista.com/statistics/471264/ [26] X. Zhao, J. Zhai, G. Fang, An access priority level based random access scheme
iot-number-of-connected-devices-worldwide/ (Accessed: 12 Oct 2019). for QoS guarantee in TD-LTE-a systems, in: 2014 IEEE 80th Vehicular Technology
[3] Cisco, Cisco visual networking index: Forecast and trends, 2017– Conference, VTC2014-Fall, 2014, pp. 1–5.
2022 white paper, 2019, (Visited December 2019) [Online]. Available: [27] D. Pacheco-Paramo, L. Tello-Oquendo, V. Pla, J. Martinez-Bauset, Deep re-
https://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual- inforcement learning mechanism for dynamic access control in wireless
networking-index-vni/white-paper-c11-741490.html. networks handling mMTC, Ad Hoc Netw. 94 (2019) 101939, URL http://www.
[4] G. Wunder, P. Jung, M. Kasparick, T. Wild, F. Schaich, Y. Chen, S.T. Brink, sciencedirect.com/science/article/pii/S1570870519300976.
I. Gaspar, N. Michailow, A. Festag, L. Mendes, N. Cassiau, D. Ktenas, M. [28] A. Mohammed Mikaeil, W. Hu, L. Li, Joint allocation of radio and fronthaul
Dryjanski, S. Pietrzyk, B. Eged, P. Vago, F. Wiedmann, 5GNOW: non-orthogonal, resources in multi-wavelength-enabled C-RAN based on reinforcement learning,
asynchronous waveforms for future mobile applications, IEEE Commun. Mag. 52 J. Lightwave Technol. 37 (23) (2019) 5780–5789.
(2) (2014) 97–105. [29] J. Wang, L. Zhao, J. Liu, N. Kato, Smart resource allocation for mobile edge
[5] P. Marsch, O. Bulakci, I. Silva, P. Arnold, N. Bayer, J. Belschner, T. Rosowski, computing: A deep reinforcement learning approach, IEEE Trans. Emerg. Top.
G. Zimmermann, M. Ericson, A. Kaloxylos, P. Spapis, A. Ibrahim, Y. Yang, S. Comput. (2019) 1.
Singh, H. Celik, J. Gebert, A. Prasad, F. Moya, M. Säily, J. Monserrat, D2.2 draft [30] J. Li, H. Gao, T. Lv, Y. Lu, Deep reinforcement learning based compu-
overall 5G RAN design, 2016, http://dx.doi.org/10.13140/RG.2.2.17831.14245. tation offloading and resource allocation for MEC, in: 2018 IEEE Wireless
[6] S. Vural, N. Wang, P. Bucknell, G. Foster, R. Tafazolli, J. Muller, Dynamic Communications and Networking Conference, WCNC, 2018, pp. 1–6.
preamble subset allocation for RAN slicing in 5G networks, IEEE Access 6 (2018) [31] Y. Liu, H. Yu, S. Xie, Y. Zhang, Deep reinforcement learning for offloading and
13015–13032. resource allocation in vehicle edge computing and networks, IEEE Trans. Veh.
[7] W. Tian, M. Fan, C. Zeng, Y. Liu, D. He, Q. Zhang, Telerobotic spinal surgery Technol. 68 (11) (2019) 11158–11168.
based on 5G network: The first 12 cases, Neurospine 17 (2020) 114–120. [32] Y. Wei, F.R. Yu, M. Song, Z. Han, User scheduling and resource allocation
[8] M. Vilgelm, S. Schiessl, H. Al-Zubaidy, W. Kellerer, J. Gross, On the reliability of in HetNets with hybrid energy supply: An actor-critic reinforcement learning
LTE random access: Performance bounds for machine-to-machine burst resolution approach, IEEE Trans. Wireless Commun. 17 (1) (2018) 680–692.
time, in: 2018 IEEE International Conference on Communications, ICC, 2018, pp. [33] Z. Chen, D.B. Smith, Heterogeneous machine-type communications in cellular
1–7. networks: Random access optimization by deep reinforcement learning, in: 2018
[9] J. Oueis, E. Strinati, Uplink traffic in future mobile networks: Pulling the alarm, IEEE International Conference on Communications, ICC, 2018, pp. 1–6.
2016, pp. 583–593. [34] C. Chang, N. Nikaein, RAN runtime slicing system for flexible and dynamic
[10] Ericsson, EricssonMobility report, on the pulse of the networked society, service execution environment, IEEE Access 6 (2018) 34018–34042.
2015, URL http://www.ericsson.com/res/docs/2015/ericsson-mobility-report- [35] H.D.R. Albonda, J. Pérez-Romero, An efficient RAN slicing strategy for a
june-2015.pdf. heterogeneous network with eMBB and V2X services, IEEE Access 7 (2019)
44771–44782.
[11] Y. Dong, Z. Chen, P. Fan, K.B. Letaief, Mobility-aware uplink interference model
[36] D. Marabissi, R. Fantacci, Highly flexible RAN slicing approach to manage
for 5G heterogeneous networks, IEEE Trans. Wireless Commun. 15 (3) (2016)
isolation, priority, efficiency, IEEE Access 7 (2019) 97130–97142.
2231–2244.
[37] M.R. Raza, C. Natalino, P. Öhlen, L. Wosinska, P. Monti, Reinforcement learning
[12] C. Nguyen, H. Dinh Thai, S. Gong, D. Niyato, P. Wang, Y.-C. Liang, D.I. Kim,
for slicing in a 5G flexible RAN, J. Lightwave Technol. 37 (20) (2019)
Applications of deep reinforcement learning in communications and networking:
5161–5169.
A survey, 2018.
[38] S. Sesia, I. Toufik, M. Baker, The UMTS Long Term Evolution: From Theory to
[13] Z. Xiong, Y. Zhang, D. Niyato, R. Deng, P. Wang, L. Wang, Deep reinforcement
Practice, second Ed., 2011.
learning for mobile 5G and beyond: Fundamentals, applications, and challenges,
[39] F.H.S. Pereira, C.A. Astudillo, T.P.C. De Andrade, N.L.S. Da Fonseca, PRACH
IEEE Veh. Technol. Mag. 14 (2) (2019) 44–52.
power control mechanism for improving random-access energy efficiency in
[14] R. Boutaba, M. Salahuddin, N. Limam, S. Ayoubi, N. Shahriar, F. Estrada-
long term evolution, in: 2018 IEEE 10th Latin-American Conference on
Solano, O. Caicedo Rendon, A comprehensive survey on machine learning for
Communications, LATINCOM, 2018, pp. 1–6.
networking: Evolution, applications and research opportunities, J. Internet Serv. [40] J. Choi, J. Ding, P. Le, Z. Ding, Grant-free random access in machine-type
Appl. 9 (2018). communication: Approaches and challenges, 2020.
[15] Y. Shi, Y.E. Sagduyu, T. Erpek, Reinforcement learning for dynamic resource [41] V. Savaux, A. Kountouris, Y. Louët, C. Moy, Modeling of time and frequency
optimization in 5G radio access network slicing, in: 2020 IEEE 25th International random access network and throughput capacity analysis, EAI Endorsed Trans.
Workshop on Computer Aided Modeling and Design of Communication Links and Cogn. Commun. 3 (11) (2017) e2, URL https://hal.archives-ouvertes.fr/hal-
Networks, CAMAD, 2020, pp. 1–6. 01531239.
[16] J. Liu, M. Agiwal, M. Qu, H. Jin, Online control of preamble groups with priority [42] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov, Proximal policy
in massive IoT networks, IEEE J. Sel. Areas Commun. (2020) 1. optimization algorithms, 2017.
[17] W.T. Toor, H. Jin, Comparative study of access class barring and extended access [43] Y. Li, Deep reinforcement learning: An overview, 2017, CoRR, abs/1701.07274,
barring for machine type communications, in: 2017 International Conference arXiv:1701.07274.
on Information and Communication Technology Convergence, ICTC, 2017, pp. [44] P. Coady, Trpo, 2018, GitHub Repository, GitHub URL http://dx.doi.org/10.
604–609. 5281/zenodo.1183378.
[18] N. Zangar, S. Gharbi, M. Abdennebi, Service differentiation strategy based on [45] D. Kingma, J. Ba, Adam: A method for stochastic optimization, in: International
MACB factor for M2M communications in LTE-a networks, in: 2016 13th IEEE Conference on Learning Representations, 2014.
Annual Consumer Communications Networking Conference, CCNC, 2016, pp. [46] S. Reddi, S. Kale, S. Kumar, On the convergence of Adam and Beyond, in:
693–698. International Conference on Learning Representations, 2018.
13
A.M. Gedikli et al. Computer Networks 215 (2022) 109202
[47] M. Koseoglu, Pricing-based load control of M2M traffic for the LTE-A random Mehmet Koseoglu a principal data scientist at Udemy
access channel, IEEE Trans. Commun. 65 (3) (2017) 1353–1365. working in the catalog search team. Prior to joining Udemy,
[48] M. Koseoglu, Lower bounds on the LTE-A average random access delay under he was an Assistant Professor of Computer Engineering at
massive M2M arrivals, IEEE Trans. Commun. 64 (5) (2016) 2104–2115. Hacettepe University, Turkey between 2015–2020. He has
[49] M. Koseoglu, Smart pricing for service differentiation and load control of the visited the UCLA ECE department during the 2017–2018
LTE-A IoT system, in: 2015 IEEE 2nd World Forum on Internet of Things, WF-IoT, academic year as a Fulbright postdoctoral scholar.
2015, pp. 187–192.
[50] C.P. Andriotis, K.G. Papakonstantinou, Managing engineering systems with large
state and action spaces through deep reinforcement learning, 2018.
14