Deep Reinforcement Learning For Resource Allocation With Network Slicing in Cognitive Radio Network
Deep Reinforcement Learning For Resource Allocation With Network Slicing in Cognitive Radio Network
2298/CSIS200710055Y
Siyu Yuan1,2 , Yong Zhang1,2?? , Wenbo Qie1 , Tengteng Ma1,2 , and Sisi Li1
1
School of Electronic Engineering,
Beijing University of Posts and Telecommunication
100876, Beijing, China
{yuanisyu, yongzhang, qwb, mtt, ssl123}@bupt.edu.cn
2
Beijing Key Laboratory of Work Safety Intelligent Monitoring,
Beijing University of Posts and Telecommunications
100876, Beijing, China
1. Introduction
With the development of wireless communication technology, wireless communication
services around the world have shown a trend of rapid movement, huge capacity and
mechanism intelligence. The fifth-generation cellular network is the key technology of
the current wireless communication technology. The deployment of 5G networks will
promote the rapid development of IoT (Internet of Things) and cloud computing services
such as 4K video, VR (Virtual Reality), AR (Augmented Reality), driverless cars, intel-
ligent power grids, and telemedicine [14]. 5G network has the characteristics of network
virtualization and programmability, and uses a new technology called network slicing [4].
Network slicing is an on-demand networking model that allows operators to separate mul-
tiple virtual networks on a unified infrastructure. Each network slice is logically isolated
? This paper is an extented version of [27] which is published in International Conference on Human-Centered
Computing 2019.
?? Corresponding author
980 Siyu Yuan, Yong Zhang, Wenbo Qie, Tengteng Ma, and Sisi Li
from the wireless access network to the core network to adapt to various types of appli-
cations. The 5G network supports three general service scenarios: eMBB (Enhanced Mo-
bile Broadband), URLLC (Ultra-reliable and Low Latency Communication) and mMTC
(Massive Machine-type Communications). eMBB refers to the further improvement of
user experience and other performance based on existing mobile broadband business sce-
narios. The intuitive feeling is that the transmission rate has been greatly improved, which
is mainly used for 4K video and large file download. URLLC is characterized by high re-
liability and low latency, and is mainly used for unmanned driving and remote surgery.
In order to provide better performance and cost-effective services, network slicing has a
lot of research space in terms of resource management. By using resource management
algorithms, the wireless network can effectively increase the total transmission rate of
the wireless access network [17], spectrum efficiency [11], and user-perceived QoE [8].
mMTC scenario is mainly used for large-scale IoT services.
At present, people’s demand for data rate is higher and higher, and the demand for
spectrum resources is also increasing. However, spectrum resources are very scarce. Ac-
cording to current spectrum policies, most of the available spectrum has been allocated or
licensed to wireless service providers. In order to solve the problem of spectrum scarcity,
cognitive radio technology has become the key to solving this problem [12]. Cognitive
radio technology monitors the working conditions of authorized users by sensing the
spectrum environment, and dynamically schedules the available idle spectrum under the
premise of causing interference within a certain range to the authorized users, thereby im-
proving spectrum utilization. In a cognitive radio network, according to the different ways
that cognitive users access idle licensed spectrum, the sharing of licensed spectrum can
be divided into two models (overlay and underlay). In the overlay mode, cognitive users
can only use authorized spectrum when authorized users are not communicating. Under-
lay mode allows cognitive users to use the spectrum to which authorized users belong to
perform data transmission with authorized users at the same time. Cognitive users will
cause certain interference to authorized users, but the interference should be guaranteed
within a certain range. In order to restrict the interference caused by cognitive users, the
interference temperature constraint plays a key role in the allocation of cognitive radio
resources. Interference temperature is a concept defined by the FCC (Federal Commu-
nications Commission) in order to improve spectrum utilization efficiency and study the
application of cognitive radio [22], which is used to quantify the communication interfer-
ence of cognitive users.
Currently in China, the 230 MHz frequency band is used for the construction of elec-
tric power wireless private networks. It is a dedicated spectrum resource specifically al-
located to industries such as power, water power, and geology. Many frequency bands in
electric power wireless private networks are licensed frequency bands. Private network
users cannot use the licensed frequency bands of other private networks, which makes the
230MHz frequency band have weak transmission capabilities and low spectrum utiliza-
tion [2]. With the development of wireless communication technology, the current power
wireless private network based on the LTE system has begun to evolve to 5G, and the
application of multi-slice services needs to be carried out in the spectrum awareness en-
vironment. Applying cognitive radio technology to 5G networks can effectively solve the
problem of spectrum scarcity, improve spectrum utilization, and provide effective help for
the construction of 5G-based power wireless private network systems.
DRL for Resource Allocation with Network Slicing in Cognitive Radio Network 981
In this article, we apply a DDQN algorithm and propose a deep reinforcement learn-
ing framework called CNDDQN for cognitive radio networks. This deep reinforcement
learning framework is used to solve the resource allocation problem in cognitive radio
networks with network slicing. Under the cognitive radio network underlay model, this
framework jointly optimizes the overall spectrum efficiency of the cognitive network and
the QoE of the secondary users by managing the channel selection and power allocation
of the secondary users. This framework learns the optimal resource allocation strategy
by establishing a mapping between known primary user channel selection and power al-
location strategies and secondary user channel selection and power allocation strategies.
We first introduce a cognitive radio network model combined with network slicing. Sec-
ondly, we introduce the basic concepts of reinforcement learning algorithms, Q-Learning
and DDQN algorithms. Subsequently, we show the details of the CNDDQN algorithm.
Finally, we conduct simulation experiments on the CNDDQN algorithm to verify the sta-
bility and effectiveness of the CNDDQN algorithm.
1) This paper proposes a cognitive radio model in the 5G network slicing scenario,
which provides effective help for the construction of 5G-based electric power wireless
private network system.
2) The resource allocation algorithm proposed in this paper considers user QoE and
jointly optimizes the network spectrum efficiency and user QoE to ensure the user expe-
rience.
3) This paper proposes a resource allocation algorithm based on DDQN to solve the
overestimation problem of DQN algorithm.
The remaining chapters of this paper are arranged as follows. Section 2 introduces
some research work related to this article. Section 3 introduces the system model of the
cognitive radio network and the formulation process of the resource allocation problem.
Section 4 introduces the proposed deep reinforcement learning algorithm (CNDDQN).
The simulation results and analysis are in Section 5. We summarize this article in Section
6.
982 Siyu Yuan, Yong Zhang, Wenbo Qie, Tengteng Ma, and Sisi Li
2. Related Work
Resource allocation in cognitive radio networks has been widely studied, [18,6] summa-
rizes these existing studies. The main optimization objectives of resource allocation in
cognitive radio network include maximizing throughput, spectrum efficiency and energy
efficiency, minimizing interference and ensuring the quality of service of users. [7] pro-
poses a distributed user association and resource allocation algorithm based on matching
theory to maximize the total throughput of primary and secondary users. [9] proposes a
method based on deep reinforcement learning for cognitive uplink users of cellular net-
works, and deployed some sensors to help secondary users collect signal strength infor-
mation at different locations in the wireless environment. Therefore, the secondary user
can realize spectrum sharing with the primary user without knowing the power allocation
strategy of the primary user. However, [9] does not consider the channel selection strategy
of secondary users.
As the key technology of 5G network, network slicing technology is considered in
many kinds of resource allocation scenarios. There are some researches on the applica-
tion of network slicing technology in cognitive radio network resource allocation sce-
narios [10,3]. In [15], the network slicing technology is classified into spectrum level,
infrastructure level and network level network slicing. In [10], the allocation of wireless
slicing resources among multiple users is modeled as a bankruptcy game, which realizes
the fairness of allocation. [3] proposes a multi-time-scale cognitive radio network slic-
ing resource allocation model. The resource allocation model can be decomposed into
inter-slice subchannels pre-assignment in large time period and intra-slice subchannels
and power scheduling in same time slot. [3] formulates the inter-slice problem as an inte-
ger optimization problem and intra-slice problem as a mixed optimization problem with
integer variables, and adopts Lyapunov optimization method with heuristic subchannel
assignment procedure and a fast barrier-based power allocation procedure. The above pa-
pers use traditional optimization methods, such as game theory and Lyapunov optimiza-
tion. These traditional optimization methods need to transform the optimization objectives
into convex optimization problems to obtain the optimal solution, which has certain re-
strictions on the communication network scenarios. For example, the locations of users
are fixed, and more users will bring higher algorithm complexity and longer calculation
time.
In order to solve the problem of resource allocation in complex communication net-
work scenarios, we propose a reinforcement learning architecture to solve the problem
of resource allocation optimization in communication networks. The existing reinforce-
ment learning algorithms applied to resource allocation are mainly divided into distributed
multi-agent reinforcement learning algorithm [5] and centralized single agent reinforce-
ment learning algorithm [27,25]. The centralized algorithm needs global information, has
better utility value, and can balance the whole network users. Distributed algorithm only
needs to know local information, so it has less communication cost. [27] proposes a cen-
tralized reinforcement learning algorithm based on DQN, which uses underlay access
mode to maximize the spectrum efficiency of secondary users under the interference tem-
perature limit acceptable for the primary user. But the network model of [27] does not con-
sider network slicing. [5] proposes a distributed reinforcement learning algorithm based
on Q-Learning and SARSA. The secondary users are organized into a random dynamic
team in a decentralized and cooperative way, which speeds up the convergence speed
DRL for Resource Allocation with Network Slicing in Cognitive Radio Network 983
of the algorithm, improves the network capacity, and obtains the optimal energy-saving
resource allocation strategy. But [5] only considers a single kind of service slice (high
rate service slice) in the network model, and due to the use of table-based Q-learning and
SARSA algorithm, the state space becomes discrete space, and there is a certain quantiza-
tion error when segmenting the state space. [25] proposes a graph convolutional network-
based reinforcement learning algorithm based on DQN. Secondary users are formed into
a graph, and the information features are extracted by graph convolution, and then the
DQN algorithm is used for policy learning to maximize the data rate of secondary users
on the premise of the quality of service of users. In this paper, we propose a centralized
reinforcement learning algorithm based on DDQN, and use DDQN algorithm to solve the
problem of over estimation of DQN algorithm, so as to speed up the convergence speed
and stability of the algorithm. In addition, in the network scenario, we consider the sce-
nario where multiple service slices are combined with cognitive radio networks, and we
consider rate-sensitive eMBB service slices and delay-sensitive URLLC service slices. In
terms of optimization goals, if only the overall spectral efficiency of the cognitive network
is optimized, this may sacrifice the user experience of some users. Therefore, we jointly
optimize the spectral efficiency of the cognitive network and the user-perceived QoE of
each user.
eM BB U RRLC
users is SU = {1, 2, ..., Nsu }, where Nsu = Nsu + Nsu is the total number
of secondary users. The total primary user set is P U = {1, 2, ..., Npu }, where Npu is the
total number of primary users.
There are k channels for users to use. The channel set is C = {1, 2, ..., k} and the
bandwidth of each channel is B. Therefore, the total network bandwidth is W = k ∗ B.
Assuming that each primary user can occupy multiple channels at the same time, the
primary user-channel association matrix is P CA = {apc n,k }Npu ∗k . If PU n occupies the
channel k, then apc
n,k = 1, otherwise a pc
n,k = 0. Each secondary user can only occupy one
channel, and the secondary user-channel association matrix is SCA = {asc n,k }Nsu ∗k . If
the secondary user n occupies the channel k, then asc n,k = 1, otherwise a sc
n,k = 0.
The channel gain matrix of each primary user and PBS is P P G = {gnpP }Npu , and the
channel gain matrix of each primary user and CBS is P CG = {gnpC }Npu . The channel
gain matrix of each secondary user and PBS is SP G = {gnsP }Nsu , and the channel gain
matrix of each secondary user and CBS is SCG = {gnsC }Nsu .
P BS
Assume that the maximum transmission power of the PBS and CBS are Pmax and
CBS pu
Pmax correspondingly. Pn,k indicates the transmission power of the primary user n on
su
the channel k, and Pn,k indicates the transmission power of the secondary user n on the
channel k.
According to the definition of the signal-to-interference and noise ratio, ( 1)( 2) are
the expressions of the signal-to-interference ratio of the PU and SU.
apc pu,P BS pu
P
n,k · gn · Pn,k
pu k∈C
δ = .
apc apc pP pu
apc sc pC su
P P P P 2
n,k · ·
a,k ga · Pa,k + n,k · aa,k · ga · Pa,k + σ
a∈P U,a6=n k∈C a∈SU k∈C
(1)
asc sC su
P
n,k · gn · Pn,k
k∈C
δ su = P pc pu .
asc
P sc
an,k · asc su
P sP
P sC 2
n,k · aa,k · ga · Pa,k + a,k · ga · Pa,k + σ
a∈P U k∈C a∈SU,a6=n k∈C
(2)
DRL for Resource Allocation with Network Slicing in Cognitive Radio Network 985
According to the Shannon channel formula R = B · log(1 + δ), the transmission rate
of the primary user Rnpu and secondary user Rnsu can P
be calculated. Therefore, the total
transmission rate of the cognitive network is Rcn = Rnsu . ( 3) is the total spectrum
n∈SU
efficiency of the cognitive network.
B · log(1 + δnsu )
P
Rcn n∈SU 1 X
ηcn = = = · log(1 + δnsu ) . (3)
W k∗B k
n∈SU
The user’s QoE is mainly reflected by the user’s communication needs. The user’s
QoE is defined as the ratio of the number of packets meeting the communication re-
quirements to the total number of packets. The communication demand of eMBB slice
users is that the transmission rate is higher than a certain threshold, and the communica-
tion demand of URLLC slice users is that the transmission delay is lower than a certain
threshold.
The transmission rate of the data packet is expressed by the user’s transmission rate,
and the transmission delay of the data packet is composed as shown in Fig. 2.
The transmission delay of data packets is mainly composed of the queue delay (t1 )
when entering the base station, the operation delay of the channel allocated at the base
station (t2 ), the queue delay of entering the channel (t3 ), and the transmission delay of
transmitting in the channel (t4 ). In order to simplify the transmission delay model of the
data packet, t1 and t2 belong to the transmission delay of the base station, the value of
t4 is very small, they are not considered in this paper. Therefore, the transmission delay
of the data packet is the queue delay of the data packet entering the channel t3 . We use
the M/M/1 queue model to calculate the queue delay. According to the average waiting
time formula of the M/M/1 queue model Ws = 1/(µ − λ), where µ is the service rate
and λ is the arrival rate. We can get the queue delay t3 = 1/(rpackage − λ), where λ
is the arrival rate of each data packet, rpackage = Rn /L is the transmission rate of each
data packet, Rn is the transmission rate of the user, L is the packet length of the data
packet. We assume that the packet length is normally distributed. Therefore, ( 4) is the
transmission delay of the data packet.
1
t= . (4)
Rn /L − λ
986 Siyu Yuan, Yong Zhang, Wenbo Qie, Tengteng Ma, and Sisi Li
Let tmax and Rmin be the threshold for the data packet transmission delay and trans-
mission rate to meet the communication requirements. The expression that meets the com-
munication requirements is shown in ( 5). The user’s QoE is equal to the ratio of the
number of packets that meet the inequality requirements to the total number of packets.
Rn ≥ Rmin for eMBB users
1 . (5)
t = Rn /L−λ ≤ tmax for URLLC users
In order to balance the spectral efficiency and the user’s QoE, we set the attention
coefficient α ∈ [0, 1] between the spectral efficiency and the user’s QoE. α = 1 means that
the optimization goal is only to maximize the system spectral efficiency, and α = 0 means
that the optimization goal is only to maximize the user QoE. Therefore, our optimization
goal is ( 6).
asc pC su
P P
a,k · ga · Pa,k
a∈SU k∈C
IT = . (7)
kcons · W
Let the maximum interference temperature caused by the cognitive network accept-
able to the PU be IT max .
Constraint C1 indicates that each secondary user can only be associated with one
channel. Constraint C2 is the maximum total power constraint of the cognitive base sta-
tion. Constraint C3 is the main user’s interference temperature constraint on the cognitive
network.
Therefore, the optimization problem can be expressed as ( 8- 11).
asc pC su
P P
a,k · ga · Pa,k
a∈SU k∈C
C3 : ≤ IT max . (11)
kcons · W
su
Due to the nonlinear constraints of continuous variables (such as Pa,k ) and binary
variables (such as scaa,k ), the optimization problem is a non-convex problem. Using
deep reinforcement learning to solve such non-convex problems is a common method.
Therefore, we propose a deep reinforcement learning algorithm to solve this optimization
problem.
DRL for Resource Allocation with Network Slicing in Cognitive Radio Network 987
Then, we define two value functions—state value function and action value function.
The state value function V (s) is defined as the expectation of the long-term reward that the
state s can obtain at the moment t. The state value function represents the value of a state,
regardless of which action the state chooses. It takes the current state as the starting point
to make a weighted sum of all possible actions, the expression is Vπ (s) = Eπ [Rt |St = s],
where π is the strategy, and the expression is π(a|s) = P [At = a|St = s]. The action
value function G(s, a) is defined as the long-term reward that can be obtained by selecting
the action a in state s at the moment t. The action value function represents the value of
an action in a certain state. It is the weighted sum of all possible long-term rewards for a
given state and action, the expression is Gπ (s, a) = Eπ [Rt |St = s, At = a].
Usually, a limited Markov decision process consists of a quadruple M = (S, A, P, R).
Where S represents the limited state set space, A represents the action set space, P repre-
sents the state transition probability matrix, and R represents the expected reward value.
The Markov decision process relies on the Markov assumption that the probability of the
next state St+1 depends only on the current state St and action At , not on the previous
state or action. In the Markov decision process, given a state s ∈ S and an action a ∈ A, it
will transition to the next state s0 ∈ S with a certain probability. Pss
a
0 is the state transition
988 Siyu Yuan, Yong Zhang, Wenbo Qie, Tengteng Ma, and Sisi Li
probability, which means that starting from the state s and taking action a, we will reach
the state s0 with the probability of Pss
a a
0 , the expression is Pss0 = P (St+1 |St = s, At = a).
a
rss0 is the expected reward, which means starting from the state s, taking action a, and
transferring to the state s0 , the expression is rssa
0 = E(rt+1 |St = s, At = a, St+1 = s ).
0
Combining the definition of the action value function and the cumulative discount
reward function, we can obtain the Bellman equation form of the action value function
through a similar derivation process, as shown in ( 17). ( 18) and ( 19) are Bellman opti-
mality equations.
X X
Gπ (s, a) = a
Pss a
0 [Rss0 + γ Gπ (s0 , a0 )] . (17)
s0 a0
Q-Learning is a classic algorithm for reinforcement learning, but there is a problem that
Q-Learning uses a Q-Table to store Q values. This makes Q-Learning limited to the action
space and the state space are very small, and generally in discrete situations. If there are
many types of states and actions in the model, the size of the Q-Table will become very
large, even larger than the memory of the computer, and it is also very time-consuming to
search in a huge table for each update. However, more complex tasks that are closer to the
actual situation often have a large state space and action space. For the field of processing
high-dimensional data, deep learning has a good performance. Deep reinforcement learn-
ing combines reinforcement learning and deep learning, using neural networks instead of
the original table to calculate the value function.
DQN is a representative algorithm for deep reinforcement learning. Based on the orig-
inal Q-Learning used Q-tables, the Q value (action value function) is calculated using a
neural network in DQN algorithm. In the decision-making process, DQN takes the state
as the input of the neural network, calculates the Q value of each action through the neu-
ral network, and then selects the action according to the principle similar to Q-Learning.
Fig. 4 compares the Q value calculation process of Q-Learning and DQN. The original
Q value Q(s, a) is replaced by a new form with neural network parameters Q(s, a; θ),
where θ represents the parameters of the neural network.
In order to reduce the problems caused by the correlation between data, DQN intro-
duced two key technologies of experience replay and fixed target value network.
In supervised learning, each sample is independently identically distribution. How-
ever, the samples of reinforcement learning are obtained through the agent’s continuous
exploration, which makes the samples in reinforcement learning highly correlated and
non-stationary, causing the training results difficult to converge. The experience replay
technology is used to solve this problem. First put the collected samples into the sample
pool, and then randomly select a sample from the sample pool for network training. Ran-
dom sampling is used to remove the correlation between samples, making the samples
independent of each other, thereby improving the stability and convergence of network
training.
In the original Q-Learning, as described in ( 20), when we calculated the TD error,
we obtained it by calculating the difference between the target Q and estimated Q. The
calculation of the TD target is by using the Bellman equation. The TD target is the reward
of the current action plus the highest Q value of the next state through attenuation. How-
990 Siyu Yuan, Yong Zhang, Wenbo Qie, Tengteng Ma, and Sisi Li
ever, the same parameters are used when calculating the TD target and estimating the Q
value. The correlation between the two makes the model prone to oscillation and diver-
gence. In order to solve this problem, DQN builds an independent target Q network that is
slower than the current Q network to calculate the TD target, which makes the possibility
of oscillation and divergence during training reduced and more stable.
In Q-Learning, updating the Q value directly changes the value of the corresponding
position in the table. In DQN, the Q value is updated by updating the parameters of the
neural network. The update of the neural network parameters is based on the reverse
transfer of the loss function. The loss function of DQN is defined as the square error form
of target Q and estimated Q. ( 21) is the form of the loss function of DQN.
LossDQN = [r + γ max
0
Q(s0 , a0 ; θ− ) − Q(s, a; θ)]2 . (21)
a
DQN still has the problem of overestimation. Overestimation means that the estimated
value function is larger than the real value function, and its root is mainly in the maximiza-
tion operation in Q-Learning. When calculating target Q, the maximum Q value in the next
state is obtained. For real strategies and in a given state, the action that maximizes the Q
value is not selected every time, because the general real strategies are random strategies,
the selection of the maximum Q value of the action here will often result in the target
value being higher than the real value. Double DQN solves the problem of overestima-
tion on the basis of DQN. DDQN implements action selection and action evaluation with
different value functions, and in DQN we have proposed two Q networks. Therefore, the
step of DDQN calculating target Q can be split into two steps. In the first step, the action
to maximize the Q value is obtained by estimated Q network. In the second step, the ac-
tion value function corresponding to the action is obtained through the target Q network.
Combining the two steps together, the loss function of DDQN can be obtained, as shown
in ( 22).
Except for the change of the loss function, the main process of DDQN is the same as
that of DQN. Fig. 5 is a flowchart of the operation of the DDQN algorithm.
CBS
Pmax 2P CBS
su
Pn,k ∈ {0, , max ,...,Pmax
CBS
}. (23)
19 19
The reward function is modified on the basis of ( 8). The constraint condition C3
for interference temperature is added to the description of the reward function. If the
constraint condition of the interference temperature is satisfied, a normal reward will be
obtained. If the constraints of the interference temperature are not met, then only zero
rewards can be obtained. The characteristics of the step function meet our expectations.
We make the difference between the actual interference temperature and the interference
temperature threshold to obtain the interference temperature threshold constraint function
( 24).
asc p su
P P
a,k · ga · Pa,k
a∈SU k∈C
f (asc su
a,k , Pa,k ) = ε(IT max
− ). (24)
k·W
1 x>0
ε(x) is a step function, its characteristic is ε = 0.5 x=0.
0 x<0
The step function is discontinuous at 0, making it difficult during gradient descent.
Therefore, we use the Sigmoid function to approximate the equivalent step function.
Therefore, the interference temperature threshold constraint function is expressed as ( 25).
992 Siyu Yuan, Yong Zhang, Wenbo Qie, Tengteng Ma, and Sisi Li
asc pu,CBS su
P P
a,k · ga · Pa,k
a∈SU k∈C
f (asc su
a,k , Pa,k ) = Sigmoid(IT max
− ). (25)
k·W
The expression of Sigmoid function is Sigmoid(x) = 1+e1−x . Combined with the
interference temperature threshold constraint function, the reward function can be defined
as shown in ( 26).
Reward = f (asc su
a,k , Pa,k ) ∗ [αηcn + (1 − α)QoE]/, . (26)
There are two neural networks in this algorithm, the training network and the tar-
get network. These two networks have the same structure, but the parameter updates are
different. The neural network in this algorithm uses a simple fully connected neural net-
work. The fully connected neural network contains 2 fully connected layers, the structure
is shown in Fig. 6.
Fig. 6. Neural network structure (Npu : total number of primary users, Nsu : total number
of secondary users, k: total number of channels)
5. Performance Evaluation
In this section, we first introduce the settings of cognitive network model parameters and
DDQN hyperparameters. Then, we provide the simulation performance results.
DRL for Resource Allocation with Network Slicing in Cognitive Radio Network 993
The three curves in the figure are the images of the DDQN reward function value with
the number of iterations when the learning rate is 0.05, 0.005 and 0.0005. When the
learning rate value is small (δ= 0.0005), the CNDDQN algorithm converges in about
1500 iterations, and the convergence speed is slow. As the learning rate increases, when
δ= 0.005,the convergence speed of the CNDDQN algorithm increases, and the CNDDQN
algorithm converges in about 700 iterations. When the learning rate is too large (δ= 0.05),
the CNDDQN algorithm converges in about 400 iterations. But the final reward function
convergence value is lower than the reward function convergence value when the learning
rate is 0.005 and 0.0005. It can be seen that a low learning rate will lead to a slower
convergence rate, requiring more iterations to achieve convergence. Too high a learning
rate will cause CNDDQN to reach the final reward function convergence value lower
than the normal learning rate convergence value. Therefore, the choice of learning rate
should be moderate, too high and too low learning rate will make the performance of
the algorithm decline. In the simulation of this paper, the learning rate δ= 0.005 is an
appropriate value.
We observe the curve of learning rate δ= 0.005 in Fig. 8. We can find that the re-
ward function value is low and unstable at the beginning of the iteration. As the training
iteration progresses, the reward function continues to grow. After a certain number of
iterations, the reward function completes convergence. This means that the CNDDQN al-
gorithm has learned the optimal action strategy. After the CNDDQN algorithm converges,
the jitter of the reward function is caused by ε-greedy in the CNDDQN algorithm.
DRL for Resource Allocation with Network Slicing in Cognitive Radio Network 995
6. Conclusion
In this article, we propose a resource allocation algorithm (CNDDQN) for cognitive ra-
dio with network slicing. This algorithm is used in cognitive radio scenarios in underlay
mode. Under the interference acceptable to the primary user, the secondary user is allowed
to access the frequency band authorized to the primary user. In order to quantify the inter-
ference caused by secondary users, we introduce the concept of interference temperature.
In order to solve the proposed non-convex and NP-hard problem of resource allocation,
we use a deep reinforcement learning algorithm (DDQN). The algorithm jointly optimizes
the overall spectrum efficiency of the cognitive network and the QoE of the secondary user
by managing the channel selection and power allocation of the secondary user. Through
continuous iterative learning, the algorithm continuously updates the resource allocation
strategy of the secondary users, and finally reaches the optimal resource allocation strat-
egy. Simulation results show that compared with other reinforcement learning methods,
the proposed CNDDQN can effectively achieve a near-optimal solution through a smaller
number of iterations.
DRL for Resource Allocation with Network Slicing in Cognitive Radio Network 997
Acknowledgments. The authors would like to thank the reviewers for their detailed reviews and
constructive comments, which have helped improve the quality of this paper. This work is supported
by the National Natural Science Foundation of China under Grant No.61971057.
References
1. Chen, J., Chen, S., Wang, Q., Cao, B., Feng, G., Hu, J.: iraf: A deep reinforcement learning ap-
proach for collaborative mobile edge computing iot networks. IEEE Internet of Things Journal
6(4), 7011–7024 (2019)
2. Guo, D., Zhang, Y., Xu, G., Hyeongchun, P.: Spectrum aggregation scheme in a wireless broad-
band data transceiver system. international conference on robotics and automation 33(5) (2018)
3. Jiang, H., Wang, T., Wang, S.: Multi-scale hierarchical resource management for wireless net-
work virtualization. IEEE Transactions on Cognitive Communications and Networking 4(4),
919–928 (2018)
4. Katsalis, K., Nikaein, N., Schiller, E., Ksentini, A., Braun, T.: Network slices toward 5g com-
munications: Slicing the lte network. IEEE Communications Magazine 55(8), 146–154 (2017)
5. Kaur, A., Kumar, K.: Energy-efficient resource allocation in cognitive radio networks under co-
operative multi-agent model-free reinforcement learning schemes. IEEE Transactions on Net-
work and Service Management 17(3), 1337–1348 (2020)
6. Kumar, A., Kumar, K.: Multiple access schemes for cognitive radio networks: A survey. Phys-
ical Communication 38, 100953 (2020)
7. LeAnh, T., Tran, N.H., Saad, W., Le, L.B., Niyato, D., Ho, T.M., Hong, C.S.: Matching theory
for distributed user association and resource allocation in cognitive femtocell networks. IEEE
Transactions on Vehicular Technology 66(9), 8413–8428 (2017)
8. Li, R., Zhao, Z., Sun, Q., Chihlin, I., Yang, C., Chen, X., Zhao, M., Zhang, H.: Deep rein-
forcement learning for resource management in network slicing. IEEE Access 6, 74429–74441
(2018)
9. Li, X., Fang, J., Cheng, W., Duan, H., Chen, Z., Li, H.: Intelligent power control for spectrum
sharing in cognitive radios: A deep reinforcement learning approach. IEEE Access 6, 25463–
25473 (2018)
10. Liu, B., Tian, H.: A bankruptcy game-based resource allocation approach among virtual mobile
operators. IEEE Communications Letters 17(7), 1420–1423 (2013)
11. Ma, T., Zhang, Y., Wang, F., Wang, D., Guo, D.: Slicing resource allocation for embb and urllc
in 5g ran. Wireless Communications and Mobile Computing 2020, 1–11 (2020)
12. Mitola, J., Maguire, G.Q.: Cognitive radio: making software radios more personal. IEEE Per-
sonal Communications 6(4), 13–18 (1999)
13. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller,
M.: Playing atari with deep reinforcement learning. arXiv: Learning (2013)
14. Pengpeng, L.I., Zheng, N., Kang, P., Tan, H., Fang, J.: Overview and inspiration of global 5g
spectrum researches. Telecommunication Engineering (2017)
15. Richart, M., Baliosian, J., Serrat, J., Gorricho, J.L.: Resource slicing in virtual wireless net-
works: A survey. IEEE Transactions on Network and Service Management 13(3), 462–476
(2016)
16. Schaul, T., Quan, J., Antonoglou, I., Silver, D.: Prioritized experience replay. arXiv: Learning
(2015)
17. Tang, L., Tan, Q., Shi, Y., Wang, C., Chen, Q.: Adaptive virtual resource allocation in 5g net-
work slicing using constrained markov decision process. IEEE Access 6, 61184–61195 (2018)
18. Tarek, D., Benslimane, A., Darwish, M., Kotb, A.M.: Survey on spectrum sharing/allocation
for cognitive radio networks internet of things. Egyptian Informatics Journal (2020)
998 Siyu Yuan, Yong Zhang, Wenbo Qie, Tengteng Ma, and Sisi Li
19. Van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double q-learning pp.
2094–2100 (2016)
20. Wang, Z., Schaul, T., Hessel, M., Van Hasselt, H., Lanctot, M., De Freitas, N.: Dueling network
architectures for deep reinforcement learning pp. 1995–2003 (2016)
21. Watkins, C.J.C.H., Dayan, P.: Q-learning. Machine Learning 8(3-4), 279–292 (1992)
22. Yongjun, X., Xiaohui, Z.: Optimal power allocation for multiuser underlay cognitive radio
networks under qos and interference temperature constraints. China Communications 10(10),
91–100 (2013)
23. Zhang, Y., Kang, C., Ma, T., Teng, Y., Guo, D.: Power allocation in multi-cell networks using
deep reinforcement learning pp. 1–6 (2018)
24. Zhang, Y., Kang, C., Teng, Y., Li, S., Zheng, W., Fang, J.: Deep reinforcement learning frame-
work for joint resource allocation in heterogeneous networks pp. 1–6 (2019)
25. Zhao, D., Qin, H., Song, B., Han, B., Du, X., Guizani, M.: A graph convolutional network-
based deep reinforcement learning approach for resource allocation in a cognitive radio net-
work. Sensors 20(18), 5216 (2020)
26. Zhao, N., Liang, Y., Niyato, D., Pei, Y., Jiang, Y.: Deep reinforcement learning for user associ-
ation and resource allocation in heterogeneous networks pp. 1–6 (2018)
27. Zheng, W., Wu, G., Qie, W., Zhang, Y.: Deep reinforcement learning for joint channel selection
and power allocation in cognitive internet of things. In: International Conference on Human
Centered Computing. pp. 683–692. Springer (2019)
Siyu Yuan, received the B.E in electronic science and technology from Beijing Univer-
sity of Posts and Telecommunications (BUPT), Beijing, China, in 2019. He is currently
pursuing the Ph.D. degree with the School of Electronic Engineering, BUPT, Beijing,
China. His research interests include reinforcement learning, cognitive network slicing
and wireless network resource allocation. Email:[email protected]
Zhang Yong, received the Ph.D. degree from the Beijing University of Posts and Telecom-
munications (BUPT), Beijing, China in 2007. He is a Professor with the School of Elec-
tronic Engineering, BUPT. He is currently the Director of Fab. X Artificial Intelligence
Research Center, BUPT. He is the Deputy Head of the mobile internet service and plat-
form working group, China communications standards association. He has authored or
coauthored more than 80 papers and holds 30 granted China patents. His research in-
terests include Artificial intelligence, wireless communication, and Internet of Things.
Email: [email protected]
Qie Wenbo,received the B.E in Yanshan University, Beijing, China, in 2018. She is cur-
rently pursuing the M.S in electronic science and technology, Beijing University of Posts
and Telecommunications, Beijing, China. Her research interests include reinforcement
learning, cognitive network slicing, and resource allocation. Email:[email protected]
Ma Tengteng, received the B.S degree from the School of Science, Qufu Normal Uni-
versity, Jining, China, in 2015. He is currently pursuing the Ph.D. degree with the School
of Electronic Engineering, Beijing University of Posts and Telecommunications, Beijing,
China. Her current research interests include network slicing, QoS, cognitive radio and
virtual network resource allocation. Email: [email protected]
DRL for Resource Allocation with Network Slicing in Cognitive Radio Network 999
Li Sisi, received the B.S degree from Beijing University of Posts and Telecommunications
(BUPT) in 2019. She is currently pursuing the Ph.D. degree in computer science and
technology at BUPT. Her research areas include wireless communication, network slicing,
network resource allocation, mobile edge computing. Email: [email protected]