A Q-Learning-Based Adaptive MAC Protocol For Internet of Things Networks
A Q-Learning-Based Adaptive MAC Protocol For Internet of Things Networks
24, 2021.
Digital Object Identifier 10.1109/ACCESS.2021.3103718
ABSTRACT In Internet of Things (IoT) applications, sometimes the quality of service (QoS) of throughput
for transmitting video or the QoS of bounded delay for control of a sensor node is required. A traditional
contention-based medium access control (MAC) protocol cannot meet the adaptive traffic demands of these
networks and confers delay-related constraints. Q-learning (QL) is one of the reinforcement learning (RL)
mechanisms and can potentially be the future machine learning scheme for spectrum MAC protocols in
IoT networks. In this study, a QL-based MAC protocol is proposed to facilitate adaptive adjustment of
the length of the contention period in response to the ongoing traffic rate in IoT networks. The novelty
of QL-based MAC lies in its use of RL to dynamically adjust the length of the contention period according
to the traffic rate. The QL-based MAC will solve the models without additional input information to adapt
to environmental variations during training. We confirm that the proposed QL-based MAC protocol with
node contention is robust. In addition, we showed that our proposed QL-based MAC protocol has higher
system throughput, lower end-to-end delay, and lower energy consumption in MAC contention than those
of contention-based MAC protocols.
INDEX TERMS Internet of Things, quality of service, medium access control, reinforcement learning,
Q-learning.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 9, 2021 128905
C.-M. Wu et al.: QL-Based Adaptive MAC Protocol for IoT Networks
One standard for wireless network design is defined in the can independently access system resources based on the con-
IEEE 802.11 protocol, which is used extensively as a testbed dition of the wireless channels. Q-learning (QL) is a rein-
and as a reference model for simulation in wireless network forcement learning (RL) scheme that may also serve as an
studies. IEEE 802.11 defines the request-to-send/clear-to- ML strategy for facilitating IoT network in the future [16].
send (RTS/CTS) scheme to overcome the hidden terminal ML, RL, and QL have been previously investigated with
problem in wireless networks. When many nodes simultane- regards to their potential roles in wireless networks. In [17],
ously send the RTS control frame in the contention period, a novel Markov decision process (MDP) model is proposed.
a collision occurs [9]. Furthermore, if the CTS control frame Using this model, the highest throughput can be achieved by
cannot reply to the receiver node after receiving data from the adjusting the power and probabilities of node transmission
RTS control frame, the connection fails as well. This scheme in the wireless networks. A model-free RL mechanism is
can be used to solve the hidden terminal problem in wireless also employed by many states to solve the centralized MDP
networks, which can improve the system throughput. model.
Accordingly, a TDMA scheme based on carrier sense mul- The future research studies on Q-Learning-based MAC
tiple access with collision avoidance (CSMA/CA) has been protocol for duty-cycling mechanism are expected to
proposed and termed ‘‘hybrid-TDMA’’ [10]. In this scheme, be focused on achieving energy efficiency and delay-
a shorter end-to-end delay and higher channel utilization for awareness [18]. In [19], a power control scheme based on
data transmission can be achieved. Different approaches have game theory is proposed. The maximum energy efficiency
been used to create systems that adapt dynamically to network is determined by the selection of the source nodes, and
traffic. In [11], [12], node density and mobility speed are used the data transmission rate is maximized by the selection of
as the reference conditions for determining the length of the the relay nodes. This selection process is achieved using a
contention period. This ensures that system performance is QL-based algorithm to identify the convergence of the best
maintained when the traffic rate changes. Nash equilibrium points. In [20], the authors propose an
In [13], the authors propose a delay-collision CSMA ML-based mechanism for cognitive radio technology. In this
(DC-CSMA) scheme based on nonpersistent TDMA approach, the efficiency of opportunistic spectrum access is
CSMA/CA. The high system performance of the DC-CSMA determined by a QL algorithm; as a result, prior knowledge of
scheme is achieved by balancing the contention probability the environment’s characteristics is not required. In addition,
and channel access time. A low contention latency and high identifying the ongoing interaction between radio nodes and
successful channel access probability are achieved using the environment by this method facilitates increased perfor-
the DC-CSMA under a nonuniform contention probability mance.
distribution. DC-CSMA can also achieve robustness under a In [21], the authors proposed a Q-Learning-based packet
dynamic number of contenders, contention window size, and flow ALOHA protocol which applies a simple Q-Learning
packet size. mechanism under no ACK control framework to achieve a
In [14], the authors propose a contention window control collision-free protocol. When the packet is received success-
mechanism for wireless networks based on IEEE 802.11. fully, the reward is +1, which is −1 otherwise. The node
The size of the contention window is determined by the selects the slot according to the Q-value to mitigate the colli-
traffic load, transmission rate, and packet size. The authors sion and to achieve the collision-free protocol.
also develop a model to demonstrate the system performance In [22], a novel RL-based, model-free scheme is proposed
under collision probability, collision time, and back-off time. for wireless networks that combines the expected Sarsa and
In [15], the authors propose a token adaptive MAC protocol eligibility trace. In this scheme, all values of the possible
(token-based MAC) for a mobile network. The token-based successive actions are averaged to update the targets, which
MAC protocol uses a two-hop wireless mobile network with reduces the variance caused by random sampling. In [23],
a fixed licensed channel. Each node can send packets when the authors propose a deep-RL MAC (DRL-MAC) protocol
it receives a token. The length of the token cycle corresponds for heterogeneous wireless networks that considers the time
to the number of nodes. Therefore, the cycles are long when slot sharing problem for multiple different MAC protocols in
using token-based MAC with a large number of nodes, which multiple time-slotted networks.
will result in increased propagation delays. In [24], the authors propose a QL-based MAC (QL-MAC)
In traditional MAC protocols, the fixed length of the protocol to solve the sleep-wake scheduling problem in wire-
contention period is designed into the MAC contention less networks. QL-MAC is a distributed QL mechanism and
mechanism. Many studies have used the number of wireless can accordingly adapt to dynamic traffic load. The QL-MAC
network nodes to define the length of the contention period, protocol can increase network lifetimes and packet delivery
which does not allow adaptation to network traffic rates. This, ratios compared to other MAC protocols for wireless sensor
in turn, reduces the system throughput and will lead to a networks. In [25], the authors propose an RL-based adaptive
longer-contention-MAC-delay than if adaptive lengths were MAC (RLA-MAC) protocol for wireless sensor networks to
used. optimize the sleep-wake schedule of sensor nodes to reduce
Machine learning (ML) can enable machines to perform energy consumption in MAC contention. Whereas most pro-
as intelligent tools in IoT-enabled wireless networks, which tocols use adaptive duty cycles to optimize energy utilization,
in RLA-MAC, each node receives information and actively In [28], the authors proposed a time division multiple
infers the status of other nodes. RLA-MAC then uses the RL access (TDMA) slot assignment protocol to improve the
mechanism to create a duty cycle function based on traffic. system performance. The length of the contention period
This protocol enables high system throughput and low energy of TDMA protocol is adaptive and can dynamically adjust
consumption under dynamic traffic conditions. according to the increase in nodes. The length of the con-
Thus, previous works have shown that RL can be applied tention period is increased by a power of 2, while the unsigned
to improve the system performance of wireless networks time slots are not enough to be applied to the new nodes.
and reduce the collision probability. RL can also be applied A collision-free MAC protocol is achieved due to a dynamic
to decrease the propagation delay and energy consumption length of the contention period for different number of nodes.
while increasing system throughput. Therefore, RL can over- In [29], [30], the authors proposed the hybrid MAC proto-
come the disadvantages of existing network protocols. Fur- cols, which exhibit effective performance in low load traffic.
thermore, in other environments, ML mechanisms have been However, the system performance deteriorates because of
widely used to solve problems without prior knowledge. high collision probability when the traffic load is high. The
Thus, in this study, a QL-based MAC protocol in an IoT net- hybrid MAC protocol is composed of the CSMA/CA and
work is proposed to improve the system performance. In this TDMA protocols. For wireless networks, the system perfor-
QL-based MAC protocol, the length of the contention period mance is determined by the contention conditions. The hybrid
is adaptive and regulated by the QL algorithm to ensure the MAC protocol can achieve high system performance under
quality of service (QoS) of the system in an IoT network. dynamic traffic load.
Overall, the main contributions of this work are as follows: However, a scalable network for hybrid MAC protocols
1) We apply the QL algorithm in persuasive applications cannot be achieved due to accumulation of high traffic load.
for an IoT network. A token-based MAC protocol was proposed for an IoT net-
2) We propose a QL algorithm for controlling channel work in [15]. In token-based MAC protocols, the number of
access in a contention-based IoT network. time slots of the contention period is equal to the number
3) We employ QL to adaptively regulate the length of the of nodes in the IoT network. Each node attempts to send a
contention period for an IoT network. Our QL-based packet until it receives the token. If the node receives the
MAC protocol is formulated to enable adaptive con- token and has no packets to transmit, the token is released and
tention period sizing according to traffic rates in an IoT randomly rotated to another node. Therefore, the propagation
network. delay of the token-based MAC protocol for an IoT network
4) We propose a QL algorithm to generalize our QL-based is longer than that of the traffic adaptive MAC protocol, and
MAC protocol for adaptive contention period sizing to the system throughput will be lower. Therefore, an efficient
achieve higher system performance. traffic adaptive MAC protocol for IoT networks is important,
and current barriers to this implementation are significant
The remainder of this paper is organized as follows. The challenges to be overcome.
preliminaries are introduced in Section II. The system model
is described in Section III. The QL-based MAC protocol
for IoT networks is presented in Section IV. Performance B. Q-LEARNING IN MARKOVIAN ENVIRONMENTS
evaluation of the protocol is discussed in Section V, and the The selection of the length of the contention period is a
final section presents our conclusions. trade-off problem. When the length of the contention period
is large enough to support the high IoT network traffic,
II. PRELIMINARIES no collision may occur in the contention period. However,
A. CSMA/CA AND TDMA the length of the contention period need not be large for low
For the communications of IoT networks, one important issue traffic loads. The long length of the contention period in low
is whether the MAC protocol is scalable. For IoT networks, traffic load will lead to long packet propagation delay. In that
communications may have a large number of IoT nodes. With case, many IoT nodes will have to wait for a long time and
increasing applications of IoT networks, the density of IoT the channel utilization will also be low. Therefore, the QL
nodes is increasing. In addition, because nodes keep leav- algorithm is applied to adjust the length of the contention
ing and joining the IoT network, the number of contention period dynamically.
IoT nodes is dynamic. Therefore, an important requirement QL algorithm does not require the complex computational
is to propose an MAC protocol with scalability for IoT capability for MAC controllers and has a low communica-
networks to adapt to the variation in the number of IoT tion overhead for IoT network nodes. In wireless networks,
nodes [26]. When the number of contention nodes increases the acknowledgment of reception is applied to verify the
for contention-based MAC protocols, the collision probabil- reliability of unicast communication. Most contention-based
ity increases due to the hidden terminal problems, and the MAC protocols have this function to confirm whether
system performance decreases. IEEE 802.11 CSMA/CA is the packet is received by the receiver or not. Therefore,
one of the methods of contention-based MAC protocols that contention-based MAC protocols have a high overhead
has no scalability for an increase in IoT nodes [27]. because of the acknowledgement scheme [31].
promptly and upload it to an IoT destination during a speci- for a short length in the contention period. Furthermore, there
fied time slot. To send the data to the receiver, each IoT sender is high propagation delay in low traffic environments for
must create a connection; this first requires it to contend with long lengths in the contention period. As such, we herein
the other nodes in the contention period, following which the propose a traffic-adaptive MAC protocol for IoT networks.
successful contending IoT sender can transmit data to the IoT The adaptive length of the contention period regulation com-
receiver. The destination of a connection can be an IoT cluster plies with the MDP formulation and QL was used to design
head or other IoT nodes; however, the contention period in a MAC protocol to enable the cluster head to select the
IoT network should be able to adapt with traffic to adjust appropriate length of the contention period in the network
the length of the contention field. After the contention period based on experience gained from agent-environment inter-
is completed, the cluster head will broadcast the status of actions. The proposed QL-based MAC protocol features a
transmission and the length for the next contention period. QL-based scheme that adaptively regulates the length of the
In this study, the QL algorithm is applied to IoT networks contention period to maximize the system throughput and
to dynamically adjust the length of the contention period minimize the propagation delay. Table 1 shows the symbols
according the change in traffic rate. Therefore, we introduce for performance evaluation of this approach.
the RL algorithm for IoT networks here. In Fig. 1, the agent
is defined as the learner and labelled as a decision node. A. QL-BASED MAC PROTOCOL FOR IoT NETWORKS
In addition to the agent, the environment will interact with The adaptive length of the contention period problem suits
all other network components. The agent selects the action to the MDP formulation. QL algorithm is used to design
and the environment will adapt to the resulting new situation an IoT MAC protocol that can adaptively adjust the length
and create a reward. The agent seeks to maximize the reward of the contention period dynamically based on the obtained
over time by selecting the action. experience in the IoT communication area. The proposed
The state information of the environment is received at QL-based adaptive IoT MAC protocol that adjusts the length
each time step t by the agent, St ∈ S; subsequently, an action of the contention period is based on the broadcast of the
is selected by the agent, such that At ∈ A. The agent will cluster head in order to avoid the contention collision.
receive a reward, Rt ∈ R, after one time-step based on the In the remaining parts of this section, we use QL func-
selected action, after which the agent moves to the new state, tion (3) as a learning and self-improving control mecha-
St+1 [32]. nism among multiple nodes to share wireless channel for
In our proposed system model, the agent is any IoT node. IoT network. The basic procedures for this mechanism as
The state, St , is the length of the contention period at time t. follows: A node wants to create a connection and sends an
The action, At , is the step size of adjusting the length of the RTS control frame in the contention period. After the end of
contention period. The step size maybe plus d time slots or the contention period, the cluster head obtains the contention
minus d time slots under the current length of the contention status for each slot in the contention period. The cluster head
period. In addition, the step size may also be zero and then the gets the reward by using reward function which is combined
next length of the contention period is the same as the current by the contention results. The contention results are deter-
length of the contention period. d is an integer. The reward, mined by the number of successful, collision and idle slots.
Rt , depends on the contention results of the channel access at The Q-learning agent calculates the Q-value by using QL
time t. function (3). Then the Q-Learning agent gets the maximum
Therefore, the interactions between the IoT node (agent) Q-value from Q-table and performs the action. This action
and the environments at time step t are as follows [22]: determines the length of the next contention period. There-
fore, the Q-Learning agent dynamically adjusts the length
1) The IoT node observes the environment in the IoT
of next contention period, and then the nodes send the RTS
network and determines the current length of the con-
control frame according to the new length of the contention
tention period, St .
period. Then, this procedure is repeated. The baseline of the
2) The IoT node decides the current step size of adjusting
operation for the proposed Q-Learning based MAC protocol
the length of the contention period according to the
showed in Fig. 2.
current length of the contention period, St and then
By observing the current state and receiving the reward
determines the next action At .
to provide the IoT node specific goal, then the QL-based
3) The IoT node applies the selected action At and after
MAC protocol can satisfy the IoT networks applications. The
one time-step the IoT node obtains the feedback reward
IoT network can dynamically adjust the length of the con-
Rt+1 of the IoT network.
tention period according to the traffic load. We can develop
4) The IoT node moves from the state St to the new state
the reward function according to the performance metrics of
St+1 .
the system, based on the mechanism of adapting the length
of the contention period. In our proposed QL-based MAC
IV. QL-BASED MAC PROTOCOL DESIGN protocol for IoT networks, the time and spectrum access are
If the length of the contention period is fixed. There is there- opportunistically created in the beacon interval by division.
fore a high collision probability in high traffic environments There are contention and data periods in each beacon interval,
generating the reward in past situations. However, to discover TABLE 2. The initial Q-value table.
such preferred actions, the agent must not attempt to select
previous actions that have not generated the reward and must
exploit experiences that can be used to obtain the reward.
However, the agent must also explore potential outcomes in
order to identify better actions for future situations. As such,
neither exploration nor exploitation can be carried out exclu-
sively, which results in a dilemma [32].
The simplest action selection rule is used to select one of
the actions with the greatest estimated value. If there is more
than one ‘‘greedy’’ action (an action that results in the maxi-
mum estimated value), then the action is randomly selected. ities [34]. The ε-greedy method is used to focus on the
The greedy action selection method can be expressed as [32]: exploration of our proposed MAC protocol on the most pos-
sible length of the contention period trajectories. The scheme
.
π(s) = argmax Q(s, a), (5) forces the agent by probability to take the sample of (s, a) over
a
time by probability. Therefore, the proposed QL-based adap-
where argmax denotes the action a for which the expression tive MAC protocol can satisfy convergence rules. However,
a
is maximized. the further optimization of convergence speed and system
The continuous exploiting for the greedy mechanism of suitability is required.
Q-value cannot satisfy the first convergence because it cannot The current state of knowledge is used to maximize
correctly explore all the pairs (State, Action). On another the immediate reward by greedy action selection. However,
side, all the pairs (State, Action) are continuously explored greedy action uses no additional time for sampling to see
while using random mechanism. However, the mechanism of if better options can be selected. Thus, greedy action selec-
random explore for controller is sub-optimal. A compromise tion behaves in most cases with a small probability, ε. The
between these two extremes is the ε-greedy method [32]. ε-greedy method is defined such that the greedy action is not
The ε-greedy method is executed by the probability 1. The randomly selected from all actions with the same probability
balance of explore and exploit is guaranteed with good system and is instead independently selected based on the action-
performance. value estimates. As the number of steps increase, the number
of samplings of each action approaches infinity. And Qt (a)
C. ε-GREEDY METHOD is guaranteed to converge to q∗ (a), it is implied that the
From the first transmission of the RTS control packet, using converged probability of selecting an optimal actions will be
the strategy defined by Eq.(6) can result in instant perfor- greater than 1 − ε.
mance benefits. For an iterative algorithm like QL, an initial When QL is applied in a new environment, the agent
condition is needed. After the first update, the initial condi- must explore and exploit the reward to gradually discover
tion is changed. For QL, the initial condition is always zero. the optimal action At that maximizes the Q-value. Then, ε
The initial untrained Q-table is set as in Table 2. The first col- is defined as follows:
umn in Table 2 represents the possible states and the first row − TTrun
ε=e simu , (6)
represents the action space. Here, L denotes the current length
of the contention period. The action (At ) of L − d denotes where Trun is the simulation running time and Tsimu is the
that the length of the contention period will be decreased by system simulation time. The convergence to an optimal policy
d whereas action (At ) of L + d denotes that the length of the is guaranteed by the decay function in our proposed QL-based
contention period will be increased by d at time t. The action MAC for an IoT network.
(At ) of L denotes that the length of the contention period will The policy is based on the ε-greedy mechanism. The value
be the same as the current length. Let the initial length of of ε varied over time and is not constant. The starting value
contention period be 4. And nd means that has n times d. of ε is 1 and initially it explored different values randomly to
The length of the contention period has following different perform all the pairs (State, Action). The value of ε gradually
states (St ): 4, 4 + d, 4 + 2d, . . . , 4 + (n − 1)d, or 4 + nd time gets smaller over time and finally achieves the smallest value.
slots. This ε-greedy mechanism can achieve optimal convergence.
Each node uses a controller which depends on the traffic Then more explore actions are not needed while the optimal
rate and the length of the contention period. In this study, all action is learned. In our implementation, the ε value decreases
the nodes are in a one-hop environment and the destination while the simulation time increases. If the ε value achieve the
of traffic is the cluster head or other IoT nodes in the IoT minimum value (0.05), it means that the portion of random
network. The controller is trained a priori for γ = 0.9 and explore occupies 5 percentage [31].
decay period of 10000 s in a 100-node IoT network. According to Eq.(6), the agent gradually limits the rate
The convergence speed of QL algorithms depends on for the obtained reward from random explore to overrid-
the application and its associated environmental complex- ing the current experience. Then, the agent becomes more
confident for the obtained experience as the time passes. Then, the agent can keep learning using QL. Subsequently,
Finally, the agent finds the optimal state and performs better state stability is achieved by sustained regulation of Fsucc ,
control to avoid large oscillation of L value. Fcoll , and Fidle . Thus, the values of Fsucc , Fcoll , and Fidle are
obtained.
D. REWARD FUNCTION FORMULATION After Fsucc , Fcoll , and Fidle are determined, we again reg-
The QL agent gets a positive or negative reward depending ulate γ and α. According to the QL function (3), we first
on the action that can act correctly by learning in an IoT performed the experiments using γ = 0.9 and α = 0.5.
network. The main purpose of IoT networks is to achieve low If the discount factor γ is set too small, then the Q-value by
contention collision, which provides a high system through- the learning scheme will always be negative under a small
put, low propagation delay, and low energy MAC contention. value of γ maxa Q(St+1 , a) − Q(St , At ) even if the reward
This can be achieved by using the binary reward function. The (Rt ) is positive. To prevent the learning step size from being
agent gets a reward according to the results of contention. too slow, the learning rate may need to be adjusted to speed
For each node that wants to transmit data in an IoT net- it up. To prevent the learning step size from being too fast,
work, an RTS control packet must be sent in the RTS sub- the learning rate should be reduced. We use the step decay
period. After one contention period has ended, the cluster schedule to decrease the learning rate by a factor every few
head collects the transmission status of each slot. There are epochs and to observe the variations of state. We observed
three possible statuses for each slot: success, collision, and that found the state was stable under γ (0.9) and α (0.1).
idled. The cluster head calculates the reward of Eq.(7) by After Fsucc , Fcoll , and Fidle , γ and α are determined.
assigning suitable values for the parameters Fsucc , Fcoll , and To achieve a stable state and optimal throughput, we also
Fidle . perform different numbers of iterations based on the above
mentioned Fsucc , Fcoll , Fidle , γ , and α values. We consider
Rt = Fsucc ∗ Slotsucc − Fcoll ∗ AvgSend ∗ Slotcoll the simulation time of 10,000 s as one iteration. State sta-
− Fidle ∗ Slotidle . (7) bility is achieved when the number of iterations is 10 in our
simulation. Therefore, the convergence time is 10 iterations.
where Slotsucc , Slotcoll and Slotidle denote the number of In the Algorithm (Q-Learning Adaptive MAC Protocol),
successful, collision and idled slots in a contention period, the action is selected based on Pε . When Pε < ε, the length of
respectively. AvgSend denotes the average number of trans- the contention period of the next period is randomly selected
missions until successful transmission. Fsucc , Fcoll , and Fidle from (Lt − d, Lt , Lt + d). Otherwise, the action is decided by
denote the impact factor of successful contention, collision the controller (Eq.(5)). Pε is randomly selected from (0, 1).
contention and idle time slots, respectively. In this design, the agent must overcome the problem that
Generally, the discount factor γ is set between 0.6 to each state has different reward gradient. The more detailed
0.99 [31], and its determination is considered to be part of reward function provides more information and then speed
the problem. In addition, the estimated speed increases with up the convergence of algorithm. The feedback for specified
an increase in the learning rate, and then decreases over time goals returns different reward. The L will has faster conver-
as the learning rate decreases. If the learning rate α is set gence and provides higher system performance.
to 0, then there is no new learning message. If the learning In fact, we can let the agent of Q-Learning prefer some
rate α is set to 1, then only the most recent message must be specified L over other L values. If the contention results has
considered. many successful slots, less collision slots, and less idle slots,
Therefore, we set γ to 0.9 and α to 0.1. Subsequently, it means that this L is preferred. If the contention results
we regulate the Fsucc , Fcoll , and Fidle under the above men- has less successful slots, many collision slots, and many idle
tioned γ (0.9) and α (0.1) values. The optimal length of the slots, it means that this L is not preferred. Therefore, the L
contention period for a traffic rate of 8 is approximately 4 and adaptation can be achieved by the number of successful slots,
that for a traffic rate of 192 is approximately 16 when using collision slots, and idled slots. Then, we define the reward
zero primary user (PU) channels in the active ON state [35]. function is determined by the number of slots for contention
We regulate the Fsucc , Fcoll , and Fidle to observe whether successful, collision, and idle. If the number of successful
the state is stable, while the rate is low (traffic rate: 8) or slots in the contention period increases, it means that the
high (traffic rate: 192). First, let Rt = 0 be a baseline to system throughput increases. If the number of collision slots
determine the Fsucc , Fcoll , and Fidle . Of course, the questions decreases, it means that energy consumption decreases. The
had multiple groups of answers. We let Fsucc be higher than high number of successful slots and low number of collision
Fcoll and Fidle , and we can subsequently obtain a positive slots for fixed traffic rate means that the end-to-end delay
reward in the learning process. In addition, the number of decreases.
time slots of collision is more influential than the number of
idle time slots for a negative reward; hence, we let Fcoll be V. PERFORMANCE EVALUATION OF AN IoT NETWORK
higher than Fidle . Therefore, Rt may be positive or negative In this paper, the QL algorithm for our proposed QL-based
owing to the slight changes in Slotsucc and Slotcoll under MAC protocol is applied offline in the IoT network.
the above mentioned regulations of Fsucc , Fcoll , and Fidle . In this section, the simulation results for the proposed
Algorithm : Q-Learning Adaptive MAC Protocol the transmission of the RTS, the cluster head broadcasted
01: Initialize Q0 (L, A) at t = 0 the BTC control frame. If the contention of the RTS was
02: L0 = 4 at t = 0 successful, then the connection was created successfully. The
03: if Trun < Tsimu then probability of collision increased under high traffic, as did
04: ε, α ←− decay function the end-to-end delay of CSMA/CA systems. The system
05: else performance of the IoT network when CSMA/CA is used was
06: ε, α ←− constant also decreased due to the longer end-to-end delay.
07: end if If one node wants to transmit data, it must exchange and
08: procedure Action-selection(Lt , d) negotiate information on the control frame-licensed channel.
09: randomly select Pε ∈ random (0, 1) Therefore, token-based MAC and hybrid MAC schemes will
10: if Pε < ε then have low system performance due to the variable traffic load
11: At+1 ←− random (Lt − d, Lt , Lt + d) in IoT network.
12: else if Pε ≥ 1 − ε then The uniform distribution with various overall loads is
13: At+1 ←− Aπ assumed to be traffic in IoT networks. The traffic is the
14: end if ratio of the arrival rate to the departure rate. The arrival
15: Lt+1 ←− L At+1 rate is defined as the number of new connections cre-
16: end procedure ated per second whereas the departure rate is defined
17: procedure Feedback(Lt+1 , At+1 ) as the number of connections ended per second. The
18: set Fsucc , Fcoll and Fidle inverse of departure rate is also the life time per each
19: cluster head collects Slotsucc , Slotcoll and Slotidle . connection.
20: cluster head calculates the AvgSend for each node. In our simulation, the departure rate is fixed as 0.5. The
21: cluster head calculates the reward of Eq.(7). desired traffic load is obtained by varying the arrival rate.
22: Rt = Fsucc ∗ Slotsucc − Fcoll ∗ AvgSend ∗ Slotcoll Therefore, if the arrival rate is 2, then the overall load is 4.
−Fidle ∗ Slotidle This means that the average active connections in the network
23: update Q(Lt+1 , At+1 ) is 4.
24: Action-selection(Lt , d) In our simulation, the departure rate was set to 0.5 and the
25: end procedure arrival rate was set as 1, 2, 4, 8, 16, 32, 64, 96, 128, 160,
192, 224 or 256. The system load is defined as the arrival
rate divided by the departure rate. For example, if the arrival
rate is 16, then the system overload is 32, which indicates
QL-based MAC protocol in an IoT network are presented. that there are an average of 32 active connections in the IoT
C programming language was used to implement the simu- network at a given time. The optimal/near-optimal length of
lation. For hybrid MAC and token-based MAC, the length of the contention period is determined by selecting the maxi-
the contention period is fixed. For QL-based MAC, the length mum Q-value for each state of the length of the contention
of the contention period is adjusted dynamically according to period.
the traffic rate. The lengths of the contention period of the The system performance of QL-based MAC is compared
token-based MAC and hybrid MAC protocols were fixed as 4, with those of token-based MAC and hybrid MAC in this
8, 12, 16, 20, 24, 28, or 32 slots. section. Here, the relationship between nodes in the IoT net-
In token-based MAC, the length of the contention period work is one-hop. For our proposed QL-based MAC protocol,
was fixed for the simulation. Whether one node can trans- the length of the contention period is dynamically adjusted
mit data was determined by whether it could access the according to the traffic rate. The bandwidth of the unlicensed
token. If a node wanted to transmit data and received the channel is 2 Mbps. The energy consumption of transmission
token, then the transmission was successful. Conversely, if a and receiving for each control frame is 1,675 and 1,425 mW ,
node wanted to transmit data and could not receive the respectively. We also set 1 time slot = 1 s. Here, we focus on
token, then transmission failed. The IoT sender transmitted the one-hop IoT environment. For the square region 200 m2 ,
the data and determined the next owner of the token ran- the largest distance for any two objects is 282.8 m. Therefore,
domly. Here, we considered the effect of sending a token as we set the transmission range for IoT node as 300 m. Table 3
similar to the RTS control frame in hybrid MAC and IoT- shows the other parameters of our proposed QL-based MAC
based MAC. Thus, the end-to-end delay of token-based MAC protocol for IoT networks.
increased under low traffic loads. For greater end-to-end Regarding QL training and evaluation, the values of Fsucc ,
propagation delays, the system performance would also be Fcoll , and Fidle are set as 1.5, 0.55, and 0.4, respectively.
decreased. In addition, the discount factor γ is set at 0.9. The learning
The length of the contention period is also fixed in hybrid rate α is set at 0.1. In the simulation process, 30 seeds were
MAC. In this case, each IoT node transmitted the RTS control used to create 30 topologies. The simulation results for these
frame randomly during the contention period. Here, we only thirty seeds were summed and averaged. The system through-
considered the RTS control frame in hybrid MAC. After put, end-to-end delay and energy consumption were used
TABLE 3. Parameters for our proposed QL-based MAC scheme. arrival rate of 256 and a contention period length of 32 slots;
the throughput ranged from 0.184 to 1.016 Mbps.
When using hybrid MAC, the throughput increased at the
beginning of the simulation while the traffic rate increased for
different contention period lengths. However, the throughput
plateaued once a certain traffic rate was reached, beyond
which it gradually decreased.
For token-based MAC, the probability of simultaneously
holding the token and the data to send increases with traffic
rate. Therefore, the throughput increases with the traffic rate;
however, the throughput is accordingly low while the traffic
rate is low.
However, for QL-based MAC, the length of the contention
period is dynamically adapting to different traffic rates using
the QL algorithm. Therefore, the throughput of QL-based
MAC-based systems can always increase for different traffic
rates, which differs from hybrid MAC and token-based MAC
in IoT networks.
B. END-TO-END DELAY
The average end-to-end delay for MAC contention is denoted
by E[Tdelay ]. The MAC delay per connection is defined as the
average time elapsed between the generation of a frame and
its successful and complete receipt. In the QL-based MAC
protocol, the MAC delay per connection is the sum of the
MAC delays that occur prior to the complete transmission
j
of a packet. Let tdelay denote the contention delay for the jth
trial for a connection involved in a MAC contention in an IoT
network. E[Tdelay ] can then be computed as follows [38]:
C succ nX
trial
j
X
FIGURE 3. System throughput for QL-based MAC, hybrid MAC and E[Tdelay ] = tdelay . (9)
token-based MAC as a function of the arrival rate in an IoT network. i=1 j=1,conn=i
FIGURE 5. Energy consumption for MAC contention of successful FIGURE 6. Energy consumption of failed MAC contention of nodes in
transmission by IoT nodes for QL-based MAC, hybrid MAC and QL-based MAC, hybrid MAC and token-based MAC protocols as a function
token-based MAC protocols as a function of arrival rate in an IoT of the arrival rate in an IoT network.
network.
with one unlicensed channel. The maximum throughput using that the maximum throughput using QL-based MAC com-
Sarsa-based MAC was 1.300 Mbps for an arrival rate of 64; pared to hybrid MAC and token-based MAC was increased
the throughput using Sarsa-based MAC ranged from 1.077 to by 101.7% and 344.3%, respectively (for a contention period
1.300 Mbps. The highest energy consumption for MAC con- length of 32 slots with traffic rate = 64). Therefore, the pro-
tention of successful transmission using Sarsa-based MAC, posed QL-MAC protocol is worth using instead of classical
shown on the right side of Fig. 7, is 3,363,500 mW for an MAC protocols.
arrival rate of 192. The energy consumption for MAC con-
tention of successful transmission ranged from 1,695,080 to ACKNOWLEDGMENT
3,363,500 mW . The authors would like to thank the editor and the reviewers
For Sarsa-based MAC, the length of the contention period for their valuable comments and suggestions.
dynamically adapts to different traffic rates, similar to the QL
algorithm. Therefore, the throughput of Sarsa-based MAC- REFERENCES
based systems can also always increase for different traffic [1] A. A. Khan, M. H. Rehmani, and A. Rachedi, ‘‘Cognitive-radio-based
rates, such as in the QL algorithm in IoT networks. In addi- Internet of Things: Applications, architectures, spectrum related function-
tion, for Sarsa-based MAC, the energy consumption for MAC alities, and future research directions,’’ IEEE Wireless Commun., vol. 24,
no. 3, pp. 17–25, Jun. 2017.
contention of successful transmission is less than that for [2] J. Gubbi, R. Buyya, S. Marusic, and M. Palaniswami, ‘‘Internet of Things
hybrid MAC and greater than that for token-based MAC, (IoT): A vision, architectural elements, and future directions,’’ Future
Gener. Comput. Syst., vol. 29, no. 7, pp. 1645–1660, Sep. 2013.
similar to the QL algorithm in an IoT network. [3] H. Nishiyama, T. Ngo, S. Oiyama, and N. Kato, ‘‘Relay by smart device:
However, the throughput index ζ of the Sarsa-based MAC Innovative communications for efficient information sharing among vehi-
is smaller than that of the QL-based MAC owing to the cles and pedestrians,’’ IEEE Veh. Technol. Mag., vol. 10, no. 4, pp. 54–62,
Dec. 2015.
different Q functions. The energy consumption for MAC [4] J. Liu and N. Kato, ‘‘A Markovian analysis for explicit probabilistic
contention for successful transmission is determined by the stopping-based information propagation in postdisaster ad hoc mobile
number of successful transmissions. Therefore, the energy networks,’’ IEEE Trans. Wireless Commun., vol. 15, no. 1, pp. 81–90,
Jan. 2016.
consumption for MAC contention of successful transmission [5] P. Wang, H. Jiang, and W. Zhuang, ‘‘Capacity improvement and analysis
of the Sarsa-based MAC is smaller than that of the QL-based for voice/data traffic over WLANs,’’ IEEE Trans. Wireless Commun.,
MAC. vol. 6, no. 4, pp. 1530–1541, Apr. 2007.
[6] J. L. Sobrinho and A. S. Krishnakumar, ‘‘Quality-of-service in ad hoc
carrier sense multiple access wireless networks,’’ IEEE J. Sel. Areas Com-
VI. CONCLUSION mun., vol. 17, no. 8, pp. 1353–1368, Aug. 1999.
The length of contention period for traditional MAC protocol [7] S. Jiang, J. Rao, D. He, X. Ling, and C. C. Ko, ‘‘A simple distributed PRMA
for MANETs,’’ IEEE Trans. Veh. Technol., vol. 51, no. 2, pp. 293–305,
is fixed in IoT networks. The collision probability increases Mar. 2002.
for fixed length under high traffic. In contrast, the slot uti- [8] P. Wang and W. Zhuang, ‘‘A collision-free MAC scheme for multimedia
wireless mesh backbone,’’ IEEE Trans. Wireless Commun., vol. 8, no. 7,
lization decreases for fixed length under low traffic. In an pp. 3577–3589, Jul. 2009.
IoT network, the node which has high power or is con- [9] Wireless LAN Media Access Control(MAC) and Physical Layer(PHY)
nected to the wired network can be selected as the cluster Specifications, IEEE Standard 802.11, 1999.
[10] R. Zhang, L. Cai, and J. Pan, ‘‘Performance analysis of reservation and
head. The cluster head is responsible to collect the slot con- contention-based hybrid MAC for wireless networks,’’ in Proc. IEEE Int.
tention information among nodes and broadcast the suitable Conf. Commun., Cape Town, South Africa, May 2010, pp. 1–5.
length in the next contention period. Therefore, the pro- [11] M. Wang, Q. Shen, R. Zhang, H. Liang, and X. Shen, ‘‘Vehicle-density-
based adaptive MAC for high throughput in drive-thru networks,’’ IEEE
posed QL-MAC protocol is feasible and the optimal length Internet Things J., vol. 1, no. 6, pp. 533–543, Dec. 2014.
in the next contention period can be achieved by using the [12] W. Alasmary and W. Zhuang, ‘‘Mobility impact in IEEE 802.11p
Q-Learning algorithm. The proposed QL-based MAC pro- infrastructureless vehicular networks,’’ Ad Hoc Netw., vol. 10, no. 2,
pp. 222–230, Mar. 2012.
tocol can dynamically adjust the length of the contention [13] N. Cordeschi, F. De Rango, and M. Tropea, ‘‘Exploiting an opti-
period to achieve a high throughput, energy-efficient, IoT mal delay-collision tradeoff in CSMA-based high-dense wireless sys-
network with a short end-to-end delay. The reduction of tems,’’ IEEE/ACM Trans. Netw., early access, Jun. 30, 2021, doi:
10.1109/TNET.2021.3089825.
idle slots in the QL-based MAC protocol exceeds those in [14] J. Choi, S. Byeon, S. Choi, and K. B. Lee, ‘‘Activity probability-based
hybrid MAC and token-based MAC due to the dynamic performance analysis and contention control for IEEE 802.11 WLANs,’’
IEEE Trans. Mobile Comput., vol. 16, no. 7, pp. 1802–1814, Jul. 2017.
contention period. Therefore, the increase in the system [15] Q. Ye and W. Zhuang, ‘‘Token-based adaptive MAC for a two-hop Internet-
throughput of QL-based MAC is greater than that of hybrid of-Things enabled MANET,’’ IEEE Internet Things J., vol. 4, no. 5,
MAC and token-based MAC in IoT networks. Furthermore, pp. 1739–1753, Oct. 2017.
[16] R. Ali, Y. A. Qadri, Y. B. Zikria, T. Umer, B.-S. Kim, and S. W. Kim,
the proposed QL-based MAC scheme had a shorter MAC ‘‘Q-learning-enabled channel access in next-generation dense wireless
delay and higher throughput than token-based MAC. How- networks for IoT-based eHealth systems,’’ J. Wireless Commun. Netw.,
ever, the token-based MAC had lower energy consumption vol. 178, pp. 1–12, Jul. 2019.
[17] G. Naddafzadeh-Shirazi, P.-Y. Kong, and C.-K. Tham, ‘‘Distributed rein-
due to the rotating token mechanism. The simulation results forcement learning frameworks for cooperative retransmission in wireless
showed that the greatest reductions in end-to-end delay using networks,’’ IEEE Trans. Veh. Technol., vol. 59, no. 8, pp. 4157–4162,
QL-based MAC compared to hybrid MAC and Token-based Oct. 2010.
[18] S. Sarwar, R. Sirhindi, L. Aslam, G. Mustafa, M. M. Yousaf, and
MAC were 87.3% (for a contention period length of 32 slots S. W. U. Q. Jaffry, ‘‘Reinforcement learning based adaptive duty cycling
with traffic rate = 64). The simulation results also showed in LR-WPANs,’’ IEEE Access, vol. 8, pp. 161157–161174, 2020.
[19] F. Shams, G. Bacci, and M. Luise, ‘‘Energy-efficient power control for CHIEN-MIN WU was born in Yunlin, Taiwan,
multiple-relay cooperative networks using Q-learning,’’ IEEE Trans. Wire- in 1966. He received the B.S. degree in automatic
less Commun., vol. 14, no. 3, pp. 1567–1580, Mar. 2015. control engineering from Feng Chia University,
[20] P. Venkatraman, B. Hamdaoui, and M. Guizani, ‘‘Opportunistic band- Taichung, Taiwan, in 1989, the M.S. degree in
width sharing through reinforcement learning,’’ IEEE Trans. Veh. Technol., electrical and information engineering from Yuan
vol. 59, no. 6, pp. 3148–3153, Jul. 2010. Ze University, Chung-Li, Taiwan, in 1994, and the
[21] I. B. Alhassan and P. D. Mitchell, ‘‘Packet flow based reinforcement
Ph.D. degree from the Department of Electrical
learning MAC protocol for underwater acoustic sensor networks,’’ Sensors,
vol. 21, no. 7, p. 2284, 2021. Engineering, National Chung Cheng University,
[22] H. Jiang, R. Gui, Z. Chen, L. Wu, J. Dang, and J. Zhou, ‘‘An improved Chiayi, Taiwan, in 2004. In July 1994, he joined
Sarsa(λ) reinforcement learning algorithm for wireless communication the Technical Development Department, Philips
systems,’’ IEEE Access, vol. 7, pp. 115418–115427, 2019. Company Ltd., where he was a member of the Technical Staff. He is currently
[23] Y. Yu, T. Wang, and S. C. Liew, ‘‘Deep-reinforcement learning multiple a Professor with the Department of the Computer Science and Information
access for heterogeneous wireless networks,’’ in Proc. IEEE Int. Conf. Engineering, Nanhua University. His current research interests include rein-
Commun. (ICC), Kansas City, MO, USA, May 2018, pp. 20–24. forcement learning, cognitive radio networks, ad hoc networks, and MAC
[24] S. Galzarano, A. Liotta, and G. Fortino, ‘‘QL-MAC: A Q-learning based protocol design.
MAC for wireless sensor networks,’’ in Proc. ICAPP, Vietri sul Mare, Italy,
Dec. 2013, pp. 267–275.
[25] Z. Liu and I. Elhanany, ‘‘RL-MAC: A QoS-aware reinforcement learn-
ing based MAC protocol for wireless sensor networks,’’ in Proc. IEEE
Int. Conf. Netw., Sens. Control, Ft. Lauderdale, FL, USA, Apr. 2006,
pp. 768–773. YEN-CHUN KAO was born in Yunlin, Taiwan,
[26] A. Rajandekar and B. Sikdar, ‘‘A survey of MAC layer issues and protocols in 1998. He received the B.S. degree in computer
for machine-to-machine communications,’’ IEEE Internet Things J., vol. 2, science and information engineering from Nanhua
no. 2, pp. 175–186, Apr. 2015. University, Chiayi, Taiwan, in 2020. His research
[27] R. Jurdak, C. V. Lopes, and P. Baldi, ‘‘A survey, classification and com- interests include reinforcement learning, cogni-
parative analysis of medium access control protocols for ad hoc net- tive radio networks, ad hoc networks, and MAC
works,’’ IEEE Commun. Surveys Tuts., vol. 6, no. 1, pp. 2–16, 1st Quart.,
protocol design.
2004.
[28] A. Kanzaki, T. Uemukai, T. Hara, and S. Nishio, ‘‘Dynamic TDMA slot
assignment in ad hoc networks,’’ in Proc. 17th Int. Conf. Adv. Inf. Netw.
Appl. (AINA), Xi’an, China, Mar. 2003, pp. 330–335.
[29] W. Hu, H. Yousefi’zadeh, and X. Li, ‘‘Load adaptive MAC: A hybrid MAC
protocol for MIMO SDR MANETs,’’ IEEE Trans. Wireless Commun.,
vol. 10, no. 11, pp. 3924–3933, Nov. 2011.
[30] Y. Liu, C. Yuen, X. Cao, N. U. Hassan, and J. Chen, ‘‘Design of a scalable
hybrid MAC protocol for heterogeneous M2M networks,’’ IEEE Internet KAI-FU CHANG was born in Kaohsiung, Taiwan,
Things J., vol. 1, no. 1, pp. 99–111, Feb. 2014. in 1997. He received the B.S. degree in computer
[31] A. Pressas, Z. Sheng, F. Ali, and D. Tian, ‘‘A Q-learning approach with science and information engineering from Nanhua
collective contention estimation for bandwidth-efficient and fair access University, Chiayi, Taiwan, in 2020. His research
control in IEEE 802.11p vehicular networks,’’ IEEE Trans. Veh. Technol., interests include reinforcement learning, cogni-
vol. 68, no. 9, pp. 9136–9150, Sep. 2019. tive radio networks, ad hoc networks, and MAC
[32] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. protocol design.
Cambridge, MA, USA: MIT Press, 1988.
[33] C. J. C. H. Watkins and P. Dayan, ‘‘Q-learning,’’ Mach. Learn., vol. 8,
nos. 3–4, pp. 279–292, 1992.
[34] A. Pressas, Z. Sheng, F. Ali, D. Tian, and M. Nekovee, ‘‘Contention-
based learning MAC protocol for broadcast vehicle-to-vehicle com-
munication,’’ in Proc. IEEE Veh. Netw. Conf. (VNC), Nov. 2017,
pp. 263–270.
[35] C.-M. Wu, M.-S. Wu, Y.-J. Yang, and C.-Y. Sie, ‘‘Cluster-based distributed CHENG-TAI TSAI was born in Pingtung, Taiwan,
MAC protocol for multichannel cognitive radio ad hoc networks,’’ IEEE in 1999. He is currently pursuing the B.S. degree
Access, vol. 7, pp. 65781–65796, 2019.
of computer science and information engineer-
[36] L. T. Tan and L. B. Le, ‘‘Channel assignment for throughput maximization
ing from Nanhua University, Chiayi, Taiwan. His
in cognitive radio networks,’’ in Proc. IEEE Wireless Commun. Netw.
Conf. (WCNC), Paris, France, Apr. 2012, pp. 1427–1431. research interests include reinforcement learning,
[37] Z. Sadreddini, O. Makul, T. Cavdar, and F. B. Gunay, ‘‘Performance cognitive radio networks, ad hoc networks, and
analysis of licensed shared access based secondary users activity on cog- MAC protocol design.
nitive radio networks,’’ in Proc. Electric Electron., Comput. Sci., Biomed.
Engineerings’ Meeting (EBBT), Istanbul, Turkey, Apr. 2018, pp. 1–4.
[38] A. Azarfar, J. F. Frigon, and B. Sansò, ‘‘Delay analysis of multichannel
opportunistic spectrum access MAC protocols,’’ IEEE Trans. Mobile Com-
put., vol. 15, no. 1, pp. 92–106, Jan. 2016.
[39] D. Jung, R. Kim, and H. Lim, ‘‘Power-saving strategy for balancing energy
and delay performance in WLANs,’’ Comput. Commun., vol. 50, pp. 3–9, CHENG-CHUN HOU was born in Taoyuan, Tai-
Sep. 2014. wan, in 1995. He is currently pursuing the B.S.
[40] S. Maleki, A. Pandharipande, and G. Leus, ‘‘Energy-efficient distributed
degree in computer science and information engi-
spectrum sensing for cognitive sensor networks,’’ IEEE Sensors J., vol. 11,
neering from Nanhua University, Chiayi, Taiwan.
no. 3, pp. 565–573, Mar. 2011.
[41] S. Atapattu, C. Tellambura, and H. Jiang, ‘‘Energy detection for spectrum His research interests include reinforcement learn-
sensing in cognitive radio,’’ in Communications and Networks. Berlin, ing, cognitive radio networks, ad hoc networks,
Germany: Springer, 2014. and MAC protocol design.
[42] S. Gao, L. Qian, and D. Vaman, ‘‘Distributed energy efficient spectrum
access in cognitive radio wireless ad hoc networks,’’ IEEE Trans. Wireless
Commun., vol. 8, no. 10, pp. 5202–5213, Oct. 2009.