0% found this document useful (0 votes)
15 views6 pages

Decentralized_Deep_Reinforcement_Learning_Approach

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views6 pages

Decentralized_Deep_Reinforcement_Learning_Approach

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Decentralized Deep Reinforcement Learning

Approach for Channel Access Optimization


Sheila C. da S. J. Cruz

INATEL
Felipe Augusto Pereira de Figueiredo
INATEL
Rausley A. A. de Souza
INATEL

Research Article

Keywords: Wi-Fi, contention-based channel access, channel utilization optimization, reinforcement


learning, NS-3, NS3-gym

Posted Date: June 11th, 2024

DOI: https://doi.org/10.21203/rs.3.rs-4555252/v1

License:   This work is licensed under a Creative Commons Attribution 4.0 International License.
Read Full License

Additional Declarations: The authors declare no competing interests.


Decentralized Deep Reinforcement Learning
Approach for Channel Access Optimization
Sheila C. da S. J. Cruz, Felipe A. P. de Figueiredo and Rausley A. A. de Souza

Abstract— The IEEE 802.11 standard’s binary exponential proposed and applied in various domains, including wireless
back-off (BEB) algorithm is the prevailing method for tackling networks. DRL algorithms have the capacity to learn from
the collision avoidance problem. Under the BEB paradigm, the network states and adjust to evolving network conditions,
back-off period increases each time a collision occurs, aiming
to minimize the likelihood of subsequent collisions. However, thereby optimizing cumulative rewards over time. This adapt-
this provides sub-optimal results degrading network performance ability renders DRL particularly suitable for solving many op-
and leading to bandwidth wastage, especially in dynamic dense timization and decision-making problems of Wi-Fi networks,
networks. To overcome these drawbacks, this paper proposes providing optimal solutions that are flexible and adaptable to
using a decentralized approach with deep reinforcement learning different learning scenarios. Proposing DRL algorithms offers
algorithms, namely Deep Q Learning (DQN) and Deep Deter-
ministic Policy Gradient (DDPG), to optimize the contention potential improvements by optimizing CW values, especially
window value and maximize throughput while minimizing colli- in dynamic scenarios where the number of nodes increases
sions. Simulations with the NS-3 simulator and NS3-gym toolkit over time.
reveal that DQN and DDPG outperform BEB in both static In [5], a centralized single-agent DRL approach is pre-
and dynamic scenarios, achieving up to a 37.16% network sented. In that solution, the DRL agent is located at the access
throughput improvement in dense networks, keeping a high and
stable throughput as the number of stations increases. point (AP). Since it has a global view of the network, a
unique CW value is optimized and broadcast to all associated
Keywords— Wi-Fi, contention-based channel access, channel stations, ensuring that all stations have the same CW value.
utilization optimization, reinforcement learning, NS-3, NS3-gym.
It showed a considerable increase in throughput. However, a
decentralized approach offers a more robust, efficient solution
I. I NTRODUCTION with guaranteed high scalability, especially in dense dynamic
Wireless networks are widely used and applied in different scenarios. By leveraging distributed computing, new stations
domains where the stations connected to the wireless networks can be easily added to the corresponding scenario, acceler-
require a fair share of the spectrum resources to guarantee a ating convergence to optimal collision avoidance solutions in
better performance while accessing the channel to transmit wireless local area networks (WLANs).
data [1]. One key challenge is managing collisions, where Several multi-agent reinforcement learning (MARL) meth-
multiple stations transmit simultaneously, causing interference ods have been proposed in the literature to enhance the
and data loss. The IEEE 802.11 standard uses the carrier- performance of WLANs by providing decentralized solutions.
sensing multiple access with collision avoidance (CSMA/CA) For instance, the authors in [6] have proposed a MARL to
[2] protocol in the MAC layer to mitigate collision occur- optimize spectrum occupation prediction and enhance multi-
rences by employing a contention window (CW) value that channel slotted wireless network access. The overuse of the
is a random back-off time to minimize collisions. Each new wireless spectrum due to various network technologies causes
collision doubles the CW value, ranging from CWMin (15 collisions. To mitigate this, a multi-agent DRL mechanism
or 31) to CWMax (1023), in order to reduce the chance of with six known and two unknown nodes, using ring and AP
different stations selecting the same back-off value, deferring topologies, was proposed. A distributed reinforcement learning
the transmission to a later time. (RL)-based scheduler allocates slots to avoid collisions, with
The binary exponential back-off (BEB) algorithm is respon- agents trained via online supervised learning with experience
sible for this deferring method, which is used in CSMA/CA replay. Using radio frequency observations for predictions,
[3]. However, the BEB algorithm has significant limitations this approach reduces inter-network collisions by 30% and in-
and often provides sub-optimal results, particularly under high creases overall throughput by 10% compared to the traditional
loads, lacks adaptability, not being able to adapt to changing exponentially weighted moving average (EWMA) algorithm.
network conditions, and also lacks fairness [4]. To overcome To optimize the system spectral efficiency, the authors
these drawbacks, machine learning-based solutions such as in [7] have proposed a MARL algorithm for power alloca-
deep reinforcement learning (DRL) algorithms have been tion and joint subcarrier assignment in multi-cell orthogonal
frequency-division multiplexing systems. Each base station
This work was partially funded by CNPq (Grant Nos. 403612/2020-9, independently calculates resource allocation based on local
311470/2021-1, and 403827/2021-3), by Minas Gerais Research Foundation conditions but collaborates by exchanging information for
(FAPEMIG) (Grant No. APQ-00810-21, PPE-00124-23), by FADCOM -
Fundo de Apoio ao Desenvolvimento das Comunicações, presidential decree global optimization. The MARL algorithm demonstrated fast
no 264/10, November 26, 2010, Republic of Angola, and by the projects convergence and up to 53.6% higher efficiency compared to
XGM-AFCCT-2024-2-5-1 and XGM-FCRH-2024-2-1-1 supported by xGMo- conventional Q-learning. In heterogeneous networks, where
bile - EMBRAPII-Inatel Competence Center on 5G and 6G Networks, with
financial resources from the PPI IoT/Manufatura 4.0 from MCTI grant number multiple APs and users share the same spectrum, effective
052/2023, signed with EMBRAPII. power control is crucial for managing interference. However,
obtaining instantaneous global channel state information in networks, such as providing optimal solutions to channel ac-
rapidly changing environments is challenging. cess, collision avoidance, resource management, and improved
Previous works present limitations in directly optimizing network performance. It supports scalability and robustness,
the network’s throughput and adaptability to extremely dense allowing easy integration of new agents and continued opera-
dynamic scenarios showing high computational complexity. tion despite agent failures. Agents can also share information
Our decentralized solution differs from previous works by to enhance convergence. However, decentralized MARL faces
treating each station as a DRL agent that updates its CW challenges like the non-stationarity problem [12], [13], leading
values resulting in optimized individual throughputs that are to the problem of moving targets. This means that the optimal
passed to the AP which sums up the individual throughputs policy will change whenever other agents’ policies change.
from the self-learning agents. Then the AP broadcasts the Furthermore, the computational complexity increases exponen-
total throughput value back to the stations maximizing the tially with more agents, and achieving agent communication,
network’s overall performance leading to a more flexible, i.e. collaboration, is difficult due to limited local information
adaptable and robust collision avoidance solution to static and and a lack of awareness of other agents’ actions and rewards.
dynamic scenarios.
The decentralized solution proposed in this uses the colli- B. Methodology
sion probability, i.e., the transmission failure probability, as
the information from the state of the network for training the The proposed approach encompasses a decentralized al-
independent DRL agents. The study compares the collision gorithm that runs on multiple stations simultaneously. Each
probabilities of the decentralized approach and the conven- station observes the network state independently and selects
tional BEB algorithm, demonstrating that the decentralized appropriate CW values to optimize overall network perfor-
method outperforms BEB and better adapts to changing net- mance. Next, we describe each part of the decentralized
work conditions, making it well-suited for addressing collision solution.
avoidance in WLANs. 1) Agent is represented by a DRL algorithm (DQN or
Therefore, this work involves analyzing two scenarios: DDPG) to run on multiple stations varying from 5 to
static, with a fixed number of nodes, and dynamic, with an 50.
increasing number of nodes. We propose the use of DRL 2) Current state is the environment status s, of all stations
algorithms, specifically Deep Q Learning (DQN) and Deep associated with the AP. However, it is impossible to get
Deterministic Policy Gradient (DDPG), using the well-known this information because of the nature of the optimiza-
collision probability as the observation metric to optimize tion problem. Therefore, we model the problem as a
network performance. partially observable Markov decision process (POMDP)
The remainder of the paper is organized as follows. Section instead of a Markov decision process (MDP). POMDP
II presents a brief theoretical background and the methodology assumes the environment’s state cannot be perfectly
used for the simulation. Section III describes the simulation observed [14].
results. Finally, Section IV presents the conclusions and future 3) Observation, O, is the network information based on
works. the collision probability to observe the overall network’s
status. This information is saved into a buffer of re-
cent observations. Then a moving average is calculated,
II. S IMULATION M ETHODOLOGY
producing the mean value, µ, and the variance, σ 2 ,
A. Theoretical Background transferred to a two-dimensional vector to train the DRL
RL involves single-agent learning through interaction with agent.
the environment, making decisions that maximize the overall 4) Action, a, determines the CW value. As we compare
cumulative reward [8]. The main idea of the RL algorithms DRL algorithms with discrete and continuous action
is to find a policy, i.e., a rule to explore and exploit the spaces, the actions are integer values between 0 and 6 in
environment, that maximizes the total future reward. DRL the discrete case and real values within the interval [0, 6]
combines RL with artificial neural networks to handle high- in the continuous case. This interval is selected so that
dimensional data and accelerate convergence [9]. DRL may the action space is within 802.11 standard’s CW range,
include DQN [10] and DDPG [11] algorithms. This work which ranges from 15 up to 1023. Therefore, the CW
focuses on using these two algorithms to assist in optimizing value to be calculated for each station can be obtained
CW with the primary goal of reducing node collisions while by applying CW = ⌊2a+4 ⌋ − 1.
improving network performance. 5) Reward, r, is the normalized network throughput. It
A decentralized RL system is a multi-agent RL approach is calculated by dividing the actual throughput by the
designed for decision-making and optimization involving mul- expected maximum throughput for each station. Each
tiple self-learning agents in a shared environment [12]. Based station’s agent receives individual rewards, and the cu-
on game theory, MARL involves agents independently max- mulative reward to be broadcast to every station is the
imizing their rewards by using local information without sum of these individual rewards, resulting in a real value
considering other agents, resulting in a competitive and non- within the interval [0, 1].
communicative learning process [13]. A decentralized ap- The collision probability is a good characterization of the
proach offers significant advantages in the context of WLAN environment state. It is the probability of collision, pcol ,
observed by the network. It can also be interpreted as the
Algorithm 1 DRL-based Decentralized CW Optimization
probability of transmission failure. It is calculated based on the
▷ ### Initialization ###
number of transmitted, Nt , and correctly received, Nr , frames, 1: Define the maximum number of stations, WifiNode.
that is 2: Initialize the observation buffer of each station, O(i), with zeroes.
Nt − Nr 3: Initialize the weights of each agent, θ(i).
pcol = . (1) 4: Get the action function for each station, Aθ (i), which each agent uses to choose
Nt the action according to its current state
5: Initialize the algorithm’s interaction period with the environment, envStepT ime
This collision rate approximates the actual probability of 6: Initialize the number episodes, Nepisodes
collision as the number of frames used to calculate it increases. 7: Initialize the number of steps per episode, Nspe
8: Initialize the training stage period, trainingP eriod
Thus, this rate represents the probability of a frame not being 9: Set trainingFlag ← T rue to tell the algorithm is in the training stage
received because another station is transmitting a frame at the 10: Initialize the experience replay buffer of each station, E(i), with zeroes.
11: trainingStartTime ← currentTime
same time. These probabilities are calculated within the inter- 12: lastUpdate ← currentTime
action periods and provide information on the performance of 13: Initialize the previous mean value of each station µprev (i) ← 0
2
14: Initialize the previous variance value of each station σprev (i) ← 0
the selected CW value. 15: Set CW(i) ← 15, ∀i
The decentralized CW optimization with DRL follows three
16: for e = 1, . . . , Nepisodes do
stages, which are i) pre-learning, where the observation buffer 17: Reset and run the environment, i.e., reset and run the NS-3 simulator
is filled with data; ii) learning, where the agent selects CW 18: for t = 1, . . . , Nspe do
19: for i = 1, . . . , WifiNode do
values; and iii) operational, where the agent uses learned
actions known by training that provide the best rewards. The ▷ ### Pre-learning stage ###
20: Nt (i) ← get number of transmitted frames of the ith station
decentralized solution includes a preprocessing step that uses 21: Nr (i) ← get number of received frames of the ith station
N (i)−N (i)
recent observation history to calculate a moving average, 22: observation(i) ← t N (i) r
t
23: O(i).append(observation(i))
generating the mean, µ, and the variance, σ 2 values that
will be passed to fill a two-dimensional vector, which is 24: if currentT ime ≥ lastU pdate + envStepT ime then
the information used for training the agent. Exploration is ▷ ### Learning and operational stages ###
enabled by adding a decreasing noisy factor to actions, with 25: µ(i), σ 2 (i) ← preprocess(O(i))
26: a(i) ← Aθ(i) (µ(i), σ 2 (i), trainingFlag)
DQN using a noisy factor corresponding to the probability of 27: CW (i) ← 2a(i)+4 − 1
taking a random action instead of an action predicted by the
agent. In the DDPG’s case, the noisy factor occurs by adding 28: if trainingF lag == T rue then
29: NRP (i) ← get the number of received packets of the ith station.
Gaussian noise directly to the actions. The implementation for 30: tput(i) ← envStepT ime
NRP (i)

the proposed decentralized solution follows the pseudo-code 31: Send the throughput of each station to the access point.
shown in Algorithm 1, which presents how the three before- 32: r ← normalize(tput(i))
33: Broadcast the new r reward value to all associated stations
mentioned stages operate. 34: E(i).append((µ(i), σ 2 (i), a(i), r, µprev(i) , σprev(i)
2
))
35: µprev(i) ← µ(i)
2
36: σprev(i) ← σ 2 (i)
III. S IMULATION R ESULTS 37: mb(i) ← get random mini-batch from E(i)
38: Update θ(i) based on mb(i)
In this section, we compare the proposed solution’s perfor- 39: end if
mance against that of the conventional BEB algorithm under 40: lastUpdate ← currentTime
static and dynamic scenarios. 41: end if

▷ ### Makes the transition between learning and operational stages ###
42: if currentT ime ≥ trainingStartT ime + trainingP eriod
A. Simulation Scenario Description then
In the decentralized solution, the stations with RL agents 43: trainingF lag ← F alse
44: end if
follow a distributed topology with one AP, as shown in Fig. 45: end for
1. Each station acts as an autonomous agent, adjusting its 46: end for
47: end for
CW value and receiving individual rewards. The cumulative Cenário da Solução Descentralizada
reward is the total sum of all individual rewards. The stations’
arrangement occurs statically and dynamically. The NS3-gym
and NS-3 simulator are used to train the agent and implement
Total Reward
the DRL algorithm using TensorFlow and PyTorch [15]. Sum

Tables I and II summarize the parameters setup used in NS-


3 for creating the agent’s environment and in NS3-gym for 31 63
Sent packets Sent packets
CW CW
training the DRL agent, respectively. Both DRL algorithms Update Single reward
Access Point
Single reward
Update

include one recurrent long short-term memory (LSTM) layer Station A Station B

and two fully connected layers as the network architecture,


forming an 8 × 128 × 64 topology. CW CW
31 63

𝜇 𝜎 𝜇 𝜎
B. Static Scenario
In the static scenario, a fixed number of stations associated Fig. 1. System model for the decentralized DRL-based CW optimization.
with the AP is kept constant during the experiment execution.
TABLE I 800
Decent. DQN-Pcol
NS-3 E NVIRONMENT CONFIGURATION PARAMETERS . 700 Decent. DDPG-Pcol

600
Configuration Parameter Value
500

Mean CW
Wi-Fi standard IEEE 802.11ax
Number of APs 1 400
Number of static stations 5,15, 30 or 50
Number of dynamic stations increases steadily from 5 to 50 300
Frame aggregation disabled
Packet size 1500 bytes 200
Max Queue Size 100 packets 100
Frequency 5 GHZ
Channel BW 20 MHz 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Traffic constant bit-rate UDP of 150 Mbps Number of Episodes
MCS HeMcs (1024-QAM with a 5/6 coding rate)
Guard Interval 800 [ns] Fig. 3. Mean CW value for 30 stations in the static scenario.
Propagation delay model ConstantSpeedPropagationDelayModel
Propagation loss model MatrixPropagationLossModel
42

40

Network throughput [Mb/s]


TABLE II
38
NS3- GYM AGENT CONFIGURATION PARAMETERS .
36
Configuration Parameter Value 34 Improvement
over BEB
DQN’s learning rate 4 × 10−4
DDPG’s actor learning rate 4 × 10−4 32
DDPG’s critic learning rate 4 × 10−3 30
Reward discount rate 0.7 BEB
Batch size 32 28 Decent. DQN-Pcol
Replay memory size 18000 Decent. DDPG-Pcol
Size of observation history memory 300 26
0 10 20 30 40 50
trainingP eriod 840 [s] Number of stations
envStepT ime (i.e., interaction interval) 10 [ms]
Fig. 4. Comparison of the network throughput in the dynamic scenario as
the number of stations increases from 5 to 50.

Fig. 2 presents a comparison of the network throughput as a


function of the number of stations between the decentralized happens because the number of random actions decreases over
DRL algorithms over the conventional BEB. The improve- time, and the agent is converging to a result and learning
ments of DDPG over BEB are 5.19%, 17.64%, 16.67%, correctly. After the 10th episode, the mean CW value is kept
and 27.78% for 5, 15, 30, and 50 stations, respectively. The stable around the same value.
improvements of DQN over BEB are 5.19%, 17.21%, 16.27%,
and 27.10% for 5, 15, 30, and 50 stations, respectively. The
C. Dynamic Scenario
results demonstrate that DDPG is slightly better than DQN,
especially for 15, 30, and 50 stations. This can be attributed In the dynamic scenario, the number of stations grows
to DDPG’s ability to select any real CW value within the steadily during the simulation execution, increasing from 5
[0, 6] range satisfactorily suitable for tracking the network’s to 50. The higher the number of stations, the higher the
dynamics [11]. collision probability. This experiment evaluated whether the
Fig. 3 depicts the mean CW value throughout 15 episodes DRL algorithms correctly act upon network changes. Fig.
with 30 stations in the static scenario. The results demonstrate 4 shows that the DRL algorithms effectively enhance the
that in the initial episodes, there is a higher variance on network’s throughput compared to the BEB algorithm in the
CW, but as the training progresses, the variance reduces. This decentralized dynamic scenario. The degree of DQN and
DDPG’s improvement in comparison to BEB was similar. The
improvements are 7.89%, 11.94%, 9.27%, 8.43% over BEB
42
for 5, 15, 30, and 50 stations, respectively.
40
Fig. 5 shows the mean CW for the dynamic scenario as a
Network throughput [Mb/s]

38 function of the number of episodes. As with the static scenario,


36 after some episodes, the CW value remains stable around the
34 same value.
32
Fig. 6 illustrates the instantaneous selected CW values and
Improvement
30 over BEB
the number of stations as a function of the simulation time
BEB for the decentralized dynamic scenario, with the number of
28 Decent. DQN-Pcol
Decent. DDPG-Pcol stations incrementally rising from 5 to 50 at every 1.2 seconds.
26
0 10 20 30 40 50 It is possible to observe that DQN varies between discrete
Number of stations
neighboring CW values. At the same time, DDPG consistently
Fig. 2. Comparison of the network throughput in a static scenario as the raises the CW value, resulting in a lower CW value for 50
number of stations increases from 5 to 50. stations and subsequently improving the throughput.
Decent. DQN-Pcol IV. C ONCLUSIONS
600 Decent. DDPG-Pcol

500
This work has proposed a decentralized solution using
multiple self-learning agents with RL algorithms (DQN and
400
DDPG) to optimize the CW parameter and enhance network
Mean CW

300 throughput. Simulation results have shown that this approach


200 outperforms traditional BEB, with DQN and DDPG achieving
100
up to a 37.16% increase in throughput with 50 stations. Both
DRL algorithms have performed similarly, making either suit-
0
able for improving network efficiency in static and dynamic
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 environments. These findings have confirmed the effective-
Number of Episodes
ness of DRL algorithms in addressing collision avoidance
Fig. 5. Mean CW value for 30 stations in the dynamic scenario. in WLANs. Future work could focus on station cooperation
through information sharing, crucial for agents to optimize
network-wide rewards.

R EFERENCES
[1] S. Giannoulis, C. Donato, R. Mennes, F. A. P. de Figueiredo, I. Ja-
bandžic, Y. De Bock, M. Camelo, J. Struye, P. Maddala, M. Mehari,
A. Shahid, D. Stojadinovic, M. Claeys, F. Mahfoudhi, W. Liu, I. Seskar,
S. Latre, and I. Moerman, “Dynamic and collaborative spectrum sharing:
The scatter approach,” in 2019 IEEE International Symposium on
Dynamic Spectrum Access Networks (DySPAN), 2019, pp. 1–6.
[2] X. Guo, S. Wang, H. Zhou, J. Xu, Y. Ling, and J. Cui, “Performance
evaluation of the networks with Wi-Fi based TDMA coexisting with
CSMA/CA,” Wireless Personal Communications, vol. 114, no. 2, pp.
1763–1783, 2020.
[3] P. Patel and D. K. Lobiyal, “A simple but effective collision and
Fig. 6. Selected CW value with the number of stations increasing from 5 error aware adaptive back-off mechanism to improve the performance
to 50 in the dynamic scenario. of IEEE 802.11 DCF in error-prone environment,” Wireless Personal
Communications, vol. 83, pp. 1477–1518, 2015.
[4] B.-J. Kwak, N.-O. Song, and L. E. Miller, “Performance analysis of
exponential backoff,” IEEE/ACM Transactions on Networking, vol. 13,
Fig. 7 compares the instantaneous network throughput in the no. 2, pp. 343–355, 2005.
[5] S. J. Sheila de Cássia, M. A. Ouameur, and F. A. P. de Figueiredo,
decentralized dynamic scenario, where the number of stations “Reinforcement learning-based wi-fi contention window optimization,”
progressively increases from 5 to 50. The elevated number of Journal of Communication and Information Systems, vol. 38, no. 1, pp.
stations modifies the CW value, affecting the instantaneous 128–143, 2023.
[6] R. Mennes, F. A. P. De Figueiredo, and S. Latré, “Multi-agent deep
network throughput. The BEB’s throughput decays to approx- learning for multi-channel access in slotted wireless networks,” IEEE
imately 26 Mbps when the number of stations connected with Access, vol. 8, pp. 95 032–95 045, 2020.
the AP reaches 50, unlike the decentralized mode, which yields [7] Y. Hu, M. Chen, Z. Yang, M. Chen, and G. Jia, “Optimization of resource
allocation in multi-cell OFDM systems: a distributed reinforcement
a throughput of 35.8 Mbps with an increase of 37.16% over learning approach,” in IEEE 31st Annual International Symposium on
BEB. The proposed DRL algorithms (with either DQN or Personal, Indoor and Mobile Radio Communications, 2020, pp. 1–6.
DDPG) present an almost constant behavior, keeping a high [8] S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach,
3rd ed. USA: Prentice Hall Press, 2009.
and stable throughput as the number of stations progressively [9] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath,
increases. These findings make the DRL algorithm totally apt “A brief survey of deep reinforcement learning,” arXiv preprint
to address the collision avoidance challenges in dense wireless arXiv:1708.05866, 2017.
[10] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning
networks. with double q-learning,” in Proceedings of the AAAI conference on
artificial intelligence, vol. 30, no. 1, 2016.
42
[11] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
BEB D. Silver, and D. Wierstra, “Continuous control with deep reinforcement
Decent. DQN-Pcol
40 Decent. DDPG-Pcol learning,” arXiv preprint arXiv:1509.02971, 2015.
[12] L. Busoniu, R. Babuska, and B. De Schutter, “A comprehensive survey
38 of multiagent reinforcement learning,” IEEE Transactions on Systems,
Network throughput [Mb/s]

36 Man, and Cybernetics, Part C (Applications and Reviews), vol. 38, no. 2,
pp. 156–172, 2008.
34 [13] D. Huh and P. Mohapatra, “Multi-agent reinforcement learning: A
comprehensive survey,” arXiv preprint arXiv:2312.10256, 2023.
32
[14] A. R. Cassandra, “A survey of POMDP applications,” in Working notes
30 of AAAI 1998 fall symposium on planning with partially observable
Markov decision processes, vol. 1724, 1998.
28 [15] P. Gawłowicz and A. Zubow, “NS3-gym: Extending openai gym for
networking research,” arXiv preprint arXiv:1810.03943, 2018.
26
0 10 20 30 40 50 60
Simulation time [s]

Fig. 7. Comparison of the instantaneous network throughput in the dynamic


scenario.

You might also like