Deep Reinforcement Learning Algorithm Based On Multi-Agent Parallelism and Its Application in Game Environment
Deep Reinforcement Learning Algorithm Based On Multi-Agent Parallelism and Its Application in Game Environment
Entertainment Computing
journal homepage: www.elsevier.com/locate/entcom
A R T I C L E I N F O A B S T R A C T
Keywords: Deep reinforcement learning has become a prominent area of research in artificial intelligence in recent years. Its
Multi-agent parallelism application in solving complex tasks and game environments has garnered significant attention. This study aims
Deep reinforcement learning to develop a deep reinforcement learning algorithm based on multi-agent parallelism to enhance intelligent
Game environment
decision-making in game environments. The algorithm combines a deep Q-network with a multi-agent cooper
Convolutional neural network
Policy gradient
ation strategy. Through parallel training of multiple agents, the learning process is accelerated, and decision
accuracy is improved. The experimental results indicated that the Actor-Critic algorithm, when combined with
precision rate, recall rate, and average fitness of multi-agent parallel, achieves a relatively high accuracy rate
index, which stabilizes above 0.95. The recall rate index was also above 0.8, and the average fitness was in a
relatively high range. The research shows that the deep reinforcement learning algorithm based on multi-agent
parallelism performs better and is more effective in game environments. It can learn the optimal strategy faster
and obtain higher rewards. This not only provides a new technical means for game development but also offers a
useful reference for the application of multi-agent systems in complex environments.
* Corresponding author.
E-mail address: [email protected] (C. Liu).
https://doi.org/10.1016/j.entcom.2024.100670
Received 27 November 2023; Received in revised form 7 March 2024; Accepted 2 April 2024
Available online 4 April 2024
1875-9521/© 2024 Elsevier B.V. All rights reserved.
C. Liu and D. Liu Entertainment Computing 50 (2024) 100670
concerns about user privacy and data protection. In the context of generation strategy for deceptive games in GE. The results revealed that
gaming, this study serves as a reminder of the importance of prioritizing the method can effectively avoid deceptive reward traps compared to
user privacy and data security. It also explores effective methods for the traditional goal generation strategy [14]. Liu’s team proposed a
utilizing data to enhance training and learning outcomes while safe strategy integrating self-game actor critic and DRL for the problem of
guarding user privacy. The application of the DRL algorithm based on instability of self-game RL in game application scenarios, and applied
multi-agent parallel in GEs not only brings innovation and technological the strategy to intelligences playing computer games. The results
breakthroughs to the game industry but also provides useful references showed that the overall effect of the intelligences trained using this
and insights for research and applications in related fields. The article strategy was better than that of the intelligences trained with the
comprises four sections, the first of which is a literature review discus traditional strategy [15]. To address the issue of bugs in the game field,
sing and analyzing the current state of DRL and its research in the Rani et al. developed a model using deep reinforcement neural learning
gaming industry at both domestic and international levels. The second to detect vulnerabilities in the GE. This model automated error detection
section introduces a MAP-based DRL algorithm, implemented under GE. and minimized human intervention. The research results demonstrated
The third section conducts experiments to demonstrate the algorithm’s the model’s effectiveness in detecting errors in both fuzzy and non-fuzzy
effectiveness in game applications. The fourth section provides a sum input on various platforms [16].
mary of the research findings. In summary, DRL has found wide application in various fields,
including the gaming industry. However, the scope of its application in
2. Related works gaming is relatively limited due to the requirement of a large amount of
data for training. Generating valid data in the gaming industry is often
DRL, an approach that amalgamates DL and RL techniques, has time-consuming and expensive. The generalization ability of the current
found frequent utilization across various sectors, such as robot control, DRL algorithm is limited when dealing with complex environments and
gaming, and self-driving cars. It exhibits the potential to enhance per dynamic changes in games, making it difficult to adapt to different sit
formance through constant optimization and autonomous learning, and uations. DRL often produces decisions that lack interpretability, making
has shown human-overmatching results in certain assignments. In order it difficult for players and developers to understand the decision-making
to obtain a marketing investment strategy to increase the visibility of its process and logic. Improving the robustness of DRL is crucial due to its
corresponding brand in marketing scenarios, the Vargas-Perez team high dependence on data and environment. Training DRL requires
developed a DRL agent based on a dual DQN algorithm. The results of powerful computing resources, such as GPUs or TPUs, which can be a
the study showed that this decision support system was beneficial to challenge for small game development teams. DRL is primarily applied
optimize an online dynamic learning environment [7]. In order to better to specific game types and tasks, such as esports and RPGs, and may not
control the vehicle’s intelligence to learn errors based on its actions and be suitable for more complex games or those requiring real-time inter
interactions with the environment, Quek et al. used DQN based on action. It is important to note that the performance requirements of
fractional and pixel inputs to implement agent learning in the vehicle. It these types of games may not be met by DRL algorithms. Therefore, this
was experimentally confirmed that this method enabled the self-driving study proposes a DRL algorithm for the MAP problem and applies it to
car to learn maneuvering operations and gradually gain the ability to several gaming environments. The aim is to enhance the integration of
successfully navigate and avoid obstacles [8]. Yang et al. suggested a the artificial intelligence and gaming fields, thereby promoting intelli
DRL approach based on the 3D simulation platform to control the ac gent development in the gaming industry.
tivity of the window in response to the situation where the hydraulic
support window is difficult to control. The method’s performance was 3. Actor-Critic PG algorithm and MAP in GE
confirmed using three-dimensional simulation trials, which also yielded
higher economic benefits and increased mining efficiency for top coal The research employs the positivist paradigm and the scientific
[9]. In order to optimize the traditional network intrusion detection method to analyze the DRL algorithm based on multi-agent parallel and
methods and improve the detection rate, Yang’s team came out with a its application in the GE. The research findings are obtained through
DL-based encrypted network malicious traffic detection model, in which objective scientific methods and experiments [17]. The study uses an
the encrypted network malicious traffic is automatically feature Actor-Critic Algorithm (ACA) based on the dominance function and
extracted. An experimental demonstration of the model’s 99.95 % ac reduces the bias in the estimation of the return value by the idea of
curacy rate in differentiating between regular and anomalous encrypted multi-step TD. Then the study designs a deep neural network for the
network data was conducted [10]. Wang et al. conducted a systematic approximation of the value function and the strategy function in the
classification of DRL algorithms and their applications, and provided a algorithm based on the particular scenario of GE. After that, the study
detailed review of existing methods. They categorized the algorithms proposes a parallelization framework for the ACA to make the training
into model-based, model-free, and advanced RL methods. The authors data weakly correlated. Finally, the implementation details are appro
also summarized the current representative applications and analyzed priately optimized to avoid the adverse effect of the policy delay prob
four open questions for future research [11]. lem on the training process.
In addition to the application of DRL in the above-mentioned fields,
many scholars have also applied it to the game field and achieved great 3.1. RGB to grayscale based game image preprocessing
results. Zhu et al. improved the profit of vehicle resource management
game and proposed a resource management scheme based on DRL and DRL is a machine learning method that learns how to make optimal
Stackelberg game, using which the profit of vehicle and vehicle edge decisions to maximize cumulative rewards through the interaction be
computing server can be maximized. The scheme was empirically tween an agent program and its environment. DRL uses deep neural
analyzed and a large number of experimental results showed the effec networks as function approximators to represent and learn policies or
tiveness of the scheme [12]. Li’s team proposed two efficient DRL value functions with more expressive power than traditional machine
network architectures to improve the accuracy of game strategies in learning methods [18]. The standard RL model consists of four compo
various games, and evaluated these two DRL network architectures. The nents: environment, reward, action, and state. In RL, agent and envi
evaluation results showed that both architectures can greatly improve ronment interact at different moments and the process is shown in Fig. 1.
the accuracy of game strategies and are practical [13]. Li et al. proposed It is possible to reduce anything in RL to the Markov Decision Process
a goal generation strategy incorporating maximum entropy exploration (MDP). Eq. (1) illustrates how the MDP characterizes a fully observable
and DRL algorithm in order to avoid deceptive games in the game more process, in which the qualities needed for decision-making are entirely
effectively, by comparing this strategy with the traditional goal determined by the content of the observed state.
2
C. Liu and D. Liu Entertainment Computing 50 (2024) 100670
The main idea behind DRL is to train deep neural networks with RL
algorithms so that they may learn from their surroundings and modify
their weights based on feedback signals to make better decisions. The
commonly used RL algorithms in DRL include Q-learning, DQN, PG, etc.
DRL has demonstrated excellent performance in many fields, such as Fig. 2. Schematic diagram of deep reinforcement learning framework based on
multi-agent parallelism.
3
C. Liu and D. Liu Entertainment Computing 50 (2024) 100670
[21,22]. First, the original image is RGB transformed to reduce the interaction from entering a long cycle, resulting in slower learning.
original isosceles to 110*84, and then the original image is split into For instance, in the game of Checkers, if there are only a few squares on
84*84 to cover the full game range as much as possible. Finally, the the board, the agent is likely to remain in a fixed position due to the
individual pixel values are normalized to expedite the network’s decreasing probability of hitting and the uncertainty of the ball’s tra
convergence. A common method of converting an RGB image to a jectory. This can cause the agent to maintain a fixed trajectory for an
grayscale image is by performing a weighted average of the pixel values extended period of time. Also, the method is prone to local extremes.
of the three RGB channels. Here is a simple method: assume that the This project suggests adding a constraint number N to the intelligent
input RGB image is a matrix of size M × N, where each pixel has three body-environment interaction in order to improve network perfor
channels (red, green, and blue). The values of the red, green and blue mance. In other words, if the intelligent body is still in the game after
channels of each pixel are denoted as R, G and B. The method of RGB to trying N operation options, the game will be forcibly ended and an end
grayscale map is shown in Eq. (9). signal will be returned. Deep neural networks can learn the features and
optimal strategies of the GE to enable prediction and decision making for
Gray = 0.299*R(i, j) + 0.587*G(i, j) + 0.114*B(i, j) (9)
the GE. This approach has been very successful in RL tasks and game AI,
In Eq. (9), (i, j) denotes each pixel, and this non-uniform weight e.g., it performs well in complex environments such as Atari games.
distribution is more sensitive to color than the direct averaging method.
People often see green as being brighter than red and blue as being 3.2. Implementation of parallelization framework based on Actor-Critic
weaker than red, therefore the transformed image has more information. algorithm
The simplest approach is to center the cropping directly, though it is
possible to specify the cropping in any way. The study stacks the pre After preprocessing the game screen, it meets the need for 2D
vious k consecutive observations as the state of the environment at the convolution. The study created a convolutional neural network with
current time point before inputting them into the agent in order to fully four convolutional layers and three pooling layers connected to them, as
perceive the state of the environment. This avoids using the observation illustrated in Fig. 4, taking into account the intricacy of the game screen
at the current time point as the state of the environment. The game elements.
image cropping preprocessing process is shown in Fig. 3. On this basis, the study proposes a method based on convolutional
In addition to the previously mentioned preprocessing, Gym does window shift, i.e., the shift step of the convolutional window is set to 1
further processing of the original environment after encapsulating the and its boundary is complemented with zero, so that the feature maps of
ALE, i.e., frame skipping. In general, at each moment, the environment each convolutional layer have the same dimension as the input features.
sends an action emitted from the agent to the emulator and feeds the Eq. (10) illustrates the link between the convolutional layers’ input and
result of that behavior back into the emulator. Frame skipping is a little output scales.
different, in that instead of feeding the current picture back to that agent
input shape
in a timely manner after completing an action, it ignores n frames of output shape = (10)
pictures in a row, and since the simulation program is still working in the stepsize
mode of one action per frame of pictures, a way to ignore it is to The activation function of neural networks is to model the network
repeatedly send to the simulator the actions entered by the agent at the nonlinearly [23]. A neural network can only represent linear mappings
most recent point in time. With the introduction of the process of skip when its only operations are linear convolution and completely linked;
ping frames, the agent is able to perform the selection of n actions fitting nonlinear input in real scenarios becomes challenging even with
sequentially in a single computation, which reduces the computation increased network depth. In order to provide it a hierarchical and
time and greatly improves the learning efficiency of the agent. Gym by nonlinear mapping learning function, the activation function is intro
default automatically skips between 2 and 4 pictures after each action is duced. Activation functions are therefore essential components of deep
selected. Finally, a clear definition of the intelligent-environment neural networks. ReLU, a segmented linear function with the form
interaction process is needed to prevent the intelligent-environment described in Eq. (11), is used in the study as the activation function of
4
C. Liu and D. Liu Entertainment Computing 50 (2024) 100670
the hidden layer neurons. agent must be strongly correlated. However, in the case of multiple
{ agents, the correlation between the microbatch data generated by each
z z⩾0
g(z) = ReLU(z) = (11) agent is poor because the agents are independent of each other. In this
0 z>0
way, the microbatch data generated by multiple agents can be combined
ReLUs are very similar to linear units and hence are easy to optimize. to obtain a training batch that is approximately randomly sampled. On
Almost all optimization algorithms for neural networks are based on this basis, the study proposes a parallelization framework for the ACA, as
gradient optimization. The most typical of these is Stochastic Gradient shown in Fig. 5.
Descent (SGD). SGD is an extension of the gradient descent algorithm The framework is divided into three main parts, Worker, Master and
that treats the gradient as an expectation and approximates the estimate Model. The Worker is a process that encapsulates a standard RL model
from a small sample size. SGD updates the parameters in the following that interacts with the environment through an agent. However, the
way as shown in Eq. (12). behavior of this agent is different from that of an agent in a typical re-
⎧ enforcement learning algorithm. The method inputs behaviors into the
∑m
⎨ ∇θ J(θ) = 1
⎪
∇θ L(f (x(i) ; θ), y(i) ) system through the selection of learned strategies, and then determines
m i=1 (12) whether the system is reset through termination markers fed back from
⎪
⎩
θ = θ − εk g the external environment. However, he will not be rewarded by feed
back from the surroundings. In other words, all this agent has to do is a
In Eq. (12), m is the number of samples, εk is the loss function of a selection of a behavior and then input it into the environment to be
single sample, and εk is the learning rate of the kth iteration. That is, in executed without computing and training it. On this basis, a very small
each step, a small number of samples are taken from the training set. At pool of migrated experiences is added to the Worker to temporarily save
the same time, the parameters in the network are gradually reduced with
the random selection of a small number of samples, which greatly ac
celerates the training speed. However, the learning rate of SGD cannot
change over time, so the learning rate must be manually adjusted during
iterations. Therefore, this algorithm is not directly used for deep neural
network training. The study utilizes the RMSProp algorithm to correct
the parameters of the neural network. The RMSProp algorithm is an
improvement on AdaGrad that better solves nonconvex problems by
accumulating the gradients as a weighted sliding average rather than
exponentially. AdaGrad has strong convergence for convex functions,
but for nonconvex problems, its learning trajectory traverses multiple
structures, ending up in a “convex-concave “region. AdaGrad, on the
other hand, scales down the learning rate based on the squared accu
mulated gradients, and thus will have a very low learning rate to reach
the optimum before reaching this convex structure. The RMSProp al
gorithm proposes a new hyperparameter, ρ, which discards the past
historical gradients with an exponential decay averaging method,
allowing for a fast convergence of the method, similar to one of the
examples of the AdaGrad algorithm. In practice, RMSProp is an efficient
and practical deep network optimization algorithm, which is a
commonly used algorithm in the current DL field. Although the
improved ACA can reduce the estimation error of the return value to
some extent, similar to other RL methods, it fails to effectively solve the
correlation problem among the training samples. The research will
present a method that allows for weak correlation between training
data. The experiential playback technique involves random sampling of
past experiences to obtain a weakly correlated training sample. This is
based on a single agent, since serialized training data generated by one Fig. 5. Parallelization framework.
5
C. Liu and D. Liu Entertainment Computing 50 (2024) 100670
the state-transition relations generated by the agent while interacting Again, because the parameters of the network are changing all the
with the environment up to k + 1. Therefore, this migration experience time during the iteration process, the parameters of the network for the
pool requires much less memory than the experiential experience t − kth iteration are different from the parameters of the network for the
method. After the agent has finished interacting with the environment, tth iteration. The strategy function used in calculating the gradient
the Worker transmits the rewards and states you have obtained to the values is different from the one used in generating the migration expe
Master through the pipeline. The specific workflow of the Worker is rience. The instability of the network training is caused by the use of
shown in Fig. 6. stale data, which is largely due to the logarithmic nature of the policy
Master is also a process that encapsulates three modules together. function. When generating the migration experience, the values are
First, the Logic Control is the most important part, which logically usually large because the agents’ behaviors are selected according to a
controls the entire process. It communicates with all the work processes probability distribution. However, after k iterations, the probability
through a single pipeline and processes the reporting, status information distribution of the behaviors changes, making their values small or even
from each staff member. This process can be divided into three stages. In converging to zero. This value will be infinitely close to infinite, leading
the first scenario, if the transmitted experience is empty, meaning that to failure of the optimum. To avoid this phenomenon, training samples
the environment is still in its initial state, then the rewards at this point with large policy delay need to be filtered. Therefore, the study makes
in time are meaningless. Therefore, it is sufficient to add this state to a some modifications to the policy function for calculating the gradient,
predefined queue, and then the Prediction Queue will output this and the modified PG is shown in Eq. (14).
strategy. The second condition is that if your experience pool is not
∇θt J(θt ) = ∇θt [δlog(max{ρ, ε}) + βρlog(max{ρ, ε})] (14)
empty, then you add your experience value to your current experience.
When an event occurs in a given time interval, the first step is to iterate ρ = π (At− k |St− k ; θt ), H(ρ) = − ρlogρ in Eq. (14). When ρ < ε, since the
through the multi-step TD algorithm for each event to get the corre loss function of the strategy network is constant and does not depend on
sponding minimum samples, and then the samples are fed into the the network parameters, its corresponding PG is 0. Therefore, based on
training queue, which is then trained by the Trainer. Then, state is added this, the study proposes a strategy delay where the gradient value of the
to the Prediction Queue and the strategy request is sent to the Prediction sampled value to the parameters is also 0, so that this sample can be
Queue. It should be noted that here, the maximum number of in effectively screened out. While the improved ACA can reduce revenue
teractions is not the same as the maximum number of steps per iteration, value estimation errors to some extent, like other RL methods, it does not
which is K + 1. The goal is to take the state value obtained by the agent’s address the issue of correlation between training samples. In RL, training
K + 1th strategy request as the estimated value of the gain for the next data correlation refers to the correlation between continuous state-
step in order to circumvent needless network operations and make the behavior sequences. Traditional RL algorithms often assume that
program easy to run. In the third example, if the pool of transmitted states are independent and equally distributed. However, in practice,
experience is not empty but is not currently finished and the number of this assumption is often invalid. Dealing with the correlation of training
agents has not yet reached the upper limit, then the reward is added to data is a crucial issue. One approach to address this problem is to
the current transmitted experience, and then the same steps as in the introduce a temporal dependency model that can capture time de
first example are followed. The specific workflow of Master is shown in pendencies between states and make better use of training data. A
Fig. 7. method called ’delayed reward’ can be used to achieve this. The basic
Although the parallelism of the ACA greatly improves the conver idea of this method is that at each time step t, not only the current state st
gence and stability of the network, there are still some issues that cause and behavior at get an immediate reward rt , but also a delayed reward
instability in the learning process. During the execution of the program, rt+1 based on the state st+1 and behavior at+1 . In this way, the model is
the agent tends to generate more data than the Trainer consumes, which able to learn the relationship between behavior and future rewards,
results in the failure of the training samples. The so-called “expiration” thereby making better use of the training data.
refers to the network parameters, for example, during t − k iteration of
the network, an agent chooses an action At− k from the predicted policies 4. Research on MAP-based DRL algorithm and its application in
in state St− k , and then generates a training data through the control GE
module and adds it to the Training Queue. Since Trainer training is slow,
it may make that training data not really be used until the tth iteration. The study mainly analyzes the feasibility and stability of this DRL
Thus, when this data starts to be used in training, its corresponding PG is algorithm. The article conducts comparative experiments on parame
calculated as shown in Eq. (13). terized configurations that may affect the algorithm’s performance, as
analyzed in the previous section. It compares the algorithm with the
∇θt J(θt ) = ∇θt [δlogπ (At− k |St− k ; θt ) + βH(π(St− k ; θt ))] (13)
existing Deep Q-learning (DQL) algorithm based on empirical playback
6
C. Liu and D. Liu Entertainment Computing 50 (2024) 100670
and the training process of the GPU-based Asynchronous Advantage performance. The average accuracies of the four algorithms were
Actor-Critic (GA3C) algorithm, which is also based on MAP. compared and the results are shown in Fig. 10.
In Fig. 10, the average accuracy of method 2 is 83 %, method 3 is 90
%, method 4 is 88 %, and method 1, i.e., ACA combined with MAP, is 96
4.1. Performance analysis of MAP-based DRL algorithm %, which is higher than the average accuracy of all three models.
7
C. Liu and D. Liu Entertainment Computing 50 (2024) 100670
DQL algorithm after 14 h of training and still has not reached a stable features; the Linear Learner algorithm, which learns by using linear
value. It shows that the DRL algorithm is able to learn the control function approximation based on the SARSA algorithm; and the DRL
strategy under complex GE quickly. The study compares the perfor algorithm, which was chosen for comparison and run for 18 h. The final
mance of the DRL algorithm with the State-Action-Reward-State-Action performance of different methods in different games is shown in Table 2.
algorithm, the Linear Learner algorithm and the Random algorithm in In Table 2, the performance of the DRL algorithm in the five games
the games Breakout, Pong, Boxing, Qbert and Space Invaders. The al comprehensively outperforms the traditional RL algorithm based on
gorithms used in this study include the Random algorithm, which em artificially labeled features, which indicates that in the absence of arti
ploys a randomized strategy to generate actions; the State-Action- ficially labeled features and prior knowledge, the DRL algorithm
Reward-State-Action algorithm, which learns by manually labeled implemented in the study is still able to learn the high-level features in
8
C. Liu and D. Liu Entertainment Computing 50 (2024) 100670
6. Conclusion
5. Discussion
9
C. Liu and D. Liu Entertainment Computing 50 (2024) 100670
finally stabilized above 0.95, the recall index was also finally above 0.8, [3] Z. Zhu, K. Lin, A.K. Jain, J. Zhou, Transfer learning in deep reinforcement learning:
a survey, PAMI 45 (11) (Nov 2023) 13344–13362, https://doi.org/10.1109/
and the average fitness was also in a relatively high interval. The DRL
TPAMI.2023.3292075.
algorithm outperformed the traditional RL algorithm based on manually [4] B.R. Kiran, I. Sobh, V. Talpaert, P. Mannion, A.A. Sallab, S. Yogamani, P. Pérez,
labeled features in the five games comprehensively. The research indi Deep reinforcement learning for autonomous driving: a survey, T-ITS 23 (6) (June
cated that the multi-agent parallel DRL algorithm offers significant ad 2022) 4909–4926, https://doi.org/10.1109/TITS.2021.3054625.
[5] J. Hu, H. Niu, J. Carrasco, B. Lennox, F. Arvin, Voronoi-based multi-robot
vantages in GEs. When compared to traditional single-agent RL autonomous exploration in unknown environments via deep reinforcement
algorithms, multi-agent parallel training accelerated the learning pro learning, TVT 69 (12) (Dec 2020) 14413–14423, https://doi.org/10.1109/
cess and enhanced the accuracy and efficiency of decision-making. The TVT.2020.3034800.
[6] A. Haydari, Y. Yılmaz, Deep reinforcement learning for intelligent transportation
algorithm for DRL, based on MAP, had significant advantages in GEs. It systems: a survey, T-ITS 23 (1) (Jan 2022) 11–32, https://doi.org/10.1109/
enabled cooperative decision-making among multiple agents and ach TITS.2020.3008612.
ieves high performance. However, challenges and problems still exist, [7] V.A. Vargas-Perez, P. Mesejo, M. Chica, O. Cordon, Deep reinforcement learning in
agent-based simulations for optimal media planning, Inform. Fusion 15 (7) (March
and current algorithms have limited generalization ability when dealing 2023) 644–664, https://doi.org/10.1016/j.inffus.2022.10.029.
with complex environments and dynamic changes in games. Future [8] Y.T. Quek, W.A. Tso, W.L. Woo, N.T. Koh, L.L. Koh, Deep Q-network
research can focus on improving the model’s generalization ability to implementation for simulated autonomous vehicle control, IET Intell. Transp. Sy.
15 (7) (May 2021) 875–885, https://doi.org/10.1049/itr2.12067.
better adapt to different game scenarios and environmental changes. [9] Y. Yang, X. Li, H. Li, R. Yuan, Deep Q-network for optimal decision for top-coal
DRL often produces decisions that lack interpretability, making it diffi caving, Energies 13 (7) (April 2020) 1618–1697, https://doi.org/10.3390/
cult for players and developers to understand the decision-making en13071618.
[10] J. Yang, G. Liang, B. Li, G. Wen, T. Aao, A deep-learning- and reinforcement-
process and logic. Future research could explore techniques for inter
learning-based system for encrypted network malicious traffic detection, Electron.
preting machine learning to enhance the transparency of the decision- Lett. 57 (9) (February 2021) 23–37, https://doi.org/10.1049/ell2.12125.
making process. DRL heavily relies on data and environment, which [11] H. Wang, N. Liu, Y. Zhang, D.W. Feng, F. Huang, D.S. Li, Y.M. Zhang, Deep
can pose challenges to its robustness and stability. Therefore, future reinforcement learning: a survey, Front. Inform. Tech. El. 21 (12) (October 2021)
1726–1744, https://doi.org/10.1631/FITEE.1900533.
research should focus on improving the algorithm’s stability and [12] X. Zhu, Y. Luo, A. Liu, N.N. Xiong, M. Dong, S. Zhang, A deep reinforcement
robustness to ensure it can operate effectively in diverse gaming envi learning-based resource management game in vehicular edge computing, T-ITS 23
ronments. Future research should consider how to regulate and adjust (3) (March 2022) 2422–2433, https://doi.org/10.1109/TITS.2021.3114295.
[13] Y. Li, Y. Fang, Z. Akhtar, Accelerating deep reinforcement learning model for game
the behavior of algorithms to ensure that they meet ethical and regu strategy, NC 408 (9) (September 2020) 157–168, https://doi.org/10.1016/j.
latory requirements when using DRL in games. neucom.2019.06.110.
[14] C. Li, X. Wei, Y. Zhao, X. Geng, An effective maximum entropy exploration
approach for deceptive game in reinforcement learning, NC 403 (25) (August
CRediT authorship contribution statement 2020) 98–108, https://doi.org/10.1016/j.neucom.2020.04.068.
[15] S. Liu, J. Cao, Y. Wang, W. Chen, Y. Liu, Self-play reinforcement learning with
Chao Liu: Writing – review & editing, Writing – original draft, comprehensive critic in computer games, NC 449 (18) (August 2021) 207–213,
https://doi.org/10.1016/j.neucom.2021.04.006.
Methodology, Formal analysis, Data curation, Conceptualization. Di [16] G. Rani, U. Pandey, A.A. Wagde, V.S. Dhaka, A deep reinforcement learning
Liu: Writing – review & editing, Writing – original draft, Visualization, technique for bug detection in video games, IJIT 15 (1) (August 2023) 355–367,
Methodology, Investigation, Formal analysis, Data curation. https://doi.org/10.1007/s41870-022-01047-z.
[17] S. Rahi, Research design and methods: a systematic review of research paradigms,
sampling issues and instruments development, Int. J. Econ. Manag. Sci. 6 (2) (June
2017) 1–5, https://doi.org/10.4172/2162-6359.1000403.
Declaration of competing interest [18] K. Zhang, J. Cao, Y. Zhang, Adaptive digital twin and multiagent deep
reinforcement learning for vehicular edge computing and networks, IEEE T Ind.
The authors declare that they have no known competing financial Inform. 18 (2) (Feb 2022) 1405–1413, https://doi.org/10.1109/
TII.2021.3088407.
interests or personal relationships that could have appeared to influence [19] X. Tang, J. Chen, T. Liu, Y. Qin, D. Cao, Distributed deep reinforcement learning-
the work reported in this paper. based energy and emission management strategy for hybrid electric vehicles, IEEE
T Veh. Technol. 70 (10) (Oct 2021) 9922–9934, https://doi.org/10.1109/
TVT.2021.3107734.
Data availability
[20] C. Huang, R. Mo, C. Yuen, Reconfigurable intelligent surface assisted multiuser
MISO systems exploiting deep reinforcement learning, IEEE J. Sel. Area Comm. 38
Data will be made available on request. (8) (Aug 2020) 1839–1850, https://doi.org/10.1109/JSAC.2020.3000835.
[21] W.J. Yun, S. Park, J. Kim, M.J. Shin, S. Jung, D.A. Mohaisen, Cooperative
multiagent deep reinforcement learning for reliable surveillance via autonomous
References multi-UAV control, IEEE T Ind. Inform. 18 (10) (Oct 2022) 7086–7096, https://doi.
org/10.1109/TII.2022.3143175.
[1] W. Chen, X. Qiu, T. Cai, H.N. Dai, Z. Zheng, Y. Zhang, Deep reinforcement learning [22] H. Yang, Z. Xiong, J. Zhao, D. Niyato, L. Xiao, Q. Wu, Deep reinforcement learning-
for internet of things: a comprehensive survey, IEEE Commun. Surv. Tut. 23 (3) based intelligent reflecting surface for secure wireless communications, IEEE TWC
(May 2021) 1659–1692, https://doi.org/10.1109/COMST.2021.3073036. 20 (1) (2021) 375–388, https://doi.org/10.1109/TWC.2020.3024860.
[2] A. Feriani, E. Hossain, Single and multi-agent deep reinforcement learning for AI- [23] E. Nsugbe, Toward a self-supervised architecture for semen quality prediction
enabled wireless networks: a tutorial, IEEE Commun. Surv. Tut. 23 (2) (September using environmental and lifestyle factors, AI&A 1 (1) (October 2023) 35–42,
2021) 1226–1252, https://doi.org/10.1109/COMST.2021.3063822. https://doi.org/10.47852/bonviewAIA2202303.
10