0% found this document useful (0 votes)

30 views

Deep Reinforcement Learning Algorithm Based On Multi-Agent Parallelism and Its Application in Game Environment

Uploaded by

priyan05072003

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views

Deep Reinforcement Learning Algorithm Based On Multi-Agent Parallelism and Its Application in Game Environment

Uploaded by

priyan05072003

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Entertainment Computing 50 (2024) 100670

Contents lists available at ScienceDirect

Entertainment Computing
journal homepage: www.elsevier.com/locate/entcom

Deep reinforcement learning algorithm based on multi-agent parallelism

and its application in game environment
Chao Liu a, *, Di Liu b
a
Kot Damansara, Segi University, Kuala Lumpur 47810, Malaysia
b
YinChuan Municipal Public Security Bureau, Yinchuan 750001, China

A R T I C L E I N F O A B S T R A C T

Keywords: Deep reinforcement learning has become a prominent area of research in artificial intelligence in recent years. Its
Multi-agent parallelism application in solving complex tasks and game environments has garnered significant attention. This study aims
Deep reinforcement learning to develop a deep reinforcement learning algorithm based on multi-agent parallelism to enhance intelligent
Game environment
decision-making in game environments. The algorithm combines a deep Q-network with a multi-agent cooper
Convolutional neural network
Policy gradient
ation strategy. Through parallel training of multiple agents, the learning process is accelerated, and decision
accuracy is improved. The experimental results indicated that the Actor-Critic algorithm, when combined with
precision rate, recall rate, and average fitness of multi-agent parallel, achieves a relatively high accuracy rate
index, which stabilizes above 0.95. The recall rate index was also above 0.8, and the average fitness was in a
relatively high range. The research shows that the deep reinforcement learning algorithm based on multi-agent
parallelism performs better and is more effective in game environments. It can learn the optimal strategy faster
and obtain higher rewards. This not only provides a new technical means for game development but also offers a
useful reference for the application of multi-agent systems in complex environments.

1. Introduction training, game strategy optimization, and other relevant aspects.

Finally, this research will summarize the existing findings and explore
Artificial intelligence has advanced significantly in recent years future research directions. Studying the DRL algorithm of MAP and its
because to Deep Reinforcement Learning (DRL). DRL merges Deep application in the context of GE enhances comprehension and applica
Learning (DL) and Reinforcement Learning (RL) to learn data obtained tion of this method, and further advances the development of AI in
from the environment via neural networks and optimize decision- gaming. Consequently, this study holds crucial theoretical and practical
making strategies with RL algorithms [1–3]. This method has proven significance. The use of a DRL algorithm based on MAP has improved the
to be successful across various domains, particularly in the realm of intelligence level of games and the player experience. Through multi-
gaming. While conventional RL algorithms generally involve only one agent collaboration and DRL, game characters can adapt to environ
intelligent agent interacting with the environment, many real-world mental changes more quickly and make more intelligent decisions,
tasks necessitate the cooperation of multiple intelligences for optimal enhancing the challenge and fun of the game. This research promotes
outcomes [4–6]. Therefore, DRL algorithms for Multi-agent Parallelism the development of a multi-agent RL algorithm and provides a new idea
(MAP) have garnered attention from researchers. The purpose of this and method for the application of multi-agent systems in GEs. It also
work is to examine the use of the MAP-based DRL algorithm in Game contributes to the progress of research in related fields by providing new
Environments (GE). Firstly, this research will introduce the fundamental technical means and possibilities for game development. The DRL al
principles of DRL, including the concepts of RL, DL, and DRL. The ad gorithm based on MAP can be applied to various types of games,
vantages and challenges of using the DRL algorithm for MAP will be providing game developers with more tools and choices. This paper
discussed, along with an introduction to commonly used MAP algo serves as a reference for the application of multi-agent systems in other
rithms like Deep Q-Network (DQN), Policy Gradient (PG), among others. fields. The research is not limited to gaming but can also be applied to
The research will then delve into a detailed investigation of the appli other fields that require intelligent decision-making and collaboration,
cation of DRL algorithms for MAP under GE, including game intelligence such as autonomous driving and robot control. This study raises

* Corresponding author.
E-mail address: [email protected] (C. Liu).

https://doi.org/10.1016/j.entcom.2024.100670
Received 27 November 2023; Received in revised form 7 March 2024; Accepted 2 April 2024
Available online 4 April 2024
1875-9521/© 2024 Elsevier B.V. All rights reserved.
C. Liu and D. Liu Entertainment Computing 50 (2024) 100670

concerns about user privacy and data protection. In the context of generation strategy for deceptive games in GE. The results revealed that
gaming, this study serves as a reminder of the importance of prioritizing the method can effectively avoid deceptive reward traps compared to
user privacy and data security. It also explores effective methods for the traditional goal generation strategy [14]. Liu’s team proposed a
utilizing data to enhance training and learning outcomes while safe strategy integrating self-game actor critic and DRL for the problem of
guarding user privacy. The application of the DRL algorithm based on instability of self-game RL in game application scenarios, and applied
multi-agent parallel in GEs not only brings innovation and technological the strategy to intelligences playing computer games. The results
breakthroughs to the game industry but also provides useful references showed that the overall effect of the intelligences trained using this
and insights for research and applications in related fields. The article strategy was better than that of the intelligences trained with the
comprises four sections, the first of which is a literature review discus traditional strategy [15]. To address the issue of bugs in the game field,
sing and analyzing the current state of DRL and its research in the Rani et al. developed a model using deep reinforcement neural learning
gaming industry at both domestic and international levels. The second to detect vulnerabilities in the GE. This model automated error detection
section introduces a MAP-based DRL algorithm, implemented under GE. and minimized human intervention. The research results demonstrated
The third section conducts experiments to demonstrate the algorithm’s the model’s effectiveness in detecting errors in both fuzzy and non-fuzzy
effectiveness in game applications. The fourth section provides a sum input on various platforms [16].
mary of the research findings. In summary, DRL has found wide application in various fields,
including the gaming industry. However, the scope of its application in
2. Related works gaming is relatively limited due to the requirement of a large amount of
data for training. Generating valid data in the gaming industry is often
DRL, an approach that amalgamates DL and RL techniques, has time-consuming and expensive. The generalization ability of the current
found frequent utilization across various sectors, such as robot control, DRL algorithm is limited when dealing with complex environments and
gaming, and self-driving cars. It exhibits the potential to enhance per dynamic changes in games, making it difficult to adapt to different sit
formance through constant optimization and autonomous learning, and uations. DRL often produces decisions that lack interpretability, making
has shown human-overmatching results in certain assignments. In order it difficult for players and developers to understand the decision-making
to obtain a marketing investment strategy to increase the visibility of its process and logic. Improving the robustness of DRL is crucial due to its
corresponding brand in marketing scenarios, the Vargas-Perez team high dependence on data and environment. Training DRL requires
developed a DRL agent based on a dual DQN algorithm. The results of powerful computing resources, such as GPUs or TPUs, which can be a
the study showed that this decision support system was beneficial to challenge for small game development teams. DRL is primarily applied
optimize an online dynamic learning environment [7]. In order to better to specific game types and tasks, such as esports and RPGs, and may not
control the vehicle’s intelligence to learn errors based on its actions and be suitable for more complex games or those requiring real-time inter
interactions with the environment, Quek et al. used DQN based on action. It is important to note that the performance requirements of
fractional and pixel inputs to implement agent learning in the vehicle. It these types of games may not be met by DRL algorithms. Therefore, this
was experimentally confirmed that this method enabled the self-driving study proposes a DRL algorithm for the MAP problem and applies it to
car to learn maneuvering operations and gradually gain the ability to several gaming environments. The aim is to enhance the integration of
successfully navigate and avoid obstacles [8]. Yang et al. suggested a the artificial intelligence and gaming fields, thereby promoting intelli
DRL approach based on the 3D simulation platform to control the ac gent development in the gaming industry.
tivity of the window in response to the situation where the hydraulic
support window is difficult to control. The method’s performance was 3. Actor-Critic PG algorithm and MAP in GE
confirmed using three-dimensional simulation trials, which also yielded
higher economic benefits and increased mining efficiency for top coal The research employs the positivist paradigm and the scientific
[9]. In order to optimize the traditional network intrusion detection method to analyze the DRL algorithm based on multi-agent parallel and
methods and improve the detection rate, Yang’s team came out with a its application in the GE. The research findings are obtained through
DL-based encrypted network malicious traffic detection model, in which objective scientific methods and experiments [17]. The study uses an
the encrypted network malicious traffic is automatically feature Actor-Critic Algorithm (ACA) based on the dominance function and
extracted. An experimental demonstration of the model’s 99.95 % ac reduces the bias in the estimation of the return value by the idea of
curacy rate in differentiating between regular and anomalous encrypted multi-step TD. Then the study designs a deep neural network for the
network data was conducted [10]. Wang et al. conducted a systematic approximation of the value function and the strategy function in the
classification of DRL algorithms and their applications, and provided a algorithm based on the particular scenario of GE. After that, the study
detailed review of existing methods. They categorized the algorithms proposes a parallelization framework for the ACA to make the training
into model-based, model-free, and advanced RL methods. The authors data weakly correlated. Finally, the implementation details are appro
also summarized the current representative applications and analyzed priately optimized to avoid the adverse effect of the policy delay prob
four open questions for future research [11]. lem on the training process.
In addition to the application of DRL in the above-mentioned fields,
many scholars have also applied it to the game field and achieved great 3.1. RGB to grayscale based game image preprocessing
results. Zhu et al. improved the profit of vehicle resource management
game and proposed a resource management scheme based on DRL and DRL is a machine learning method that learns how to make optimal
Stackelberg game, using which the profit of vehicle and vehicle edge decisions to maximize cumulative rewards through the interaction be
computing server can be maximized. The scheme was empirically tween an agent program and its environment. DRL uses deep neural
analyzed and a large number of experimental results showed the effec networks as function approximators to represent and learn policies or
tiveness of the scheme [12]. Li’s team proposed two efficient DRL value functions with more expressive power than traditional machine
network architectures to improve the accuracy of game strategies in learning methods [18]. The standard RL model consists of four compo
various games, and evaluated these two DRL network architectures. The nents: environment, reward, action, and state. In RL, agent and envi
evaluation results showed that both architectures can greatly improve ronment interact at different moments and the process is shown in Fig. 1.
the accuracy of game strategies and are practical [13]. Li et al. proposed It is possible to reduce anything in RL to the Markov Decision Process
a goal generation strategy incorporating maximum entropy exploration (MDP). Eq. (1) illustrates how the MDP characterizes a fully observable
and DRL algorithm in order to avoid deceptive games in the game more process, in which the qualities needed for decision-making are entirely
effectively, by comparing this strategy with the traditional goal determined by the content of the observed state.

2
C. Liu and D. Liu Entertainment Computing 50 (2024) 100670

gaming, robot control, and autonomous driving [19,20]. Through

extensive interaction with and learning from the environment, DRL is
able to discover complex strategies and behaviors, and even achieve a
level beyond humans without their prior knowledge. However, DRL also
faces some challenges, such as long training time, low sample efficiency,
and low stability. For example, Q-learning algorithm improves the sta
bility of training by introducing empirical playback to eliminate the
dependency between training samples, but the AGENT obtained by this
method still has some defects. The first point is that this method is
essentially a random sampling of historical data, so it must have a very
large data store, which greatly increases the memory usage of the pro
gram. Second, Q-learning is a value function-based learning method,
and even though an ε-greedy strategy is used to weigh exploration and
development in behavioral decision-making, Q-learning is still an
Fig. 1. Deep reinforcement learning interaction process. approximately deterministic learning method. However, in real games,
the choice of individual behaviors is often random, so the Q-learning
pss′ = P[St+1 = s′|St = s, St− 1 , ..., S1 ] = P[St+1 = s′|St = s] (1) algorithm is very easy to fall into the local optimum. Third, it is difficult
to fully utilize the parallel processing capability of GPU when using only
In Eq. (1), p is the state transfer probability matrix and S is the finite one augmentation learning-based model for intelligence-environment
set of states. The transfer probabilities of all states in matrix p are defined interaction and sequence-based network prediction and training
as shown in Eq. (2). methods. This results in a low iteration rate. To address the above
⎡ ⎤
p11 ... p1n problems, the study proposes a deep augmented learning framework
⎣
p= ⋮ ⋱ ⋮ ⎦ (2) based on the Actor-CriticPG algorithm and MAP execution. The struc
pn1 ... pnn ture of the framework is shown in Fig. 2.
In Fig. 2, the AGENT preprocesses the feedback images from the
The number of states is denoted by n in Eq. (2), and the total of the
outside world and converts its latest processed multi-frame sequence
elements in each row of the matrix is 1. Eq. (3) illustrates how the state
images to the state at the current time. The intelligent body inputs the
transfer probability matrix and reward function of the MDP must be
state information at the current point in time into the established deep
connected to a particular action.
neural network to obtain the decision result, i.e., the probability dis
{
pass′ = P[St+1 = s′|St = s, At = a] tribution of the behavior. The agent randomly selects a behavior from
(3) the system according to the probability distribution of the behavior and
Ras = E[Rt+1 |St = s, At = a]
sends it into the environment to run. The intelligence gets observations
In Eq. (3), S is the state space, A is the action space, and R is the and rewards from external feedback and feeds them into the training
reward function. Consider the state transfer probability matrix and queue, completes a specific task to train the neural network, and then
reward function expression for the whole strategy as shown in Eq. (4). returns to the previous step to start a new preprocessing. The second and
⎧ π ∑ third steps in this framework show us that in the absence of an ε-greedy
⎪ pss′ =
⎪ π (a|s)pass′ strategy, agents can strike a good balance between exploration and
⎨
(4) development. The fourth step reveals that with multi-agent collabora
a∈A
∑
⎪
tion, the final training queue gathers independent training samples from
π
⎪
⎩ Rs = π (a|s)pas
a∈A
multiple agents, which reduces dependencies between data and requires
In Eq. (4), π denotes the probability of action a that may be per much less storage. At the same time, the algorithm uses an asynchronous
formed at a given moment in state s. Strategy π is shown in Eq. (5). approach, which allows the GPU to better utilize the computational
power of the GPU. In order to facilitate the 2D convolution operation
π (a|s) = P[a = At |s = St ] (5)
and minimize the input dimension, the GE must be preprocessed
The Bellman equation under MDP is shown in Eq. (6).
{
Vπ (s) = Eπ [γVπ (St+1 ) + Rt+1 |St = s]
(6)
Qπ (s, a) = Eπ [γQπ (St+1 ) + Rt+1 , At+1 |St = s, At = a]

In Eq. (6), γ is the discount factor. Qπ (s, a) denotes the expectation of

the reward obtained in state s when executing strategy π. Qπ (s, a) de
notes the expectation of the payoff obtained after choosing a specific
action a to execute in state s when executing strategy π. The Qπ (s, a)
iterative equation is shown in Eq. (7).
∑ ∑
Vπ (s) = π(a|s)(Ras + γ pass′ Vπ (s′)) (7)
a∈A s′∈S

Iterative equation Qπ (s, a) is shown in Eq. (8).

∑ ∑
Qπ (s, a) = Ras + γ pass′ π (a′|s′)Qπ (s′, a′) (8)
s′∈S a′∈A

The main idea behind DRL is to train deep neural networks with RL
algorithms so that they may learn from their surroundings and modify
their weights based on feedback signals to make better decisions. The
commonly used RL algorithms in DRL include Q-learning, DQN, PG, etc.
DRL has demonstrated excellent performance in many fields, such as Fig. 2. Schematic diagram of deep reinforcement learning framework based on
multi-agent parallelism.

3
C. Liu and D. Liu Entertainment Computing 50 (2024) 100670

[21,22]. First, the original image is RGB transformed to reduce the interaction from entering a long cycle, resulting in slower learning.
original isosceles to 110*84, and then the original image is split into For instance, in the game of Checkers, if there are only a few squares on
84*84 to cover the full game range as much as possible. Finally, the the board, the agent is likely to remain in a fixed position due to the
individual pixel values are normalized to expedite the network’s decreasing probability of hitting and the uncertainty of the ball’s tra
convergence. A common method of converting an RGB image to a jectory. This can cause the agent to maintain a fixed trajectory for an
grayscale image is by performing a weighted average of the pixel values extended period of time. Also, the method is prone to local extremes.
of the three RGB channels. Here is a simple method: assume that the This project suggests adding a constraint number N to the intelligent
input RGB image is a matrix of size M × N, where each pixel has three body-environment interaction in order to improve network perfor
channels (red, green, and blue). The values of the red, green and blue mance. In other words, if the intelligent body is still in the game after
channels of each pixel are denoted as R, G and B. The method of RGB to trying N operation options, the game will be forcibly ended and an end
grayscale map is shown in Eq. (9). signal will be returned. Deep neural networks can learn the features and
optimal strategies of the GE to enable prediction and decision making for
Gray = 0.299*R(i, j) + 0.587*G(i, j) + 0.114*B(i, j) (9)
the GE. This approach has been very successful in RL tasks and game AI,
In Eq. (9), (i, j) denotes each pixel, and this non-uniform weight e.g., it performs well in complex environments such as Atari games.
distribution is more sensitive to color than the direct averaging method.
People often see green as being brighter than red and blue as being 3.2. Implementation of parallelization framework based on Actor-Critic
weaker than red, therefore the transformed image has more information. algorithm
The simplest approach is to center the cropping directly, though it is
possible to specify the cropping in any way. The study stacks the pre After preprocessing the game screen, it meets the need for 2D
vious k consecutive observations as the state of the environment at the convolution. The study created a convolutional neural network with
current time point before inputting them into the agent in order to fully four convolutional layers and three pooling layers connected to them, as
perceive the state of the environment. This avoids using the observation illustrated in Fig. 4, taking into account the intricacy of the game screen
at the current time point as the state of the environment. The game elements.
image cropping preprocessing process is shown in Fig. 3. On this basis, the study proposes a method based on convolutional
In addition to the previously mentioned preprocessing, Gym does window shift, i.e., the shift step of the convolutional window is set to 1
further processing of the original environment after encapsulating the and its boundary is complemented with zero, so that the feature maps of
ALE, i.e., frame skipping. In general, at each moment, the environment each convolutional layer have the same dimension as the input features.
sends an action emitted from the agent to the emulator and feeds the Eq. (10) illustrates the link between the convolutional layers’ input and
result of that behavior back into the emulator. Frame skipping is a little output scales.
different, in that instead of feeding the current picture back to that agent
input shape
in a timely manner after completing an action, it ignores n frames of output shape = (10)
pictures in a row, and since the simulation program is still working in the stepsize
mode of one action per frame of pictures, a way to ignore it is to The activation function of neural networks is to model the network
repeatedly send to the simulator the actions entered by the agent at the nonlinearly [23]. A neural network can only represent linear mappings
most recent point in time. With the introduction of the process of skip when its only operations are linear convolution and completely linked;
ping frames, the agent is able to perform the selection of n actions fitting nonlinear input in real scenarios becomes challenging even with
sequentially in a single computation, which reduces the computation increased network depth. In order to provide it a hierarchical and
time and greatly improves the learning efficiency of the agent. Gym by nonlinear mapping learning function, the activation function is intro
default automatically skips between 2 and 4 pictures after each action is duced. Activation functions are therefore essential components of deep
selected. Finally, a clear definition of the intelligent-environment neural networks. ReLU, a segmented linear function with the form
interaction process is needed to prevent the intelligent-environment described in Eq. (11), is used in the study as the activation function of

Fig. 3. Preprocessing of the game environment.

4
C. Liu and D. Liu Entertainment Computing 50 (2024) 100670

Fig. 4. The convolutional neural network designed in this study.

the hidden layer neurons. agent must be strongly correlated. However, in the case of multiple
{ agents, the correlation between the microbatch data generated by each
z z⩾0
g(z) = ReLU(z) = (11) agent is poor because the agents are independent of each other. In this
0 z>0
way, the microbatch data generated by multiple agents can be combined
ReLUs are very similar to linear units and hence are easy to optimize. to obtain a training batch that is approximately randomly sampled. On
Almost all optimization algorithms for neural networks are based on this basis, the study proposes a parallelization framework for the ACA, as
gradient optimization. The most typical of these is Stochastic Gradient shown in Fig. 5.
Descent (SGD). SGD is an extension of the gradient descent algorithm The framework is divided into three main parts, Worker, Master and
that treats the gradient as an expectation and approximates the estimate Model. The Worker is a process that encapsulates a standard RL model
from a small sample size. SGD updates the parameters in the following that interacts with the environment through an agent. However, the
way as shown in Eq. (12). behavior of this agent is different from that of an agent in a typical re-
⎧ enforcement learning algorithm. The method inputs behaviors into the
∑m
⎨ ∇θ J(θ) = 1
⎪
∇θ L(f (x(i) ; θ), y(i) ) system through the selection of learned strategies, and then determines
m i=1 (12) whether the system is reset through termination markers fed back from
⎪
⎩
θ = θ − εk g the external environment. However, he will not be rewarded by feed
back from the surroundings. In other words, all this agent has to do is a
In Eq. (12), m is the number of samples, εk is the loss function of a selection of a behavior and then input it into the environment to be
single sample, and εk is the learning rate of the kth iteration. That is, in executed without computing and training it. On this basis, a very small
each step, a small number of samples are taken from the training set. At pool of migrated experiences is added to the Worker to temporarily save
the same time, the parameters in the network are gradually reduced with
the random selection of a small number of samples, which greatly ac
celerates the training speed. However, the learning rate of SGD cannot
change over time, so the learning rate must be manually adjusted during
iterations. Therefore, this algorithm is not directly used for deep neural
network training. The study utilizes the RMSProp algorithm to correct
the parameters of the neural network. The RMSProp algorithm is an
improvement on AdaGrad that better solves nonconvex problems by
accumulating the gradients as a weighted sliding average rather than
exponentially. AdaGrad has strong convergence for convex functions,
but for nonconvex problems, its learning trajectory traverses multiple
structures, ending up in a “convex-concave “region. AdaGrad, on the
other hand, scales down the learning rate based on the squared accu
mulated gradients, and thus will have a very low learning rate to reach
the optimum before reaching this convex structure. The RMSProp al
gorithm proposes a new hyperparameter, ρ, which discards the past
historical gradients with an exponential decay averaging method,
allowing for a fast convergence of the method, similar to one of the
examples of the AdaGrad algorithm. In practice, RMSProp is an efficient
and practical deep network optimization algorithm, which is a
commonly used algorithm in the current DL field. Although the
improved ACA can reduce the estimation error of the return value to
some extent, similar to other RL methods, it fails to effectively solve the
correlation problem among the training samples. The research will
present a method that allows for weak correlation between training
data. The experiential playback technique involves random sampling of
past experiences to obtain a weakly correlated training sample. This is
based on a single agent, since serialized training data generated by one Fig. 5. Parallelization framework.

5
C. Liu and D. Liu Entertainment Computing 50 (2024) 100670

the state-transition relations generated by the agent while interacting Again, because the parameters of the network are changing all the
with the environment up to k + 1. Therefore, this migration experience time during the iteration process, the parameters of the network for the
pool requires much less memory than the experiential experience t − kth iteration are different from the parameters of the network for the
method. After the agent has finished interacting with the environment, tth iteration. The strategy function used in calculating the gradient
the Worker transmits the rewards and states you have obtained to the values is different from the one used in generating the migration expe
Master through the pipeline. The specific workflow of the Worker is rience. The instability of the network training is caused by the use of
shown in Fig. 6. stale data, which is largely due to the logarithmic nature of the policy
Master is also a process that encapsulates three modules together. function. When generating the migration experience, the values are
First, the Logic Control is the most important part, which logically usually large because the agents’ behaviors are selected according to a
controls the entire process. It communicates with all the work processes probability distribution. However, after k iterations, the probability
through a single pipeline and processes the reporting, status information distribution of the behaviors changes, making their values small or even
from each staff member. This process can be divided into three stages. In converging to zero. This value will be infinitely close to infinite, leading
the first scenario, if the transmitted experience is empty, meaning that to failure of the optimum. To avoid this phenomenon, training samples
the environment is still in its initial state, then the rewards at this point with large policy delay need to be filtered. Therefore, the study makes
in time are meaningless. Therefore, it is sufficient to add this state to a some modifications to the policy function for calculating the gradient,
predefined queue, and then the Prediction Queue will output this and the modified PG is shown in Eq. (14).
strategy. The second condition is that if your experience pool is not
∇θt J(θt ) = ∇θt [δlog(max{ρ, ε}) + βρlog(max{ρ, ε})] (14)
empty, then you add your experience value to your current experience.
When an event occurs in a given time interval, the first step is to iterate ρ = π (At− k |St− k ; θt ), H(ρ) = − ρlogρ in Eq. (14). When ρ < ε, since the
through the multi-step TD algorithm for each event to get the corre loss function of the strategy network is constant and does not depend on
sponding minimum samples, and then the samples are fed into the the network parameters, its corresponding PG is 0. Therefore, based on
training queue, which is then trained by the Trainer. Then, state is added this, the study proposes a strategy delay where the gradient value of the
to the Prediction Queue and the strategy request is sent to the Prediction sampled value to the parameters is also 0, so that this sample can be
Queue. It should be noted that here, the maximum number of in effectively screened out. While the improved ACA can reduce revenue
teractions is not the same as the maximum number of steps per iteration, value estimation errors to some extent, like other RL methods, it does not
which is K + 1. The goal is to take the state value obtained by the agent’s address the issue of correlation between training samples. In RL, training
K + 1th strategy request as the estimated value of the gain for the next data correlation refers to the correlation between continuous state-
step in order to circumvent needless network operations and make the behavior sequences. Traditional RL algorithms often assume that
program easy to run. In the third example, if the pool of transmitted states are independent and equally distributed. However, in practice,
experience is not empty but is not currently finished and the number of this assumption is often invalid. Dealing with the correlation of training
agents has not yet reached the upper limit, then the reward is added to data is a crucial issue. One approach to address this problem is to
the current transmitted experience, and then the same steps as in the introduce a temporal dependency model that can capture time de
first example are followed. The specific workflow of Master is shown in pendencies between states and make better use of training data. A
Fig. 7. method called ’delayed reward’ can be used to achieve this. The basic
Although the parallelism of the ACA greatly improves the conver idea of this method is that at each time step t, not only the current state st
gence and stability of the network, there are still some issues that cause and behavior at get an immediate reward rt , but also a delayed reward
instability in the learning process. During the execution of the program, rt+1 based on the state st+1 and behavior at+1 . In this way, the model is
the agent tends to generate more data than the Trainer consumes, which able to learn the relationship between behavior and future rewards,
results in the failure of the training samples. The so-called “expiration” thereby making better use of the training data.
refers to the network parameters, for example, during t − k iteration of
the network, an agent chooses an action At− k from the predicted policies 4. Research on MAP-based DRL algorithm and its application in
in state St− k , and then generates a training data through the control GE
module and adds it to the Training Queue. Since Trainer training is slow,
it may make that training data not really be used until the tth iteration. The study mainly analyzes the feasibility and stability of this DRL
Thus, when this data starts to be used in training, its corresponding PG is algorithm. The article conducts comparative experiments on parame
calculated as shown in Eq. (13). terized configurations that may affect the algorithm’s performance, as
analyzed in the previous section. It compares the algorithm with the
∇θt J(θt ) = ∇θt [δlogπ (At− k |St− k ; θt ) + βH(π(St− k ; θt ))] (13)
existing Deep Q-learning (DQL) algorithm based on empirical playback

Fig. 6. Worker specifies the work flow.

6
C. Liu and D. Liu Entertainment Computing 50 (2024) 100670

Fig. 7. Master specific workflow.

and the training process of the GPU-based Asynchronous Advantage performance. The average accuracies of the four algorithms were
Actor-Critic (GA3C) algorithm, which is also based on MAP. compared and the results are shown in Fig. 10.
In Fig. 10, the average accuracy of method 2 is 83 %, method 3 is 90
%, method 4 is 88 %, and method 1, i.e., ACA combined with MAP, is 96
4.1. Performance analysis of MAP-based DRL algorithm %, which is higher than the average accuracy of all three models.

Since the DRL method implemented in the research involves MAP,

and the training and prediction of the model are carried out on GPU. 4.2. Study on the application of DRL algorithm in GE
Therefore, the operation of the algorithm needs to mobilize the
computational power of CPU and GPU at the same time, and the The research will be conducted in the following five GEs, namely
experiment has relatively high requirements for hardware. The hard Breakout, Pong, Boxing, Qbert, and Space Invaders. The experimental
ware and software environments on which the research experiments are research methodology is the same for all of the above GEs with the same
based are shown in Table 1. experimental configurations, which means that the researched meth
The research will perform regular sampling tests while training the odology can be used to successfully learn the control strategies for each
algorithm. A trigger is set up during the training process, and after every game without modifying any configuration parameters. The default
3 epochs, the trigger is activated to test the current model for 50 epi configuration information in the experiment is: the number of iterations
sodes. The mean and maximum values are recorded after testing, and for each training epoch is 6000, and the parameters of the current model
these two variations are displayed on a line chart using TensorBoard as will be saved for later analysis when each epoch execution is completed.
the criteria for final evaluation. To test the performance of the ACA The research will periodically sample the algorithm’s performance
combined with MAP (Method 1), the study analyzes its precision, recall, throughout the training phase to reflect the algorithm’s performance in
and fitness in comparison with the traditional ACA (Method 2), as shown real time. The specific technique involves setting a trigger during the
in Fig. 8. algorithm’s training phase. Once the algorithm has finished its three
In Fig. 8, (a) shows the precision, recall, and average fitness of the epochs of training, the trigger will be engaged to run 50 episodes of
ACA combined with MAP, and it can be seen that the precision metric is testing on the model that is now being trained. First of all, the stability of
relatively high and finally stabilized above 0.95, the recall metric is the algorithm’s output strategy is analyzed, and the stability of the DRL
finally above 0.8, and the average fitness is also in a relatively high algorithm is compared with that of the DQL algorithm, and the results
interval. Fig. (b) is the precision rate, recall rate and average fitness of are shown in Fig. 11.
traditional ACA, the precision rate index, recall rate index and average In Fig. 11, the DRL algorithm is more stable than the DQL algorithm
fitness index are around 0.7–0.8, which indicates that ACA combined with 96.7 %, 91.2 %, 93.5, 97.2 %, and 90.8 % in Breakout, Pong,
with MAP has a high accuracy rate. In order to show the performance of Boxing, Qbert, and Space Invaders games, respectively. It indicates that
ACA combined with MAP (Method 1) more scientifically, the study an the DRL algorithm output strategy has better stability. The deep neural
alyzes and compares it with traditional ACA (Method 2), single-agent network model designed by the institute is applied to the value function
based DQL algorithm (Method 3), and MAP-based GA3C algorithm and strategy function of the DRL algorithm respectively to compare its
(Method 4), and its F1 value and accuracy rate comparison are shown in feature expression ability for complex GE, as shown in Fig. 12.
Fig. 9. In Fig. 12, the deep neural network model applied to the value
In Fig. 9, method 1 i.e., ACA combined with MAP has the highest function and the strategy function of the DRL algorithm, respectively,
accuracy of up to 96 %. the F1 value is also the highest among several has a high accuracy rate, with the accuracy rate of the five games in the
methods, indicating that ACA combined with MAP has the best value function being above 78 %, and the accuracy rate of the five games
in the strategy function being above 80 %. It shows that the models
Table 1 designed by the research institute are applied to the value function and
Experimental hardware and software environment. strategy function of DRL algorithm respectively have stronger ability to
CPU Intel Core i7-7700k express the features of complex GE. The training process of DRL algo
rithm and DQL algorithm under SpaceInvaders GE is compared, and the
GPU NVIDIA GeForce GTX 1060 6G
OS version Ubuntu 16.04 training results of the two algorithms are shown in Fig. 13.
Python version 2.7 In Fig. 13, the average performance of the DQL algorithm rises very
Internal memory 32G slowly and eventually almost stabilizes around 650 points. On the other
Training results visualization tool TensorBoard hand, the DRL algorithm not only rises faster than the DQL algorithm in
Deep learning framework TensorFlow 1.3
terms of average performance, but also performs much better than the

7
C. Liu and D. Liu Entertainment Computing 50 (2024) 100670

Fig. 8. Comparative analysis of accuracy, recall and fitness.

Fig. 9. Comparison of F1 values and accuracy.

Fig. 10. Average accuracy of 4 algorithms.

DQL algorithm after 14 h of training and still has not reached a stable features; the Linear Learner algorithm, which learns by using linear
value. It shows that the DRL algorithm is able to learn the control function approximation based on the SARSA algorithm; and the DRL
strategy under complex GE quickly. The study compares the perfor algorithm, which was chosen for comparison and run for 18 h. The final
mance of the DRL algorithm with the State-Action-Reward-State-Action performance of different methods in different games is shown in Table 2.
algorithm, the Linear Learner algorithm and the Random algorithm in In Table 2, the performance of the DRL algorithm in the five games
the games Breakout, Pong, Boxing, Qbert and Space Invaders. The al comprehensively outperforms the traditional RL algorithm based on
gorithms used in this study include the Random algorithm, which em artificially labeled features, which indicates that in the absence of arti
ploys a randomized strategy to generate actions; the State-Action- ficially labeled features and prior knowledge, the DRL algorithm
Reward-State-Action algorithm, which learns by manually labeled implemented in the study is still able to learn the high-level features in

8
C. Liu and D. Liu Entertainment Computing 50 (2024) 100670

learning can improve the performance and applicability of algorithms.

In conclusion, the research on the application of multi-agent parallel
DRL algorithms in GEs is expected to bring innovation and technological
breakthroughs to the gaming industry.

6. Conclusion

With the rapid development of DL, DRL has achieved significant

breakthroughs in applied research within the field of GE. Traditional
DRL algorithms typically train and make decisions based on a single
agent, whereas many real-world tasks require multiple agents to
collaborate. Therefore, this study examines the applied research of MAP-
based DRL algorithms under GE. The study results indicated that the
precision rate of ACA combined with MAP was relatively high and

Fig. 11. Stability of DRL algorithm.

the environment through convolutional neural networks and train the

network efficiently by means of MAP, so as to output the optimal policy
stably.

5. Discussion

In multi-agent systems, cooperation and competition among agents

are important topics of discussion. Designing an effective cooperation
mechanism that allows agents to work together to complete tasks in the
game is a key issue. It is also necessary to consider the competition
between agents to ensure the stability and balance of the entire system.
The reward mechanism is the core component of RL. When designing a
multi-agent system, it is crucial to create a fair and transparent reward
mechanism that encourages collaboration and better performance. This
requires careful consideration of how to design the reward functions. It
is important to avoid any unfair competition between agents. In the
context of games, balance and strategy variety are two important dis Fig. 13. The training results of the two algorithms.
cussion points. It is crucial to balance the strengths and strategies of
different agents to ensure the overall game system is balanced. Addi
tionally, agents should be encouraged to explore a variety of strategies Table 2
Performance of different algorithms in 5 games.
to make the game more interesting and challenging. As games increase
in size and complexity, the scalability and robustness of algorithms / Breakout Pong Boxing Qbert Space
Invaders
become particularly important. It is necessary to design efficient algo
rithms that can adapt to the challenges of large-scale gaming environ DRL algorithm 638.41 21 100 14,378 8864
ments while also improving their robustness, ensuring stable State-Action-Reward- 6.7 − 16.5 10.6 934.2 274.7
State-Action
performance in various GEs. In the gaming environment, ensuring user algorithm
privacy and data protection is crucial. It is important to focus on col Linear Learner 5.9 − 18.2 45 688.7 258.6
lecting and using game data in a way that protects user privacy and data algorithm
security. Additionally, exploring how to utilize data for training and Random algorithm 2.3 − 21.7 5.4 186.6 129

Fig. 12. Comparison of accuracy in value function and strategy function.

9
C. Liu and D. Liu Entertainment Computing 50 (2024) 100670

finally stabilized above 0.95, the recall index was also finally above 0.8, [3] Z. Zhu, K. Lin, A.K. Jain, J. Zhou, Transfer learning in deep reinforcement learning:
a survey, PAMI 45 (11) (Nov 2023) 13344–13362, https://doi.org/10.1109/
and the average fitness was also in a relatively high interval. The DRL
TPAMI.2023.3292075.
algorithm outperformed the traditional RL algorithm based on manually [4] B.R. Kiran, I. Sobh, V. Talpaert, P. Mannion, A.A. Sallab, S. Yogamani, P. Pérez,
labeled features in the five games comprehensively. The research indi Deep reinforcement learning for autonomous driving: a survey, T-ITS 23 (6) (June
cated that the multi-agent parallel DRL algorithm offers significant ad 2022) 4909–4926, https://doi.org/10.1109/TITS.2021.3054625.
[5] J. Hu, H. Niu, J. Carrasco, B. Lennox, F. Arvin, Voronoi-based multi-robot
vantages in GEs. When compared to traditional single-agent RL autonomous exploration in unknown environments via deep reinforcement
algorithms, multi-agent parallel training accelerated the learning pro learning, TVT 69 (12) (Dec 2020) 14413–14423, https://doi.org/10.1109/
cess and enhanced the accuracy and efficiency of decision-making. The TVT.2020.3034800.
[6] A. Haydari, Y. Yılmaz, Deep reinforcement learning for intelligent transportation
algorithm for DRL, based on MAP, had significant advantages in GEs. It systems: a survey, T-ITS 23 (1) (Jan 2022) 11–32, https://doi.org/10.1109/
enabled cooperative decision-making among multiple agents and ach TITS.2020.3008612.
ieves high performance. However, challenges and problems still exist, [7] V.A. Vargas-Perez, P. Mesejo, M. Chica, O. Cordon, Deep reinforcement learning in
agent-based simulations for optimal media planning, Inform. Fusion 15 (7) (March
and current algorithms have limited generalization ability when dealing 2023) 644–664, https://doi.org/10.1016/j.inffus.2022.10.029.
with complex environments and dynamic changes in games. Future [8] Y.T. Quek, W.A. Tso, W.L. Woo, N.T. Koh, L.L. Koh, Deep Q-network
research can focus on improving the model’s generalization ability to implementation for simulated autonomous vehicle control, IET Intell. Transp. Sy.
15 (7) (May 2021) 875–885, https://doi.org/10.1049/itr2.12067.
better adapt to different game scenarios and environmental changes. [9] Y. Yang, X. Li, H. Li, R. Yuan, Deep Q-network for optimal decision for top-coal
DRL often produces decisions that lack interpretability, making it diffi caving, Energies 13 (7) (April 2020) 1618–1697, https://doi.org/10.3390/
cult for players and developers to understand the decision-making en13071618.
[10] J. Yang, G. Liang, B. Li, G. Wen, T. Aao, A deep-learning- and reinforcement-
process and logic. Future research could explore techniques for inter
learning-based system for encrypted network malicious traffic detection, Electron.
preting machine learning to enhance the transparency of the decision- Lett. 57 (9) (February 2021) 23–37, https://doi.org/10.1049/ell2.12125.
making process. DRL heavily relies on data and environment, which [11] H. Wang, N. Liu, Y. Zhang, D.W. Feng, F. Huang, D.S. Li, Y.M. Zhang, Deep
can pose challenges to its robustness and stability. Therefore, future reinforcement learning: a survey, Front. Inform. Tech. El. 21 (12) (October 2021)
1726–1744, https://doi.org/10.1631/FITEE.1900533.
research should focus on improving the algorithm’s stability and [12] X. Zhu, Y. Luo, A. Liu, N.N. Xiong, M. Dong, S. Zhang, A deep reinforcement
robustness to ensure it can operate effectively in diverse gaming envi learning-based resource management game in vehicular edge computing, T-ITS 23
ronments. Future research should consider how to regulate and adjust (3) (March 2022) 2422–2433, https://doi.org/10.1109/TITS.2021.3114295.
[13] Y. Li, Y. Fang, Z. Akhtar, Accelerating deep reinforcement learning model for game
the behavior of algorithms to ensure that they meet ethical and regu strategy, NC 408 (9) (September 2020) 157–168, https://doi.org/10.1016/j.
latory requirements when using DRL in games. neucom.2019.06.110.
[14] C. Li, X. Wei, Y. Zhao, X. Geng, An effective maximum entropy exploration
approach for deceptive game in reinforcement learning, NC 403 (25) (August
CRediT authorship contribution statement 2020) 98–108, https://doi.org/10.1016/j.neucom.2020.04.068.
[15] S. Liu, J. Cao, Y. Wang, W. Chen, Y. Liu, Self-play reinforcement learning with
Chao Liu: Writing – review & editing, Writing – original draft, comprehensive critic in computer games, NC 449 (18) (August 2021) 207–213,
https://doi.org/10.1016/j.neucom.2021.04.006.
Methodology, Formal analysis, Data curation, Conceptualization. Di [16] G. Rani, U. Pandey, A.A. Wagde, V.S. Dhaka, A deep reinforcement learning
Liu: Writing – review & editing, Writing – original draft, Visualization, technique for bug detection in video games, IJIT 15 (1) (August 2023) 355–367,
Methodology, Investigation, Formal analysis, Data curation. https://doi.org/10.1007/s41870-022-01047-z.
[17] S. Rahi, Research design and methods: a systematic review of research paradigms,
sampling issues and instruments development, Int. J. Econ. Manag. Sci. 6 (2) (June
2017) 1–5, https://doi.org/10.4172/2162-6359.1000403.
Declaration of competing interest [18] K. Zhang, J. Cao, Y. Zhang, Adaptive digital twin and multiagent deep
reinforcement learning for vehicular edge computing and networks, IEEE T Ind.
The authors declare that they have no known competing financial Inform. 18 (2) (Feb 2022) 1405–1413, https://doi.org/10.1109/
TII.2021.3088407.
interests or personal relationships that could have appeared to influence [19] X. Tang, J. Chen, T. Liu, Y. Qin, D. Cao, Distributed deep reinforcement learning-
the work reported in this paper. based energy and emission management strategy for hybrid electric vehicles, IEEE
T Veh. Technol. 70 (10) (Oct 2021) 9922–9934, https://doi.org/10.1109/
TVT.2021.3107734.
Data availability
[20] C. Huang, R. Mo, C. Yuen, Reconfigurable intelligent surface assisted multiuser
MISO systems exploiting deep reinforcement learning, IEEE J. Sel. Area Comm. 38
Data will be made available on request. (8) (Aug 2020) 1839–1850, https://doi.org/10.1109/JSAC.2020.3000835.
[21] W.J. Yun, S. Park, J. Kim, M.J. Shin, S. Jung, D.A. Mohaisen, Cooperative
multiagent deep reinforcement learning for reliable surveillance via autonomous
References multi-UAV control, IEEE T Ind. Inform. 18 (10) (Oct 2022) 7086–7096, https://doi.
org/10.1109/TII.2022.3143175.
[1] W. Chen, X. Qiu, T. Cai, H.N. Dai, Z. Zheng, Y. Zhang, Deep reinforcement learning [22] H. Yang, Z. Xiong, J. Zhao, D. Niyato, L. Xiao, Q. Wu, Deep reinforcement learning-
for internet of things: a comprehensive survey, IEEE Commun. Surv. Tut. 23 (3) based intelligent reflecting surface for secure wireless communications, IEEE TWC
(May 2021) 1659–1692, https://doi.org/10.1109/COMST.2021.3073036. 20 (1) (2021) 375–388, https://doi.org/10.1109/TWC.2020.3024860.
[2] A. Feriani, E. Hossain, Single and multi-agent deep reinforcement learning for AI- [23] E. Nsugbe, Toward a self-supervised architecture for semen quality prediction
enabled wireless networks: a tutorial, IEEE Commun. Surv. Tut. 23 (2) (September using environmental and lifestyle factors, AI&A 1 (1) (October 2023) 35–42,
2021) 1226–1252, https://doi.org/10.1109/COMST.2021.3063822. https://doi.org/10.47852/bonviewAIA2202303.