0% found this document useful (0 votes)
23 views

Deep_Reinforcement_Learning-Based_Air_Defense_Deci

A paper about using Reinforcement Learning to optimize Air Defenses

Uploaded by

Tiffany Degg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Deep_Reinforcement_Learning-Based_Air_Defense_Deci

A paper about using Reinforcement Learning to optimize Air Defenses

Uploaded by

Tiffany Degg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

RESEARCH ARTICLE

www.advintellsyst.com

Deep Reinforcement Learning-Based Air Defense


Decision-Making Using Potential Games
Minrui Zhao,* Gang Wang, Qiang Fu,* Xiangke Guo, and Tengda Li

the emergence of new random air-attack


This study addresses the challenge of intelligent decision-making for command- weapons, such as unmanned cluster weap-
and-control systems in air defense combat operations. Current autonomous onry.[3] More unpredictable weaponry
decision-making systems suffer from limited rationality and insufficient intelli- poses a new challenge to the operational
decision system configured to defend
gence during operation processes. Recent studies have proposed methods based
against such air attacks. Figure 1 outlines
on deep reinforcement learning (DRL) to address these issues. However, DRL an air defense operational process.
methods typically face challenges related to weak interpretability, lack of con- Efficient decision-making in air defense
vergence guarantees, and high-computing power requirements. To address these operations can increase operational effec-
issues, a novel technique for large-scale air defense decision-making by com- tiveness by more than three times when
compared to unstructured decision-
bining a DRL technique with game theory is discussed. The proposed method
making.[4] The rational schedule of combat
transforms the target assignment problem into a potential game that provides resources to enhance interception efficacy
theoretical guarantees for Nash equilibrium (NE) from a distributed perspective. is a pressing concern for operational deci-
The air-defense decision problem is decomposed into separate target selection sion-making systems.
and target assignment problems. A DRL method is used to solve the target Recently, deep reinforcement learning
selection problem, while the target assignment problem is translated into a target (DRL) has become widely adopted in
control,[5] scheduling,[3b,6] and decision-
assignment optimization game. This game is proven to be an exact potential
making systems [7] and has emerged as a
game with theoretical convergence guarantees for an NE. Having simulated the notable asset in the field of intelligent
proposed decision-making method using a digital battlefield environment, the decision-making. Yang et al.[8] proposed a
effectiveness of the proposed method is demonstrated. real-time scheduling algorithm for
unmanned aerial vehicle (UAV) clusters
driven by reinforcement learning (RL) to
address the channel assignment problem
1. Introduction in UAV cluster scheduling. Song et al.[6] developed an enhanced
multiobjective reinforcement learning algorithm configured to
Modern warfare is heavily reliant on advanced weaponry, a deci- efficiently balance multiple objectives. This algorithm was
sive factor in determining the victor of a conflict.[1] Decision- applied to solve the problem of offloading an application
making is the cornerstone of combat operations, necessitating consisting of dependent tasks in multiaccess edge computing.
the judicious utilization of multitype and multiplatform weap- Sun et al.[9] developed the multiagent hierarchical policy gradient
onry, based on intelligence collected about the adversary, the algorithm to enable autonomous decision-making for air-combat
rational assignment of incoming targets, and the successful com- confrontations, discovering that intelligence has tactical evolu-
pletion of countermeasures.[2] tionary abilities. Li et al.[10] proposed a generative adversarial
The battlefield dynamics faced by air defense operations have deep reinforcement learning scheduling algorithm that lever-
been further complicated and rendered more unpredictable by aged expert knowledge to direct the DRL training process, result-
ing in the high-quality scheduling of workloads and optimization
objectives. Ren et al.[7b] employed a reinforcement learning
M. Zhao, G. Wang, Q. Fu, X. Guo, T. Li
framework based on a multiagent deep deterministic policy
College of Air and Missile Defense
Air Force Engineering University gradient (DDPG) to enable autonomous decision-making in
No.1 Changle East Road, Xi’an, Shaanxi 710051, China air warfare and allow for collaborative operations. Hu et al.[11]
E-mail: [email protected]; [email protected] incorporated a dynamic quality replay method into conventional
The ORCID identification number(s) for the author(s) of this article DRL, decreasing reliance on expert knowledge and augmenting
can be found under https://doi.org/10.1002/aisy.202300151. the efficiency of the new algorithm. This method enabled auton-
© 2023 The Authors. Advanced Intelligent Systems published by Wiley-
omous cooperative operations involving multiple UAVs. Zhang
VCH GmbH. This is an open access article under the terms of the et al.[12] designed an air combat maneuver decision model, using
Creative Commons Attribution License, which permits use, distribution the actor–critic architecture, to achieve cooperative air combat
and reproduction in any medium, provided the original work is decision-making with multiple UAVs. Liu et al.[13] developed a
properly cited. reinforcement learning approach, using one general agent with
DOI: 10.1002/aisy.202300151 multiple narrow agents, to overcome the slow convergence of

Adv. Intell. Syst. 2023, 5, 2300151 2300151 (1 of 15) © 2023 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com

Cruise Missile Fighter

Air Early Warning

UAV
Anti -radiation
Short-range Missile
Radar
Short-range
missiles

Short-range
Interception Unit Air-to-Surface
Missile
Long-range
Radar Long-range
missiles

Long-range
Command Post Interception Unit

Airport

Figure 1. Intelligent decision-making system of air defense: a schematic diagram.

traditional task assignment methods and enable intelligent task the proposed method was applicable to both symmetric and non-
assignment for air defense combat. symmetric games. Cao et al.[15b] proposed a game theory-based
However, DRL methods suffer from weak interpretability, a inverse learning framework to obtain the parameters of both the
lack of convergence guarantees, and high-computing power dynamic system and individual cost of multistage games, by solv-
requirements in intelligent decision-making applications.[14] In ing affine-quadratic games and calculating the gradient of the game
order to solve these problems, methods combining game theory system parameters. Peng et al.[21] formulated the distributed con-
and reinforcement learning have emerged in recent years.[15] trol problem as a differential graphical game and combined it with
Yarahmadi et al.[16] modeled the multiagent credit assignment DRL to achieve bounded control. Albaba et al.[15a] combined hier-
problem (MPC) in multi-intelligent reinforcement learning as archical game theory with DRL for simultaneous decision-making
a bankruptcy gam. They then solved the MPC problem by select- in multiagent traffic scenarios. They experimentally demonstrated
ing the best assignment result through the evolutionary game. that the proposed method modeled more traffic situations and
The results showed that the proposed method has good perfor- effectively reduced the collision rate, compared to conventional
mance in terms of group learning rate, confidence, expertness, methods. Gao et al.[22] proposed a passivity-based methodology
certainty, and correctness metrics. Zhang et al.[17] proposed a for multiagent finite games in reinforcement learning dynamics,
reciprocal reputation mechanism, combined with a double deep as well as algorithms for analysis and design. Results indicated that
Q-network algorithm, to suppress attack motivation in vehicular the proposed method improved convergence.
ad hoc networks and reduce selfish node attacks. Xu et al.[18] for- Previous studies have applied DRL to decision-making; how-
mulated the task offloading problem in vehicular edge comput- ever, limited research has explored the integration of potential
ing as an exact potential game, achieving task offloading resource games into intelligent decision-making systems for air defense
optimization by solving the Nash equilibrium (NE) using multi- combat. To address this research gap, and to improve the
agent reinforcement learning. Duan et al.[19] combined game the- interpretability and convergence of DRL when used in decision-
ory and reinforcement learning to enable vehicle autonomous making, potential game theory is introduced into the DRL frame-
driving, proposing a multi-intelligence negotiation model based work. Furthermore, proof that the target assignment (TA) game
on an incomplete information game to facilitate automatic is an exact potential game is provided, the primary contribution
negotiation of agents. Zhu et al.[20] combined the minimax Q of this article.
algorithm with game theory to solve the NE of two-player In this article, an intelligent decision system for air defense
zero-sum games. Subsequent simulation results showed that operations is proposed based on potential games and DRL,

Adv. Intell. Syst. 2023, 5, 2300151 2300151 (2 of 15) © 2023 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com

leveraging the advantages of both DRL and game theory to attain next state stþ1 at a time step of t þ 1. In an MDP, a state is called
an effective and high-quality decision system for air defense a Markov state when it satisfies the following condition.
operations. Potential games are a branch of game theory that
possess desirable properties that guarantee the existence of P½stþ1 j st  ¼ P½stþ1 j s1 , · · · , st  (1)
NE solutions. Specifically, potential game theory is utilized to
The state of the next moment is independent of the state of the
construct a multiobject multi-intercept unit resource assignment
prior moment. The state transition probability matrix describes
model for efficient TA, as well as to obtain an NE solution in the
the probability of transitioning from the current state s to the sub-
TA phase. The primary contributions of this article are as follows.
sequent state s0 .
First, an air defense operational decision-making framework
is proposed that combines DRL and potential games for target P ss0 ¼ P½stþ1 ¼ s0 j st ¼ s (2)
selection and radar control. A type-different-based reward mech-
anism is employed to incentivize efficient decision-making The reward function represents the desired expected reward of
and TA. the intelligence taking the action, after the transition
Second, a DRL-based target selection and radar control mod-
ule is proposed to facilitate intelligent decision-making for air r as ¼ E½r tþ1 j st ¼ s, at ¼ a (3)
defense operations. The air defense combat problem is modeled
as a Markov decision process (MDP). The action state space, where E is the expectation operator.
action space, and reward are formulated by solving the problem Return Rt , also known as the cumulative reward, is the sum of
characteristics. To facilitate the learning of better strategies for all rewards from the beginning to the end of the round
the above MDP, a bidirectional gate recurrent unit (BiGRU)- X

based feature extraction method configured to extract the type Rt ¼ γ k r tþkþ1 (4)
of air situation and a multihead attention mechanism are k¼0

proposed. Both are tailored to the specific characteristics of


the air defense combat problem. The state value function V π ðsÞ is the expected cumulative
Finally, a potential game-based (PG-TA) module is proposed reward when in the state s given a policy π
that enables agents to coordinate their behaviors through inter- V π ðsÞ ¼ E½Rt j st ¼ s, π (5)
actions and reach a cooperative equilibrium. In order to optimize
the air defense combat TA, a TA optimization game is formu- πðajsÞ ¼ P½at ¼ a j st ¼ s (6)
lated from the perspective of distribution. Separately, it is proved
that the game belongs to the potential game. Hence, the NE solu- The action–state–value function Q π ðs, aÞ is the expected
tion can be obtained by applying the optimal response method to cumulative reward of taking an action a while in a given state
solve the game while avoiding incorrect assignment, duplicate s, following a specified policy π
assignment, and omitted assignment, thereby increasing deci-
Q π ðs, aÞ ¼ E½Rt j st ¼ s, at ¼ a, π (7)
sion effectiveness.
The remainder of this article is organized as follows. Section 2 Neural networks are typically utilized to estimate the action
provides an overview of reinforcement learning and potential state value function Q π ðs, a, θÞ and the strategy π when the state
games. Section 3 describes the air defense combat problem space is large or continuous, where θ is a parameter of the neural
and the methods used to solve it. Section 4 outlines the experi-
network. This idea forms the foundation of the strategic gradient
mental setup, states the results from the experiments that were
approach. Finding the gradient of the parameter allows for the
conducted, and evaluates those results. Section 5 concludes the
creation of a more accurate expected reward function for the
results and evaluations drawn and discusses potential future
parameter θ.
research directions.
The definition of the reward function is

JðθÞ ¼ Eπθ ½π θ ðs, aÞQ πθ ðs, aÞ (8)


2. Related Works
The gradient is computed as follows
2.1. Deep Reinforcement Learning
∇θ JðθÞ ¼ Eπθ ½∇θ log π θ ðs, aÞQ πθ ðs, aÞ (9)
The reinforcement learning process involves an intelligent agent
interacting with the external environment in order to maximize The two main categories of reinforcement learning techniques
the external cumulative reward.[23] The problem is typically mod- are typically policy-based and value-based methods. The advan-
eled as a Markov reward process, which is represented by the tages of these two approaches are combined in the actor–critic
tuple M ¼< S, A, T, R, γ >, where S ¼ ðs1 , s2 , · · · , sn Þ is the approach, in which the actor module employs a policy-based
set of states, A ¼ ða1 , a2 , · · · , am Þ is the set of actions, T is approach and the critic module employs a value-based approach.
the transition probability matrix, R is the reward function, and The asynchronous advantage actor–critic (A3C) technique[24] is
γ ∈ ½0, 1 is the discount factor. the more popular implementation of the general actor–critic
At each time step t of the reinforcement learning process, the method.
agent is in state st , observes the environment, performs an action Based on the advantage actor–critic (A2C) method,[25] the A3C
at , receives environmental feedback Rt , and transitions to the algorithm employs asynchronous parallelism. Typically, there are

Adv. Intell. Syst. 2023, 5, 2300151 2300151 (3 of 15) © 2023 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com

m worker nodes and 1 server node. The server node holds the U i ðx, yi Þ  U i ðz, yi Þ ¼ Gðx, yi Þ  Gðz, yi Þ (13)
most recent policy network ðπ θ ðst , at ÞÞ and value network
ðQ w ðs, aÞ or Vðst ; θw ÞÞ, where θ denotes the policy network Lemma 1: All finite-potential games have pure strategy NE
parameter and w denotes the value network parameter. Based solutions.
on the gradient data transmitted by the worker node, the server
node modifies the parameters. Each worker node has a copy
π θ0 ðst , at Þ of the policy network and a copy Q w0 ðs, aÞ or
3. Problem Formulation and Proposed Method
Vðst ; θw0 Þ of the server node’s value network, where θ0 and w0 rep- This section introduces the air defense combat intelligent deci-
resent the modified parameters of the policy network and the sion problem and models it as an MDP problem. The state space
worker node, respectively. The gradient information is sent to and action space of the air defense combat decision problem are
the server node at regular intervals to update the parameters proposed. An event-based reward mechanism is developed, and a
of the server node. The latest parameters are then requested from potential game-based TA model is proposed that is configured to
the server node, ensuring that the conditions θ0 ¼ θ and w0 ¼ w solve the air defense decision problem.
are met. Generally, the parameters of the policy network and the
value network are shared in the nonoutput layer.
3.1. Problem Formulation
The definition of the updated reward function is
The air defense combat intelligent decision problem is typically
∇θ0 Jðθ0 Þ ¼ Âðst , at ; θ, θw Þ ⋅ ∇θ0 log πðat jst ; θ0 Þ (10)
modeled mathematically in order to minimize the probability of
missing high-value targets, minimize the loss of defending key
where  represents an objective estimation of the dominance locations, maximize the probability of effective kills, and mini-
function mize the consumption of interceptor resources.[2b]
First, the air defense combat intelligent decision problem is
Âðst , at ; θ, θw Þ ¼ r t þ γVðstþ1 ; θÞ  Vðst ; θw Þ (11)
framed as an MDP and solved using a DRL-based target selection
and radar control (DRL-TSRC) module. Neural networks and
 indicates whether it is advantageous or disadvantageous to DRL techniques are used to analyze the air situation and threat
perform an action at . level of individual targets, assign air targets to be intercepted and
their threat level, before deciding when to turn on the guidance
radar of the fire unit.
2.2. Potential Game
Second, PG-TA module is used to assign the target selected by
the DRL-TSRC module for interception and control the fire unit
The player, the strategy, and the utility function are the three
for interception. The process is illustrated in Figure 2.
components that are used to define a game in game theory.[26]
In response to air defense decision problems, there are several
The symbol defining a game is Γ ¼< N, S, fU i gi∈.V >, where
equipment limitations that should be considered: 1) The guid-
each element i is a player and N ¼ f1; 2; : : : ng is the set of play-
ance radar cannot be turned off after it is turned on. 2) The guid-
ers. S ¼ S1  S2  · · · SN represents the strategy space of the
ance radar of the fire unit should turn on before the fire unit can
game Γ, Si is the strategy space of the i-th innings, and Si ∈ Si is
launch a missile to intercept a target. 3) The guidance radar can
the strategy of the i-th player. Let A represent a subset of N, A only guide missiles launched by the fire unit. 4) The blue side
represent the complement of A, and Y A represent the Cartesian cannot detect the fire unit position until the guidance radar is
sum i∈A Y i ⋅ Y fig . Y A is shortened to Y i if A only has one ele- turned on. 5) The air situation posture of the red side is given
ment. Thus, the strategy combination y ¼ ðy1 , y2 , · · · , yn Þ can by the airborne warning and control system and does not require
be rearranged to y ¼ ðyi , yi Þ, yi ∈ Y i , yi ∈ Y i , y ∈ Y. The util- guidance radar detection. 6) If the guidance radar of a fire unit is
ity function of the i-th player maps the combination of strategies destroyed, the fire unit cannot launch and guide missiles. 7) Each
to the real numbers in the set. This mapping is denoted by fire unit satisfies the rational choice theory.
U i ∶Y ! R. It should be noted that in this article, the term “red side” refers
Definition 1 (Pure Strategy NE): For ∀i ∈ N, ∀yi ∈ Y i that sat- to the defending side, which utilizes ground-to-air missiles as its
isfy Equation (12), the strategy combination z is said to be a pure main weapons, while the term “blue side” refers to the attacking
strategy NE. side, which employs UAVs and fighter aircraft as its main
weapons.

U i ðyi , yi Þ ≥ U i ðyi , yi Þ (12)
3.2. Deep Reinforcement Learning-Based Target Selection and
Ordinal potential games, weighted potential games, and pre- Radar Control Module (DRL-TSRC)
cise potential games are the different categories of potential
games. Below is a description of the precise potential game that An agent takes action ai to intercept the enemy target at battle-
was used in this study. field state si . The Markov process model consists of a state set
Definition 2 (Exact Potential Game): The game S ¼ ½s1 , s2 , · · · , sn  and an action set A ¼ ½a1 , a2 , · · · , an .
Γ ¼< N, S, fU i gi∈N > is said to be an exact potential game, if The agent selects action ai from the action set A at state si accord-
there exists a function G∶Y ! R, which for ∀i ∈ N, ing to the policy π∶S  A ! ½0, 1. The battlefield environment
∀yi ∈ Y i , ∀x, z ∈ Y i that satisfies transitions to the next state according to the state transition

Adv. Intell. Syst. 2023, 5, 2300151 2300151 (4 of 15) © 2023 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com

Figure 2. The architecture of the proposed decision-making method. The proposed method is composed of a DRL-TSRC module and a PG-TA module.

function P∶S  A  S ! ½0, 1. The goal of the agent is to maxi- Table 1. Air defense operation states definition.
mize the cumulative reward function
( ) State variable Meaning
XT
id Strategic location number
Rt ¼ E γt r t , (14)
t¼0 xd Strategic location position
td Strategic location type
where r t is the reward received at step t, T is the time horizon,
ad Strategic location been attacked mode
and γ is the discount factor. Equation (5) provides the calculation
for the state-value function V π ðsÞ, which is the expectation of the if Fire unit number
cumulative reward function Rt , and Equation (7) provides the cal- xf Fire unit position
culation for the action–state–value function Q π ðs, aÞ. When com- mf Fire unit missile left
pared to the average reward under strategy π, or V π ðsÞ, the
uf Fire unit usability
dominance function, Aπ ðst , at Þ ¼ Q π ðs, aÞ  V π ðsÞ, indicates
how good the reward earned by taking action at in the current iαf Targets number can be targeted by the unit
state st was. af Fire unit been attacked mode
is Enemy target number
3.2.1. State and Action Space of Air Defense Problem xs Enemy target position
ts Enemy target type
In this section, the air defense combat intelligent decision prob-
vs Enemy target velocity
lem is formulated as an MDP by specifying the states, actions,
rewards, and goals in air defense combat. The rewards are related as Enemy target been attacked mode
to the type of fire unit used and are specified in Section 3.2.
The state space of the air defense combat problem includes the
state of the defense sites, the state of the fire units, the state of the action. The entire air defense combat action space is outlined
detectable enemy targets, and the state of the attackable enemy in Table 2.
targets. This state information is maintained by the digital bat-
tlefield environment. The state information of the defense sites 3.2.2. Air Situation Feature Extraction based on BiGRU Network
comprises the site number, location, type, and attack status. The
state information of the fire unit includes the fire unit tag num- The BiGRU method is used to analyze air situation information
ber, location, number of missiles remaining, usability, the num- in air defense missions, which are time-series events. This
ber of targets the unit can attack, and attack status. The state of method is used to consider the state input for a period of time
the reconnoitered enemy targets includes the target number, before the current time. The BiGRU network consists of a for-
location, type, movement, and state under attack. The entire ward GRU and a reverse GRU, and its structure is shown in
air defense combat state space is outlined in Table 1. Figure 3. The BiGRU method enhances the ability of the model
The action space of the air defense combat problem includes to learn air situation features by analyzing the laws of state infor-
target selection, target threat level, radar selection, and radar mation in two directions: past to present and present to past.

Adv. Intell. Syst. 2023, 5, 2300151 2300151 (5 of 15) © 2023 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com

Table 2. Air defense operation actions definition. 3.2.3. Multiheaded Attention Mechanism

Action variable Meaning The digital battlefield environment is characterized by a high-


T Target selection dimensional state space and a large amount of information to
L Target threat
be processed by the network. The multiheaded attention
mechanism enables the neural network to identify and prioritize
R Radar selection
important information with high attention. The multiheaded
A Radar action
attention scheme assembles multiple self-attentive layers, and
multiple attention layers can linearly transform the same input
from different angles to obtain information from different sub-
The GRU is a simplified version of the long short-term mem- spaces. Figure 4 illustrates this multiheaded attention structure.
ory neural network that uses an update gate, instead of an input For single head self-attention, the processed situation infor-
gate, and a forget gate. The update gate is used to determine the mation is transformed into three vectors: query Q, key K, and
retention of historical information and the forget gate is used to value V vectors. The linear transformation is given by the follow-
determine how the historical information is combined with the ing equation
current information. The main parameters of the GRU are cal-
culated as follows P ¼ W P x, P ∈ ðQ, K, VÞ (17)

zt ¼ σðwz ½ht1 , x t  þ bz Þ where W P is the linear transformation matrix. The attention is


r t ¼ σðwr ½ht1 , x t  þ br Þ calculated as a scaled dot product attention, which is calculated as
(15)
h̃t ¼ tan hðwh ½r t ht1 , x t  þ bh Þ follows
ht ¼ ð1  zt Þht1 þ zt h̃t , !
QK T
where x t is the input at time t, ht is the output at time t, r t is the AttentionðQ, K, VÞ ¼ softmax pffiffiffiffiffi V (18)
dk
reset gate, zt is the update gate, h̃t is the information generated
according to the update gate, σ is the sigmoid activation function,
For parallel stacking of multiple self-attentive modules, the
tan h is the hyperbolic tangent type activation function, and w and
multihead attention mechanism is calculated as
b are the weight and bias terms, respectively.
The forward unit in the BiGRU network analyzes the forward
state sequence law (from past to present), and the reverse unit MultiHeadðQ, K, VÞ ¼ Concatðhead1 , : : : , headh ÞW O
Q (19)
analyzes the state sequence law in the reverse direction. The where head ¼ AttentionðQW i , KW Ki , VW Vi Þ,
main process is computed as follows.
  Q
where W i , W Ki , and W Vi are the matrices of learnable parame-
~
ht ¼ f x t , ~
ht1 ters in the data projection, and h is the number of heads. In this
  article, h ¼ 4.
~
ht ¼ f xt , ht1 (16)

ht ¼ wt1~
ht þ wt2 ht þ bt ,

where, ~ ht denotes the forward hidden layer state, ht denotes the


reverse hidden layer state, w t1 , w t2 denotes the forward and
backward propagation hidden layer output weights, bt is the bias,
f is the activation function, and the sigmoid function is used in
this paper.

Figure 3. Diagram of the Bi-GRU structure. Figure 4. Diagram of multihead attention structure.

Adv. Intell. Syst. 2023, 5, 2300151 2300151 (6 of 15) © 2023 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com

3.2.4. Type-Different Reward Mechanism fire units is denoted by LM ¼ fLM1 , : : : , LM0 g, while the set
of short-range fire units is denoted by SM ¼ fSM1 , : : : , SM h g.
Since the firing and exposure costs of long-range fire units are The number of long-range fire units and short-range fire units
higher than those of short-range fire units, it is important to train is s, and k, respectively, as indicated by the number of players,
the agent to use a higher proportion of short-range fire units to where jN m j ¼ s þ k ¼ m is the number of players.
reduce the usage costs. To this end, the type-different-based
reward mechanism (TDRM) was designed, which provides dif-
3.3.2. Strategy Space
ferent reward values for different types of fire units. Depending
on the type of fire unit, different rewards are given for radar
The strategy space is defined as S ¼ S1  S2  : : :  SN , where
detection, missile launch, and target interception. Since the state
Si represents the set of strategies of the i-th player.
space and action space of air defense operations are large, giving
Si ¼ fsi1 , si2 , : : : , sij , : : : , siT g is the strategy of the i-th player.
a reward to the agent after each round of combat is not conducive
to training. Therefore, this article proposes a critical event-based For each element sij , i ∈ N m , j ∈ N t in Si
reward mechanism for different fire units to provide timely feed- 
1, Shoot the j-th target
back after the agent makes a decision, thereby speeding up the sij ¼ (20)
training process. The specific reward design is shown in Table 3. 0, Not Shoot the j-th target or C ij ¼ 0

If the target j-th satisfies the interception condition of player i,


3.3. Potential Game-based Target Assignment Module it is denoted as Cij ¼ 1. Otherwise it is denoted as Cij ¼ 0.
The equipment of each type of fire unit is limited. Long-range
The target to be intercepted, as provided by the DRL-TSRC mod-
fire units can fire at up to eight targets at once, while short-range
ule, and the fire units for which the guidance radar has been acti-
fire units can fire at up to six targets at once. These limits act as
vated, are modeled using potential game theory. The target
constraints
groups are defined as f1, 2, · · · , Tg, where the targets to be
intercepted are the targets provided by the DRL-TSRC module. X 
T
8, i ∈ LM
For the red side, the total number of fire units is denoted by sij ≤ (21)
6, i ∈ SM
mðm ¼ l þ sÞ, consisting of long-range fire units l sets and short- j¼1

range fire units s sets. To defend the red strategic location against
blue air targets, the red side cooperates to intercept the targets
using the launchers and guidance radars of each fire unit. 3.3.3. Utility Functions

3.3.1. Player Set The amount of fire units that can be fired is a local constraint that
affects the player and was taken into account when designing the
The players are defined as the firing units whose guiding strategy set. It is preferable to intercept the target with the least
radar has been activated, with the set of players defined as amount of fire resources possible, in order to conserve fire
N m ¼ fLM1 , : : : , LM 0 , SM 1 , : : : , SMh g. The set of long-range resources. The penalty function is defined as

Table 3. Air defense operation reward design.

Categories Event name Long-range Short-range Description


Weights Weights
Episodic Reward Win 50 Foe’s fighter losses over 30%
1. Ally’s command post has been attacked more than twice or
Loss 0 2. The distance between Foe’s bombers and the ally’s command post is less than 10 km.
Event Based Reward Strategic location attacked –4 One attack on ally’s command post or airport
Long-range radar attacked –2 One attack on Long-range radar
Short-range radar attacked –1 One attack on short-range radar
AWE attacked –1 One attack on AWE (Air Early Warning)
Long-range missile launched –0.05 A long-range missile launched
Short-range missile launched –0.03 A short-range missile launched
Intercept UAV 3 3.5 An UAV intercepted
Intercept fighter 5 5.5 A fighter aircraft intercepted
Intercept bomber 5 5.5 A bomber intercepted
Intercept cruise missile 1.5 2.5 A cruise missile intercepted
Intercept ARM 1.5 2 An antiradiation missile intercepted
Intercept Air-to-ground missile 1.5 2 An air-to-ground missile intercepted

Adv. Intell. Syst. 2023, 5, 2300151 2300151 (7 of 15) © 2023 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com

X
m
U i ðzi , yi Þ  U i ðyi , yi Þ
F c ðyi , yi Þ ¼ ð1  yij Þ2 (22) " #
i¼1 P i i i i
¼ F i ðz , y Þ  βF c ðz , y Þ
0
i∪i0 ∈J i
Each player may improve their own gain in this distributed " #
target assignment problem, and in order to prevent the waste P
 Fi ðyi , yi Þ  βF c ðyi , yi Þ
of fire resources, the penalty function is added to each player i∪i0 ∈J i
in order to develop the utility function " #
P
( " P i !# ) ¼ Fi ðzi , yi Þ þ F i0 ðzi , yi Þ  βF c ðzi , yi Þ
X X j rj i0 ∈J i
U i ðyi , yi Þ ¼ αj ⋅ f it  P max " #
iU i ∈J i j j rj P (28)
(23)  F i ðyi , yi Þ þ F i0 ðyi , yi Þ  βF c ðyi , yi Þ
X m i0 ∈J i
β ð1  yij Þ2 , ∀j ∈ N t " #
P
i¼1
¼ Fi ðzi , yi Þ þ F i0 ðyi , yi Þ  βF c ðzi , yi Þ
i0 ∈J i
where αj denotes the threat level of the target j, t represents the " #
target type, nit represents the number of targets of type t assigned P
 Fi ðyi , yi Þ þ F i0 ðyi , yi Þ  βF c ðyi , yi Þ
to the fire unit, and f it represents the reward value obtained i0 ∈J i
by the fire unit for intercepting a type-t target. Define ¼ ½F i ðzi , yi Þ  F i ðyi , yi Þ  β½F c ðzi , yi Þ  F c ðyi , yi Þ
J i ¼ fi0 ji0 ∈ N m , C ij ¼ 1, Ci0 j ¼ 1, i0 6¼ ig as the set of proximity
fire units of i. r ij is the range shortcut between fire unit i, and target
Gðzi , yi Þ  Gðyi , yi Þ
j  r mas
j is the maximum value of the course shortcut from target j " #
P
to the fire unit that can intercept it. The penalty coefficient β is a ¼ i i
F i ðz , y Þ  βF c ðz , y Þ
i i
large constant used to ensure that the penalty function term con- i∈N m ,
" #
verges to zero as the game progresses, thus achieving cost savings. P
The reward function of the i-th fire unit is denoted by  Fi ðyi , yi Þ  βF c ðyi , yi Þ
i∈N m
P i ! " #
X j rj P
F i ðyi , yi Þ ¼ αj ⋅ f it  P max (24) ¼ Fi ðzi , yi Þ þ Fi ðzi , yi Þ  βF c ðzi , yi Þ
j j rj i0 ∈N m , i0 6¼i
" # (29)
P
Then, the utility function can be written as  F i ðyi , yi Þ þ F i ðyi , yi Þ  βF c ðyi , yi Þ
X i0 ∈N m , i0 6¼i
U i ðyi , yi Þ ¼ F i ðyi , yi Þ  βF c ðyi , yi Þ, ∀j ∈ N t (25) " #
i∪i0 ∈J i i
P i i
¼ F i ðz , y Þ þ
i
F i ðy , y Þ  βF c ðz , y Þ
i i
i0 ∈N m , i0 6¼i
The reward value for intercepting the same target varies " #
between fire units according to their respective equipment char- i
P i i
 F i ðy , y Þ þi
F i ðy , y Þ  βF c ðy , y Þ
i i
acteristics and interdiction costs, as shown in Table 3. i0 ∈N m , i0 6¼i

¼ ½F i ðzi , yi Þ  F i ðyi , yi Þ  βðF c ðzi , yi Þ  F c ðyi , yi ÞÞ


3.3.4. Target Assignment Optimization Game
When the strategy of the i-th player changes from yi to zi , the
The target assignment optimization game (Γ) is proposed to change in the potential function is equal to the change in the
solve the target assignment problem and is defined by the follow- utility function of the i-th player: ΔU ¼ ΔG. This implies that
ing equation the target assignment optimization game (Γ) belongs to the exact
potential game.
ðΓÞ∶ max U i ðyi , yi Þ (26) For the potential game, by Lemma 1, the NE must exist and
yi ∈Y i
the optimal point of the potential function converges to the pure
strategy NE. Hence, the equilibrium solution of the target assign-
For the target assignment optimization game (Γ), a potential
ment optimization game (Γ) exists and is optimal. Therefore, it is
function exists feasible to design an algorithm to search for the NE of
X Pthe game.
GðYÞ ¼ F i ðyi , yi Þ  βF c ðyi , yi Þ, ∀j ∈ N t (27)
It is not difficult to prove by contradiction that m i¼1 yij ¼ 1
i∈N m holds at the optimal point, that is, each target is assigned to only
one fire unit.
Theorem 1: There exists an NE for the target assignment opti-
mization game (Γ), and the optimal solution to the target assign- 3.4. Best Response-based NE Solving Method
ment problem converges to the NE. Proof: The target assignment
optimization game (Γ) is an exact potential game. For the poten- Definition 3: The best response dynamics can be defined as the
tial function Gðyi , yi Þ, for player i and strategy yi , zi ∈ Y i dynamics at which each player i is able to maximize the utility

Adv. Intell. Syst. 2023, 5, 2300151 2300151 (8 of 15) © 2023 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com

function by choosing a strategy based on the strategies of other Algorithm 2. Best response-based target assignment NE solving (BRTA)
players when making decisions. In other words algorithm.

Initialize the target assignment game Γ ¼< N, S, fUi gi∈N >, where M and T are the
Si ¼ arg minui ðSi , Si Þ. (30) number of fire units and the number of targets, respectively. Nit is number of
ai ∈Ai
iterations.
Randomly initialize the strategy combination Sr ; let n denote the player that changes
Lemma 2: The best response dynamics converges the exact strategy in the iteration.
for k = 1: Nit do
potential game to a pure strategic NE.
The decision maker generated by Algorithm 1 is player i.
Theorem 2: The use of best response dynamics converges the Let Si denote the set of available strategies for the i-th fire unit and N0 denote the
target assignment optimization game (Γ) to a NE. number of elements in the set Si ;
Proof: Based on Theorem 1, the target assignment optimiza- Let Si denote the strategy for the i-th fire unit, Si ∈ Sr , let S0i ¼ Si  fSi g;
tion game (Γ) is an exact potential game. When combined with Take the strategy Si for the i-th fire unit, Si ∈ Sr , calculate the utility function
Lemma 2, Theorem 2 can be obtained. Ui ðSi , Si Þ;
For the target assignment optimization game (Γ) proposed in Slocal ¼ Sr
this article, an NE solving algorithm is proposed. This game is a Ulocal ¼ Ui ðSi , Si Þ
for j ¼ 1∶NS do
noncooperative game, and the best response approach can be
S0r ¼ Sr
used to solve the NE. Since the strategy change of a single
elect S0r randomly from S0i as the strategy for the n-th fire unit and update the
player will lead to the strategy change of other players, a best- strategy set S0i ¼ S0i  fS0r g;
response-based NE solving algorithm is proposed to consider this if C i0 ðS0i Þ ¼ 1 then
interaction. However, since the dynamic best-response algorithm Let the i-th fire unit temporarily change its strategy from S0n to Sn
requires participants to make sequential decisions in the deci- end if
sion process, which typically requires a decision coordinator, Update the strategy set S0r and calculate the utility function Ui ðS0i , Si Þ
it is not conducive to distributed decision-making. Therefore, if Ui ðS0i , Si Þ > Ulocal then
this section proposes a distributed decision-maker selection pro- Slocal ¼ S0r
Ulocal ¼ Ui ðS0i , Si Þ
tocol that randomly selects participants to make decisions in each
end if
round. end for
Algorithm 1 sets a maximum waiting time τmax at the outset Update strategy profile Sr ¼ Slocal
and, at the start of the iteration, each player i generates a random end for
waiting time τi in the interval ½0, τmax . After the start, player i Pure Strategy profile NE SNE ¼ Sr
waits for the first τi time units and abandons the decision-maker
selection if it receives a DR signal from other players. Otherwise,
a DR signal is sent at time τi , and player i is identified as the
decision maker for this round.
Using Algorithm 2, a pure strategy NE point of the game (Γ) 4. Results and Analysis
can be obtained, which is the optimal solution of the target
assignment optimization game according to Theorem 1. In this section, the effectiveness of the proposed method is eval-
The proposed algorithm has a complexity of oðN it N s Þ, whereas uated in a simulated environment that faithfully replicates the
the exhaustive search algorithm has a complexity of physical laws, terrain occlusion, and Earth curvature of a realistic
oðN s !=ðN s  NÞÞ. Consequently, the proposed algorithm per- battlefield environment for air defense operations. The weapon
forms better than the exhaustive search approach. parameters used in the simulation are consistent with those of a
real battlefield. The digital battlefield environment is shown in
Figure 5.

Algorithm 1. Distributed Decision Maker Selection Protocol.


4.1. Experimental Section
Set maximum waiting time τmax
4.1.1. Force Configuration of Red Army
The iteration starts with each player i generating a random wait time τi in ½0, τmax 
Start timer; In the simulations, the red army (acting as the defending side)
if Player i receives a Decision Request (DR) signal from another player in the first τi has a number of defensive units that can be employed:
time after the iteration starts; 1) Strategic location: command post, airport; 2) Short-range fire
then unit: includes one short-range radar and three long-range launch
Withdraws from the selection of decision maker vehicles (with each vehicle including eight short-range missiles);
else
3) Long-range fire unit: six sets of units including one long-range
radar and eight long-range launch vehicles (with each vehicle
Player i is identified as the decision maker for the round and sends DR signals to
the other players
including six long-range missiles and three close-range
missiles); and 4) AEW aircraft: One unit with a detection range
end if
of 400 km.

Adv. Intell. Syst. 2023, 5, 2300151 2300151 (9 of 15) © 2023 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com

Figure 5. Schematic diagram of digital battlefield environment.

4.1.2. Force Configuration of Blue Army rate. 11) The effective jamming direction of an electronic jammer
is a cone of about 15∘ , and jamming against the radar lowers the
In the simulations, the blue army (acting as the attacking side) kill probability of a missile.
has a number of attacking units that can be employed:
1) Fighters: 12, each carrying 6 ARMs and 2 air-to-surface mis- 4.2. Performance Analysis
siles; 2) UAVs: 20, each carrying 2 ARMs and 1 air-to-surface
missile; 3) Bombers: 4; 4) Cruise missiles: 18; 5) Electronic This section compares the performance of the proposed
jammers: 2 decision-making method, the A3CPG algorithm, with main-
stream methods such as the A3C, proximal policy optimization
(PPO), and DDPG algorithms, with the DDPG algorithm being
4.1.3. Battlefield Environment Rules
used as a baseline result in the comparison.
Figure 6 illustrates the reward curves obtained by the agent
There are a number of constraints that have been placed on the
during the training process, with the horizontal axis representing
simulation as a result of the battlefield environment. These rules
the confrontation episodes.
are listed below: 1) If the guidance radar of a fire unit is destroyed,
It can be observed that the various methods have different
the fire unit is rendered inoperable. 2) During the guidance pro- degrees of reward growth as the number of training rounds
cedure, the guidance radar must be turned on and produce elec- are increased during the learning process. Similarly, the growth
tromagnetic radiation. 3) The maximum range for the guided
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi rate of reward and the rewards obtained are also different. From
radar is 4) Rmas ¼ 4.12f HT ðmÞ þ HR ðmÞgðkmÞ, where Figure 6, it is clear that the highest degree of reward growth is
HT is the altitude of the target and HR is the altitude of the radar achieved by the A3CPG algorithm, outperforming the other
antenna, which is set as H R ¼ 4 m in this simulation. The range conventional methods. Out of the conventional methods, the
of the guided radar is affected by the terrain shading and the A3C algorithm outperformed the PPO algorithm, which outper-
Earth’s curvature by taking atmospheric refraction into account. formed the DDPG algorithm. The proposed method is also sig-
5) An antiaircraft missile follows the minimum energy trajectory nificantly better than the other methods in terms of average
during flight. 6) The short-range and long-range antiaircraft mis- reward ðP < 0.05Þ. In terms of learning speed, from best to
siles have maximal interception ranges of 40 and 160 km, respec- worst: A3CPG > A3C > PPO  DDPG. The proposed method
tively. 7) The high and low kill probabilities in the kill zone for is also significantly better than the other methods in terms of
cruise missiles are 45% and 35%, respectively. 8) The high and convergence speed (P < 0.05). It can be seen that the proposed
low kill probabilities in the kill zone for fighters, UAVs, bombers, A3CPG method can produce more effective target assignment
ARMs, and air-to-surface missiles are 75% and 55%, respectively. results than the A3C method in the target assignment process,
9) The air-to-surface missiles have a range of 60 km and an 80% helping the agent to accelerate the learning process and obtain a
hit rate. 10) The ARMs have a range of 110 km and an 80% hit better air defense strategy.

Adv. Intell. Syst. 2023, 5, 2300151 2300151 (10 of 15) © 2023 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com

indicative of a reasonable reward design and the ability of the


designed reward to guide the agent toward better strategic
choices. The proposed method achieved the highest winning rate
of 81.7%, followed by the A3C method with 77.3%, followed by
the PPO method with 70.5%, with the DDPG method displaying
by far the worst winning rate of 45.4%. The proposed method
showed a statistically significant superiority over the other con-
ventional methods (P < 0.05), with a 5.69% increase in winning
rate compared to the next best method (the A3C method). It is
possible that the superior performance of the A3CPG method in
the context of target assignment can be attributed to the potential
game module it incorporates. By leveraging the NE, this module
facilitates efficient target assignment and enables the attainment
of optimal strategies, resulting in a higher winning rate. The var-
iances of the A3CPG, A3C, DDPG, and PPO methods were 6.9,
7.1, 9.3, and 11.2, respectively. In terms of algorithm stability, the
A3CPG method performed slightly better than the A3C method.
This behavior can be attributed to the A3CPG system being
Figure 6. Change curve of agent reward during the training process. embedded with a potential game method, which was more stable
than the A3C system based on a neural network.

The DDPG method had demanding requirements for the level


of reward design and feedback in the training process. In con-
trast, the PPO method normalized rewards during training Table 4. Ablation experimental design.
and had less demanding requirements for the reasonableness
of reward setting and real-time feedback. In the experimental Method Multihead Attention Bi-GRU Type-different Reward
environment of this study, real-time feedback of rewards was Mechanisma) Mechanism Mechanism
p p p
challenging due to the large state-action space and multiple enti- A3CPG
p
ties, which may explain why the PPO method performed better A3CþM ○ ○
than the DDPG method. The A3CPG and A3C methods were p
A3CþB ○ ○
more efficient at exploring the environment and were able to uti- p
A3CþT ○ ○
lize more computational resources due to parallel computing,
A3C ○ ○ ○
which may explain why they outperformed the PPO and
DDPG methods.
a)p
The mechanism is used in the method; ○ The mechanism is not used in the
method.
4.2.1. Cumulative Reward and Winning Rate

Figure 7 shows the training winning rate statistics, and the anal-
ysis combined with Figure 6 shows that the higher reward
method has a higher winning rate. The reward function is
designed to guide the agent in learning the winning strategy.
Consequently, a method that yields a high final reward is

A3CPG

100
A3C
PPO
80 DDPG
WinningRate

60

40

20

0
A3CPG A3C PPO DDPG
Method

Figure 7. Air defense combat battle winning rate after training. Figure 8. Change curve of agent reward during the training process.

Adv. Intell. Syst. 2023, 5, 2300151 2300151 (11 of 15) © 2023 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com

4.2.2. Mechanism Effectiveness Analysis A3CPG approach, in which all three mechanisms are used.
The results of the ablation experiment are shown in Figure 8.
To evaluate the effectiveness of the proposed multiple attention As shown, the use of any mechanism with the A3C approach
mechanism, BiGRU mechanism, TDRM, and ablation experi- is able to enhance the reward compared to the baseline approach,
ments were conducted. indicating that the three proposed mechanisms are effective in
The experimental design of the ablation experiment is shown the experimental scenario. In particular, the multiheaded atten-
in Table 4, which compares the effects of the three mechanisms tion mechanism displayed the best reward results and faster
proposed in this article on the reward. The experiment uses the training speeds of all the single-mechanism methods at the
A3C method as a baseline and compares the effectiveness of end of the mechanism training. This indicated that the reward
using a single mechanism with the A3C approach to the after training was higher for the A3C-M method than for the

Figure 9. Air defense combat scene during training, a) scene 1 and b) scene 2.

Adv. Intell. Syst. 2023, 5, 2300151 2300151 (12 of 15) © 2023 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com

other two single-mechanism methods, and that the slope of the where N s is the number of consumed short-range missiles, and
reward curve was steeper for the A3C-M method. This could be N l is the number of consumed long-range missiles.
attributed to the addition of the multiheaded attention mecha- According to Figure 11, the ROT values of the two methods at
nism, which allowed the agent to analyze the battlefield situation 20 k episodes were 4.26 and 4.28, respectively, with no significant
information more efficiently, thus speeding up training and difference (P > 0.05). However, at 100 k, the ROT values for the
focusing on more important targets to obtain better air defense A3CPG and A3C methods were 1.91 and 2.31 respectively, with a
strategies. With the addition of different types of reward mech- statistically significant reduction of 17.32% for the A3CPG
anisms in the A3CPG approach, the trained reward of the agent method over the A3C method (P < 0.05). This improvement
was slightly improved compared to the A3C-B and baseline can be attributed to the potential game module embedded in
approaches. This could be due to the TDRM helping the agent the proposed A3CPG method, which can optimize the assign-
to apply the strategy in obtaining a more reasonable amount of ment of firefighting resources by assigning targets to the appro-
defensive force and using less long-range fire units and exposure, priate firefighting units. The observed decrease in ROT values for
thereby accelerating the convergence speed of the agent signifi- both methods during training suggests that the agent learnt
cantly. After the BiGRU mechanism was added, the convergence effective strategies to conserve firefighting resources.
speed of the agent was significantly accelerated relative to the According to Figure 12, the ROS values of the two methods
baseline. This is evidenced by the fact that the slope of the were 0.541 and 0.592 at 20 k, respectively, and increased to
A3C-B method before 48 episodes was significantly higher than 1.971 and 1.537 as training progressed to 100 k. The observed
that of the A3C-M, A3C-T, and baseline methods, almost reach- increase in ROS values suggests that the agent learnt to prefer-
ing the convergence level. The reason for this could be attributed entially use the short-range missile strategy. At 20 k, no signifi-
to the fact that the BiGRU mechanism had more capabilities for cant difference in ROS values was observed between the two
analyzing the air situation information, which allowed the agent methods (P > 0.05). However, after training, the A3CPG method
to have a better perception of the current situation and thus accel- showed a 28.24% improvement over the A3C method at 100k,
erate the training. However, the final reward was similar to the
baseline due to the lack of attention to the global situation. (a)

4.2.3. Convergence Analysis of Potential Function

Figure 10 shows the variation of potential function during train-


ing, where Figure 10a shows in the case of scenario 1 in Figure 9
and 10b demonstrates the case of scenario 2 in Figure 9. In these
figures, the horizontal axis represents the number of iterations of
the potential game and the vertical axis represents the potential
function value.
From Figure 10, it can be seen that in both scenarios the
potential function is able to converge to the NE, with the potential
function value in scene 1 converging to 9.56 and the potential
function value in scene 2 converging to 13.38. The convergence
to NE occurred at step 57 in scene 1 and at step 74 in scene 2. The
greater number of steps required for convergence in scene 2 rel-
ative to the number of steps required in scene 1 can be attributed (b)
to the larger number of targets in scene 2 than in scene 1, further
complicating the air situation. As a result, the process of conver-
gence toward NE entails a greater number of steps.

4.2.4. Rational Analysis of Target Assignment

In order to compare the performance between the proposed


method and the traditional method, two metrics are introduced:
The ratio of the targets eliminated the missiles consumed
(ROT), which is calculated as

ROT ¼ N c =N e , (31)

where N c is the number of consumed missiles, and N e is the


number of eliminated targets.
The ratio of consumed short-range missiles to consumed
long-range missiles (ROS), which can be calculated by Figure 10. The convergence process of the target assignment optimiza-
tion game. a) The convergence process of scene 1. b) The convergence
ROS ¼ N s =N l , (32) of scene 2.

Adv. Intell. Syst. 2023, 5, 2300151 2300151 (13 of 15) © 2023 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com

A3CPG methods (17.32% reduction in ROT and 28.24% improvement


A3C in ROS compared to the traditional method).
5 In the future, we plan to explore two directions for improve-
ment. First, we will seek to embed the potential game more
4
deeply into the DRL model and develop game theory-inspired
DRL algorithms to solve the agent game problem. Second, we
3
ROT

plan to design more complex tasks, such as scheduling air


2 defense firepower, fighter aircraft, and early warning aircraft,
and develop the corresponding algorithms to solve these tasks.
1

0 Acknowledgements
20k 50k 100k
The authors would like to thank the anonymous reviewers, whose insight-
Episodes
ful comments greatly improved the quality of this article. The work
Figure 11. ROT comparison between different methods. described in this article was supported by the National Natural Science
Foundation of China (grants: 62106283 and 52175282) and Basic
Natural Science Research Program of Shaanxi Province (grant: 2021JM-226).
A3CPG
A3C
3 Conflict of Interest
2 The authors declare no conflict of interest.

2
ROS

Data Availability Statement


1
The data that support the findings of this study are available from the
corresponding author upon reasonable request.
1

0
20k 50k 100k
Keywords
Episodes
air defense, deep reinforcement learning, intelligent decision-making,
Figure 12. ROS comparison between different methods. Nash equilibrium, potential games

Received: March 27, 2023


and the difference was statistically significant (P < 0.05). These Revised: June 27, 2023
results indicate that the potential game module in A3CPG can Published online: August 18, 2023
effectively enhance the rationality of firepower resource use
and promote the preferential use of short-range missiles for target
interception by the agent, over more costly long-range missiles. [1] E. Suarez-Diaz, Isis 2021, 112, 393.
[2] a) S. Gupta, S. Modgil, S. Bhattacharyya, I. Bose, Ann. Oper. Res. 2022,
308, 215; b) L. Y. Wang, L. B. Chen, Z. W. Yang, M. H. Li, K. W. Yang,
5. Conclusion M. J. Li, Mathematics 2022, 10, 3486.
[3] a)R. X. Zhao, Y. L. Wang, G. Xiao, C. Liu, P. Hu, H. Li, Appl. Intell.
In this article, a potential game-embedded reinforcement 2022, 52, 16775; b) W. H. Zhou, J. Li, Z. H. Liu, L. C. Shen, ChJA 2022,
learning-based intelligent decision-making method is proposed. 35, 100.
The proposed method is divided into two modules: DRL-TSRC [4] D. Khosla, presented at 6th Battlespace Digitization and Network-
and PG-TA modules. ‘The DRL-TSRC is based on DRL and uses Centric Warfare Conf., Orlando, FL, April 2001.
BiGRU and multi-head attention mechanisms for battlefield sit- [5] a) W. B. Hu, F. Chen, L. Y. Xiang, G. R. Chen, IEEE T. Cybern. 2022,
uational information processing to achieve target selection and https://doi.org/10.1109/tcyb.2022.3203507; b) J. Degrave, F. Felici,
radar control. The PG-TA module models the multitarget J. Buchli, M. Neunert, B. Tracey, F. Carpanese, T. Ewalds,
R. Hafner, A. Abdolmaleki, D. de las Casas, C. Donner, L. Fritz,
multifire unit assignment problem as a target assignment opti-
C. Galperti, A. Huber, J. Keeling, M. Tsimpoukelli, J. Kay, A. Merle,
mization game, proves the game as a potential game, and inte-
J. M. Moret, S. Noury, F. Pesamosca, D. Pfau, O. Sauter,
grates the potential game into the intelligent decision system to C. Sommariva, S. Coda, B. Duval, A. Fasoli, P. Kohli,
ensure that the target assignment result is a NE solution. The K. Kavukcuoglu, D. Hassabis, et al., Nature 2022, 602, 414.
experimental results demonstrate that the proposed intelligent [6] F. H. Song, H. L. Xing, X. H. Wang, S. X. Luo, P. L. Dai, K. Li, Future
decision-making method can effectively enhance the winning Gener. Comput. Syst. 2022, 128, 333.
rate (5.69% improvement compared to the traditional method), [7] a) Y. Lan, X. Xu, Q. Fang, Y. Zeng, X. Liu, X. Zhang, Knowl. Based. Syst.
optimize the rationality of target assignment, conserve firepower 2022, 242, 108221; b) Z. Ren, D. Zhang, S. Tang, W. Xiong, S.-H. Yang,
resources, and generate superior strategies compared to other Def. Technol. 2022, https://doi.org/10.1016/j.dt.2022.10.008.

Adv. Intell. Syst. 2023, 5, 2300151 2300151 (14 of 15) © 2023 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com

[8] J. Yang, X. H. You, G. X. Wu, M. M. Hassan, A. Almogren, J. Guna, [17] B. W. Zhang, X. L. Wang, R. Xie, C. C. Li, H. Z. Zhang, F. Jiang, Future
Future Gener. Comput. Syst. 2019, 95, 140. Gener. Comput. Syst. 2023, 139, 17.
[9] Z. X. Sun, H. Y. Piao, Z. Yang, Eng. Appl. Artif. Intell. 2021, 98, [18] X. C. Xu, K. Liu, P. L. Dai, F. Y. Jin, H. L. Ren, C. J. Zhan, S. T. Guo,
104112. J. Syst. Architect. 2023, 134, 102780.
[10] J. B. Li, X. J. Zhang, J. Wei, Z. Y. Ji, Z. Wei, Future Gener. Comput. Syst. [19] W. P. Duan, Z. Y. Tang, W. Liu, H. B. Zhou, Expert Syst. 40, e13191.
2022, 135, 259. [20] Y. H. Zhu, D. B. Zhao, IEEE. Trans. Neural Netw. Learn. Syst. 2022, 33,
[11] D. Hu, R. Yang, Y. Zhang, L. Yue, M. Yan, J. Zuo, X. Zhao, Eng. Appl. 1228.
Artif. Intell. 2022, 111, 104767. [21] B. W. Peng, A. Stancu, S. P. Dang, Z. T. Ding, IEEE Trans. Cybern.
[12] Z. Jiandong, Y. Qiming, S. H. I. Guoqing, L. U. Yi, W. U. Yong, J. Syst. 2022, 52, 8897.
Eng. Electron. 2021, 32, 1421. [22] B. L. Gao, L. Pavel, IEEE Trans. Automat. Contr. 2021, 66, 121.
[13] J. Y. Liu, G. Wang, Q. Fu, S. H. Yue, S. Y. Wang, Def. Technol. 2022, 19, [23] Y. X. Huang, S. F. Wu, Z. Y. Kang, Z. C. Mu, H. Huang, X. F. Wu,
210. A. J. Tang, X. B. Cheng, ChJA 2023, 36, 284.
[14] a) S. Gronauer, K. Diepold, Artif. Intell. Rev. 2022, 55, 895; [24] Q. G. Lu, X. F. Liao, S. J. Deng, H. Q. Li, IEEE Trans. Parallel Distrib.
b) G. A. Vouros, ACM Comput. Surv. 2023, 55, 92. Syst. 2023, 34, 16.
[15] a) B. M. Albaba, Y. Yildiz, IEEE Trans. Control. Syst. Technol. 2022, 30, [25] X. Y. Gong, J. Y. Yu, S. Lu, H. W. Lu, Inf. Sci. 2022, 582, 633.
885; b) K. Cao, L. H. Xie, IEEE Trans. Neural Netw. Learn. Syst. 2022, [26] a) Y. R. Zhang, P. Hang, C. Huang, C. Lv, Adv. Intell. Syst. 2022, 4,
https://doi.org/10.1109/tnnls.2022.3148376. 2100211;
[16] H. Yarahmadi, M. E. Shiri, H. Navidi, A. Sharifi, M. Challenger, b) M. Diamant, S. Baruch, E. Kassem, K. Muhsen, D. Samet,
Swarm. Evol. Comput. 2023, 77, 101229. M. Leshno, U. Obolski, Nat. Commun. 2021, 12, 1148.

Adv. Intell. Syst. 2023, 5, 2300151 2300151 (15 of 15) © 2023 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH

You might also like