Deep_Reinforcement_Learning-Based_Air_Defense_Deci
Deep_Reinforcement_Learning-Based_Air_Defense_Deci
www.advintellsyst.com
Adv. Intell. Syst. 2023, 5, 2300151 2300151 (1 of 15) © 2023 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com
UAV
Anti -radiation
Short-range Missile
Radar
Short-range
missiles
Short-range
Interception Unit Air-to-Surface
Missile
Long-range
Radar Long-range
missiles
Long-range
Command Post Interception Unit
Airport
traditional task assignment methods and enable intelligent task the proposed method was applicable to both symmetric and non-
assignment for air defense combat. symmetric games. Cao et al.[15b] proposed a game theory-based
However, DRL methods suffer from weak interpretability, a inverse learning framework to obtain the parameters of both the
lack of convergence guarantees, and high-computing power dynamic system and individual cost of multistage games, by solv-
requirements in intelligent decision-making applications.[14] In ing affine-quadratic games and calculating the gradient of the game
order to solve these problems, methods combining game theory system parameters. Peng et al.[21] formulated the distributed con-
and reinforcement learning have emerged in recent years.[15] trol problem as a differential graphical game and combined it with
Yarahmadi et al.[16] modeled the multiagent credit assignment DRL to achieve bounded control. Albaba et al.[15a] combined hier-
problem (MPC) in multi-intelligent reinforcement learning as archical game theory with DRL for simultaneous decision-making
a bankruptcy gam. They then solved the MPC problem by select- in multiagent traffic scenarios. They experimentally demonstrated
ing the best assignment result through the evolutionary game. that the proposed method modeled more traffic situations and
The results showed that the proposed method has good perfor- effectively reduced the collision rate, compared to conventional
mance in terms of group learning rate, confidence, expertness, methods. Gao et al.[22] proposed a passivity-based methodology
certainty, and correctness metrics. Zhang et al.[17] proposed a for multiagent finite games in reinforcement learning dynamics,
reciprocal reputation mechanism, combined with a double deep as well as algorithms for analysis and design. Results indicated that
Q-network algorithm, to suppress attack motivation in vehicular the proposed method improved convergence.
ad hoc networks and reduce selfish node attacks. Xu et al.[18] for- Previous studies have applied DRL to decision-making; how-
mulated the task offloading problem in vehicular edge comput- ever, limited research has explored the integration of potential
ing as an exact potential game, achieving task offloading resource games into intelligent decision-making systems for air defense
optimization by solving the Nash equilibrium (NE) using multi- combat. To address this research gap, and to improve the
agent reinforcement learning. Duan et al.[19] combined game the- interpretability and convergence of DRL when used in decision-
ory and reinforcement learning to enable vehicle autonomous making, potential game theory is introduced into the DRL frame-
driving, proposing a multi-intelligence negotiation model based work. Furthermore, proof that the target assignment (TA) game
on an incomplete information game to facilitate automatic is an exact potential game is provided, the primary contribution
negotiation of agents. Zhu et al.[20] combined the minimax Q of this article.
algorithm with game theory to solve the NE of two-player In this article, an intelligent decision system for air defense
zero-sum games. Subsequent simulation results showed that operations is proposed based on potential games and DRL,
Adv. Intell. Syst. 2023, 5, 2300151 2300151 (2 of 15) © 2023 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com
leveraging the advantages of both DRL and game theory to attain next state stþ1 at a time step of t þ 1. In an MDP, a state is called
an effective and high-quality decision system for air defense a Markov state when it satisfies the following condition.
operations. Potential games are a branch of game theory that
possess desirable properties that guarantee the existence of P½stþ1 j st ¼ P½stþ1 j s1 , · · · , st (1)
NE solutions. Specifically, potential game theory is utilized to
The state of the next moment is independent of the state of the
construct a multiobject multi-intercept unit resource assignment
prior moment. The state transition probability matrix describes
model for efficient TA, as well as to obtain an NE solution in the
the probability of transitioning from the current state s to the sub-
TA phase. The primary contributions of this article are as follows.
sequent state s0 .
First, an air defense operational decision-making framework
is proposed that combines DRL and potential games for target P ss0 ¼ P½stþ1 ¼ s0 j st ¼ s (2)
selection and radar control. A type-different-based reward mech-
anism is employed to incentivize efficient decision-making The reward function represents the desired expected reward of
and TA. the intelligence taking the action, after the transition
Second, a DRL-based target selection and radar control mod-
ule is proposed to facilitate intelligent decision-making for air r as ¼ E½r tþ1 j st ¼ s, at ¼ a (3)
defense operations. The air defense combat problem is modeled
as a Markov decision process (MDP). The action state space, where E is the expectation operator.
action space, and reward are formulated by solving the problem Return Rt , also known as the cumulative reward, is the sum of
characteristics. To facilitate the learning of better strategies for all rewards from the beginning to the end of the round
the above MDP, a bidirectional gate recurrent unit (BiGRU)- X
∞
based feature extraction method configured to extract the type Rt ¼ γ k r tþkþ1 (4)
of air situation and a multihead attention mechanism are k¼0
Adv. Intell. Syst. 2023, 5, 2300151 2300151 (3 of 15) © 2023 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com
m worker nodes and 1 server node. The server node holds the U i ðx, yi Þ U i ðz, yi Þ ¼ Gðx, yi Þ Gðz, yi Þ (13)
most recent policy network ðπ θ ðst , at ÞÞ and value network
ðQ w ðs, aÞ or Vðst ; θw ÞÞ, where θ denotes the policy network Lemma 1: All finite-potential games have pure strategy NE
parameter and w denotes the value network parameter. Based solutions.
on the gradient data transmitted by the worker node, the server
node modifies the parameters. Each worker node has a copy
π θ0 ðst , at Þ of the policy network and a copy Q w0 ðs, aÞ or
3. Problem Formulation and Proposed Method
Vðst ; θw0 Þ of the server node’s value network, where θ0 and w0 rep- This section introduces the air defense combat intelligent deci-
resent the modified parameters of the policy network and the sion problem and models it as an MDP problem. The state space
worker node, respectively. The gradient information is sent to and action space of the air defense combat decision problem are
the server node at regular intervals to update the parameters proposed. An event-based reward mechanism is developed, and a
of the server node. The latest parameters are then requested from potential game-based TA model is proposed that is configured to
the server node, ensuring that the conditions θ0 ¼ θ and w0 ¼ w solve the air defense decision problem.
are met. Generally, the parameters of the policy network and the
value network are shared in the nonoutput layer.
3.1. Problem Formulation
The definition of the updated reward function is
The air defense combat intelligent decision problem is typically
∇θ0 Jðθ0 Þ ¼ Âðst , at ; θ, θw Þ ⋅ ∇θ0 log πðat jst ; θ0 Þ (10)
modeled mathematically in order to minimize the probability of
missing high-value targets, minimize the loss of defending key
where  represents an objective estimation of the dominance locations, maximize the probability of effective kills, and mini-
function mize the consumption of interceptor resources.[2b]
First, the air defense combat intelligent decision problem is
Âðst , at ; θ, θw Þ ¼ r t þ γVðstþ1 ; θÞ Vðst ; θw Þ (11)
framed as an MDP and solved using a DRL-based target selection
and radar control (DRL-TSRC) module. Neural networks and
 indicates whether it is advantageous or disadvantageous to DRL techniques are used to analyze the air situation and threat
perform an action at . level of individual targets, assign air targets to be intercepted and
their threat level, before deciding when to turn on the guidance
radar of the fire unit.
2.2. Potential Game
Second, PG-TA module is used to assign the target selected by
the DRL-TSRC module for interception and control the fire unit
The player, the strategy, and the utility function are the three
for interception. The process is illustrated in Figure 2.
components that are used to define a game in game theory.[26]
In response to air defense decision problems, there are several
The symbol defining a game is Γ ¼< N, S, fU i gi∈.V >, where
equipment limitations that should be considered: 1) The guid-
each element i is a player and N ¼ f1; 2; : : : ng is the set of play-
ance radar cannot be turned off after it is turned on. 2) The guid-
ers. S ¼ S1 S2 · · · SN represents the strategy space of the
ance radar of the fire unit should turn on before the fire unit can
game Γ, Si is the strategy space of the i-th innings, and Si ∈ Si is
launch a missile to intercept a target. 3) The guidance radar can
the strategy of the i-th player. Let A represent a subset of N, A only guide missiles launched by the fire unit. 4) The blue side
represent the complement of A, and Y A represent the Cartesian cannot detect the fire unit position until the guidance radar is
sum i∈A Y i ⋅ Y fig . Y A is shortened to Y i if A only has one ele- turned on. 5) The air situation posture of the red side is given
ment. Thus, the strategy combination y ¼ ðy1 , y2 , · · · , yn Þ can by the airborne warning and control system and does not require
be rearranged to y ¼ ðyi , yi Þ, yi ∈ Y i , yi ∈ Y i , y ∈ Y. The util- guidance radar detection. 6) If the guidance radar of a fire unit is
ity function of the i-th player maps the combination of strategies destroyed, the fire unit cannot launch and guide missiles. 7) Each
to the real numbers in the set. This mapping is denoted by fire unit satisfies the rational choice theory.
U i ∶Y ! R. It should be noted that in this article, the term “red side” refers
Definition 1 (Pure Strategy NE): For ∀i ∈ N, ∀yi ∈ Y i that sat- to the defending side, which utilizes ground-to-air missiles as its
isfy Equation (12), the strategy combination z is said to be a pure main weapons, while the term “blue side” refers to the attacking
strategy NE. side, which employs UAVs and fighter aircraft as its main
weapons.
U i ðyi , yi Þ ≥ U i ðyi , yi Þ (12)
3.2. Deep Reinforcement Learning-Based Target Selection and
Ordinal potential games, weighted potential games, and pre- Radar Control Module (DRL-TSRC)
cise potential games are the different categories of potential
games. Below is a description of the precise potential game that An agent takes action ai to intercept the enemy target at battle-
was used in this study. field state si . The Markov process model consists of a state set
Definition 2 (Exact Potential Game): The game S ¼ ½s1 , s2 , · · · , sn and an action set A ¼ ½a1 , a2 , · · · , an .
Γ ¼< N, S, fU i gi∈N > is said to be an exact potential game, if The agent selects action ai from the action set A at state si accord-
there exists a function G∶Y ! R, which for ∀i ∈ N, ing to the policy π∶S A ! ½0, 1. The battlefield environment
∀yi ∈ Y i , ∀x, z ∈ Y i that satisfies transitions to the next state according to the state transition
Adv. Intell. Syst. 2023, 5, 2300151 2300151 (4 of 15) © 2023 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com
Figure 2. The architecture of the proposed decision-making method. The proposed method is composed of a DRL-TSRC module and a PG-TA module.
function P∶S A S ! ½0, 1. The goal of the agent is to maxi- Table 1. Air defense operation states definition.
mize the cumulative reward function
( ) State variable Meaning
XT
id Strategic location number
Rt ¼ E γt r t , (14)
t¼0 xd Strategic location position
td Strategic location type
where r t is the reward received at step t, T is the time horizon,
ad Strategic location been attacked mode
and γ is the discount factor. Equation (5) provides the calculation
for the state-value function V π ðsÞ, which is the expectation of the if Fire unit number
cumulative reward function Rt , and Equation (7) provides the cal- xf Fire unit position
culation for the action–state–value function Q π ðs, aÞ. When com- mf Fire unit missile left
pared to the average reward under strategy π, or V π ðsÞ, the
uf Fire unit usability
dominance function, Aπ ðst , at Þ ¼ Q π ðs, aÞ V π ðsÞ, indicates
how good the reward earned by taking action at in the current iαf Targets number can be targeted by the unit
state st was. af Fire unit been attacked mode
is Enemy target number
3.2.1. State and Action Space of Air Defense Problem xs Enemy target position
ts Enemy target type
In this section, the air defense combat intelligent decision prob-
vs Enemy target velocity
lem is formulated as an MDP by specifying the states, actions,
rewards, and goals in air defense combat. The rewards are related as Enemy target been attacked mode
to the type of fire unit used and are specified in Section 3.2.
The state space of the air defense combat problem includes the
state of the defense sites, the state of the fire units, the state of the action. The entire air defense combat action space is outlined
detectable enemy targets, and the state of the attackable enemy in Table 2.
targets. This state information is maintained by the digital bat-
tlefield environment. The state information of the defense sites 3.2.2. Air Situation Feature Extraction based on BiGRU Network
comprises the site number, location, type, and attack status. The
state information of the fire unit includes the fire unit tag num- The BiGRU method is used to analyze air situation information
ber, location, number of missiles remaining, usability, the num- in air defense missions, which are time-series events. This
ber of targets the unit can attack, and attack status. The state of method is used to consider the state input for a period of time
the reconnoitered enemy targets includes the target number, before the current time. The BiGRU network consists of a for-
location, type, movement, and state under attack. The entire ward GRU and a reverse GRU, and its structure is shown in
air defense combat state space is outlined in Table 1. Figure 3. The BiGRU method enhances the ability of the model
The action space of the air defense combat problem includes to learn air situation features by analyzing the laws of state infor-
target selection, target threat level, radar selection, and radar mation in two directions: past to present and present to past.
Adv. Intell. Syst. 2023, 5, 2300151 2300151 (5 of 15) © 2023 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com
Table 2. Air defense operation actions definition. 3.2.3. Multiheaded Attention Mechanism
ht ¼ wt1~
ht þ wt2 ht þ bt ,
Figure 3. Diagram of the Bi-GRU structure. Figure 4. Diagram of multihead attention structure.
Adv. Intell. Syst. 2023, 5, 2300151 2300151 (6 of 15) © 2023 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com
3.2.4. Type-Different Reward Mechanism fire units is denoted by LM ¼ fLM1 , : : : , LM0 g, while the set
of short-range fire units is denoted by SM ¼ fSM1 , : : : , SM h g.
Since the firing and exposure costs of long-range fire units are The number of long-range fire units and short-range fire units
higher than those of short-range fire units, it is important to train is s, and k, respectively, as indicated by the number of players,
the agent to use a higher proportion of short-range fire units to where jN m j ¼ s þ k ¼ m is the number of players.
reduce the usage costs. To this end, the type-different-based
reward mechanism (TDRM) was designed, which provides dif-
3.3.2. Strategy Space
ferent reward values for different types of fire units. Depending
on the type of fire unit, different rewards are given for radar
The strategy space is defined as S ¼ S1 S2 : : : SN , where
detection, missile launch, and target interception. Since the state
Si represents the set of strategies of the i-th player.
space and action space of air defense operations are large, giving
Si ¼ fsi1 , si2 , : : : , sij , : : : , siT g is the strategy of the i-th player.
a reward to the agent after each round of combat is not conducive
to training. Therefore, this article proposes a critical event-based For each element sij , i ∈ N m , j ∈ N t in Si
reward mechanism for different fire units to provide timely feed-
1, Shoot the j-th target
back after the agent makes a decision, thereby speeding up the sij ¼ (20)
training process. The specific reward design is shown in Table 3. 0, Not Shoot the j-th target or C ij ¼ 0
range fire units s sets. To defend the red strategic location against
blue air targets, the red side cooperates to intercept the targets
using the launchers and guidance radars of each fire unit. 3.3.3. Utility Functions
3.3.1. Player Set The amount of fire units that can be fired is a local constraint that
affects the player and was taken into account when designing the
The players are defined as the firing units whose guiding strategy set. It is preferable to intercept the target with the least
radar has been activated, with the set of players defined as amount of fire resources possible, in order to conserve fire
N m ¼ fLM1 , : : : , LM 0 , SM 1 , : : : , SMh g. The set of long-range resources. The penalty function is defined as
Adv. Intell. Syst. 2023, 5, 2300151 2300151 (7 of 15) © 2023 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com
X
m
U i ðzi , yi Þ U i ðyi , yi Þ
F c ðyi , yi Þ ¼ ð1 yij Þ2 (22) " #
i¼1 P i i i i
¼ F i ðz , y Þ βF c ðz , y Þ
0
i∪i0 ∈J i
Each player may improve their own gain in this distributed " #
target assignment problem, and in order to prevent the waste P
Fi ðyi , yi Þ βF c ðyi , yi Þ
of fire resources, the penalty function is added to each player i∪i0 ∈J i
in order to develop the utility function " #
P
( " P i !# ) ¼ Fi ðzi , yi Þ þ F i0 ðzi , yi Þ βF c ðzi , yi Þ
X X j rj i0 ∈J i
U i ðyi , yi Þ ¼ αj ⋅ f it P max " #
iU i ∈J i j j rj P (28)
(23) F i ðyi , yi Þ þ F i0 ðyi , yi Þ βF c ðyi , yi Þ
X m i0 ∈J i
β ð1 yij Þ2 , ∀j ∈ N t " #
P
i¼1
¼ Fi ðzi , yi Þ þ F i0 ðyi , yi Þ βF c ðzi , yi Þ
i0 ∈J i
where αj denotes the threat level of the target j, t represents the " #
target type, nit represents the number of targets of type t assigned P
Fi ðyi , yi Þ þ F i0 ðyi , yi Þ βF c ðyi , yi Þ
to the fire unit, and f it represents the reward value obtained i0 ∈J i
by the fire unit for intercepting a type-t target. Define ¼ ½F i ðzi , yi Þ F i ðyi , yi Þ β½F c ðzi , yi Þ F c ðyi , yi Þ
J i ¼ fi0 ji0 ∈ N m , C ij ¼ 1, Ci0 j ¼ 1, i0 6¼ ig as the set of proximity
fire units of i. r ij is the range shortcut between fire unit i, and target
Gðzi , yi Þ Gðyi , yi Þ
j r mas
j is the maximum value of the course shortcut from target j " #
P
to the fire unit that can intercept it. The penalty coefficient β is a ¼ i i
F i ðz , y Þ βF c ðz , y Þ
i i
large constant used to ensure that the penalty function term con- i∈N m ,
" #
verges to zero as the game progresses, thus achieving cost savings. P
The reward function of the i-th fire unit is denoted by Fi ðyi , yi Þ βF c ðyi , yi Þ
i∈N m
P i ! " #
X j rj P
F i ðyi , yi Þ ¼ αj ⋅ f it P max (24) ¼ Fi ðzi , yi Þ þ Fi ðzi , yi Þ βF c ðzi , yi Þ
j j rj i0 ∈N m , i0 6¼i
" # (29)
P
Then, the utility function can be written as F i ðyi , yi Þ þ F i ðyi , yi Þ βF c ðyi , yi Þ
X i0 ∈N m , i0 6¼i
U i ðyi , yi Þ ¼ F i ðyi , yi Þ βF c ðyi , yi Þ, ∀j ∈ N t (25) " #
i∪i0 ∈J i i
P i i
¼ F i ðz , y Þ þ
i
F i ðy , y Þ βF c ðz , y Þ
i i
i0 ∈N m , i0 6¼i
The reward value for intercepting the same target varies " #
between fire units according to their respective equipment char- i
P i i
F i ðy , y Þ þi
F i ðy , y Þ βF c ðy , y Þ
i i
acteristics and interdiction costs, as shown in Table 3. i0 ∈N m , i0 6¼i
Adv. Intell. Syst. 2023, 5, 2300151 2300151 (8 of 15) © 2023 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com
function by choosing a strategy based on the strategies of other Algorithm 2. Best response-based target assignment NE solving (BRTA)
players when making decisions. In other words algorithm.
Initialize the target assignment game Γ ¼< N, S, fUi gi∈N >, where M and T are the
Si ¼ arg minui ðSi , Si Þ. (30) number of fire units and the number of targets, respectively. Nit is number of
ai ∈Ai
iterations.
Randomly initialize the strategy combination Sr ; let n denote the player that changes
Lemma 2: The best response dynamics converges the exact strategy in the iteration.
for k = 1: Nit do
potential game to a pure strategic NE.
The decision maker generated by Algorithm 1 is player i.
Theorem 2: The use of best response dynamics converges the Let Si denote the set of available strategies for the i-th fire unit and N0 denote the
target assignment optimization game (Γ) to a NE. number of elements in the set Si ;
Proof: Based on Theorem 1, the target assignment optimiza- Let Si denote the strategy for the i-th fire unit, Si ∈ Sr , let S0i ¼ Si fSi g;
tion game (Γ) is an exact potential game. When combined with Take the strategy Si for the i-th fire unit, Si ∈ Sr , calculate the utility function
Lemma 2, Theorem 2 can be obtained. Ui ðSi , Si Þ;
For the target assignment optimization game (Γ) proposed in Slocal ¼ Sr
this article, an NE solving algorithm is proposed. This game is a Ulocal ¼ Ui ðSi , Si Þ
for j ¼ 1∶NS do
noncooperative game, and the best response approach can be
S0r ¼ Sr
used to solve the NE. Since the strategy change of a single
elect S0r randomly from S0i as the strategy for the n-th fire unit and update the
player will lead to the strategy change of other players, a best- strategy set S0i ¼ S0i fS0r g;
response-based NE solving algorithm is proposed to consider this if C i0 ðS0i Þ ¼ 1 then
interaction. However, since the dynamic best-response algorithm Let the i-th fire unit temporarily change its strategy from S0n to Sn
requires participants to make sequential decisions in the deci- end if
sion process, which typically requires a decision coordinator, Update the strategy set S0r and calculate the utility function Ui ðS0i , Si Þ
it is not conducive to distributed decision-making. Therefore, if Ui ðS0i , Si Þ > Ulocal then
this section proposes a distributed decision-maker selection pro- Slocal ¼ S0r
Ulocal ¼ Ui ðS0i , Si Þ
tocol that randomly selects participants to make decisions in each
end if
round. end for
Algorithm 1 sets a maximum waiting time τmax at the outset Update strategy profile Sr ¼ Slocal
and, at the start of the iteration, each player i generates a random end for
waiting time τi in the interval ½0, τmax . After the start, player i Pure Strategy profile NE SNE ¼ Sr
waits for the first τi time units and abandons the decision-maker
selection if it receives a DR signal from other players. Otherwise,
a DR signal is sent at time τi , and player i is identified as the
decision maker for this round.
Using Algorithm 2, a pure strategy NE point of the game (Γ) 4. Results and Analysis
can be obtained, which is the optimal solution of the target
assignment optimization game according to Theorem 1. In this section, the effectiveness of the proposed method is eval-
The proposed algorithm has a complexity of oðN it N s Þ, whereas uated in a simulated environment that faithfully replicates the
the exhaustive search algorithm has a complexity of physical laws, terrain occlusion, and Earth curvature of a realistic
oðN s !=ðN s NÞÞ. Consequently, the proposed algorithm per- battlefield environment for air defense operations. The weapon
forms better than the exhaustive search approach. parameters used in the simulation are consistent with those of a
real battlefield. The digital battlefield environment is shown in
Figure 5.
Adv. Intell. Syst. 2023, 5, 2300151 2300151 (9 of 15) © 2023 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com
4.1.2. Force Configuration of Blue Army rate. 11) The effective jamming direction of an electronic jammer
is a cone of about 15∘ , and jamming against the radar lowers the
In the simulations, the blue army (acting as the attacking side) kill probability of a missile.
has a number of attacking units that can be employed:
1) Fighters: 12, each carrying 6 ARMs and 2 air-to-surface mis- 4.2. Performance Analysis
siles; 2) UAVs: 20, each carrying 2 ARMs and 1 air-to-surface
missile; 3) Bombers: 4; 4) Cruise missiles: 18; 5) Electronic This section compares the performance of the proposed
jammers: 2 decision-making method, the A3CPG algorithm, with main-
stream methods such as the A3C, proximal policy optimization
(PPO), and DDPG algorithms, with the DDPG algorithm being
4.1.3. Battlefield Environment Rules
used as a baseline result in the comparison.
Figure 6 illustrates the reward curves obtained by the agent
There are a number of constraints that have been placed on the
during the training process, with the horizontal axis representing
simulation as a result of the battlefield environment. These rules
the confrontation episodes.
are listed below: 1) If the guidance radar of a fire unit is destroyed,
It can be observed that the various methods have different
the fire unit is rendered inoperable. 2) During the guidance pro- degrees of reward growth as the number of training rounds
cedure, the guidance radar must be turned on and produce elec- are increased during the learning process. Similarly, the growth
tromagnetic radiation. 3) The maximum range for the guided
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi rate of reward and the rewards obtained are also different. From
radar is 4) Rmas ¼ 4.12f HT ðmÞ þ HR ðmÞgðkmÞ, where Figure 6, it is clear that the highest degree of reward growth is
HT is the altitude of the target and HR is the altitude of the radar achieved by the A3CPG algorithm, outperforming the other
antenna, which is set as H R ¼ 4 m in this simulation. The range conventional methods. Out of the conventional methods, the
of the guided radar is affected by the terrain shading and the A3C algorithm outperformed the PPO algorithm, which outper-
Earth’s curvature by taking atmospheric refraction into account. formed the DDPG algorithm. The proposed method is also sig-
5) An antiaircraft missile follows the minimum energy trajectory nificantly better than the other methods in terms of average
during flight. 6) The short-range and long-range antiaircraft mis- reward ðP < 0.05Þ. In terms of learning speed, from best to
siles have maximal interception ranges of 40 and 160 km, respec- worst: A3CPG > A3C > PPO DDPG. The proposed method
tively. 7) The high and low kill probabilities in the kill zone for is also significantly better than the other methods in terms of
cruise missiles are 45% and 35%, respectively. 8) The high and convergence speed (P < 0.05). It can be seen that the proposed
low kill probabilities in the kill zone for fighters, UAVs, bombers, A3CPG method can produce more effective target assignment
ARMs, and air-to-surface missiles are 75% and 55%, respectively. results than the A3C method in the target assignment process,
9) The air-to-surface missiles have a range of 60 km and an 80% helping the agent to accelerate the learning process and obtain a
hit rate. 10) The ARMs have a range of 110 km and an 80% hit better air defense strategy.
Adv. Intell. Syst. 2023, 5, 2300151 2300151 (10 of 15) © 2023 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com
Figure 7 shows the training winning rate statistics, and the anal-
ysis combined with Figure 6 shows that the higher reward
method has a higher winning rate. The reward function is
designed to guide the agent in learning the winning strategy.
Consequently, a method that yields a high final reward is
A3CPG
100
A3C
PPO
80 DDPG
WinningRate
60
40
20
0
A3CPG A3C PPO DDPG
Method
Figure 7. Air defense combat battle winning rate after training. Figure 8. Change curve of agent reward during the training process.
Adv. Intell. Syst. 2023, 5, 2300151 2300151 (11 of 15) © 2023 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com
4.2.2. Mechanism Effectiveness Analysis A3CPG approach, in which all three mechanisms are used.
The results of the ablation experiment are shown in Figure 8.
To evaluate the effectiveness of the proposed multiple attention As shown, the use of any mechanism with the A3C approach
mechanism, BiGRU mechanism, TDRM, and ablation experi- is able to enhance the reward compared to the baseline approach,
ments were conducted. indicating that the three proposed mechanisms are effective in
The experimental design of the ablation experiment is shown the experimental scenario. In particular, the multiheaded atten-
in Table 4, which compares the effects of the three mechanisms tion mechanism displayed the best reward results and faster
proposed in this article on the reward. The experiment uses the training speeds of all the single-mechanism methods at the
A3C method as a baseline and compares the effectiveness of end of the mechanism training. This indicated that the reward
using a single mechanism with the A3C approach to the after training was higher for the A3C-M method than for the
Figure 9. Air defense combat scene during training, a) scene 1 and b) scene 2.
Adv. Intell. Syst. 2023, 5, 2300151 2300151 (12 of 15) © 2023 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com
other two single-mechanism methods, and that the slope of the where N s is the number of consumed short-range missiles, and
reward curve was steeper for the A3C-M method. This could be N l is the number of consumed long-range missiles.
attributed to the addition of the multiheaded attention mecha- According to Figure 11, the ROT values of the two methods at
nism, which allowed the agent to analyze the battlefield situation 20 k episodes were 4.26 and 4.28, respectively, with no significant
information more efficiently, thus speeding up training and difference (P > 0.05). However, at 100 k, the ROT values for the
focusing on more important targets to obtain better air defense A3CPG and A3C methods were 1.91 and 2.31 respectively, with a
strategies. With the addition of different types of reward mech- statistically significant reduction of 17.32% for the A3CPG
anisms in the A3CPG approach, the trained reward of the agent method over the A3C method (P < 0.05). This improvement
was slightly improved compared to the A3C-B and baseline can be attributed to the potential game module embedded in
approaches. This could be due to the TDRM helping the agent the proposed A3CPG method, which can optimize the assign-
to apply the strategy in obtaining a more reasonable amount of ment of firefighting resources by assigning targets to the appro-
defensive force and using less long-range fire units and exposure, priate firefighting units. The observed decrease in ROT values for
thereby accelerating the convergence speed of the agent signifi- both methods during training suggests that the agent learnt
cantly. After the BiGRU mechanism was added, the convergence effective strategies to conserve firefighting resources.
speed of the agent was significantly accelerated relative to the According to Figure 12, the ROS values of the two methods
baseline. This is evidenced by the fact that the slope of the were 0.541 and 0.592 at 20 k, respectively, and increased to
A3C-B method before 48 episodes was significantly higher than 1.971 and 1.537 as training progressed to 100 k. The observed
that of the A3C-M, A3C-T, and baseline methods, almost reach- increase in ROS values suggests that the agent learnt to prefer-
ing the convergence level. The reason for this could be attributed entially use the short-range missile strategy. At 20 k, no signifi-
to the fact that the BiGRU mechanism had more capabilities for cant difference in ROS values was observed between the two
analyzing the air situation information, which allowed the agent methods (P > 0.05). However, after training, the A3CPG method
to have a better perception of the current situation and thus accel- showed a 28.24% improvement over the A3C method at 100k,
erate the training. However, the final reward was similar to the
baseline due to the lack of attention to the global situation. (a)
ROT ¼ N c =N e , (31)
Adv. Intell. Syst. 2023, 5, 2300151 2300151 (13 of 15) © 2023 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com
0 Acknowledgements
20k 50k 100k
The authors would like to thank the anonymous reviewers, whose insight-
Episodes
ful comments greatly improved the quality of this article. The work
Figure 11. ROT comparison between different methods. described in this article was supported by the National Natural Science
Foundation of China (grants: 62106283 and 52175282) and Basic
Natural Science Research Program of Shaanxi Province (grant: 2021JM-226).
A3CPG
A3C
3 Conflict of Interest
2 The authors declare no conflict of interest.
2
ROS
0
20k 50k 100k
Keywords
Episodes
air defense, deep reinforcement learning, intelligent decision-making,
Figure 12. ROS comparison between different methods. Nash equilibrium, potential games
Adv. Intell. Syst. 2023, 5, 2300151 2300151 (14 of 15) © 2023 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH
www.advancedsciencenews.com www.advintellsyst.com
[8] J. Yang, X. H. You, G. X. Wu, M. M. Hassan, A. Almogren, J. Guna, [17] B. W. Zhang, X. L. Wang, R. Xie, C. C. Li, H. Z. Zhang, F. Jiang, Future
Future Gener. Comput. Syst. 2019, 95, 140. Gener. Comput. Syst. 2023, 139, 17.
[9] Z. X. Sun, H. Y. Piao, Z. Yang, Eng. Appl. Artif. Intell. 2021, 98, [18] X. C. Xu, K. Liu, P. L. Dai, F. Y. Jin, H. L. Ren, C. J. Zhan, S. T. Guo,
104112. J. Syst. Architect. 2023, 134, 102780.
[10] J. B. Li, X. J. Zhang, J. Wei, Z. Y. Ji, Z. Wei, Future Gener. Comput. Syst. [19] W. P. Duan, Z. Y. Tang, W. Liu, H. B. Zhou, Expert Syst. 40, e13191.
2022, 135, 259. [20] Y. H. Zhu, D. B. Zhao, IEEE. Trans. Neural Netw. Learn. Syst. 2022, 33,
[11] D. Hu, R. Yang, Y. Zhang, L. Yue, M. Yan, J. Zuo, X. Zhao, Eng. Appl. 1228.
Artif. Intell. 2022, 111, 104767. [21] B. W. Peng, A. Stancu, S. P. Dang, Z. T. Ding, IEEE Trans. Cybern.
[12] Z. Jiandong, Y. Qiming, S. H. I. Guoqing, L. U. Yi, W. U. Yong, J. Syst. 2022, 52, 8897.
Eng. Electron. 2021, 32, 1421. [22] B. L. Gao, L. Pavel, IEEE Trans. Automat. Contr. 2021, 66, 121.
[13] J. Y. Liu, G. Wang, Q. Fu, S. H. Yue, S. Y. Wang, Def. Technol. 2022, 19, [23] Y. X. Huang, S. F. Wu, Z. Y. Kang, Z. C. Mu, H. Huang, X. F. Wu,
210. A. J. Tang, X. B. Cheng, ChJA 2023, 36, 284.
[14] a) S. Gronauer, K. Diepold, Artif. Intell. Rev. 2022, 55, 895; [24] Q. G. Lu, X. F. Liao, S. J. Deng, H. Q. Li, IEEE Trans. Parallel Distrib.
b) G. A. Vouros, ACM Comput. Surv. 2023, 55, 92. Syst. 2023, 34, 16.
[15] a) B. M. Albaba, Y. Yildiz, IEEE Trans. Control. Syst. Technol. 2022, 30, [25] X. Y. Gong, J. Y. Yu, S. Lu, H. W. Lu, Inf. Sci. 2022, 582, 633.
885; b) K. Cao, L. H. Xie, IEEE Trans. Neural Netw. Learn. Syst. 2022, [26] a) Y. R. Zhang, P. Hang, C. Huang, C. Lv, Adv. Intell. Syst. 2022, 4,
https://doi.org/10.1109/tnnls.2022.3148376. 2100211;
[16] H. Yarahmadi, M. E. Shiri, H. Navidi, A. Sharifi, M. Challenger, b) M. Diamant, S. Baruch, E. Kassem, K. Muhsen, D. Samet,
Swarm. Evol. Comput. 2023, 77, 101229. M. Leshno, U. Obolski, Nat. Commun. 2021, 12, 1148.
Adv. Intell. Syst. 2023, 5, 2300151 2300151 (15 of 15) © 2023 The Authors. Advanced Intelligent Systems published by Wiley-VCH GmbH