Reinforcement_Learning-Based_Dynamic_Anti-Jamming_Power_Control_in_UAV_Networks_An_Effective_Jamming_Signal_Strength_Based_Approach
Reinforcement_Learning-Based_Dynamic_Anti-Jamming_Power_Control_in_UAV_Networks_An_Effective_Jamming_Signal_Strength_Based_Approach
Abstract— Unmanned aerial vehicle (UAV) assisted Reinforcement learning (RL) is a promising approach for
air-to-ground (A2G) communication is vulnerable to malicious solving such a problem when the jamming model is not
jamming due to the broadcast nature of wireless communications. available in advance. In [10], a cooperative anti-jamming
In this letter, an anti-jamming power control framework with an
unknown jamming model and unknown transmission power is
resource allocation algorithm was proposed, where the BS
proposed. In particular, the probability density function (PDF) selected appropriate power and sub-band based on deep RL
of the effective jamming signal strength (EJSS) is first estimated to realize energy-efficient anti-jamming transmission. In [11],
via kernel density estimation (KDE). Then, utilizing the EJSS, an RL-based algorithm was proposed to improve the quality
a deep deterministic policy gradient (DDPG) based framework of service of low-latency Visual Internet of Things (VIOT)
is proposed to acquire the power control strategy in real time. video streaming in jamming environment. In [9], the authors
Moreover, a trajectory design scheme based on K-means++ is
proposed to track the location of users. The simulation results proposed a Q-learning based algorithm to optimize the utility
show that the proposed framework yields an improved sum rate of the BS in millimeter-wave massive MIMO system with
and energy efficiency over the reference schemes. smart jammer. However, most of the previous works assume
Index Terms— UAV, anti-jamming power control, unknown discrete power and perfect estimation of the jamming power
jamming model, deep deterministic policy gradient, kernel den- by the BS, which is not always available in practice.
sity estimation. Motivated by the above research, in this letter, we model the
interaction between a UAV and a smart jammer as a Markov
I. I NTRODUCTION
decision process (MDP) and then use the deep RL-based
A S SUPPLEMENTS to cellular mobile communication
systems, unmanned aerial vehicles (UAVs) can greatly
improve coverage performance due to their deployment flexi-
algorithm to contend with such a problem. Considering the
mobility of users and the kinematic restraints of fixed-wing
UAVs, we propose a location-based trajectory design scheme
bility [1]. However, the links between UAV base stations (BSs)
to track the users. Under the assumption that the system has no
and ground users are vulnerable to malicious jamming due to
prior knowledge of the jamming model and jamming power,
the broadcast nature. This makes the anti-jamming problem a
we use the jamming signal strength (JSS) as an indicator of the
critical issue in UAV-assisted air-to-ground (A2G) communi-
jamming strategy, and a kernel density estimation (KDE) based
cation systems.
effective JSS (EJSS) estimation method is proposed. Moreover,
The anti-jamming power control has often been investigated
we address continuous power control with deep deterministic
through game theory due to its noncooperative nature [2]–[6].
policy gradient (DDPG) which is different from the above
In [7], the authors proposed a beam-domain anti-jamming
RL-based works. Simulation results show the superiority of
transmission scheme for downlink massive multi-input multi-
the proposed framework compared with Q-learning and the
output (MIMO) system, in which the hierarchical interactions
benchmark algorithm.
between the jammer and the BS were formulated as Bayesian
Stackelberg game. In [8], to determine the users’ power invest- II. S YSTEM M ODEL
ment in dual communication environment, a noncooperative
The UAV-assisted A2G communication system with a
power control game is formulated and solved. However, the
smart jammer is illustrated in Fig. 1, which consists of one
drawbacks of these schemes are twofold: 1) They require the
fixed-wing UAV with M antennas and K single-antenna users
BS to perfectly know the jamming model between the smart
located at within a circle of radius rd . The UAV serves as a
jammer and the users [9], and this model is usually difficult
BS to provide internet access for the users. The smart jammer
to obtain. 2) They are executed in an iterative manner that
is equipped with M antennas and wishes to jam the users with
lowers the efficiency of making an optimal decision, especially
suitable power.
in time-varying environment.
Manuscript received 30 June 2022; accepted 16 July 2022. Date of pub- A. Channel Model
lication 21 July 2022; date of current version 10 October 2022. This work
was supported in part by the National Natural Science Foundation of China
We consider a UAV that flies along a circular trajectory
under Grant 62071485, Grant 61901519, Grant 62001513 and in part by the with a fixed altitude HU . The time duration for the UAV to
Basic Research Project of Jiangsu Province under Grant BK 20192002 and the fly through one rotation is T , which is equally divided into N
Natural Science Foundation of Jiangsu Province under Grant BK 20201334, epochs. In the nth epoch, the location of the UAV and user
and BK 20200579. The associate editor coordinating the review of this
letter and approving it for publication was F. Kara. (Corresponding author: k are denoted by (xU (n), yU (n), HU ) and (xk (n), yk (n), 0)
Kui Xu.) in a three-dimensional (3D) Cartesian coordinate system,
The authors are with the College of Communications Engineering, Army respectively. The channel vector between the UAV and user
Engineering University of PLA, Nanjing 210007, China (e-mail: manan995@
163.com; [email protected]; [email protected]; [email protected]; k is expressed as [12]
[email protected]; [email protected]; [email protected]).
Digital Object Identifier 10.1109/LCOMM.2022.3193309 hU,k (n) = βU,k (n)gU,k (n), n ∈ {1, . . . , N }, (1)
1558-2558 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: VNR Vignana Jyothi Inst of Eng & Tech. Downloaded on April 04,2025 at 08:13:47 UTC from IEEE Xplore. Restrictions apply.
2356 IEEE COMMUNICATIONS LETTERS, VOL. 26, NO. 10, OCTOBER 2022
Authorized licensed use limited to: VNR Vignana Jyothi Inst of Eng & Tech. Downloaded on April 04,2025 at 08:13:47 UTC from IEEE Xplore. Restrictions apply.
MA et al.: RL-BASED DYNAMIC ANTI-JAMMING POWER CONTROL IN UAV NETWORKS: AN EJSS BASED APPROACH 2357
1) State: the state of the jammer in epoch n is expressed where h denotes the bandwidth parameter. O(·) denotes the
as sjam (n) = [PU (n − 1)], where PU (n − 1) is the kernel function, which can be selected as a Gaussian kernel
transmission power of the UAV estimated by the jammer. or a rectangular kernel [21].
2) Action: the action of the jammer in epoch n is the total We use KDE to estimate the PDF of the compressed
jamming power PJ (n). JSS samples, i.e., P̃J (1), P̃J (2), . . . , P̃J (n). After the PDF
3) Reward: the reward function of the jammer in epoch n is estimated, the dominant interval P̃J,min, P̃J,max for the
is expressed as: compressed JSS can be defined as the minimum interval that
K
satisfies the condition:
Rjam (n) = − log2 (1+γk (n)) − cJ PJ (n), (12) P̃J,max +∞
k=1 fˆh (x)dx ≥ η fˆh (x)dx, (14)
where cJ denotes the transmission cost of the jammer. P̃J,min −∞
III. A NTI -JAMMING P OWER C ONTROL F RAMEWORK where the parameter η is a positive close to 1. Then, the
W ITHOUT THE JAMMING M ODEL AND P OWER UAV can compress the JSS into the dominant interval and
In this work, we model the interactions between the UAV the outliers are eliminated.
and the jammer as MDP, and considering the mobility of the The KDE-based estimation of the EJSS in epoch n + 1 is
users and continuous power optimization, a KDE-based DDPG shown in Algorithm 1.
framework is proposed to realize anti-jamming power control
at the UAV. Algorithm 1 The KDE-Based Estimation of the EJSS
Input: Initialize JSS level Lnum , the kernel function O(·), and the ratio η.
A. Trajectory Design Based on K-Means++
JSS P̄J (n + 1) in epoch n + 1.
Considering the flight payload and on-board battery, the 1: Perform Logarithmic compression on historical data of JSS, i.e., P̃J (i) =
fixed-wing UAV is adopted. Thus, the UAV must maintain 10 log10 P̄J (i), 1 ≤ i ≤ n.
a forward flight to remain aloft [16] and the majority of 2: Estimate the PDF fˆh of the compressed JSS using P̃J (i), 1 ≤ i ≤ n by
studies about trajectory optimization [17] are unavailable in KDE.
this scenario. 3: Determine the valid distribution interval [P̃J,min , P̃J,max ] of the data
Thus, we use the location information of users to determine according to (14) and the pre-defined ratio η.
a circular trajectory, which makes the UAV capable of adapting º
4: Quantify the effective interval into Lnum levels, and calculate the quan-
its trajectory to track users and provide better services. First, tization interval as ξ = (P̃J,max − P̃J,min ) (Lnum − 1).
we determine the cluster trajectory center [ccx , ccy ] and radius 5: The quantified P̂J (n + 1) is given as P̂J (n + 1) = (P̃J (n + 1) −
rc,uav (n+1) with K-means++ [18]. Then, we make the actual P̃J,min )/ξ, which is defined as EJSS.
trajectory track the change of cluster trajectory slowly, i.e., the Output: The EJSS P̂J (n + 1).
next center and radius are updated by [cx (n + 1), cy (n + 1)] =
0.5[cx (n), cy (n)] + 0.5[ccx , ccy ] and ruav (n + 1) = ruav (n) + C. The DDPG Based Power Control Framework
δr , where δr = rc,uav (n + 1) − rc,uav (n) denotes the change In the proposed framework, the design of the agent is
of cluster radius. [cx (n), cy (n)], ruav (n) and rc,uav (n) are the formulated as follows:
current center, radius and cluster trajectory radius, respectively. 1) State: the state of the UAV in epoch n is expressed as
suav (n) = r(n − 1), P̂J (n − 1) , where r(n − 1) =
B. The KDE-Based EJSS Estimation K
We define the term PJ /K[(HJ (n)HJ (n)H )−1 ]k,k in the k=1 log2 (1 + γk (n − 1)) is the sum rate of the users
denominator of (11) as the JSS at user k, which is denoted and P̂J (n − 1) is the EJSS estimated by the UAV.
as P̄J,k . We consider the JSS as an indicator of the jammer’s 2) Action: the action of the UAV in epoch n is the total
current strategy, which is used to construct the state space of transmission power PU (n).
the UAV in the next subsection. In practice, the JSS can be 3) Reward: the reward function of the UAV in epoch n is
estimated from the SINR feed back from users. However, due expressed as:
to channel fluctuations, the JSS may have a large dynamic K
region, which results in a massive state space for the UAV. Ruav (n) = log2 (1+γk (n)) − cU PU (n), (15)
In this subsection, we propose an EJSS estimation algorithm k=1
based on KDE to address this problem. where cU denotes the transmission cost of the UAV.
In particular, we define the total EJSS of all users K as P̂J , We consider continuous power optimization thus the value
which can be estimated with the JSS P̄J = k=1 P̄J,k . based approaches are not available because they can only
To compress the dynamic region and reduce the variance generate discrete actions.
of the JSS, a logarithmic operation is utilized as P̃J (i) = DDPG is an actor-critic (AC) algorithm. It concurrently
10 log10 P̄J (i), 1 ≤ i ≤ n. Moreover, we obtain the probability learns a policy network approximation μ(s|θμ ), i.e., actor
density function (PDF) of the compressed JSS P̃J by using network, and a Q-function network approximation Q(s, a|θQ )
KDE based on the historical value of the JSS. called the critic. Since the output of the policy network in
The KDE is a nonparametric method for estimating proba- the DDPG directly maps the states to the actions, instead of
bility densities [19], [20]. If x1 , x2 , . . . xn is a set of indepen- computing the probability distribution across a discrete action
dent sample points taken from an unknown density fh (x), the space, it is more suitable for the continuous action space.
estimation of fh (x) can be obtained as follows: Because the target policy in DDPG is deterministic, the
1
n
x − xi Q-function is trained using the Bellman equation
fˆh (x) = O , (13)
nh i=1 h Qμ (s, a) = E(s,a,r,s )∈D [r(s, a)+γQμ (s , μ(s |θμ ))] , (16)
Authorized licensed use limited to: VNR Vignana Jyothi Inst of Eng & Tech. Downloaded on April 04,2025 at 08:13:47 UTC from IEEE Xplore. Restrictions apply.
2358 IEEE COMMUNICATIONS LETTERS, VOL. 26, NO. 10, OCTOBER 2022
Algorithm 2 DDPG Based Anti-Jamming Power Control where ς ∈ (0, 1) denotes the soft coefficient that determines
Framework the update speed of the target networks.
Initialization: Randomly initialize actor network μ(s|θµ ) and critic network To explore the continuous action space effectively, we con-
Q(s, a|θ Q ) with weights θ µ and θ Q . Initialize target network μ and Q struct an exploration policy μe by adding noise sampled from
with weights θ µ ← θ µ , θ Q ← θ Q . Initialize replay buffer D. Initialize a noise process N to an actor policy:
a random process N for action exploration, randomly generating the initial
μe (s) = μ(s|θμ ) + N , (21)
state suav (1).
1: Generate initial trajectory with center [0, 0] and pre-defined radius. N can be chosen according to the environment.
2: for n = 1 : Nepoch do Based on the above, the DDPG based anti-jamming power
3: Select action a(n) = μ(s|θ µ ) + N . control framework is presented in Algorithm 2.
4: Execute action a(n) and observed reward Ruav (n).
5: Estimate the EJSS P̂J (n) following Algorithm 1 and construct the IV. S IMULATION R ESULTS
next state suav (n + 1).
In this section, the performance of the proposed framework
6: Store the transition (suav (n), a(n), R(n), susv (n + 1)) in replay
buffer D.
is presented and compared with Q-learning and the benchmark
7: Sample a minibatch of Nr transitions from the buffer D randomly.
algorithm. The UAV flies over a circular mission area with a
8: Calculate the target value y(n) following (17). radius of rd = 500 m along the trajectory determined by the
9: Update the weights of critic network θ µ by minimizing the loss in (18). scheme in Part A, Section III or a pre-defined trajectory with
10: Update the weights of actor network θ Q with the sampled gradient a radius of 250 m. The smart jammer located at the center of
in (19). the mission area. The altitude of the UAV is HU = 100 m.
11: Update the target actor network θ Q and the target policy network θ µ The number of antennas employed by the UAV and jammer
following (20). is M = 16. The number of users is K = 10. The parameters
12: while n = kN, k = 1, . . . , Nepoch /N do of the A2G channels is Gray = 8 [14], a = 9.61, b = 0.16,
13: Update trajectory according Part A, Section III. ηLoS = 1 and ηN LoS = 20.
14: end while The path loss in the reference distance τ0 = −40 dBm.
15: end for The noise power is σ 2 = −110 dBm. The ψ is a log-normal
where s is the next state of the UAV, and γ denotes the random variable with a standard deviation of σshadow =
discount factor. D is the experience replay buffer, and in each 8 dB. κ = 3.8 denotes the path loss factor. The maximal
epoch, the tuple (s, a, r, s ) will be stored in the replay buffer. transmission power of the UAV is PUmax = 1.5 W and can
At each timestep, the actor and critic are updated by sampling be continuously adjusted. The maximal jamming power is
a minibatch uniformly from the buffer, in which the samples PJmax = 150 W, which is equally divided into Lnum = 16 lev-
are assumed to be independently and identically distributed. els. In the DDPG algorithm, the learning rates of the actor and
Moreover, because the updated network Qμ (s, a) is also critic are 0.0005 and 0.001, respectively. The discount factor
used in calculating the target value (16), it is difficult for the is γ = 0.95. The ratio η used in KDE is 0.95. The number of
Q update to achieve convergence. To overcome this drawback, epochs in one flying cycle N = 200, and Nepoch = 105 .
we create a copy of the actor and critic networks, namely Then, the transmission costs are cU = 15 and cJ = 1.
In addition, we assume that the users move independently and
the target actor network μ (s|θμ ) and target critic network
Q (s, a|θQ ), that are used for calculating the target values. randomly during each epoch. The CSI error at the jammer
Thus, for each transition (s(n), a(n), r(n), s(n + 1)) in mini- follows CN (0, 0.1).
batch, the target value y(n) is calculated as The time-complexity of a deep neural network is represented
by the number of floating-point operations (FLOPs). Hence,
y(n) = r(n)+γ(Q (s(n + 1), μ (s(n + 1)|θμ )|θQ ). (17) the FLOPs of the proposed DDPG during inference is
The critic network is updated by minimizing the loss L(θ):
F LOP s = 2 · (128 × |S| + 64 × |A| + 32768)
1 2
N
L(θ) = Q(s(i), a(i)|θQ − y(i)) , (18) = 2 · (128 × 2 + 64 × 1 + 32768) = 33408
Nr i=1 (22)
where Nr is the size of a minibatch. Then, the actor network Fig. 2 shows the average sum rate as a function of flying
update using the sampled gradient is expressed as: cycles. It can be seen that the scheme with the adaptive
∇θμ μ|s(n) trajectory can achieve an average sum rate gain of 2 bit/s/Hz
1
N
for Q-learning and 0.5 bit/s/Hz for DDPG due to the ability
≈ ∇a Q(s, a|θQ )|s=s(n),a=μ(s(n))∇θμ μ(s|θμ )|s(n). of tracking the users real-time locations.
Nr i=1
(19) Fig. 3 compares the outage probability (the ratio of the
In the deep Q network (DQN) algorithm, since the weights user’s SINR that is less than 1) of each scheme. The DDPG
of the target network are regularly copied from the current and trajectory optimization based scheme converge to the
network, the stability of training decreases. In the DDPG lowest outage probability. However, this metric would not
algorithm, the “soft update” is used to address this issue, which converge to zero due to the smart jammer also optimizes its
means that the weights of the target actor and critic networks jamming policy according to the ongoing transmission.
are updated by having them slowly track the learned networks: Moreover, the proposed DDPG framework achieves a lower
outage probability with less energy consumption, as shown
θQ ← ςθQ + (1 − ς)θQ , θμ ← ςθμ + (1 − ς)θμ , (20) in Fig. 4. This advantage may be a benefit derived from
Authorized licensed use limited to: VNR Vignana Jyothi Inst of Eng & Tech. Downloaded on April 04,2025 at 08:13:47 UTC from IEEE Xplore. Restrictions apply.
MA et al.: RL-BASED DYNAMIC ANTI-JAMMING POWER CONTROL IN UAV NETWORKS: AN EJSS BASED APPROACH 2359
V. C ONCLUSION
In this letter, an anti-jamming DDPG based framework is
proposed to optimize the anti-jamming power control strategy
of the UAV. Considering the mobility of users, a trajectory
optimization scheme based on K-means++ is used, and a
KDE-based EJSS estimation approach is proposed to address
the problem of the UAV encountering difficulty in obtaining
the jamming power. Simulation results show that the proposed
Fig. 2. Average sum rate. DDPG based framework with the EJSS and adaptive trajectory
can achieve better performance with lower energy consump-
tion than the Q-learning and benchmark algorithm.
R EFERENCES
[1] B. Ji et al., “Several key technologies for 6G: Challenges and oppor-
tunities,” IEEE Commun. Standards Mag., vol. 5, no. 2, pp. 44–51,
Jun. 2021.
[2] D. Yang et al., “Coping with a smart jammer in wireless networks: A
Stackelberg game approach,” IEEE Trans. Wireless Commun., vol. 12,
no. 8, pp. 4038–4047, Aug. 2013.
[3] L. Xiao et al., “Anti-jamming transmission Stackelberg game with
observation errors,” IEEE Commun. Lett., vol. 19, no. 6, pp. 949–952,
Fig. 3. Outage probability. Jun. 2015.
[4] L. Jia et al., “Bayesian Stackelberg game for antijamming transmission
with incomplete information,” IEEE Commun. Lett., vol. 20, no. 10,
pp. 1991–1994, Oct. 2016.
[5] Y. Xu et al., “Anti-jamming transmission in UAV communication
networks: A Stackelberg game approach,” in Proc. IEEE/CIC Int. Conf.
Commun. China (ICCC), Chengdu, China, Oct. 2017, pp. 1–6.
[6] A. Garnaev et al., “A power control game with uncertainty on the
type of the jammer,” in Proc. IEEE Global Conf. Signal Inf. Process.
(GlobalSIP), Ottawa, ON, Canada, Nov. 2019, pp. 1–5.
[7] Z. Shen et al., “Beam-domain anti-jamming transmission for downlink
massive MIMO systems: A Stackelberg game perspective,” IEEE Trans.
Inf. Forensics Security, vol. 16, pp. 2727–2742, 2021.
[8] P. Vamvakas et al., “Exploiting prospect theory and risk-awareness to
Fig. 4. Energy consumption. protect UAV-assisted network operation,” EURASIP J. Wireless Com-
mun. Netw., vol. 2019, no. 1, pp. 286–305, Dec. 2019.
[9] Z. Xiao et al., “Learning based power control for mmWave mas-
sive MIMO against jamming,” in Proc. IEEE Global Commun. Conf.
(GLOBECOM), Abu Dhabi, United Arab Emirates, Dec. 2018, pp. 1–6.
[10] Y. Li et al., “Power and frequency selection optimization in anti-
jamming communication: A deep reinforcement learning approach,” in
Proc. IEEE 5th Int. Conf. Comput. Commun. (ICCC), Changchun, China,
Dec. 2019, pp. 815–820.
[11] Y. Xiao et al., “Learning-based low-latency VIoT video streaming
against jamming and interference,” IEEE Wireless Commun., vol. 28,
no. 4, pp. 12–18, Aug. 2021.
[12] X. Xia et al., “Toward digitalizing the wireless environment: A unified
A2G information and energy delivery framework based on binary
Fig. 5. Energy efficiency. channel feature map,” IEEE Trans. Wireless Commun., early access,
Feb. 10, 2022, doi: 10.1109/TWC.2022.3149636.
[13] A. Al-Hourani et al., “Optimal LAP altitude for maximum coverage,”
IEEE Wireless Commun. Lett., vol. 3, no. 6, pp. 569–572, Dec. 2014.
the continuous power optimization approach, which could [14] L. Liu et al., “CoMP in the sky: UAV placement and movement
optimization for multi-user communications,” IEEE Trans. Commun.,
decrease the estimation precision at the jammer because the vol. 67, no. 8, pp. 5645–5658, Aug. 2019.
discrete state space is adopted. Furthermore, compared with [15] H. Q. Ngo et al., “Energy and spectral efficiency of very large multiuser
the Q-learning algorithm, the DDPG based framework can MIMO systems,” IEEE Trans. Commun., vol. 61, no. 4, pp. 1436–1449,
Apr. 2013.
reduce the power requirement 38% without incurring perfor- [16] Q. Song et al., “A survey of prototype and experiment for UAV com-
mance loss. Thus the DDPG based framework can achieve munications,” Sci. China Inf. Sci., vol. 64, no. 4, pp. 1–21, Feb. 2021.
higher energy efficiency as shown in Fig. 5, which is more [17] X. Wang et al., “Jamming-resilient path planning for multiple UAVs
via deep reinforcement learning,” in Proc. IEEE Int. Conf. Commun.
attractive for energy-limited UAVs. Workshops (ICC Workshops), Montreal, QC, Canada, Jun. 2021, pp. 1–6.
In addition, the performance achieves a tremendous growth [18] J. An and F. Zhao, “Trajectory optimization and power allocation
when the jammer has imperfect CSI. The reason is that the CSI algorithm in MBS-assisted cell-free massive MIMO systems,” IEEE
Access, vol. 9, pp. 30417–30425, 2021.
error makes the jammer unable to achieve effective precoding; [19] D. W. Scott, Multivariate Density Estimation. Hoboken, NJ, USA: Wiley,
thus, the jamming efficiency decreases, and the UAV can 1992.
achieve better performance with less energy consumption. [20] R. O. Duda, Pattern Classification. Hoboken, NJ, USA: Wiley, 2000.
[21] C. M. Bishop, Pattern Recognition and Machine Learning. New York,
NY, USA: Springer, 2006.
1 The benchmark algorithm was proposed in [9] and simulated without EJSS [22] T. P. Lillicrap et al., “Continuous control with deep reinforcement
estimation and trajectory optimization. learning,” 2015, arXiv:1509.02971.
Authorized licensed use limited to: VNR Vignana Jyothi Inst of Eng & Tech. Downloaded on April 04,2025 at 08:13:47 UTC from IEEE Xplore. Restrictions apply.