0% found this document useful (0 votes)
12 views6 pages

CL2024-2666_Proof_hi

This document presents a deep Q-network (DQN)-based algorithm for low complexity beam alignment in millimeter-wave (mmWave) massive MIMO systems, modeled as a Markov decision process (MDP). The proposed approach utilizes the locations of both the target vehicle and neighboring vehicles to enhance alignment performance while adapting to changes in traffic density through an online learning mode. Simulation results indicate that this DQN scheme outperforms existing baseline methods in terms of alignment efficiency and computational complexity.

Uploaded by

Hòa Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views6 pages

CL2024-2666_Proof_hi

This document presents a deep Q-network (DQN)-based algorithm for low complexity beam alignment in millimeter-wave (mmWave) massive MIMO systems, modeled as a Markov decision process (MDP). The proposed approach utilizes the locations of both the target vehicle and neighboring vehicles to enhance alignment performance while adapting to changes in traffic density through an online learning mode. Simulation results indicate that this DQN scheme outperforms existing baseline methods in terms of alignment efficiency and computational complexity.

Uploaded by

Hòa Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

IEEE Communications Letters

Deep Q-Network Enabled Low Complexity Beam Alignment


for MmWave Massive MIMO System

Journal: IEEE Communications Letters

Manuscript ID CL2024-2666

Manuscript Type: Original Article

CL1.2 Wireless Communications, CL1.2.2 Wireless Networks < CL1.2


EDICS: Wireless Communications, CL1.2.6 Wireless Systems < CL1.2 Wireless
Communications

Artificial intelligence, Beams, Millimeter wave communication, MIMO


Key Words:
systems, Road vehicles
Page 1 of 5 IEEE Communications Letters
1

1
2
3
Deep Q-Network Enabled Low Complexity Beam Alignment for
4
5 MmWave Massive MIMO System
6
7 Author 1, Author 2, Author 3, Author 4
8
9
10
11 Abstract—In this letter, we first model the beam alignment
problem in millimeter-wave (mmWave) massive MIMO system
12
as a Markov decision process (MDP). Then, by leveraging the
13 locations of the target vehicle and its neighboring vehicles,
14 we propose a deep Q-network (DQN)-based beam alignment
15 algorithm with lower computation complexity. In addition, to
16 address the issue of model obsolescence caused by the changed
traffic density, an online mode is elaborated for the proposed
17 DQN, aiming to efficiently adapt to the environment in practical
18 scenarios. Simulation results show that the proposed DQN scheme
19 has better alignment performance and lower complexity than the
20 existing baseline schemes.
21 Index Terms—Beam alignment, millimeter-wave (mmWave),
22 massive MIMO, deep Q-network (DQN).
Fig. 1. The downlink mmWave communication system.
23
24
I. I NTRODUCTION
25 selection. Unfortunately, the acquisition of the above sensing
26
27
28
M ILLIMETER-WAVE (mmWave) technology has
emerged as one of the candidates for 6G wireless
communications, owing to its potential for achieving high
data is costly. Position information has also been studied for
beam alignment, which is readily available in contemporary
wireless systems [9], [10]. However, the proposed methods
29 data rates [1]. However, it suffers from severe spatial path loss
focus solely on the target vehicle’s knowledge, neglecting
30 due to high frequency. To combat path loss, the transmitter
information from other objects that impact the channel. This
31 and receiver need to be equipped with large antenna arrays
omission makes it difficult to predict non-line-of-sight links.
32 to achieve beamforming gains. To ensure high beamforming
To address this disadvantage, [11] utilized the location and
33 gain, the link between the transmitter and receiver should
type of all vehicles to learn the optimal beam pair index.
34 achieve maximum signal strength, known as beam alignment.
[12] employed computer vision to extract size and position
35 Traditional exhaustive and hierarchical search methods require
information of dynamic vehicles around the target user for
36 significant beam training overhead, making them impractical
beam alignment. These studies demonstrate that a richer set of
37 for massive multi-input multi-output (MIMO) system [2], [3].
vehicle information can effectively identify promising beam-
38 Reinforcement learning has increasingly demonstrated
forming directions. However, considering the dynamics of
39 promising potential in beam alignment with its excellent
wireless communication channels and the complexity of data
40 adaptive learning capability. Moreover, it eliminates the need
collection, these existing supervised learning-based schemes
41 for large amounts of data in advance in supervised learning
are challenging to implement in real-world scenarios.
42 through online training. To reduce the training overhead,
Each scheme mentioned above has its own drawbacks,
43 [4] proposed a deep Q-network (DQN)-based beam training
which motivates us to investigate other alternative schemes.
44 scheme, which used beam image construction to sense the
In this letter, we propose an environment-aware and low-
45 environment. Multi-armed bandit model is widely used as a
complexity approach for beam alignment. We utilize an ad-
46 lightweight reinforcement learning. For example, an intelligent
vanced DQN algorithm that leverages the locations of the
47 beam alignment scheme was proposed to reduce latency in
target vehicle and its neighboring vehicles. To address the issue
48 [5], where the training beams were selected using the upper
of model obsolescence resulting from fluctuations in traffic
49 confidence bound strategy. In [6], based on the quality of the
density, an online mode has been developed to adapt to the
50 current operating downlink beam measured by ACK/NACK
changed environment in real-world scenarios. Our simulation
51 packets, the online reinforcement learning technique called
results demonstrate that the proposed scheme outperforms
52 Thompson sampling is used to select new downlink beams.
benchmark methods.
53 The highly directional nature of mmWave communication
54 makes the beam alignment problem inherently dependent on
the locations of the transmitter and receiver, as well as the II. S YSTEM M ODEL
55
56 surrounding environment’s geometry [7–12]. Specifically, [7] We consider the downlink mmWave massive MIMO system
57 used light detection and ranging data to guide the beam as depicted in Fig. 1, where half-wave spaced uniform planar
58 prediction mission. In [8], a panoramic point cloud from the arrays (UPAs) are equipped with Nt and Nr antennas at the
59 images taken from the base station’s camera was built for beam transmitter and receiver, respectively. The size of UPAs is
60
IEEE Communications Letters Page 2 of 5
2

1
2 denoted as Nt = Nr = N a × N b , N a and N b represent the TABLE I
T HE P ROCESS OF I DENTIFYING P OSSIBLE B EAM D IRECTIONS
3 number of antennas in the vertical and horizontal dimensions,
4 respectively. The antenna array response vector is given by
Beam pair Observed vehicles
Average power
5 1 h index 1 2 ··· Nobs
6 α (θ, ϕ) = √ 1, · · · , ejπ[p cos θ+q sin θ sin ϕ] , · · · , I1 R11 R21 ··· RNobs 1 R̄1
N aN b (1) I2 R12 R22 ··· RNobs 2 R̄2
7 iT ..
jπ [(N a −1) cos θ+(N b −1) sin θ sin ϕ]
8 e . ··· ··· ··· ··· ···
IC R1C R2C ··· RNobs C R̄C
9
where θ and ϕ denote the elevation and azimuth with respect
10
to the array plane, and 0 ≤ p ≤ N a − 1, 0 ≤ q ≤ N b − 1.
11 We adopt a geometry channel model between base station pedestrians and bicycles due to their lower profile compared
12 (BS) and user equipment (UE). The channel matrix H ∈ to larger vehicles. Thus, static buildings and dynamic vehicles
13 CNr ×Nt can be given by are the main sources of channel reflection and congestion [11].
14
15 M
X Since buildings are permanent structures, identifying the
r
16 H= βm α (θm , ϕrm ) αH (θm
t
, ϕtm ) (2) UE’s location inherently reveals the surrounding static envi-
17 m=1 ronment. Therefore, the key factors affecting H are the UE’s
18 where βm is the complex gain of the m-th path, and θm r location and nearby dynamic vehicles. Given the similarity
19
r
and ϕm are elevation and azimuth of the angle-of-arrival. in vehicle size and the regularity of lane layout, the shape
20 Likewise, θmt
and ϕtm are elevation and azimuth of the angle- and type of vehicles can be characterized using location
21 of-departure. It is assumed that BS and UE have a single information [11]. Therefore, for specified radio wave, the
22 RF chain at the mmWave band to reduce expenditure. To channel matrix can be simplified as
23 simplify the beamforming procedure, we use Discrete Fourier
24 Transform (DFT) codebook F = {f1 , . . . , fNt } for the BS H = f (L, Eveh ) (5)
25 beamformer and W = {w1 , . . . , wNr } for the UE beamformer
26 [11]. Then, the received downlink signal can be expressed by where f (·, ·) is a function that maps the UE’s location L and
27 p dynamic vehicles’ location Eveh to the channel knowledge.
y = Pt wH Hf s + wH n (3) We call L and Eveh location context. The proposed approach
28
29 is based on the assumption that the BS is aware of the
where Pt is transmission power, f and w are beamforming
30 locations of all vehicles and that the location information can
vectors, and s represents  the transmitted signal with unit
31 be feedback via low-frequency links [14].
power. n ∼ CN 0, σ 2 I is the additive Ganssian noise.
32 The primary objective of beam alignment is to identify the We define the location of the i-th vehicle as (xi , yi , zi ),
33 optimal beam pair (w⋆ , f ⋆ ) to maximize the received power. where xi and yi denote the horizontal and vertical distances
34 (w⋆ , f ⋆ ) can be obtained by from the BS, respectively, and zi represents the height of the
35 i-th vehicle. Therefore, the state at time step t is defined as
(w⋆ , f ⋆ ) = arg max ∥wH Hf ∥2 .
36 (4)
w∈W,f ∈F st =[(x1 , y1 , z1); (x2 , y2 , z2); . . . ; (xVmax , yVmax , zVmax)]T
37 | {z } | {z } (6)
38 III. DQN- BASED LOCATION CONTEXT- ASSISTED BEAM L Eveh
39 ALIGNMENT ALGORITHM
where (xi , yi , zi ) represents the i-th element of st , and Vmax
40 A. MDP Modeling is the maximum number of supported vehicles in system. If the
41
Markov decision process (MDP) describes the decision- number of vehicles v ≤ Vmax , the last (Vmax −v) elements in
42
making process of the agent in a discrete time series, which st are each set to 01×3 , where 01×3 represents a 1 × 3 vector
43
is characterized by the tuple ε = {S, A, P, R}, where S, A, of all-zero.
44
45 P, and R are state space, action space, transition probabilities 2) Action: For massive MIMO systems, considering all
46 and rewards, respectively. After the agent takes action at ∈ A beam pairs as actions, as most previous studies have done [7],
in state st ∈ S at time step t, it obtains an immediate [8], [10], [11], is unrealistic due to dimensional explosions.
47
reward rt , and the environment transitions to a new state As demonstrated in [9], the geometry of a given environment
48
st+1 ∈ S with the transition probability p (st+1 | st , at ). imposes constraints on the number of beam directions that
49
R = {rt | st ∈ S, at ∈ A} is a reward set of all possible can be employed for UE within a particular location bin.
50
state-action pairs. The beam alignment problem is modeled as This results in the majority of beam pairs being invalid.
51
an MDP by elaborately designing the state space, action space, Therefore, we define the action space as the set of possible
52
and reward function. beam directions. The process of identifying possible beam
53
1) State: For the mmWave massive MIMO communication directions by observing Nobs vehicles is depicted in Table I,
54
system with a fixed BS location, the wireless channel and all where Rij represents the received power of the i-th observed
55 PNobs
56 its related channel knowledge, denoted as H, are determined vehicle for the j-th beam pair, and R̄j = i=1 Rij /Nobs and
57 by the radio wave property (e.g., the carrier frequency), the C = Nt Nr . We consider top-P beam pairs ranked according
58 locations of the UE, and the radio propagation environment to the average received power as possible beam directions.
59 [13]. In the considered propagation environment, we disregard The size of P is a system parameter chosen to balance the
60
Page 3 of 5 IEEE Communications Letters
3

1
2 dimension of the action space and the performance of the beam misalignment caused by selecting a single beam pair, a beam
3 alignment. The final action space is given by pair subset B is chosen according to ϵ-greedy strategy for
4 beam training. Here, B is determined by selecting the top-
A = {I(1) , I(2) , · · · , I(P ) } (7)
5 |B| Q-value actions or by randomly selecting |B| actions,
6 where I(i) represents the beam pair index with the i-th highest depending on the value of ϵ. Following interaction with the
7 average received power. environment, the maximum received power in the measured
8 3) Reward: We consider the received power as the reward result of B beam pairs, along with the corresponding beam
9 naturally. The immediate reward at time step t is defined as pair, is employed as the reward rt and action at , respectively.
10 Finally, the parameters of DQN are updated by performing
11 rt = ∥wH Hf ∥2 . (8) steps (d)-(g).
12 In practice, we consider two typical modes for implementing
13 B. DQN-based Location Context Sensing Beam Alignment the proposed scheme. The first mode involves online learning
14 Algorithm and offline prediction, primarily suitable for the environment
15 The problem of beam alignment has been formulated as an
with stable traffic density. In this mode, the algorithm first
16 MDP, and reinforcement learning (RL) has proven to be an
learns the channel knowledge by interacting with the envi-
17 effective and widely used approach for solving MDPs. Due
ronment. When the model has converged, it transitions to
18 to the discrete nature of the actions, we use DQN to find the
the offline prediction stage. In this stage, the agent directly
19 outputs beam pair subset B based on the location context
optimal policy. DQN builds upon the principles of Q-learning information. This way is referred to as the offline mode and
20 by utilizing neural networks to approximate the optimal action-
21 the corresponding scheme is called DQN-offline. However, if
value (also known as Q) function under a policy π, the optimal the traffic density changes drastically, the trained model may
22 Q function is given by
23 become obsolete and fail to perform properly. To save training
Q⋆(s, a) = max E rt +γrt+1 +γ 2 rt+2 +. . . | st = s,at = a,π (9) time, the model can continue to learn new environments based
 
24 π
25 on its existing knowledge and it optimizes its underlying neural
26 where γ ∈ (0, 1) is the discount factor. network at each time step to maintain performance. This way
27 It is well known that RL is unstable when a nonlinear is known as the online mode and the corresponding scheme
28 function approximator such as a neural network is used to learn is called DQN-online.
29 the Q function. Hence, two key techniques contribute to the
30 stability of DQN training: experience replay and periodic tar- Algorithm 1 Location Context Sensing Beam Alignment
31 get updates. Experience replay breaks the correlation between Algorithm.
32 observed sequences by randomizing the training data, while Input: contextual information
33 periodic updates of the target network reduce the variance in Output: beam pair subset B
34 the target value, improving convergence. 1 Initialize DQN: replay memory D; Q-network with random
35 We use deep neural networks to parameterize an approx- weights W ; target network with weights W − = W ;
36 imation function Q(s, a; W ), where W is the weight of the 2 for each episode do
Q-network. To perform experience replay, DQN stores expe- (1) get state s1 by (6);
37
riences (st , at , rt , st+1 ) in a replay buffer D at each time step (2) for t = 1 to T do
38
t. During the training, the DQN minimizes the loss function (a) choose action subset B in state st according to ϵ-
39
by drawing uniformly at random from D. The loss function greedy strategy;
40
is given by (b) execute B for beam training and obtain the optimal
41
action at ;
42
 
′ ′ − 2 (c) record reward rt and observe next state st+1 ;

43 L(W ) = E(s,a,r,s′ )∼U(D) r+γ max ′
Q s ,a ;W −Q(s,a;W)
a ∈A (d) store transition (st , at , rt , st+1 ) in D;
44 (10) (e) sample random minibatch of transitions from D;
45 −
where W is the parameters of the target network. The target (f) perform gradient descent step w.r.t W ;
46 (g) update the target network parameters W − = W
47 network is a periodically updated copy of the Q-network to
stabilize training. every K steps;
48 end
49 For clarity, the location context sensing beam alignment
algorithm based on DQN is summarized in Algorithm 1. The end
50
51 algorithm begins by initializing the DQN in Step 1. At the
52 beginning of each episode 1 , BS first sends a training request to
IV. S IMULATION R ESULTS
53 all users, who then respond with acknowledgments containing
54 their location information. BS then generates specific location This section assesses the performance of the proposed beam
55 context information for the target user based on (6). To avoid alignment algorithm compared to the baseline approach. We
56 generate the mmWave channel for the urban-macro scenario
1 In this paper, the episode continues for a fixed duration of T time
57 using QuaDRiGa [15]. To simulate the traffic scene, three
steps, during which the agent explores the environment and learns from its
58 interactions. The episode ends after the T -th time step, at which point the types of vehicles are placed, and their distances follow an
59 environment resets, and the next episode begins. Erlang distribution (α, β), where α is the shape parameter
60
IEEE Communications Letters Page 4 of 5
4

1
5000 5000
2 TABLE II
V EHICLE PARAMETERS FOR S IMULATION
3 4000 4000

4 Type Length (m) Height (m) Proportion


5

Reward
Car 3.71 1.55 3 3000 3000

Reward
Van 5.20 2.47 2
6 Bus 11.1 3.33 1 2000 2000
7
8 1000 1000
TABLE III
9 C RITICAL PARAMETERS OF Q UADRIGA AND DQN 0 0
10 1 2 3 4 5 1 2 3 4 5
Time step 104 Time step 103
11 Parameter Description Value
- Carrier frequency 60 GHz (a) offline mode (b) online mode
12 Nt , Nr Number of antennas 16 × 16, 16 × 16
13 M Number of paths 25 Fig. 2. The training process of DQN (|B| = 5). (a) α = 6, β = 1.254. (b)
14 Pt Transmission power 0 dBm α = 6, β = 0.6, 2.5.
σ2 Noise power -33 dBm
15 - BS location (0, 0, 7) m
16
1 1 6.5
L UE location (30±2.5, 8.5, 1.55)m
17 Vmax Maximum number of vehicles 50 0.8 0.8

Alignment Probability

3 dB Loss Probability

Spectral Efficiency
ς0 Initial learning rate 0.001 6
18 η Decaying rate of the learning rate 0.01
Optimal
DQN-offline
DQN-offline
19 γ Discount factor 0.99
0.6
LinUCB in [16]
0.6
5.5
LinUCB in [16]
TS in [6]
20 |D| Size of replay memory 5 × 103
0.4
TS in [6]
LRgUCB in [9] 0.4
LRgUCB in [9]
K Target network update interval 500 LgUCB in [9]
21 T Episode limit 100
LgUCB in [9]
UCB in [5] 5 UCB in [5]
HierS in [3]
22 Nobs Number of vehicle observations 50 0.2 HierS in [3] 0.2

23 P Size of action space 64


0 0
4.5

24 1 3 5 7 9 11
Number of searched beam pairs
13 15 1 3 5 7 9
Number of searched beam pairs
11 13 15

25 (a) (b)
26 and β is the inverse scale parameter. The relevant vehicle
27 parameters are listed in Table II. Without loss of generality, Fig. 3. Performance of different algorithms (α = 6, β = 1.254).
28 we assume there are two lanes, and all vehicles are driven in
29 the center of their respective lanes. The vehicles in the lane
30 closer to the BS have a vertical distance of 5m from the BS, policies with good learning ability and convergence. In Fig.
31 i.e., yi = 5, while the vehicles in another lane have yi = 8.5. 2 (a), the case of utilizing the decaying learning rate shows
32 According to [9], we allow for UE position inaccuracy within a faster convergence [17]. It can be observed from Fig. 2 (b) that
33 5m location bin. The detailed system parameters of QuaDRiGa the reward curve of the DQN-online scheme demonstrates a
34 and model parameters of DQN are listed in Table III. rapid convergence, which shows its ability to adapt swiftly
35 Similar to [9], we regard alignment probability and 3dB loss to environmental changes. Further, increasing traffic density
36 probability as performance metrics. Additionally, spectral effi- brings higher returns than decreasing traffic density. This is
37 ciency serves as a critical indicator for evaluating performance, because, with more vehicles, the scatterers are more abundant,
38 which is defined as enhancing the strength of the communication path.
39

Pt ∥wH Hf ∥2
 We evaluate the performance of different beam alignment
S = log2 1 + . (11) algorithms as shown in Fig. 3. The evaluation metrics average
40 σ2
41 over 2,000 channel realizations. It is observed from Fig. 3
42 To verify the performance of our proposed DQN scheme, (a) that the proposed DQN and LinUCB scheme outperforms
43 we compare the simulation results with the common upper several other algorithms, primarily due to their ability to lever-
44 confidence bound (UCB) algorithm in [5], the location-aided age location context for environmental sensing. The proposed
45 greedy UCB algorithm and location-aided risk-aware greedy DQN outperforms LinUCB by handling complex nonlinear
46 UCB algorithm in [9], the location context-assisted Linear features, whereas LinUCB is limited to linear features. UCB
47 UCB algorithm in [16], the Thompson sampling (TS) in [6] and TS accrue experience by measuring multiple beam pairs,
48 and hierarchical search algorithm in [3]. The following texts whereas LgUCB and LRgUCB depend solely on the UE’s
49 and figures refer to these benchmarks as “UCB”, “LgUCB”, position information, which precludes them from identifying
50 “LRgUCB”, “LinUCB”, “TS” and “HierS” for short. effective reflectors within the surrounding environment. Since
51 Fig. 2 (a) and Fig. 2 (b) show the training process of the 3dB loss probability is a more tolerant evaluation metric
DQN-offline with different learning rates and DQN-online than alignment probability, as expected, the gap between
52
schemes with different traffic densities, respectively. The initial various schemes decreases. Finally, the 3dB loss probability of
53
network used in the online mode is the Q-network saved the proposed DQN scheme is lower than 1%. Fig. 3 (b) shows
54
in the offline mode at 50,000 steps. It is well known that the spectral efficiency of our proposed DQN scheme and
55
rewards are a direct signal for the agent to measure the benchmarks, where the proposed DQN scheme can achieve the
56
quality of an action. As shown in Fig. 2, the reward curves highest spectral efficiency. Notably, it achieves near-optimal
57
gradually increase and stabilize in DQN-offline and DQN- performance with only 8 beam pair training budgets.
58
59 online modes. This indicates that both modes learn effective Subsequently, we evaluate the performance of different
60
Page 5 of 5 IEEE Communications Letters
5

1
1 1
2 6.4 MIMO system. The proposed scheme leverages the locations
3 of the target vehicle and its neighboring vehicles to sense the
Alignment Probability

3 dB Loss Probability
0.8 0.8
6

Spectral Efficiency
4 DQN-online
environment and recommend a beam pair subset for beam
DQN-online
5
0.6 0.6
DQN-offline 5.6 DQN-offline
LinUCB in [16]
training. Additionally, to address model mismatch caused by
6 changes in traffic density, we introduce an online mode to
LinUCB in [16] TS in [6]
0.4 TS in [6] 0.4 5.2 LRgUCB in [9]
7 LRgUCB in [9]
LgUCB in [9]
LgUCB in [9] adapt more quickly and effectively to the evolving environ-
8 0.2 0.2 4.8 ment. Simulation results demonstrate the effectiveness and
9 0 0 4.4
superiority of our proposed scheme.
10 1 3 5 7 9 11
Number of searched beam pairs
13 15 1 3 5 7 9 11
Number of searched beam pairs
13 15

11 (a) (b)
R EFERENCES
12 1 1 6.4 [1] Q. Xue et al., “A survey of beam management for mmwave
13 and THz communications towards 6G,” IEEE Communications
14 0.8 0.8 6 Surveys & Tutorials, vol. 26, no. 3, pp. 1520–1559, 2024.
3 dB Loss Probability
Alignment Probability

Spectral Efficiency

[2] S. Hur et al., “Millimeter wave beamforming for wireless


15 0.6 DQN-online 0.6 5.6
DQN-online
DQN-offline backhaul and access in small cell networks,” IEEE Transactions
16 DQN-offline
LinUCB in [16]
LinUCB in [16]
TS in [6] on Communications, vol. 61, no. 10, pp. 4391–4403, 2013.
17 0.4 TS in [6] 0.4 5.2 LRgUCB in [9]
LgUCB in [9]
[3] Z. Xiao et al., “Hierarchical codebook design for beamforming
LRgUCB in [9]
18 LgUCB in [9]
4.8
training in millimeter-wave communication,” IEEE Transactions
19
0.2 0.2 on Wireless Communications, vol. 15, no. 5, pp. 3380–3392,
2016.
20 0 0 4.4
1 3 5 7 9 11 13 15 [4] J. Zhang et al., “Intelligent interactive beam training for mil-
1 3 5 7 9 11 13 15
21 Number of searched beam pairs Number of searched beam pairs
limeter wave communications,” IEEE Transactions on Wireless
22 (c) (d) Communications, vol. 20, no. 3, pp. 2034–2048, 2021.
23 Fig. 4. Performance of different algorithms under varying traffic densities:
[5] Z. Jianjun et al., “Beam alignment and tracking for millimeter
wave communications via bandit learning,” IEEE Transactions
24 (a) (b) α = 6, β = 2.5; (c) (d) α = 6, β = 0.6. on Communications, vol. 68, no. 9, pp. 5519–5533, 2020.
25 [6] M. Krunz et al., “Online reinforcement learning for beam
26 TABLE IV tracking and rate adaptation in millimeter-wave systems,” IEEE
27 CPU E XECUTION T IME ( S ) Transactions on Mobile Computing, vol. 23, no. 2, pp. 1830–
28 Algorithms Training time Test time
1845, 2024.
[7] S. Jiang et al., “LiDAR aided future beam prediction in real-
29 DQN (decaying ς) 350.84 6.00 × 10−4 world millimeter wave V2I communications,” IEEE Wireless
30 DQN (ς = 10−5 ) 657.83 6.00 × 10−4 Communications Letters, vol. 12, no. 2, pp. 212–216, 2023.
31 LRgUCB 46.470 9.02 × 10−3 [8] W. Xu et al., “3D scene-based beam selection for mmwave com-
LgUCB 38.121 4.18 × 10−3
32 UCB 61.330 3.13 × 10−2
munications,” IEEE Wireless Communications Letters, vol. 9,
33 TS 351.55 7.04 × 10−1
no. 11, pp. 1850–1854, 2020.
[9] V. Va et al., “Online learning for position-aided millimeter wave
34 LinUCB 232.30 2.49
beam training,” IEEE Access, vol. 7, pp. 30 507–30 526, 2019.
HierS 2.4375 2.44
35 [10] K. Satyanarayana et al., “Deep learning aided fingerprint-based
36 beam alignment for mmwave vehicular communication,” IEEE
37 beam alignment algorithms under varying traffic densities. To
Transactions on Vehicular Technology, vol. 68, no. 11, pp.
38 show the performance more clearly, we omit the UCB and
10 858–10 871, 2019.
[11] Y. Wang et al., “Mmwave vehicular beam selection with situa-
39 HierS schemes with poor performance. From Fig. 4, it can tional awareness using machine learning,” IEEE Access, vol. 7,
40 be observed that the DQN-online scheme is more effective pp. 87 479–87 493, 2019.
41 than the other schemes, as it can continuously learn online [12] W. Xu et al., “Computer vision aided mmwave beam align-
42 and quickly adapt to changes in the environment. Further, the
ment in V2X communications,” IEEE Transactions on Wireless
43 Communications, vol. 22, no. 4, pp. 2699–2714, 2023.
superior performance of the DQN-offline indicates that the [13] Y. Zeng et al., “A tutorial on environment-aware communica-
44 trained model is robust and has a high alignment probability tions via channel knowledge map for 6G,” IEEE Communica-
45 even in fluctuating conditions. tions Surveys & Tutorials, vol. 26, no. 3, pp. 1478–1519, 2024.
46 Lastly, the average CPU execution time is compared in [14] Y. Heng et al., “Machine learning-assisted beam alignment for
47 Table IV. The training time refers to the period from the be-
mmwave systems,” IEEE Transactions on Cognitive Communi-
48 cations and Networking, vol. 7, no. 4, pp. 1142–1155, 2021.
ginning of learning until the algorithm converges. In contrast, [15] S. Jaeckel et al., “Quadriga: A 3-D multi-cell channel model
49 the test time represents the duration taken to output decisions with time evolution for enabling virtual field trials,” IEEE
50 at each time step in a practical application. The training of the Transactions on Antennas and Propagation, vol. 62, no. 6, pp.
51 proposed DQN-based scheme is time-consuming. However, 3242–3256, 2014.
52 its operation time is significantly reduced compared to other
[16] S. Dinh-van et al., “Rapid beam training at terahertz frequency
53 with contextual multi-armed bandit learning,” in 2024 IEEE In-
algorithms, meeting the timeliness requirement of mmWave ternational Conference on Industrial Technology (ICIT), 2024,
54 massive MIMO vehicular communication. pp. 1–7.
55 [17] X. Liu et al., “Machine learning empowered trajectory and
56 V. C ONCLUSIONS passive beamforming design in UAV-RIS wireless networks,”
57 IEEE Journal on Selected Areas in Communications, vol. 39,
58 In this paper, we develop a location context sensing beam no. 7, pp. 2042–2055, 2021.
59 alignment algorithm based on DQN for mmWave massive
60

You might also like