An Attention-Aided Deep Learning Framework For Massive MIMO Channel Estimation
An Attention-Aided Deep Learning Framework For Massive MIMO Channel Estimation
Abstract— Channel estimation is one of the key issues in In the prior works, least square (LS) and minimal
practical massive multiple-input multiple-output (MIMO) sys- mean-squared error (MMSE) [3] are two most commonly used
tems. Compared with conventional estimation algorithms, deep estimators for channel estimation. The LS is relatively simple
learning (DL) based ones have exhibited great potential in terms
of performance and complexity. In this paper, an attention and easy to implement while its performance is unsatisfactory.
mechanism, exploiting the channel distribution characteristics, On the other hand, MMSE can refine the LS estimation
is proposed to improve the estimation accuracy of highly sep- if accurate channel correlation matrix (CCM) is available.
arable channels with narrow angular spread by realizing the However, the complexity of MMSE estimation is much higher
“divide-and-conquer” policy. Specifically, we introduce a novel than that of LS estimation due to the matrix inversion oper-
attention-aided DL channel estimation framework for conven-
tional massive MIMO systems and devise an embedding method ation. On the other hand, to reduce the hardware and energy
to effectively integrate the attention mechanism into the fully cost, the hybrid analog-digital (HAD) architecture is usu-
connected neural network for the hybrid analog-digital (HAD) ally adopted in practical massive MIMO systems, where the
architecture. Simulation results show that in both scenarios, multi-antenna array is connected to only a limited number of
the channel estimation performance is significantly improved with radio-frequency (RF) chains through phase shifters in analog
the aid of attention at the cost of small complexity overhead.
Furthermore, strong robustness under different system and domain [4]–[6]. With HAD, channel estimation becomes even
channel parameters can be achieved by the proposed approach, more difficult since the received signals at the BS are only
which further strengthens its practical value. We also investigate a few linear combinations of the original signals. If LS is
the distributions of learned attention maps to reveal the role of used, multiple estimations are required since only part of
attention, which endows the proposed approach with a certain the antennas’ channels can be estimated once due to limited
degree of interpretability.
number of RF chains. To avoid the dramatically increased
Index Terms— Massive MIMO, channel estimation, overhead of LS, the slowly changing directions of arrival of
deep learning, attention mechanism, hybrid analog-digital, channel paths are obtained first in the preamble stage in [7],
divide-and-conquer.
then only channel gains of each path are re-estimated in
I. I NTRODUCTION
a long period. Another alternative is to exploit the channel
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on May 23,2022 at 02:17:20 UTC from IEEE Xplore. Restrictions apply.
1824 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 21, NO. 3, MARCH 2022
comparable performance as the iterative optimization algo- other if the entire angular space is properly segmented into
rithm. An unsupervised learning-based beamforming network different angular regions. Under such a condition, the classic
has been developed for intelligent reconfigurable surface aided “divide-and-conquer” policy, which tackles a complex main
massive MIMO systems in [19]. In [21], channel estimation problem by solving a series of its simplified sub-problems,
and signal detection in orthogonal frequency division multi- is very suitable. Specifically, the estimation of channels in the
plexing systems have been performed jointly by a DNN. Then, entire angular space can be regarded as the main problem
a model-driven based approach is further proposed in [22] to and the estimation of channels in different small angular
exploit the advantages of both conventional algorithms and regions can be regarded as different sub-problems. Motivated
DNN. In [23], rather than directly using a black-box DNN, by this, in this paper, we propose a novel attention-aided DL-
the conventional orthogonal approximate message passing based channel estimation framework, where the “divide-and-
algorithm (OAMP) is unfolded for the detection network. conquer” policy is realized automatically through the dynamic
There are mainly two categories of approaches for DL-based adaptation of attention maps. The main contributions of this
massive MIMO channel estimation. In the first category, “deep paper are summarized as follows:
unfolding” methods unfold various iterative optimization algo- • An attention-aided DL-based channel estimation frame-
rithms and enhance their estimation performance by inserting work is proposed for massive MIMO systems, which
learnable parameters. In [24], the AMP algorithm is unfolded achieves better performance than its counterpart without
into a cascaded neural network for millimeter wave channel attention in simulation. To the best knowledge of the
estimation, where the denoiser is learned by a DNN. Thanks to authors, this is the first work that introduces the attention
the power of DL, the proposed method can outperform a series mechanism to DL-based channel estimation.1
of conventional denoising-AMP based algorithms. In [25], • We extend the above framework to the scenario with
the iterative shrinkage thresholding algorithm is unfolded to HAD and an embedding method is proposed to effectively
solve sparse linear inverse problems, where massive MIMO integrate the attention mechanism into the fully connected
channel estimation is used as a case study. However, “unfold- neural network (FNN), which expands the application
ing” is only feasible to the iterative algorithms with simple range of the proposed approach.
structures, and the computational complexity is also high. • We visually explain the “divide-and-conquer” policy
In the other category, DL is used to directly learn the mapping reflected in the distributions of learned attention maps,
from available channel-related information to the CSI for which enhances the interpretability and rationality of the
performance improvement or complexity reduction. In [26], proposed approach.
a DNN has been proposed to refine the coarse estimation in • Based on our results, the performance gain of attention
HAD massive MIMO systems, where the channel correlation mainly comes from the narrow angular spread charac-
in the frequency and time domains is exploited for further teristic of channels. Therefore, the proposed approach
performance improvement. In [28], the estimation performance can be extended to many other problems apart from
is further improved by jointly training the pilot signals and channel estimation as long as the channel distribution
channel estimator with an autoencoder in downlink massive has certain separability, such as multi-user beamforming,
MIMO systems. In [29], graph neural network has been used FDD downlink channel prediction, and so forth.
for massive MIMO channel tracking. Deep multimodal learn-
ing has been used for massive MIMO channel estimation and The rest of this paper is organized as follows. Section II
prediction in [30]. To reduce the complexity, the amplitudes introduces the system model, channel model, and problem
of beamspace channels are predicted by a DNN and the formulation. Section III presents the attention-aided DL-based
dominant entries are estimated by LS in [31], thus avoiding the channel estimation framework, which is extended to the HAD
greedy search commonly adopted by CS algorithms. In [32], scenario in Section IV. Simulation results are demonstrated in
the uplink-to-downlink channel mapping in frequency-division Section V. Eventually, the paper is concluded in Section VI.
duplex (FDD) systems is learned by a sparse complex valued Here are some notations used subsequently. We use italic,
network. bold-face lower-case and bold-face upper-case letter to denote
Nevertheless, current DL-based channel estimation methods scalar, vector, and matrix, respectively. AT and AH denote
have seldom exploited the characteristics of channel distribu- the transpose and Hermitian or complex conjugate transpose
tion. In practice, the BS is often located in a high altitude of matrix A, respectively. [A]i,j denotes the element at the
with few surrounding scatters [33], so the angular spread of i-th row and j-th column of matrix A. x denotes the
each user’s incident signal at the BS is narrow. Thus, the global l-2 norm of vector x, and |a| denotes the amplitude of
distribution of channels corresponding to different users in the complex number a. Cx×y denotes the x × y complex space.
entire angular space can be viewed as the composition of many CN (μ, σ 2 ) denotes the distribution of a circularly symmetric
local distributions, where each local distribution represents
channels within a small angular region. Due to narrow angular 1 Attention has already been used in some literature to aid DL-based
spread, a certain angular region contains much fewer channel communication systems, such as CSI compression [34], [35] and joint source
cases than the entire angular space because of the limited and channel coding [36]. Nevertheless, the considered channel distribution
angular range of channel paths, making the local distributions in [34] does not possess strong separable property, and the proposed method
in [36] requires extra side information. As for [35], the non-local neural
much simpler than the global distribution. Besides, different network model is utilized to exploit the self-attention in the spatial dimension
local distributions can be highly distinguishable from each of channels.
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on May 23,2022 at 02:17:20 UTC from IEEE Xplore. Restrictions apply.
GAO et al.: ATTENTION-AIDED DL FRAMEWORK FOR MASSIVE MIMO CHANNEL ESTIMATION 1825
complex Gaussian random variable with mean μ and variance transform matrix [11], with the n-th row given by
σ 2 . U[a, b] denotes the uniform distribution between a and b. fn = √1 [1, e−jπηn , · · · , e−jπηn (N −1) ], for ηn =
N
−N +1 −N +3
N , N ,··· , NN−1 . Due to narrow angular spread
II. S YSTEM M ODEL AND P ROBLEM F ORMULATION assumption, the angular domain channel exhibits the
In this section, system model and channel model are first spatial-clustered sparsity structure [11]. Specifically, as shown
introduced. Then, the conventional massive MIMO channel in the right half of Fig. 1, xk only has a few significant
estimation problem is formulated. elements appearing in a cluster. If properly exploited, such
sparsity structure can help to improve estimation performance
A. System Model and reduce estimation overhead.
Consider a single cell massive MIMO system, where the BS
is equipped with an N -antenna uniform linear array (ULA) and
C. Problem Formulation
K single-antenna users are randomly distributed in the cell of
the corresponding BS, as illustrated in Fig. 1. During the uplink training, orthogonal pilot sequences are
sent by different users. Denote the pilot sequence of the k-th
B. Channel Model user as pk ∈ C1×Lp , where Lp ≥ K is the length of pilot
sequences. Notice that the channel during pilot training phase
Following the same channel model as in [37], the uplink
is assumed to be unchanged [11] since Lp is relatively small.
channel from user k to the BS can be expressed as
Therefore, the superimposed received signal at the BS can be
Np
1 expressed as
hk = αki a(θki ) ∈ CN ×1 , (1)
Np i=1 K
where Np is the number of paths, αki and θki are the complex Y = hk pk + N ∈ CN ×Lp , (3)
k=1
gain and angle of arrival (AoA) at the BS of the i-th path
from the k-th user, respectively. Without loss of generality, where N ∼ CN (0, σ 2 ) ∈ CN ×Lp is the zero-mean additive
we consider half-wavelength antenna spacing in this paper, white Gaussian noise at the BS with variance σ 2 . Without
then the steering vector of the ULA can be written as a(θ) = loss of generality, we fix the power of pilot sequences to
[1, ejπ sin(θ) , · · · , ejπ sin(θ)(N −1) ]T . Define the average AoA unit and adjust the transmit signal-to-noise ratio (SNR) by
and the angular spread of user k’s channel paths as θ̄k and changing the noise variance. Then, we have pi pHj = 0, ∀i = j
θ , respectively, that is, θki follows a uniform distribution and pi pH = 1, ∀i. Exploiting the orthogonality of the pilot
i
U[θ̄k − θ , θ̄k + θ ]. As in [11], [37], the narrow angular sequences, the LS estimation of user k’s channel can be
spread assumption is adopted, i.e., θ π. obtained as
To better understand this channel characteristic, we convert
the original channel to the angular domain by ĥk = Y pH k ∈ CN ×1 ,
k = hk + n (4)
N ×1
xk = F hk ∈ C , (2)
where n k N pH k is the effective noise for user k. For
where xk denotes the angular domain channel of user brevity, we will consider a specific user from now on and omit
k, and F ∈ CN ×N is a shift-version discrete Fourier subscript k. Besides, we use ĥLS to denote the LS estimation.
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on May 23,2022 at 02:17:20 UTC from IEEE Xplore. Restrictions apply.
1826 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 21, NO. 3, MARCH 2022
output feature matrix, which is also the input of the next layer.
Specifically, each filter contains a (L, C)-dimensional trainable
weight matrix and a scalar bias term, where L denotes the filter
size. When a filter is located in a certain position of the feature
matrix, the cross-correlation between the corresponding chunk
of the feature matrix and the weight matrix of the filter is
computed and the bias is added to obtain the convolution
Fig. 2. Structure of the channel estimation network. output of the position [39]. In the proposed channel estimation
network, NB convolutional blocks and an output Conv1D
layer are used to refine the LS coarse channel estimation.
Therefore, the goal of channel estimation2 is to find a function
As depicted in the dashed box, in each convolutional block,
that maps ĥLS to h.
a batch normalization (BN) layer to prevent gradient explosion
One of the conventional methods is the MMSE estimation,
or vanishing [43] and a ReLU activation function are inserted
where the LS estimation is refined by the CCM. However,
after the Conv1D layer. Besides, the Conv1D layer in the first
accurate CCM is hard to obtain in practice and the complexity
block has F filters of size LI and the Conv1D layers in the
of matrix inversion in MMSE estimation is very high, espe-
next NB − 1 blocks have F filters of size LH . The optimal
cially when the antenna number is large. In [38], DL-based
values of NB and F can be determined through simulation.
methods have been proposed to refine the channel estimation.
Finally, the output Conv1D layer has 2 filters of size LO ,
In this paper, we will develop an attention-aided DL frame-
corresponding to the real and imaginary parts of the channel
work for conventional massive MIMO channel estimation by
prediction, respectively. The stride is set to S and all the
exploiting the characteristics of channel distribution.
Conv1D layers pad zeros to keep the dimension N of the
feature matrix unchanged.
III. ATTENTION -A IDED DL F RAMEWORK FOR M ASSIVE To effectively exploit the distribution characteristics of
MIMO C HANNEL E STIMATION channel, the attention mechanism4 is applied in the network
In this section, input and output processing, network struc- structure design. In the original CNN, all the features are used
ture design, and detailed network training method of the for all data samples with equal importance. However, certain
proposed framework are introduced. features can definitely be more important or informative than
others to certain data samples in practice, especially for
A. Input and Output Processing highly separable data like narrow angular spread channel.
For instance, key features, which are only aimed at dealing
Since channel parameters can be canonically expressed in
with channel distribution in a specific angular region, might
the angular domain, the input and output of the networks
be useless or even disruptive for the estimation of channels
are all in the angular domain in the proposed framework.
in another region far apart. Therefore, the idea of feature
In simulation, we find that the more sparse angular domain
importance reweighting can be used here to improve network
input and output can lead to better channel estimation per-
performance.
formance than the original ones. Once the angular domain
As is demonstrated in Fig. 3, the original feature matrix is
channel estimation, x̂, is obtained, the original channel esti-
multiplied by an attention map in a channel-wise manner to
mation can be readily recovered by ĥ = F H x̂. Besides,
obtain the reweighted feature matrix in the attention module,
the real and imaginary parts have to be separately processed
where more important or informative features to the current
since complex training is still not well supported by current
data sample will be paid more “attention” to. For the learning
DL libraries. To promote efficient training, we also perform
process of the attention map, global average pooling is per-
standard normalization preprocessing on the input.
formed first on the original feature matrix, Z O , to embed the
global information into a (1, C)-dimensional squeezed feature
B. Attention-Aided Channel Estimation Network Structure matrix, z.FSpecifically, the c-th element of z is calculated
As shown in Fig. 2, convolutional neural network (CNN) by zc = f =1 [Z O ]f,c /F [40]. Then, the (1, C)-dimensional
is a suitable choice for the network structure to exploit the attention map, m, is predicted by a dedicated attention net-
local correlation in the input data due to the spatial-clustered work based on z. The attention network contains two fully
sparsity structure of the angular domain channel. In this paper, connected (FC) layers. The first FC layer with C/r neurons
one-dimensional convolution (Conv1D) is used due to the is followed by a ReLU activation, fReLU (x) = max(0, x),
shape of input data. The input of a Conv1D layer is organized where r ≥ 1 denotes the reduction ratio. The second FC
as a (F, C)-dimensional feature matrix, where C denotes the layer with C neurons is followed by a Sigmoid activation,
number of channels3 and F denotes the number of features in fSigmoid (x) = 1/(1 + e−x ), which limits the elements of m
each channel. Then, the convolution operation slides C filters between 0 and 1. As can be seen in Fig. 2, an attention
over the input feature matrix in certain strides to obtain the module is inserted at the end of each convolutional block in
the proposed channel estimation network. Besides, r is set to
2 Here we use the term “channel estimation” for consistency, actually
“channel refinement” is more proper.
3 Here channel is a term in CNN representing a dimension of feature matrix, 4 Notice that, the term attention can refer to many related methods includ-
not the communication channel. ing [40]–[42]. In this paper, we use the classic “SENet” proposed in [40].
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on May 23,2022 at 02:17:20 UTC from IEEE Xplore. Restrictions apply.
GAO et al.: ATTENTION-AIDED DL FRAMEWORK FOR MASSIVE MIMO CHANNEL ESTIMATION 1827
C. Network Training
To train the designed network, the mean-squared
error (MSE) between the true angular domain channel,
x, and the predicted angular domain channel, x̂, is used as
the loss function, which can be calculated by
n
1 2
MSE Loss = x̂i − xi , (5)
n i=1
Fig. 4. Massive MIMO system with HAD.
where subscript i denotes the i-th data sample in a mini-batch
and n = 500 is the size of the mini-batch. Xavier [44] is
used as the weight initializer and Adam [45] is used as the
where W ∈ CM×N denotes the analog combining matrix.
weight optimizer. The initial learning rate is set to 0.001.
As the phase shifters
√ only change the phase of signals, we have
To balance the training complexity and testing performance,
|[W ]i,j | = 1/ N , ∀i, j after normalization. We set W to
we generate totally 200,000 data samples according to the
a matrix whose rows are length-N Zadoff-Chu sequences
adopted channel and transmission models. Then, the generated
with different shifting steps as in [10]. Again, exploiting
dataset is split into training, validation, and testing set with a
the orthogonality of the pilot sequences, the received signal
ratio of 3:1:1. In order to accelerate loss convergence at the
corresponding to user k can be obtained as
beginning and reduce loss oscillation near the end of training,
the learning rate is set to decay 10 times if the validation loss y k = Y HAD pH k ∈ CM×1 ,
k = W hk + n (7)
does not decrease in 10 consecutive epochs. Besides, early
stopping [46] with a patience of 25 epochs is applied to prevent where n k W n
k is the effective noise for user k with HAD.
overfitting and speed up the training process. Consider a specific user and omit the subscript k, the goal of
channel estimation now becomes to find a function that maps
IV. E XTENSION TO THE HAD S CENARIO y to h.
Since the overhead of LS estimation increases dramatically
In practice, the HAD architecture is often adopted in mas- due to limited number of RF chains, CS algorithms are more
sive MIMO systems to save hardware and energy cost. Due to often adopted to solve the channel estimation problem in HAD
the effect of phase shifters in the analog domain in the HAD massive MIMO systems conventionally. However, the per-
architecture, the problem formulation of channel estimation formance of CS algorithms is highly dependent on channel
changes and the channel estimation network structure has to be sparsity and the computational complexity is relatively high
customized correspondingly as well. In the HAD architecture, due to complex operations and a large number of iterations.
we assume there is only M N RF chains available at the Therefore, we extend the proposed framework to the HAD
BS, as illustrated in Fig. 4. scenario and use DL to overcome these issues.
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on May 23,2022 at 02:17:20 UTC from IEEE Xplore. Restrictions apply.
1828 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 21, NO. 3, MARCH 2022
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on May 23,2022 at 02:17:20 UTC from IEEE Xplore. Restrictions apply.
GAO et al.: ATTENTION-AIDED DL FRAMEWORK FOR MASSIVE MIMO CHANNEL ESTIMATION 1829
a channel sample, the angular region it belongs to will A. Impacts of Network Parameters
be estimated first5 and the corresponding CCM will be To determine the best network structures for two scenarios,
selected for channel refinement. Compared with using we investigate the impacts of key network parameters on
a single CCM for all channel samples, using multiple network performance. Without HAD, the structure of CNN
CCMs matching different angular regions can effectively is mainly determined by the number of convolutional blocks,
exploit the narrow angular spread characteristic of chan- NB , and the number of filters of each Conv1D layer, F .
nels and improve performance significantly. Actually, As illustrated in Fig. 6(a), attention can improve the per-
it can be regarded as the manual implementation of the formance of CNNs with various numbers of convolutional
“divide-and-conquer” policy, i.e., the channel samples are blocks and filters and the performance of a two-layer attention-
“divided” by their angular regions and “conquered” by aided CNN is even better than a four-layer CNN without
different corresponding CCMs. attention, which indicates the superiority of the attention
• FNN: The FNN structure consists of three FC layers mechanism. In general, the performance of networks is better
with 512, 1024, and 256 neurons, respectively, with one with stronger representation capability brought by more convo-
BN layer inserted between every two FC layers. The lutional blocks. However, with enough filters, the performance
activation function of the first two FC layers is ReLU improvement of attention-aided CNN is marginal if the number
while the last FC layer does not use activation. of filters keeps growing and it can even be harmful to CNN
• CNN Without Attention: The same CNN structure but without attention sometimes. Besides, deeper and wider CNNs
with all the attention modules removed. also have heavier computing and storage burdens. To strike a
With HAD: The following algorithms are selected as balance between performance and complexity, we choose to
baselines: use four convolutional blocks and 96 filters for each Conv1D
• Separate LS: A total of N/M estimates are executed. layer.
In each estimate, only M antennas are switched on by With HAD, the structure of the attention-aided FNN is
adjusting W , and their channels are obtained by LS mainly determined by the number of neurons of the hidden
estimation [7]. FC layer F × C and the way of reshaping in the attention
• S-VBI: One of the state-of-the-art CS-based algorithms embedding module. As in Fig. 6(b), the network performs best
designed for narrow angular spread channel estima- when F × C = 3072 and the performance will deteriorate with
tion in HAD massive MIMO systems, where the either too few or many neurons. Besides, as can be indicated
spatial-clustered channel sparsity is embedded to improve from the bowl shape of curves, a medium number of features
the estimation performance [11]. The source code is in each channel performs best when F × C is fixed. The
provided by the authors of [11]. reason is that the number of channels is too small and there
• FNN Without Attention: Adopt the same structure of FNN is not enough degrees of freedom for dynamic adjustment
as in the former scenario while the number of neurons of attention maps when F is too large, while each channel
reduces to 256, 512, and 256, respectively, with smaller does not contain enough features to effectively capture the
input dimension. global information [40] when F is too small. So, we choose
• CNN: The structure of CNN is also similar to the former to reshape the feature vector into 192 channels with 16 features
scenario, except that the output layer is changed from in each channel.
Conv1D to FC for dimension conversion.
• CNN Without Attention: The same CNN structure but B. Impacts of System Parameters
with all the attention modules removed.
In this subsection, the impacts of various system parameters
5 The angular region estimations of samples are assumed to be accurate for are investigated to validate the superiority and universality of
simplicity. the proposed approach.
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on May 23,2022 at 02:17:20 UTC from IEEE Xplore. Restrictions apply.
1830 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 21, NO. 3, MARCH 2022
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on May 23,2022 at 02:17:20 UTC from IEEE Xplore. Restrictions apply.
GAO et al.: ATTENTION-AIDED DL FRAMEWORK FOR MASSIVE MIMO CHANNEL ESTIMATION 1831
Fig. 8. Impact of angular spread in the two considered scenarios. Fig. 9. Impact of antenna number and RF chain ratio in the two considered
scenarios.
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on May 23,2022 at 02:17:20 UTC from IEEE Xplore. Restrictions apply.
1832 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 21, NO. 3, MARCH 2022
or, more precisely, its sine value. So, we select three sine
value ranges for comparison, where the first two ranges are
close to each other and the third range is far away from the
first two ranges. The average attention maps of validation
data samples whose average AoAs are inside the three ranges
are plotted in Fig. 11. The number of elements of each
attention map equals to the corresponding channel number of
the feature matrix and the values of the elements represent
the scale factors acting on the original features. Due to space
limitation, only the 16-th to the 48-th channels are displayed
here. A larger scale factor indicates more important channel of
features. From the figure, we have the following observations:
• Without HAD, the role of attention is different in different
depths of the attention-aided CNN. Specifically, as is
shown in the first two subfigures, features are scaled in
an angle-agnostic manner in shallower layers with small
differences among average attention maps of different
sine value ranges while the distributions of average atten-
tion maps become increasingly angle-specific in deeper
layers. Notice that, the mean value of the 38-th scale
factor of the third attention map varies significantly with
sine value ranges. Reasonably, it can be inferred as a key
angle-related feature in the considered problem. Such a
phenomenon is also consistent with a typical discipline in
DNNs that earlier layer features are more general while
later layer features exhibit greater specificity [47].
• The distributions of average attention maps of closer sine
Fig. 10. Generalization to SNRs with different training methods in the two value ranges are more similar. From the second subfigure,
considered scenarios. the curves of the first two ranges are very close to each
where the MSE is weighted by the SNR of data sample other, while the curve of the third range is apparently
and n is the number of samples in a mini-batch. As can different from them. It can be regarded as the embodiment
be indicated by the two close curves marked with circle of “divide-and-conquer” since the channel estimation for
and cross, networks trained with mixed SNRs achieve similar data samples in the first two ranges and the third range
performance as trained with accurate SNRs and significantly can be regarded as two different subproblems, which
outperform networks trained with a single SNR point. are “divided” by different attention maps first and then
As for the generalization to other parameters, detailed “conquered” subsequently.
• As is illustrated in the third subfigure, all scale factors
results are omitted here due to space limitation while the trends
and patterns are also similar. In conclusion, through mixed in the fourth attention map are 0.5, which is due to
parameters training and proper design of the loss function, the zero output of the former ReLU activation function
a single network with strong robustness can be obtained to and the Sigmoid activation function used to predict the
handle all situations during testing, which is very appealing in attention map. Therefore, the last attention module is
practical applications. actually useless and can be removed during testing to
further reduce the complexity [40].
• From the fourth subfigure, the differences of average
D. The Role of Attention attention maps between sine value ranges are bigger and
Although it is hard to rigorously analyze the represen- the binarization level of scale factors is higher in the
tations learned by DNNs, we still try to attain at least a HAD scenario. Only one attention module is used in
primitive understanding of the role of attention. Intuitively, the attention-aided FNN, so the “divide” process has
the performance gain of attention can be considered to come to be realized more intensely, which is different from
from the “divide-and-conquer” policy realized by the dynamic the attention-aided CNN used in the scenario without
adjustment of attention maps. In this way, sample-specific HAD. Another reason might be that compared with the
processing can be performed on different data samples to denoising process in the former scenario, reversing the
improve the performance. Without attention, the processing effect of W is more angle-related, therefore the “divide-
performed by the network is fixed for all data samples, which and-conquer” policy is reflected more fully. When dealing
is less advanced. Next, we will analyze the distributions of with a certain subproblem, only specific features are kept
learned attention maps to roughly corroborate this. and others are totally abandoned.
Due to the narrow angular spread characteristic, the channel Apart from the statistical characteristics, Fig. 12 also
distribution is highly related to the average AoA parameter, presents the attention maps of two exemplary data samples
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on May 23,2022 at 02:17:20 UTC from IEEE Xplore. Restrictions apply.
GAO et al.: ATTENTION-AIDED DL FRAMEWORK FOR MASSIVE MIMO CHANNEL ESTIMATION 1833
Fig. 12. The attention maps of two exemplary data samples with very close
average AoAs.
TABLE II
C OMPLEXITY C OMPARISON W ITHOUT HAD
E. Complexity Comparison
Under typical system settings where N = 128, M = 32,
Lp = K = 10, and IE = 50, the specific complexity of differ-
ent algorithms is compared in Table II and Table III. Notice
that the last attention layer in attention-aided CNN is removed
during testing as mentioned above. Besides, for MMSE 3◦ ,
CCMs computed by channel samples whose average AoAs
have same sine values can be shared to halve the number of
Fig. 11. Average attention maps of data samples in three ranges. The legend parameters.
(a, b) denotes the range where the minimum and maximum sine values of As we can see, without HAD, the number of parameters
average AoAs are a and b, respectively. only increases 19.86% with the use of attention, and the
additional FLOPs overhead introduced by attention is almost
with close average AoAs. Although the average AoAs are negligible. Although the FLOPs of attention-aided CNN are
almost same, the attention maps of these two data samples are slightly higher than MMSE currently, it will be much smaller
still dramatically different, which reveals the sample-specific than MMSE if the antenna number keeps growing. Besides,
nature of attention. The reason is that although average AoA the parameter number of MMSE 3◦ is also quite large since
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on May 23,2022 at 02:17:20 UTC from IEEE Xplore. Restrictions apply.
1834 IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL. 21, NO. 3, MARCH 2022
TABLE III [11] X. Xia, K. Xu, S. Zhao, and Y. Wang, “Learning the time-varying
C OMPLEXITY C OMPARISON W ITH HAD massive MIMO channels: Robust estimation and data-aided predic-
tion,” IEEE Trans. Veh. Technol., vol. 69, no. 8, pp. 8080–8096,
Aug. 2020.
[12] Z. Qin, H. Ye, G. Y. Li, and B.-H. F. Juang, “Deep learning in
physical layer communications,” IEEE Wireless Commun., vol. 26, no. 2,
pp. 93–99, Apr. 2019.
[13] H. Ye, L. Liang, G. Y. Li, and B.-H. F. Juang, “Deep learning-based
end-to-end wireless communication systems with conditional GANs as
unknown channels,” IEEE Trans. Wireless Commun., vol. 19, no. 5,
tens of CCMs are required to exploit the narrow angular spread pp. 3133–3143, May 2020.
[14] J. Gao, X. Yi, C. Zhong, X. Chen, and Z. Zhang, “Deep learning
characteristic of channels. for spectrum sensing,” IEEE Wireless Commun. Lett., vol. 8, no. 6,
In the scenario with HAD, we only compare three algo- pp. 1727–1730, Dec. 2019.
rithms with practical performance. Both attention-aided CNN [15] H. Sun, X. Chen, Q. Shi, M. Hong, X. Fu, and N. D. Sidiropoulos,
“Learning to optimize: Training deep neural networks for wireless
and FNN have similar parameter numbers while the FLOPs resource management,” IEEE Trans. Signal Process., vol. 66, no. 20,
of attention-aided FNN is much lower. Remember that, its pp. 5438–5453, Oct. 2018.
performance is also better than attention-aided CNN, which [16] H. Ye, G. Y. Li, and B.-H. F. Juang, “Deep reinforcement learning
based resource allocation for V2V communications,” IEEE Trans. Veh.
indicates its superiority. The FLOPs of S-VBI is significantly Technol., vol. 68, no. 4, pp. 3163–3173, Apr. 2019.
higher than the DL-based methods. In simulation, when both [17] L. Liang, H. Ye, and G. Y. Li, “Spectrum sharing in vehicular networks
run on CPU, attention-aided FNN can be hundreds of times based on multi-agent reinforcement learning,” IEEE J. Sel. Areas Com-
mun., vol. 37, no. 10, pp. 2282–2292, Oct. 2019.
faster than S-VBI in terms of clock time and the advantage
[18] L. Liang, H. Ye, G. Yu, and G. Y. Li, “Deep-learning-based wireless
will be more exaggerated if accelerated by GPU. resource allocation with application to vehicular networks,” Proc. IEEE,
vol. 108, no. 2, pp. 341–356, Feb. 2020.
[19] J. Gao, C. Zhong, X. Chen, H. Lin, and Z. Zhang, “Unsupervised
VI. C ONCLUSION learning for passive beamforming,” IEEE Commun. Lett., vol. 24, no. 5,
pp. 1052–1056, May 2020.
In this paper, we have proposed a novel attention-aided [20] H. Song, M. Zhang, J. Gao, and C. Zhong, “Unsupervised learning-based
DL framework for massive MIMO channel estimation. Both joint active and passive beamforming design for reconfigurable intelli-
the scenarios without and with HAD are considered and gent surfaces aided wireless networks,” IEEE Commun. Lett., vol. 25,
no. 3, pp. 892–896, Mar. 2021, doi: 10.1109/LCOMM.2020.3041510.
scenario-specific neural networks are customized correspond- [21] H. Ye, G. Y. Li, and B.-H. Juang, “Power of deep learning for channel
ingly. By integrating the attention mechanism into CNN and estimation and signal detection in OFDM systems,” IEEE Wireless
FNN, the narrow angular spread characteristic of channel can Commun. Lett., vol. 7, no. 1, pp. 114–117, Feb. 2018.
be effectively exploited, which is realized by the “divide-and- [22] P. Jiang et al., “Artificial intelligence-aided OFDM receiver: Design and
experimental results,” Dec. 2018, arXiv:1812.06638. [Online]. Available:
conquer” policy to dynamically adjust attention maps. The http://arxiv.org/abs/1812.06638
proposed approach can significantly improve the performance [23] H. He, C.-K. Wen, S. Jin, and G. Y. Li, “Model-driven deep learning for
but is with relatively low complexity. MIMO detection,” IEEE Trans. Signal Process., vol. 68, pp. 1702–1715,
Feb. 2020.
[24] H. He, C.-K. Wen, S. Jin, and G. Y. Li, “Deep learning-based
R EFERENCES channel estimation for beamspace mmWave massive MIMO sys-
tems,” IEEE Wireless Commun. Lett., vol. 7, no. 5, pp. 852–855,
[1] F. Rusek et al., “Scaling up MIMO: Opportunities and challenges with Oct. 2018.
very large arrays,” IEEE Signal Process. Mag., vol. 30, no. 1, pp. 40–60, [25] W. Chen, B. Zhang, S. Jin, B. Ai, and Z. Zhong, “Solving sparse linear
Jan. 2013. inverse problems in communication systems: A deep learning approach
[2] E. G. Larsson, O. Edfors, F. Tufvesson, and T. L. Marzetta, “Massive with adaptive depth,” IEEE J. Sel. Areas Commun., vol. 39, no. 1,
MIMO for next generation wireless systems,” IEEE Commun. Mag., pp. 4–17, Jan. 2021.
vol. 52, no. 2, pp. 186–195, Feb. 2014. [26] P. Dong, H. Zhang, G. Y. Li, I. Gaspar, and N. NaderiAlizadeh, “Deep
[3] Y. S. Cho, J. Kim, W. Y. Yang, and C.-G. Kang, MIMO-OFDM Wireless CNN-based channel estimation for mmWave massive MIMO systems,”
Communications With MATLAB. Singapore: Wiley, 2010. IEEE J. Sel. Topics Signal Process., vol. 13, no. 5, pp. 989–1000,
[4] F. Sohrabi and W. Yu, “Hybrid digital and analog beamforming design Sep. 2019.
for large-scale antenna arrays,” IEEE J. Sel. Topics Signal Process., [27] P. Wu and J. Cheng, “Deep unfolding basis pursuit: Improving sparse
vol. 10, no. 3, pp. 501–513, Apr. 2016. channel reconstruction via data-driven measurement matrices,” Jul. 2020,
[5] A. F. Molisch et al., “Hybrid beamforming for massive MIMO: A sur- arXiv:2007.05177. [Online]. Available: http://arxiv.org/abs/2007.05177
vey,” IEEE Commun. Mag., vol. 55, no. 9, pp. 134–141, Sep. 2017. [28] X. Ma and Z. Gao, “Data-driven deep learning to design pilot and
[6] S. Guo, H. Zhang, P. Zhang, P. Zhao, L. Wang, and M.-S. Alouini, channel estimator for massive MIMO,” IEEE Trans. Veh. Technol.,
“Generalized beamspace modulation using multiplexing: A breakthrough vol. 69, no. 5, pp. 5677–5682, May 2020.
in mmWave MIMO,” IEEE J. Sel. Areas Commun., vol. 37, no. 9, [29] Y. Yang, S. Zhang, F. Gao, J. Ma, and O. A. Dobre, “Graph
pp. 2014–2028, Sep. 2019. neural network-based channel tracking for massive MIMO net-
[7] D. Fan et al., “Angle domain channel estimation in hybrid millimeter works,” IEEE Commun. Lett., vol. 24, no. 8, pp. 1747–1751,
wave massive MIMO systems,” IEEE Trans. Wireless Commun., vol. 17, Aug. 2020.
no. 12, pp. 8165–8179, Dec. 2018. [30] Y. Yang, F. Gao, C. Xing, J. An, and A. Alkhateeb, “Deep mul-
[8] J. Lee, G.-T. Gil, and Y. H. Lee, “Channel estimation via orthogonal timodal learning: Merging sensory data for massive MIMO chan-
matching pursuit for hybrid MIMO systems in millimeter wave com- nel prediction,” Jul. 2020, arXiv:2007.09366. [Online]. Available:
munications,” IEEE Trans. Commun., vol. 64, no. 6, pp. 2370–2386, http://arxiv.org/abs/2007.09366
Jun. 2016. [31] M. Wenyan, Q. Chenhao, Z. Zhang, and J. Cheng, “Sparse channel
[9] S. Ji, Y. Xue, and L. Carin, “Bayesian compressive sensing,” IEEE Trans. estimation and hybrid precoding using deep learning for millimeter wave
Signal Process., vol. 56, no. 6, pp. 2346–2356, Jun. 2008. massive MIMO,” IEEE Trans. Commun., vol. 68, no. 5, pp. 2838–2849,
[10] Y. Wang, A. Liu, X. Xia, and K. Xu, “Learning the structured sparsity: Feb. 2020.
3-D massive MIMO channel estimation and adaptive spatial interpo- [32] Y. Yang, F. Gao, G. Y. Li, and M. Jian, “Deep learning-based downlink
lation,” IEEE Trans. Veh. Technol., vol. 68, no. 11, pp. 10663–10678, channel prediction for FDD massive MIMO system,” IEEE Commun.
Nov. 2019. Lett., vol. 23, no. 11, pp. 1994–1998, Nov. 2019.
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on May 23,2022 at 02:17:20 UTC from IEEE Xplore. Restrictions apply.
GAO et al.: ATTENTION-AIDED DL FRAMEWORK FOR MASSIVE MIMO CHANNEL ESTIMATION 1835
[33] A. Al-Hourani, S. Kandeepan, and A. Jamalipour, “Modeling air-to- Caijun Zhong (Senior Member, IEEE) received
ground path loss for low altitude platforms in urban environments,” the B.S. degree in information engineering from
in Proc. IEEE Global Commun. Conf., Austin, TX, USA, Dec. 2014, Xi’an Jiaotong University, Xi’an, China, in 2004,
pp. 2898–2904. and the M.S. degree in information security and the
[34] Q. Cai, C. Dong, and K. Niu, “Attention model for massive MIMO CSI Ph.D. degree in telecommunications from University
compression feedback and recovery,” in Proc. IEEE Wireless Commun. College London, London, U.K., in 2006 and 2010,
Netw. Conf. (WCNC), Marrakesh, Morocco, Apr. 2019, pp. 1–5. respectively.
[35] D. J. Ji and D.-H. Cho, “ChannelAttention: Utilizing attention layers for From September 2009 to September 2011, he was
accurate massive MIMO channel feedback,” IEEE Wireless Commun. a Research Fellow at the Institute for Electron-
Lett., vol. 10, no. 5, pp. 1079–1082, May 2021. ics, Communications and Information Technologies
[36] J. Xu, B. Ai, W. Chen, A. Yang, P. Sun, and M. Rodrigues, (ECIT), Queen’s University Belfast, Belfast, U.K.
“Wireless image transmission using deep source channel coding with Since September 2011, he has been with Zhejiang University, Hangzhou,
attention modules,” Nov. 2020, arXiv:2012.00533. [Online]. Available: China, where he is currently a Professor. His current research interests
http://arxiv.org/abs/2012.00533 include reconfigurable intelligent surfaces assisted communications and arti-
[37] H. Xie, F. Gao, S. Zhang, and S. Jin, “A unified transmission strategy for ficial intelligence-based wireless communications. He was a recipient of
TDD/FDD massive MIMO systems with spatial basis expansion model,” the 2013 IEEE ComSoc Asia-Pacific Outstanding Young Researcher Award.
IEEE Trans. Veh. Technol., vol. 66, no. 4, pp. 3170–3184, Apr. 2017. He and his coauthors has been awarded the Best Paper Award at the IEEE
[38] Y. Yang, F. Gao, X. Ma, and S. Zhang, “Deep learning-based channel GLOBECOM 2020 and IEEE ICC 2019. He was an Editor of the IEEE
estimation for doubly selective fading channels,” IEEE Access, vol. 7, T RANSACTIONS ON W IRELESS C OMMUNICATIONS and IEEE C OMMUNI -
pp. 36579–36589, Mar. 2019. CATIONS L ETTERS . He is an Editor of Science China Information Sciences
[39] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge, and China Communications.
MA, USA: MIT Press, 2016.
[40] J. Hu, L. Shen, S. Albanie, G. Sun, and E. Wu, “Squeeze-and-
excitation networks,” Sep. 2017, arXiv:1709.01507. [Online]. Available:
http://arxiv.org/abs/1709.01507 Geoffrey Ye Li (Fellow, IEEE) has been a Chair
[41] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural Professor at Imperial College London, U.K., since
networks,” Nov. 2017, arXiv:1711.07971. [Online]. Available: 2020. Before moving to Imperial, he was a Pro-
http://arxiv.org/abs/1711.07971 fessor with Georgia Institute of Technology, GA,
[42] J. Fu et al., “Dual attention network for scene segmentation,” Sep. 2018, USA, for 20 years, and a Principal Technical Staff
arXiv:1809.02983. [Online]. Available: http://arxiv.org/abs/1809.02983 Member with AT&T Labs Research, NJ, USA, for
[43] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep five years. His general research interests include
network training by reducing internal covariate shift,” in Proc. ICML, statistical signal processing and machine learning for
Lille, France, Jul. 2015, pp. 448–456. wireless communications. In the related areas, he has
[44] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep published over 600 journals and conference papers
feedforward neural networks,” in Proc. AISTATS, vol. 9, May 2010, in addition to over 40 granted patents and several
pp. 249–256. books. His publications have been cited over 48,000 times and he has been
[45] D. P. Kingma and J. Ba, “Adam: A method for stochas- recognized as a Highly Cited Researcher, by Thomson Reuters, almost every
tic optimization,” Dec. 2014, arXiv:1412.6980. [Online]. Available: year.
http://arxiv.org/abs/1412.6980 He was awarded an IEEE Fellow for his contributions to signal processing
[46] R. Caruana, S. Lawrence, and L. Giles, “Overfitting in neural nets: for wireless communications in 2005. He won several prestigious awards
Backpropagation, conjugate gradient, and early stopping,” in Proc. NIPS, from IEEE Signal Processing Society (Donald G. Fink Overview Paper
Denver, CO, USA, Dec. 2020, pp. 1–7. Award in 2017), IEEE Vehicular Technology Society (James Evans Avant
[47] A. S. Morcos, D. G. Barrett, N. C. Rabinowitz, and M. Botvinick, “On Garde Award in 2013 and Jack Neubauer Memorial Award in 2014), and
the importance of single directions for generalization,” in Proc. ICLR, IEEE Communications Society (Stephen O. Rice Prize Paper Award in 2013,
Vancouver, BC, Canada, Apr./May 2018, pp. 1–5. the Award for Advances in Communication in 2017, and Edwin Howard
Armstrong Achievement Award in 2019). He received the 2015 Distinguished
ECE Faculty Achievement Award from Georgia Tech. He has organized
and chaired many international conferences, including technical the Pro-
gram Vice-Chair of the IEEE ICC’03 and the General Co-Chair of the IEEE
GlobalSIP’14, the IEEE VTC’19 (Fall), and the IEEE SPAWC’20. He has
Jiabao Gao (Student Member, IEEE) received been involved in editorial activities for over 20 technical journals, including
the B.S. degree in information engineering from the Founding Editor-in-Chief of IEEE J OURNAL ON S ELECTED A REAS IN
Zhejiang University, Hangzhou, China, in 2019, C OMMUNICATIONS (JSAC) Special Series on ML in Communications and
where he is currently pursuing the Ph.D. degree Networking.
with Zhejiang Provincial Key Laboratory of Infor-
mation Processing, Communication and Networking.
In October 2021, he will become a Visiting Student
with the Department of Electrical and Electronic Zhaoyang Zhang (Senior Member, IEEE) received
Engineering, Imperial College London, England. His the Ph.D. degree from Zhejiang University,
current research interests include massive MIMO, Hangzhou, China, in 1998. He is currently a Qiushi
channel estimation, and machine learning for Distinguished Professor with Zhejiang University.
wireless communications. He has published more than 300 peer-reviewed
international journals and conference papers,
including six conference best papers. His current
research interests are mainly focused on the
fundamental aspects of wireless communications
Mu Hu (Graduate Student Member, IEEE) received and networking, such as information theory and
the B.E. degree in information engineering from coding, network signal processing and distributed
Zhejiang University, Hangzhou, China, in 2019, learning, AI-empowered communications and networking, network
where he is currently pursuing the M.Sc. degree with intelligence with synergetic sensing, and computation and communication.
the Information Science and Electronic Engineering He was awarded the National Natural Science Fund for Outstanding
College. His current research interests are depth Young Scholars by NSFC in 2017. He is serving as an Editor for IEEE
completion in computer vision and model acceler- T RANSACTIONS ON W IRELESS C OMMUNICATIONS, IEEE T RANSACTIONS
ation in deep learning. ON C OMMUNICATIONS, and IET Communications. He has served as the
General Chair and the TPC Co-Chair or the Symposium Co-Chair for WCSP
2013/2018, GLOBECOM 2014 Wireless Communications Symposium, and
VTC-Spring 2017 Workshop HMWC.
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on May 23,2022 at 02:17:20 UTC from IEEE Xplore. Restrictions apply.