0% found this document useful (0 votes)
117 views20 pages

A General Survey On Attention Mechanisms in Deep Learning

This paper provides a survey of attention mechanisms in deep learning. It introduces a general attention model and taxonomy to classify different attention types. The paper reviews evaluation measures for attention models and how their structure can be analyzed. It aims to explain attention mechanisms across domains using consistent notation. Future work directions in combining attention with other deep learning models are also discussed.

Uploaded by

Ajay Mukund
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
117 views20 pages

A General Survey On Attention Mechanisms in Deep Learning

This paper provides a survey of attention mechanisms in deep learning. It introduces a general attention model and taxonomy to classify different attention types. The paper reviews evaluation measures for attention models and how their structure can be analyzed. It aims to explain attention mechanisms across domains using consistent notation. Future work directions in combining attention with other deep learning models are also discussed.

Uploaded by

Ajay Mukund
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO.

4, APRIL 2023 3279

A General Survey on Attention Mechanisms in


Deep Learning
Gianni Brauwers and Flavius Frasincar

Abstract—Attention is an important mechanism that can be employed for a variety of deep learning models across many different
domains and tasks. This survey provides an overview of the most important attention mechanisms proposed in the literature.
The various attention mechanisms are explained by means of a framework consisting of a general attention model, uniform notation,
and a comprehensive taxonomy of attention mechanisms. Furthermore, the various measures for evaluating attention models are
reviewed, and methods to characterize the structure of attention models based on the proposed framework are discussed. Last, future
work in the field of attention models is considered.

Index Terms—Attention models, deep learning, introductory and survey, neural nets, supervised learning

1 INTRODUCTION recurrent neural networks [14]. However, the Transformer


model proposed in [13] poses a major development in atten-
HE idea of mimicking human attention first arose in the
T field of computer vision [1], [2] in an attempt to reduce
the computational complexity of image processing while
tion research as it demonstrates that the attention mecha-
nism is sufficient to build a state-of-the-art model. This
means that disadvantages, such as the fact that recurrent
improving performance by introducing a model that would
neural networks are particularly difficult to parallelize, can
only focus on specific regions of images instead of the entire
be circumvented. As was the case for the introduction of the
picture. Although, the true starting point of the attention
original attention mechanism [3], the Transformer model
mechanisms we know today is often attributed to originate in
was created for machine translation, but was quickly
the field of natural language processing [3]. Bahdanau et al.
adopted to be used for other tasks, such as image processing
[3] implement attention in a machine translation model to
[15], video processing [16], and recommender systems [17].
address certain issues with the structure of recurrent neural
The purpose of this survey is to explain the general form
networks. After Bahdanau et al. [3] emphasized the advan-
of attention, and provide a comprehensive overview of atten-
tages of attention, the attention techniques were refined [4]
tion techniques in deep learning. Other surveys have already
and quickly became popular for a variety of tasks, such as
been published on the subject of attention models. For exam-
text classification [5], [6], image captioning [7], [8], sentiment
ple, in [18], a survey is presented on attention in computer
analysis [6], [9], and speech recognition [10], [11], [12].
vision, [19] provides an overview of attention in graph mod-
Attention has become a popular technique in deep learn-
els, and [20], [21], [22] are all surveys on attention in natural
ing for several reasons. First, models that incorporate atten-
language processing. This paper partly builds on the infor-
tion mechanisms attain state-of-the-art results for all of the
mation presented in the previously mentioned surveys. Yet,
previously mentioned tasks, and many others. Furthermore,
we provide our own significant contributions. The main dif-
most attention mechanisms can be trained jointly with a
ference between this survey and the previously mentioned
base model, such as a recurrent neural network or a convo-
ones is that the other surveys generally focus on attention
lutional neural network using regular backpropagation [3].
models within a certain domain. This survey, however, pro-
Additionally, attention introduces a certain type of interpre-
vides a cross-domain overview of attention techniques. We
tation into neural network models [8] that are generally
discuss the attention techniques in a general way, allowing
known to be highly complicated to interpret. Moreover, the
them to be understood and applied in a variety of domains.
popularity of attention mechanisms was additionally
Furthermore, we found the taxonomies presented in previ-
boosted after the introduction of the Transformer model
ous surveys to be lacking the depth and structure needed to
[13] that further proved how effective attention can be.
properly distinguish the various attention mechanisms.
Attention was originally introduced as an extension to
Additionally, certain significant attention techniques have
not yet been properly discussed in previous surveys, while
other presented attention mechanisms seem to be lacking
 The authors are with the Erasmus School of Economics, Erasmus Univer-
sity Rotterdam, 3000 DR, Rotterdam, The Netherlands. either technical details or intuitive explanations. Therefore,
E-mail: {brauwers, frasincar}@ese.eur.nl. in this paper, we present important attention techniques by
Manuscript received 6 July 2020; revised 8 Oct. 2021; accepted 29 Oct. 2021. means of a single framework using a uniform notation, a
Date of publication 9 Nov. 2021; date of current version 7 Mar. 2023. combination of both technical and intuitive explanations for
(Corresponding author: Flavius Frasincar.) each presented attention technique, and a comprehensive
Recommended for acceptance by L. Chen.
Digital Object Identifier no. 10.1109/TKDE.2021.3126456
taxonomy of attention mechanisms.

1041-4347 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on October 23,2023 at 05:05:47 UTC from IEEE Xplore. Restrictions apply.
3280 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 4, APRIL 2023

Fig. 1. An illustration of the general structure of the task model.

The structure of this paper is as follows. Section 2 intro- Fig. 2. The inner mechanisms of the general attention module.
duces a general attention model that provides the reader
with a basic understanding of the properties of attention (RNN), a convolutional neural network (CNN), a simple
and how it can be applied. One of the main contributions of embedding layer, a linear transformation of the original
this paper is the taxonomy of attention techniques pre- data, or no transformation at all. Essentially, the feature
sented in Section 3. In this section, attention mechanisms model consists of all the steps that transform the original
are explained and categorized according to the presented input X into the feature vectors f 1 ; . . . ; f nf that the attention
taxonomy. Section 4 provides an overview of performance model will attend to.
measures and methods for evaluating attention models. To determine which vectors to attend to, the attention
Furthermore, the taxonomy is used to evaluate the structure model requires the query q 2 Rdq , where dq indicates the
of various attention models. Lastly, in Section 5, we give size of the query vector. This query is extracted by the query
our conclusions and suggestions for further research. model, and is generally designed based on the type of output
that is desired of the model. A query tells the attention
model which feature vectors to attend to. It can be inter-
2 GENERAL ATTENTION MODEL preted literally as a query, or a question. For example, for
This section presents a general form of attention with corre- the task of image captioning, suppose that one uses a
sponding notation. The notation introduced here is based decoder RNN model to produce the output caption based
on the notation that was introduced in [23] and popularized on feature vectors obtained from the image by a CNN. At
in [13]. The framework presented in this section is used each prediction step, the hidden state of the RNN model
throughout the rest of this paper. can be used as a query to attend to the CNN feature vectors.
To implement a general attention model, it is necessary to In each step, the query is a question in the sense that it asks
first describe the general characteristics of a model that can for the necessary information from the feature vectors based
employ attention. We will refer to the complete model as the on the current prediction context.
task model, of which the structure is presented in Fig. 1. This
model simply takes an input, carries out the specified task, 2.2 Attention Output
and produces the desired output. For example, the task The feature vectors and query are used as input for the
model can be a language model that takes as input a piece of attention model. This model consists of a single, or a collec-
text, and produces as output a summary of the contents, a tion of general attention modules. An overview of a general
classification of the sentiment, or the text translated word for attention module is presented in Fig. 2. The input of the gen-
word to another language. Alternatively, the task model can eral attention module is the query q 2 Rdq , and the matrix of
take an image, and produce a caption or segmentation for feature vectors F ¼ ½ff 1 ; . . . ; f nf  2 Rdf nf . Two separate
that image. The task model consists of four submodels: the matrices are extracted from the matrix F : the keys matrix
feature model, the query model, the attention model, and the K ¼ ½kk1 ; . . . ; knf  2 Rdk nf , and the values matrix V ¼
output model. In Section 2.1, the feature model and query ½vv1 ; . . . ; v nf  2 Rdv nf , where dk and dv indicate, respectively,
model are discussed, which are used to prepare the input for the dimensions of the key vectors (columns of K ) and value
the attention calculation. In Section 2.2, the attention model vectors (columns of V ). The general way of obtaining these
and output model are discussed, which are concerned with matrices is through a linear transformation of F using the
producing the output. Last, in Section 2.3, we highlight the weight matrices W K 2 Rdk df and W V 2 Rdv df , for K and
applications of attention. V , respectively. The calculations of K and V are presented
in (1). Both weight matrices can be learned during training
or predefined by the researcher. For example, one can
2.1 Attention Input
choose to define both W K and W V as equal to the identity
Suppose the task model takes as input the matrix X 2
matrix to retain the original feature vectors. Other ways of
Rdx nx , where dx represents the size of the input vectors and
defining the keys and the values are also possible, such as
nx represents the amount of input vectors. The columns in
using completely separate inputs for the keys and values.
this matrix can represent the words in a sentence, the pixels
The only constraint to be obeyed is that the number of col-
in an image, the characteristics of an acoustic sequence, or
umns in K and V remains the same.
any other collection of inputs. The feature model is then
employed to extract the nf feature vectors f 1 ; . . . ; f nf 2 Rdf
K ¼ WK  F ; V ¼ WV  F : (1)
from X , where df represents the size of the feature vectors. dk nf dk df df nf dv nf dv df df nf
The feature model can be a recurrent neural network
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on October 23,2023 at 05:05:47 UTC from IEEE Xplore. Restrictions apply.
BRAUWERS AND FRASINCAR: GENERAL SURVEY ON ATTENTION MECHANISMS IN DEEP LEARNING 3281

The goal of the attention module is to produce a model. While an RNN produces a sequence of hidden state
weighted average of the value vectors in V . The weights vectors, a CNN creates feature maps, where each region in
used to produce this output are obtained via an attention the image is represented by a feature vector. The RNN hid-
scoring and alignment step. The query q and the keys matrix den states are organized sequentially, while the CNN fea-
K are used to calculate the vector of attention scores e ¼ ture maps are organized spatially. Yet, attention can still be
½e1 ; . . . ; enf  2 Rnf . This is done via the score function applied in both situations, since the attention mechanism
scoreðÞ, as illustrated in (2). does not inherently depend on the organization of the fea-
  ture vectors. This characteristic makes attention easy to
implement in a wide variety of models in different domains.
el ¼ score q ; kl : (2)
11 dq 1 dk 1 Another domain where attention can be applied is audio
processing [24], [25]. Acoustic sequences can be represented
As discussed before, the query symbolizes a request for by a sequence of feature vectors that relate to certain time
information. The attention score el represents how impor- periods of the audio sample. These vectors could simply be
tant the information contained in the key vector kl is accord- the raw input audio, or they can be extracted via, for exam-
ing to the query. If the dimensions of the query and key ple, an RNN or CNN. Video processing is another domain
vectors are the same, an example of a score function would where attention can be applied intuitively [26], [27]. Video
be to take the dot-product of the vectors. The different types data consists of sequences of images, so attention can be
of score functions are further discussed in Section 3.2.1. applied to the individual images, as well as the entire
Next, the attention scores are processed further through sequence. Recommender systems often incorporate a user’s
an alignment layer. The attention scores can generally have interaction history to produce recommendations. Feature
a wide range outside of ½0; 1. However, since the goal is to vectors can be extracted based on, for example, the id’s or
produce a weighted average, the scores are redistributed other characteristics of the products the user interacted
via an alignment function alignðÞ as defined in (3). with, and attention can be applied to them [28]. Attention
  can generally also be applied to many problems that use a
al ¼ align el ; e ; (3) time series as input, be it medical [29], financial [30], or any-
11 11 nf 1 thing else, as long as feature vectors can be extracted.
The fact that attention does not rely on the organization of
where al 2 R1 is the attention weight corresponding to the the feature vectors allows it to be applied to various prob-
lth value vector. One example of an alignment function lems that each use data with different structures, as illus-
would be to use a softmax function, but the various other trated by the previous domain examples. Yet, this can be
alignment types are discussed in Section 3.2.2. The attention taken even further by applying attention to data where there
weights provide a rather intuitive interpretation for the is irregular structure. For example, protein structures, city
attention module. Each weight is a direct indication of how traffic flows, and communication networks cannot always be
important each feature vector is relative to the others for represented using neatly structured organizations, such as
this particular problem. This can provide us with a more in- sequences, like time series, or grids, like images. In such
depth understanding of the model behaviour, and the rela- cases, the different aspects of the data are often represented
tions between inputs and outputs. The vector of attention as nodes in a graph. These nodes can be represented by fea-
weights a ¼ ½a1 ; . . . ; anf  2 Rnf is used to produce the con- ture vectors, meaning that attention can be applied in
text vector c 2 Rdv by calculating a weighted average of the domains that use graph-structured data as well [19], [31].
columns of the values matrix V , as shown in (4). In general, attention can be applied to any problem for
which a set of feature vectors can be defined or extracted.
X
nf
As such, the general attention model presented in Fig. 2 is
c ¼ al  v l : (4)
dv 1
l¼1 11 dv 1 applicable to a wide range of domains. The problem, how-
ever, is that there is a large variety of different applications
As illustrated in Fig. 1, the context vector is then used in and extensions of the general attention module. As such, in
the output model to create the output y^. This output model Section 3, a comprehensive overview is provided of a collec-
translates the context vector into an output prediction. For tion of different attention mechanisms.
example, it could be a simple softmax layer that takes as
input the context vector c, as shown in (5).
3 ATTENTION TAXONOMY
 
y^ ¼ softmax Wc  c þ bc ; There are many different types of attention mechanisms and
(5)
dy 1 dy dv dv 1 dy 1 extensions, and a model can use different combinations of
these attention techniques. As such, we propose a taxonomy
where dy is the number of output choices or classes, and that can be used to classify different types of attention mecha-
W c 2 Rdy dv and bc 2 Rdy are trainable weights. nisms. Fig. 3 provides a visual overview of the different catego-
ries and subcategories that the attention mechanisms can be
2.3 Attention Applications organized in. The three major categories are based on whether
Attention is a rather general mechanism that can be used in an attention technique is designed to handle specific types of
a wide variety of problem domains. Consider the task of feature vectors (feature-related), specific types of model
machine translation using an RNN model. Also, consider queries (query-related), or whether it is simply a general mech-
the problem of image classification using a basic CNN anism that is related to neither the feature model, nor the query
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on October 23,2023 at 05:05:47 UTC from IEEE Xplore. Restrictions apply.
3282 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 4, APRIL 2023

Fig. 3. A taxonomy of attention mechanisms.

model (general). Further explanations of these categories and attend to these various vectors. These features may have
their subcategories are provided in the following subsections. specific structures that require special attention mechanisms
Each mechanism discussed in this section is either a modifica- to handle them. These mechanisms can be categorized to
tion to the existing inner mechanisms of the general attention deal with one of the following feature characteristics: the
module presented in Section 2, or an extension of it. multiplicity of features, the levels of features, or the repre-
The presented taxonomy can also be used to analyze the sentations of features.
architecture of attention models. Namely, the major catego-
ries and their subcategories can be interpreted as orthogonal
dimensions of an attention model. An attention model can 3.1.1 Multiplicity of Features
consist of a combination of techniques taken from any or all
For most tasks, a model only processes a single input, such
categories. Some characteristics, such as the scoring and
as an image, a sentence, or an acoustic sequence. We refer to
alignment functions, are generally required for any atten-
such a mechanism as singular features attention. Other mod-
tion model. Other mechanisms, such as multi-head atten-
els are designed to use attention based on multiple inputs to
tion or co-attention are not necessary in every situation.
Lastly, in Table 1, an overview of used notation with corre- allow one to introduce more information into the model
that can be exploited in various ways. However, this does
sponding descriptions is provided.
imply the presence of multiple feature matrices that require
special attention mechanisms to be fully used. For example,
3.1 Feature-Related Attention Mechanisms [32] introduces a concept named co-attention to allow the
Based on a particular set of input data, a feature model proposed visual question answering (VQA) model to jointly
extracts feature vectors so that the attention model can attend to both an image and a question.

TABLE 1
Notation

Symbol Description
F Matrix of size df  nf containing the feature vectors f 1 ; . . . ; f nf 2 Rdf as columns. These feature vectors are extracted
by the feature model.
K Matrix of size dk  nf containing the key vectors k 1 ; . . . ; knf 2 Rdk as columns. These vectors are used to calculate the
attention scores.
V Matrix of size dv  nf containing the value vectors v1 ; . . . ; v nf 2 Rdv as columns. These vectors are used to calculate
the context vector.
WK Weights matrix of size dk  df used to create the K matrix from the F matrix.
WV Weights matrix of size dv  df used to create the V matrix from the F matrix.
q Query vector of size dq . This vector essentially represents a question, and is used to calculate the attention scores.
c Context vector of size dv . This vector is the output of the attention model.
e Score vector of size dnf containing the attention scores e1 ; . . . ; enf 2 R1 . These are used to calculate the attention
weights.
a Attention weights vector of size dnf containing the attention weights a1 ; . . . ; anf 2 R1 . These are the weights used in
the calculation of the context vector.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on October 23,2023 at 05:05:47 UTC from IEEE Xplore. Restrictions apply.
BRAUWERS AND FRASINCAR: GENERAL SURVEY ON ATTENTION MECHANISMS IN DEEP LEARNING 3283

Fig. 5. An illustration of interactive co-attention.


Fig. 4. An illustration of alternating co-attention.

from F ð1Þ via a linear transformation (see (1)), for which dw


Co-attention mechanisms can generally be split up into ð1Þ
is a prespecified dimension of the weight matrices and dk
two groups [33]: coarse-grained co-attention and fine-grained
is a prespecified dimension of the key vectors.
co-attention. The difference between the two groups is the
Perhaps one may wonder why the query is absent when
way attention scores are calculated based on the two feature
calculating attention in this manner. Essentially, the query
matrices. Coarse-grained attention mechanisms use a com-
in this attention model is learned alongside the other train-
pact representation of one feature matrix as a query when
able parameters. As such, the query can be interpreted as a
attending to the other feature vectors. Fine-grained co-atten-
general question: ”Which feature vectors contain the most
tion, on the other hand, uses all feature vectors of one input
important information?”. This is also known as a self-atten-
as queries. As such, no information is lost, which is why
tive mechanism, since attention is calculated based only on
these mechanisms are called fine-grained.
the feature vectors themselves. Self-attention is explained in
As an example of coarse-grained co-attention, [32] pro-
more detail in Section 3.3.1.
poses an alternating co-attention mechanism that uses the
The scores are combined with an alignment function (see
context vector (which is a compact representation) from one
(3)), such as the softmax function, to create attention
attention module as the query for the other module, and ð1Þ
weights used to calculate the context vector c ð0Þ 2 Rdv (see
vice versa. Alternating co-attention is presented in Fig. 4.
(4)). This context vector is not used as the output of the
Given a set of two input matrices X ð1Þ and X ð2Þ , features are
attention model, but rather as a query for calculating the
extracted by a feature model to produce the feature matrices ð2Þ
ð1Þ
df nf
ð1Þ ð2Þ
df nf
ð2Þ
ð1Þ ð2Þ context vector c ð2Þ 2 Rdv , based on the second feature
F ð1Þ 2 R and F ð2Þ 2 R , where df and df rep- ð2Þ ð2Þ
matrix F , where dv is the dimension of the value vectors
resent, respectively, the dimension of the feature vectors obtained from F ð2Þ via a linear transformation (see (1)). For
ð1Þ
extracted from the first and second inputs, while nf and this module (Attention Module2 in Fig. 4), attention scores
ð2Þ
nf represent, respectively, the amount of feature vectors are calculated using another score function with c 0 as query
extracted from the first and second inputs. In [32], co-atten- input, as presented in (8). Any function can be used in this
tion is used for VQA, so the two input matrices are the situation, but an additive function is used in [32].
image data and the question data, for which the feature  
model for the image consists of a CNN model, and the fea- ð2Þ ð2Þ
el ¼ score cð0Þ ; kl : (8)
ture model for the question consists of word embeddings, a 11 ð1Þ
dv 1 dð2Þ 1
convolutional layer, a pooling layer, and an LSTM model. k

First, attention is calculated for the first set of features F ð1Þ


These attention scores are then used to calculate attention
without the use of a query (Attention Module1 in Fig. 4). In
weights using, for example, a softmax function as alignment
[32], an adjusted additive attention score function is used
function, after which the context vector cð2Þ can be derived as
for this attention mechanism. The general form of the regu-
a weighted average of the second set of value vectors. Finally,
lar additive score function can be seen in (6).
the context vector cð2Þ is used as a query for the first attention
    module, which will produce the context vector c ð1Þ for the
score q ; kl ¼ w T act W 1  q þ W 2  kl þ b ; first feature matrix F ð1Þ . Attention scores are calculated
dq 1 dk 1 1dw dw dq dq 1 dw dk dk 1 dw 1
according to (9). In [32], the same function and weight matri-
(6) ces as seen in (7) are used, but with an added query making it
where actðÞ is a non-linear activation function, and w 2 Rdw , the same as the general additive score function (see (6)). The
W 1 2 Rdw dq , W 2 2 Rdw dk , and b 2 Rdw are trainable rest of the attention calculation is similar as before.
weights matrices, for which dw is a predefined dimension of  
ð1Þ ð1Þ
the weight matrices. A variant of this score function el ¼ score cð2Þ ; kl : (9)
ð2Þ
adapted to be calculated without a query for the application 11 dv 1 dð1Þ 1
k
at hand can be seen in (7).
  The produced context vectors cð1Þ and c ð2Þ are concatenated
ð0Þ ð1Þ ð1Þ and used for prediction in the output model. Alternating
el ¼ w ð1ÞT act W ð1Þ
 kl þ b ; (7)
1dw ð1Þ ð1Þ dw 1 co-attention inherently contains a form of sequentiality due
11 dw dk dk 1
to the fact that context vectors need to be calculated one
ð1Þ after another. This may come with a computational disad-
where wð1Þ 2 Rdw , W ð1Þ 2 Rdw dk , and b ð1Þ 2 Rdw are train- vantage since it is not possible to parallelize. Instead of
ð1Þ
ð1Þ
able weight matrices for Attention Module1, kl 2 Rdk is using a sequential mechanism like alternating co-attention,
the lth column of the keys matrix K ð1Þ that was obtained [34] proposes the interactive co-attention mechanism that can
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on October 23,2023 at 05:05:47 UTC from IEEE Xplore. Restrictions apply.
3284 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 4, APRIL 2023

 
ð1Þ ð2Þ ð1Þ ð2Þ
Ai;j ¼ wTA concat k i ; kj ; ki  kj ; (13)
11 13dk dk 1 dk 1 dk 1 dk 1

where wA 2 R3dk denotes a trainable vector of weights,


concatðÞ denotes vector concatenation, and  denotes ele-
ment-wise multiplication, also known as the Hadamard
Fig. 6. An illustration of parallel co-attention. product. Note that the keys of each keys matrix in this case
must have the same dimension dk for the element-wise mul-
calculate attention on both feature matrices in parallel, as tiplication to work. The affinity matrix can be interpreted as
depicted in Fig. 5. Instead of using the context vectors as a similarity matrix for the columns of the two keys matrices,
queries, unweighted averages of the key vectors are used as and helps translate, for example, image keys to the same
queries. The calculation of the average keys are provided in space as the keys of the words in a sentence, and vice versa.
(10), and the calculation of the attention scores are shown in The vectors of attention scores e ð1Þ and eð2Þ can be calculated
(11). Any score function can be used in this case, but an using an altered version of the additive score function as
additive score function is used in [34]. presented in (14) and (15). The previous attention score
examples in this survey all used a score function to calculate
ð1Þ ð2Þ
each attention score for each value vector individually.
n n
1 X
f
ð1Þ 1 X
f
ð2Þ
However, (14) and (15) are used to calculate the complete
kð1Þ ¼ ð1Þ
kl ; kð2Þ ¼ ð2Þ
kl ; (10) vector of all attention scores. Essentially, the attention scores
ð1Þ ð2Þ
dk 1 nf l¼1 dð1Þ 1 dk 1 nf l¼1 dð2Þ 1 are calculated in an aggregated form.
k k
 
    eð1Þ ¼ w 1 act W 2  K ð2Þ  AT þ W 1  K ð1Þ ;
ð1Þ ð2Þ ð2Þ ð2Þ ð2Þ ð1Þ ð1Þ ð1Þ ð1Þ
ð1Þ ð1Þ ð2Þ ð2Þ 1dw
el ¼ score kð2Þ ; kl ; el ¼ score kð1Þ ; kl : 1nf dw dk dk nf nf nf dw dk dk nf
ð2Þ ð1Þ
11 dk 1 dð1Þ 1 11 dk 1 dð2Þ 1 (14)
k k

(11)
 
From the attention scores, attention weights are created eð2Þ ¼ w 2 act W 1  K ð1Þ  A þ W 2  K ð2Þ ;
ð2Þ 1dw ð1Þ ð1Þ ð1Þ ð1Þ ð2Þ ð2Þ ð2Þ ð2Þ
via an alignment function, and are used to produce the con- 1nf dw dk dk nf nf nf dw dk dk nf

text vectors cð1Þ and cð2Þ . (15)


While coarse-grained co-attention mechanisms use a ð1Þ ð2Þ
dw dk dw dk
compact representation of one input to use as a query when where w1 2 R , w 2 2 R , W 1 2 R
dw dw
, and W 2 2 R
calculating attention for another input, fine-grained co- are trainable weight matrices, for which dw is a prespecified
attention considers every element of each input individually dimension of the weight matrices. Note that tanhðÞ is used
when calculating attention scores. In this case, the query in [32] for the activation function, and the feature matrices
becomes a matrix. An example of fine-grained co-attention are used as the key matrices. In that case, the affinity
is parallel co-attention [32]. Similarly to interactive co-atten- matrix A can be seen as a translator between feature
tion, parallel co-attention calculates attention on the two fea- spaces. As mentioned before, the affinity matrix is essen-
ture matrices at the same time, as shown in Fig. 6. We start tially a similarity matrix for the key vectors of the two
ð1Þ
dk nf
ð1Þ inputs. In [33], this fact is used to propose a different way
by evaluating the keys matrices K ð1Þ 2 R and K ð2Þ 2 of determining attention scores. Namely, one could take
ð2Þ ð2Þ
d n the maximum similarity value in a row or column as the
R k f that are obtained by linearly transforming the fea-
ð1Þ ð2Þ attention score, as shown in (16).
ture matrices F ð1Þ and F ð2Þ , where dk and dk are prespeci-
fied dimensions of the keys. The idea is to use the keys ð1Þ ð2Þ
matrix from one input as the query for calculating attention ei ¼ max Ai;j ; ej ¼ max Ai;j :
11 ð2Þ ð1Þ 11 (16)
j¼1;...;nf 11
on the other input. However, since K ð1Þ and K ð2Þ have
11 i¼1;...;nf

completely different dimensions, an affinity matrix A 2


ð1Þ ð2Þ
Next, the attention scores are used to calculate attention
n n
R f f is calculated that is used to essentially translate weights using an alignment function, so that two context
one keys matrix to the space of the other keys. In [32], A is vectors cð1Þ and c ð2Þ can be derived as weighted averages of
calculated as shown in (12). the value vectors that are obtained from linearly transform-
ing the features. For the alignment function, [32] proposes
  to use a softmax function, and the value vectors are simply
T
A ¼ act K ð1Þ  W A  K ð2Þ ; (12) set equal to the feature vectors. The resulting context vectors
ð1Þ ð2Þ ð1Þ ð1Þ ð1Þ ð2Þ ð2Þ ð2Þ
nf nf nf dk dk dk dk nf
can be either concatenated or added together.
ð1Þ ð2Þ
Finally, coarse-grained and fine-grained co-attention can
where W A 2 Rdk dk is a trainable weights matrix and actðÞ be combined to create an even more complex co-attention
is an activation function for which the tanhðÞ function is mechanism. [33] proposes the multi-grained co-attention
used in [32]. [35] proposes a different way of calculating mechanism that calculates both coarse-grained and fine-
this matrix, i.e., one can use (13) to calculate each individual grained co-attention for two inputs. Each mechanism pro-
element Ai;j of the matrix A. duces one context vector per input. The four resulting
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on October 23,2023 at 05:05:47 UTC from IEEE Xplore. Restrictions apply.
BRAUWERS AND FRASINCAR: GENERAL SURVEY ON ATTENTION MECHANISMS IN DEEP LEARNING 3285

context vectors are concatenated and used in the output again, using attention. First, the key vectors k t1 ; . . . ; ktnt 2
t t
Rdk and value vectors v t1 ; . . . ; v tnt 2 Rdv are extracted
f
model for prediction. from
dtf
A mechanism separate from co-attention that still uses the target phrase feature vectors f 1 ; . . . ; f nt 2 R , similarly
f t t
f
multiple inputs is the rotatory attention mechanism [36]. This as before, using a linear transformation, where dtk and dtv are
technique is typically used in a text sentiment analysis set- the dimensions of the key and value vectors, respectively.
ting where there are three inputs involved: the phrase for Note, again, that the original feature vectors as keys and val-
which the sentiment needs to be determined (target phrase), ues in [36]. The attention scores for the left-aware target
the text before the target phrase (left context), and the text representation are then calculated using (19).
after the target phrase (right context). The words in these  
three inputs are all encoded by the feature model, produc-
elit ¼ score rl ; kti : (19)
ing the following feature matrices: F t ¼ ½ff t1 ; . . . ; f tnt  2 11 dlv 1 dt 1
f k
dtf ntf dlf nlf
R , F ¼
l
½ff l1 ; . . . ; f lnl 2R , and F ¼ r
½ff r1 ; . . . ; f rnr  2
f
drf nrf f
The attention scores can be combined with an alignment
R , for the target phrase words, left context words, and
function and the corresponding value vectors to produce
right context words, respectively, where dtf , dlf , and drf rep- t
the context vector r lt 2 Rdv . For this attention calculation,
resent the dimensions of the feature vectors for the corre-
[34] proposes to use the same score and alignment functions
sponding inputs, and ntf , nlf , and nrf represent the number
as before. The right-aware target representation r rt can be
of feature vectors for the corresponding inputs. The fea-
calculated in a similar manner. Finally, to obtain the full
ture model used in [36] consists of word embeddings and
representation vector r that is used to determine the classifi-
separate Bi-LSTM models for the target phrase, the left
cation, the vectors r l , r r , r lt , and r rt are concatenated
context, and the right context. This means that the feature
together, as shown in (20).
vectors are in fact the hidden state vectors obtained from
the Bi-LSTM models. Using these features, the idea is to  
extract a single vector r from the inputs such that a soft- r ¼ concat r ; rr ; r ; r
l r lt rt
: (20)
ðdlv þdrv þdtv þdtv Þ1 dlv 1 dv 1 dtv 1 dtv 1
max layer can be used for classification. As such, we are
now faced with two challenges: how to represent the
inputs as a single vector, and how to incorporate the To summarize, rotatory attention uses the target
information from the left and right context into that vec- phrase to compute new representations for the left and
tor. [36] proposes to use the rotatory attention mechanism right context using attention, and then uses these left
for this purpose. and right representations to calculate new representa-
First, a single target phrase representation is created by tions for the target phrase. The first step is designed to
using a pooling layer that takes the average over the col- capture the words in the left and right contexts that are
umns of F t , as shown in (17). most important to the target phrase. The second step is
there to capture the most important information in the
nt
1 X f actual target phrase itself. Essentially, the mechanism
rt ¼ t f ti : (17) rotates attention between the target and the contexts to
dtf 1 nf i¼1 dt 1
f improve the representations.
There are many applications where combining informa-
r t is then used as a query to create a context vector out of the tion from different inputs into a single model can be highly
left and right contexts, separately. For example, for the left beneficial. For example, in the field of medical data, there
l
context, the key vectors kl1 ; . . . ; klnl 2 Rdk and value vectors are often many different types of data available, such as
l
v l1 ; . . . ; vlnl 2 Rdv are extracted f
from the left context feature various scans or documents, that can provide different
dl
vectors ffl1 ; . . . ; f lnl 2 R f , similarly as before, where dlk and types of information. In [37], a co-attention mechanism is
f
dlv are the dimensions of the key and value vectors, respec- used for automatic medical report generation to attend to
tively. Note that [36] proposes to use the original feature both images and semantic tags simultaneously. Similarly,
vectors as keys and values, meaning that the linear transfor- in [38], a co-attention model is proposed that combines
mation consists of a multiplication by an identity matrix. general demographics features and patient medical history
Next, the scores are calculated using (18). features to predict future health information. Additionally,
  an ablation study is used in [38] to show that the co-atten-
eli ¼ score r ; kli
t
: (18) tion part of the model specifically improves performance.
11 dtf 1 dl 1 A field where multi-feature attention has been extensively
k
explored is the domain of recommender systems. For
For the score function, [36] proposes to use an activated gen- example, in [39], a co-attention network is proposed that
eral score function [34] with a tanh activation function. The attends to both product reviews and the reviews a user
attention scores can be combined with an alignment func- has written. In [40], a model is proposed for video recom-
tion and the corresponding value vectors to produce the mendation that attends to both user features and video
l
context vector rl 2 Rdv . The alignment function used in [36] features. Co-attention techniques have also been used in
takes the form of a softmax function. An analogous proce- combination with graph networks for the purpose of, for
dure can be performed to obtain the representation of the example, reading comprehension across multiple docu-
right context, r r . These two context representations can then ments [41] and fake news detection [42]. In comparison to
be used to create new representations of the target phrase, co-attention, rotatory attention has typically been explored
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on October 23,2023 at 05:05:47 UTC from IEEE Xplore. Restrictions apply.
3286 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 4, APRIL 2023

Fig. 7. An illustration of attention-via-attention.

only in the field of sentiment analysis, which is most likely Fig. 8. An illustration of hierarchical attention.
due to the specific structure of the data that is necessary to
use this technique. An implementation of rotatory atten-
tion is proposed in [43] for sentiment analysis, where the attention module is the context vector cðcÞ . The complete
mechanism is extended by repeating the attention rotation context output of the attention model is the concatenation of
to iteratively further improve the representations. the word-level, and character-level context vectors.
The attention-via-attention technique uses representa-
tions for each level. However, accurate representations may
3.1.2 Feature Levels
not always be available for each level of the data, or it may
The previously discussed attention mechanisms process be desirable to let the model create the representations dur-
data at a single level. We refer to these attention techniques ing the process by building them from lower level represen-
as single-level attention mechanisms. However, some data tations. A technique referred to as hierarchical attention [5]
types can be analyzed and represented on multiple levels. can be used in this situation. Hierarchical attention is
For example, when analyzing documents, one can analyze another technique that allows one to apply attention on dif-
the document at the sentence level, word level, or even the ferent levels of the data. Yet, the exact mechanisms work
character level. When representations or embeddings of all quite differently compared to attention-via-attention. The
these levels are available, one can exploit the extra levels of idea is to start at the lowest level, and then create represen-
information. For example, one could choose to perform tations, or summaries, of the next level using attention. This
translation based on either just the characters, or just the process is repeated till the highest level is reached. To make
words of the sentence. However, in [44], a technique named this a little clearer, suppose one attempts to create a model
attention-via-attention is introduced that allows one to incor- for document classification, similarly to the implementation
porate information from both the character, and the word from [5]. We analyze a document containing nS sentences,
levels. The idea is to predict the sentence translation charac- with the sth sentence containing ns words, for s ¼ 1; . . . ; nS .
ter-by-character, while also incorporating information from One could use attention based on just the collection of
a word-level attention module. words to classify the document. However, a significant
To begin with, a feature model (consisting of, for exam- amount of important context is then left out of the analysis,
ple, word embeddings and RNNs) is used to encode the since the model will consider all words as a single long sen-
input sentence into both a character-level feature matrix tence, and will therefore not consider the context within the
ðcÞ ðcÞ
df nf
F ðcÞ 2 R , and a word-level feature matrix F ðwÞ 2 separate sentences. Instead, one can use the hierarchical
ðwÞ ðwÞ
df nf ðcÞ ðcÞ structure of a document (words form sentences, and senten-
R , where df and nf represent, respectively, the
dimension of the embeddings of the characters, and the ces form the document).
ðwÞ ðwÞ Fig. 8 illustrates the structure of hierarchical attention.
number of characters, while df and nf represent the
same but at the word level. It is crucial for this method that For each sentence in the document, a sentence representa-
ðSÞ
each level in the data can be represented or embedded. tion cðsÞ 2 Rdv is produced, for s ¼ 1; . . . ; nS , where dðSÞ
v is
When attempting to predict a character in the translated the dimension of the value vectors used in the attention
sentence, a query q ðcÞ 2 Rdq is created by the query model model for sentence representations (Attention ModuleS in
(like a character-level RNN), where dq is the dimension of Fig. 8). The representation is a context vector from an atten-
the query vectors. As illustrated in Fig. 7, the query is used tion module that essentially summarizes the sentence. Each
to calculate attention on the word-level feature ðwÞ
vectors F ðwÞ . sentence is first put through a feature model to extract the
This generates the context vector cðwÞ 2 Rdv , where dðwÞ ðSÞ
n ðSÞ
feature matrix F ðsÞ 2 R f
d
v s
, for s ¼ 1; . . . ; nS , where df
represents the dimension of the value vectors for the word-
represents the dimension of the feature vector for each
level attention module. This context vector summarizes
word, and ns represents the amount of words in sentence s.
which words contain the most important information for
For extra clarification, the columns of F ðsÞ are feature vec-
predicting the next character. If we know which words are
tors that correspond to the words in sentence s. As shown
most important, then it becomes easier to identify which
in Fig. 8, each feature matrix F ðsÞ is used as input for an
characters in the input sentence are most important. Thus,
attention model, which produces the context vector cðsÞ , for
the next step is to attend to the character-level features in
each s ¼ 1; . . . ; nS . No queries are used in this step, so it can
F ðcÞ , with an additional query input: the word-level context
be considered a self-attentive mechanism. The context vec-
vector cðwÞ . The actual query input for the attention model
tors are essentially summaries of the words in the sentences.
will therefore be the concatenation of the query q ðcÞ and the ðSÞ
nS
word context vector c ðwÞ . The output of this character-level The matrix of context vectors C ¼ ½ccð1Þ ; . . . ; cðnS Þ  2 Rdv
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on October 23,2023 at 05:05:47 UTC from IEEE Xplore. Restrictions apply.
BRAUWERS AND FRASINCAR: GENERAL SURVEY ON ATTENTION MECHANISMS IN DEEP LEARNING 3287

TABLE 2
Overview of Score Function (scoreðqq; k l Þ) Forms

Name Function Parameters


Additive (Concatenate) [3] w  actðW
T
W 1  q þ W 2  kl Þ þ bÞ w 2 R W 1 2 Rdw dq W 2 2 Rdw dk b 2 Rdw
dw

Multiplicative (Dot-Product) [4] q T  kl -


q T kkl
Scaled Multiplicative [13] pffiffiffiffi -
dk
General [4] kTl W q W 2 Rdk dq
Biased General [54] kTl  ðW
W  q þ bÞ W 2 Rdk dq b 2 Rdk
Activated General [34] actðkkTl  W  q þ bÞ W 2 Rdk dq ; b 2 R1
Similarity [55] similarityðqq; kl Þ -

is constructed by grouping all the obtained context vectors embedding xðei Þ is of size dei , for i ¼ 1; . . . ; E. Since not all
together as columns. Finally, attention is calculated using C embeddings are of the same size, a transformation is per-
as feature input, producing the representation of the entire formed to normalize the embedding dimensions. Using
ðDÞ
document in the context vector cðDÞ 2 Rdv , where dðDÞ is embedding-specific weight parameters, each embedding
v
the dimension of the value vectors in the attention model xðei Þ is transformed into the size-normalized embedding
for document representation (Attention ModuleD in Fig. 8). xðti Þ 2 Rdt , where dt is the size of every transformed word
This context vector can be used to classify the document, embedding, as shown in (21).
since it is essentially a summary of all the sentences (and
therefore also the words) in the document. x ðti Þ ¼ W ei  x ðei Þ þ b ei ; (21)
dt 1 dt dei dei 1 dt 1
Multi-level models can be used in a variety of tasks. For
example, in [28], hierarchical attention is used in a recom- where W ei 2 Rdt dei , and b ei 2 Rdt are trainable, embedding-
mender system to model user preferences at the long-term specific weights matrices. The final embedding x ðeÞ 2 Rdt is
level and the short-term level. Similarly, [45] proposes a a weighted average of the previously calculated trans-
hierarchical model for recommending social media images formed representations, as shown in (22).
based on user preferences. Hierarchical attention has also
been successfully applied in other domains. For example, X
E
[46] proposes to use hierarchical attention in a video action x ðeÞ ¼ ai  x ðti Þ : (22)
dt 1 dt 1
recognition model to capture motion information at the the i¼1 11
long-term level and the short-term level. Furthermore, [47]
proposes a hierarchical attention model for cross-domain The final representation x ðeÞ can be interpreted as the con-
text vector from an attention model, meaning that the
sentiment classification. In [48], a hierarchical attention
weights a1 ; . . . ; aE 2 R1 are attention weights. Attention can
model for chatbot response generation is proposed. Lastly,
be calculated as normally, where the columns of the features
using image data, [49] proposes a hierarchical attention
matrix F are the transformed representations x ðt1 Þ ; . . . ; x ðtE Þ .
model for crowd counting.
The query in this case can be ignored since it is constant in
all cases. Essentially, the query is “Which representations
3.1.3 Feature Representations are the most important?” in every situation. As such, this is
In a basic attention model, a single embedding or represen- a self-attentive mechanism.
tation model is used to produce feature representations for While an interesting idea, applications of multi-represen-
the model to attend to. This is referred to as single-representa- tational attention are limited. One example of the application
tional attention. Yet, one may also opt to incorporate multiple of this technique is found in [52], where a multi-representa-
representations into the model. In [50], it is argued that tional attention mechanism has been applied to generate
allowing a model access to multiple embeddings can allow multi-lingual meta-embeddings. Another example is [53],
one to create even higher quality representations. Similarly, where a multi-representational text classification model is
[51] incorporates multiple representations of the same book proposed that incorporates different representations of the
(textual, syntactic, semantic, visual etc.) into the feature same text. For example, the proposed model uses embed-
model. Feature representations are an important part of the dings from part-of-speech tagging, named entity recogniz-
attention model, but attention can also be an important part ers, and character-level and word-level embeddings.
of the feature model. The idea is to create a new representa-
tion by taking a weighted average of multiple representa- 3.2 General Attention Mechanisms
tions, where the weights are determined via attention. This This major category consists of attention mechanisms that
technique is referred to as multi-representational attention, can be applied in any type of attention model. The structure
and allows one to create so-called meta-embeddings. Sup- of this component can be broken down into the following
pose one wants to create a meta-embedding for a word x sub-aspects: the attention score function, the attention align-
for which E embeddings x ðe1 Þ ; . . . ; x ðeE Þ are available. Each ment, and attention dimensionality.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on October 23,2023 at 05:05:47 UTC from IEEE Xplore. Restrictions apply.
3288 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 4, APRIL 2023

3.2.1 Attention Scoring 3.2.2 Attention Alignment


The attention score function is a crucial component in how The attention alignment is the step after the attention scor-
attention is calculated. Various approaches have been devel- ing. This alignment process directly determines which parts
oped that each have their own advantages and disadvan- of the input data the model will attend to. The alignment
tages. An overview of these functions is provided in function is denoted as alignðÞ and has various forms. The
Table 2. Each row of Table 2 presents a possible form for the alignðÞ function takes as input the previously calculated
function scoreðqq; kl Þ, as seen in (23), where q is the query attention score vector e and calculates for each element el of
vector, and k l is the lth column of K . Note that the score e the attention weight al . These attention weights can then
functions presented in this section can be more efficiently be used to create the context vector c by taking a weighted
calculated in matrix form using K instead of each column average of the value vectors v 1 ; . . . ; v nf :
separately. Nevertheless, the score functions are presented
using k l to more clearly illustrate the relation between a key X
nf

and query. c ¼ al  v l : (24)


dv 1 dv 1
l¼1 11
 
The most popular alignment method to calculate these
el ¼ score q ; kl : (23)
11 dq 1 dk 1 weights is a simple softmax function, as depicted in (25).
 
expðel Þ
Due to their simplicity, the most popular choices for the al ¼ align el ; e ¼ Pnf : (25)
score function are the concatenate score function [3] and the 11 11 nf 1 j¼1 expðej Þ
multiplicative score function [4]. The multiplicative score
This alignment method is often referred to as soft alignment
function has the advantage of being computationally inex-
in computer vision settings [8], or global alignment for
pensive due to highly optimized vector operations. How-
sequence data [4]. Nevertheless, both these terms represent
ever, the multiplicative function may produce non-optimal
the same function and can be interpreted similarly. Soft/
results when the dimension dk is too large [56]. When dk is
global alignment can be interpreted as the model attending
large, the dot-product between q and k l can grow large in
to all feature vectors. For example, the model attends to all
magnitude. To illustrate this, in [13], an example is used
where the elements of q and kl are all normally distributed regions in an image, or all words in a sentence. Even though
with a mean equal to zero, and a variance equal to one. the attention model generally does focus more on specific
Then, the dot-product of the vectors has a variance of dk . A parts of the input, every part of the input will receive at least
higher variance means a higher chance of numbers that are some amount of attention due to the nature of the softmax
function. Furthermore, an advantage of the softmax func-
large in magnitude. When the softmax function of the align-
tion is that it introduces a probabilistic interpretation to the
ment step is then applied using these large numbers, the
input vectors. This allows one to easily analyze which parts
gradient will become very small, meaning the model will
of the input are important to the output predictions.
have trouble converging [13]. To adjust for this, [13] pro-
In contrast to soft/global alignment, other methods aim
poses to scale the multiplicative function by the factor p1ffiffiffiffi ,
dk to achieve a more focused form of alignment. For example,
producing the scaled multiplicative score function.
hard alignment [8], also known as hard attention or non-
In [4], the multiplicative score function is extended by
introducing a weights matrix W . This form, referred to as deterministic attention, is an alignment type that forces the
the general score function, allows for an extra transformation attention model to focus on exactly one feature vector. First,
of k l . The biased general score function [54] is a further exten- this method implements the softmax function in the exact
sion of the general function that introduces a bias weight same way as global alignment. However, the outputs
a1 ; . . . ; anf are not used as weights for the context vector cal-
vector b. A final extension on this function named the acti-
culation. Instead, these values are used as probabilities to
vated general score function is introduced in [34], and
draw the choice of the one value vector from. A value m 2
includes the use of both a bias weight b, and an activation
R1 is drawn from a multinomial distribution with
function actðÞ.
a1 ; . . . ; anf as parameters for the probabilities. Then, the con-
The previously presented score functions are all based on
text vector is simply defined as follows:
determining a type of similarity between the key vector and
the query vector. As such, more typical similarity measures,
c ¼ vm : (26)
such as the euclidean (L2 ) distance and cosine similarity, dv 1 dv 1
can also be implemented [55]. These scoring methods are
summarized under the similarity score function which is rep- Hard alignment is typically more efficient at inference
resented by the similarityðÞ function. compared to soft alignment. On the other hand, the main
There typically is no common usage across domains disadvantage of hard attention is that, due to the stochastic
regarding score functions. The choice of score function for a alignment of attention, the training of the model cannot be
particular task is most often based on empirical experi- done via the regular backpropagation method. Instead, sim-
ments. However, there are exceptions when, for example, ulation and sampling, or reinforcement learning [57] are
efficiency is vital. In models where this is the case, the multi- required to calculate the gradient at the hard attention layer.
plicative or scaled multiplicative score functions are typi- As such, soft/global attention is generally preferred. How-
cally the best choice. An example of this is the Transformer ever, a compromise can be made in certain situations. Local
model, which is generally computationally expensive. alignment [4] is a method that implements a softmax
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on October 23,2023 at 05:05:47 UTC from IEEE Xplore. Restrictions apply.
BRAUWERS AND FRASINCAR: GENERAL SURVEY ON ATTENTION MECHANISMS IN DEEP LEARNING 3289

distribution, similarly to soft/global alignment. But, the 3.2.3 Attention Dimensionality


softmax distribution is calculated based only on a subset of All previous model specifications of attention use a scalar
the inputs. This method is generally used in combination weight al for each value vector v l . This technique is referred
with sequence data. One has to specify a variable p 2 R1 to as single-dimensional attention. However, instead of deter-
that determines the position of the region. Feature vectors mining a single attention score and weight for the entire
close to p will be attended to by the model, and vectors too vector, [64] proposes to calculate weights for every single
far from p will be ignored. The size of the subset will be feature in those vectors separately. This technique is
determined by the variable D 2 R1 . Summarizing, the atten- referred to as multi-dimensional attention, since the attention
tion model will apply a softmax function on the attention weights now become higher dimensional vectors. The idea
scores in the subset ½p  D; p þ D. In other words, a win- is that the model no longer has to attend to entire vectors,
dow is placed on the input and soft/global attention is cal- but it can instead pick and choose specific elements from
culated within that window: those vectors. More specifically, attention is calculated for
  each dimension. As such, the model must create a vector of
expðel Þ attention weights a l 2 Rdv for each value vector v l 2 Rdv .
al ¼ align el ; e ¼ PpþD : (27)
11 11 nf 1 j¼pD expðej Þ The context vector can then be calculated by summing the
element-wise multiplications () of the value vectors
The question that remains is how to determine the location v 1 ; . . . ; v nf 2 Rdv and the corresponding attention weight
parameter p. The first method is referred to as monotonic vectors a 1 ; . . . ; a nf 2 Rdv , as follows:
alignment. This straightforward method entails simply set-
ting the location parameter equal to the location of the pre- X
nf
diction in the output sequence. Another method of c ¼ al  vl : (30)
dv 1
determining the position of the region is referred to as pre- l¼1 dv 1 dv 1
dictive alignment. As the name entails, the model attempts to However, since one needs to create attention weight vectors,
actually predict the location of interest in the sequence: this technique requires adjusted attention score and weight
   calculations. For example, the concatenate score function
found in Table 2 can be adjusted by changing the w 2 Rdw
p ¼ S sigmoid w Tp tanh W p  q ; (28)
11 11
1dp dp dq dq 1 weights vector to the weight matrix W d 2 Rdw dv :
 
where S 2 R1 is the length of the input sequence, and w p 2 el ¼ W Td act W 1  q þ W 2  kl þ b :
Rdp and W p 2 Rdp dq are both trainable weights parameters. dv 1 dv dw dw dq dq 1 dw dk dk 1 dw 1
The sigmoid function multiplied by S makes sure that p is (31)
in the range ½0; S. Additionally, in [4], it is recommended to
add an additional term to the alignment function to favor This new score function produces the attention score vectors
alignment around p: e1 ; . . . ; e nf 2 Rdv . These score vectors can be combined into a
matrix of scores e ¼ ½ee1 ; . . . ; enf  2 Rdv nf . To produce multi-
   
ðl  pÞ2 Þ dimensional attention weights, the alignment function stays
al ¼ align el ; e exp  ; (29) the same, but it is applied for each feature across the atten-
11 11 nf 1 2s 2
tion score columns. To illustrate, when implementing soft
where s 2 R1 is empirically set equal to D2 according to [4]. attention, the attention weight produced from the ith ele-
Another proposed method for compromising between soft ment of score vector el is defined as follows:
and hard alignment is reinforced alignment [58]. Similarly to  
local alignment, a subset of the feature vectors is deter- expðel;i Þ
al;i ¼ align el;i ; e ¼ Pnf ; (32)
j¼1 expðej;i Þ
mined, for which soft alignment is calculated. However, 11 11 dv nf
instead of using a window to determine the subset, rein-
forced alignment uses a reinforcement learning agent [57], where el;i represents the ith element of score vector el , and
similarly to hard alignment, to choose the subset of feature al;i is the ith element of the attention weights vector a l .
vectors. The attention calculation based on these chosen fea- Finally, these attention weight vectors can be used to com-
ture vectors is the same as regular soft alignment. pute the context vector as presented in (30).
Soft alignment is often regarded as the standard align- Multi-dimensional attention is a very general mechanism
ment function for attention models in practically every that can be applied in practically every attention model, but
domain. Yet, the other alignment methods have also seen actual applications of the technique have been relatively
interesting uses in various domains. For example, hard sparse. One application example is [65], where multi-
attention is used in [59] for the task of visual question dimensional attention is used in a model for named entity
answering. In [60], both soft and hard attention are used in recognition based on text and visual context from multime-
a graph attention model for multi-agent game abstraction. dia posts. In [66], multi-dimensional attention is used in a
Similarly, in [61], both global and local alignment are used model for answer selection in community question answer-
for review rating predictions. Reinforced alignment has ing. In [67], the U-net model for medical image segmenta-
been employed in combination with a co-attention structure tion is extended with a multi-dimensional attention
in [62] for the task of aspect sentiment classification. In [63], mechanism. Similarly, in [68], the Transformer model is
reinforced alignment is used for the task of person re-identi- extended with the multi-dimensional attention mechanism
fication using surveillance images. for the task of dialogue response generation. In [69], multi-
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on October 23,2023 at 05:05:47 UTC from IEEE Xplore. Restrictions apply.
3290 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 4, APRIL 2023

dimensional attention is used to extend graph attention net- feature vectors f 1 ; . . . ; f nf . These relations can then be used
works for dialogue state tracking. Lastly, for the task of as additional information to incorporate into new represen-
next-item recommendation, [70] proposes a model that tations of the feature vectors. With basic attention mecha-
incorporates multi-dimensional attention. nisms, the keys matrix K , and the values matrix V are
extracted from the features matrix F , while the query q is
3.3 Query-Related Attention Mechanisms produced separately. For this type of self-attention, the
Queries are an important part of any attention model, since query vectors are extracted in a similar process as the keys
they directly determine which information is extracted from and values, via a transformation matrix of trainable weights
the feature vectors. These queries are based on the desired W Q 2 Rdq df . We define the matrix Q ¼ ½qq1 ; . . . ; q nf  2
output of the task model, and can be interpreted as literal Rdq nf , which can be obtained as follows:
questions. Some queries have specific characteristics that
require specific types of mechanisms to process them. As Q ¼ WQ  F : (35)
dq nf dq df df nf
such, this category encapsulates the attention mechanisms
that deal with specific types of query characteristics. The
mechanisms in this category deal with one of the two fol- Each column of Q can be used as the query for the atten-
lowing query characteristics: the type of queries or the mul- tion model. When attention is calculated using a query q ,
tiplicity of queries. the resulting context vector c will summarize the informa-
tion in the feature vectors that is important to the query.
3.3.1 Type of Queries Since the query, or a column of Q , is now also a feature vec-
tor representation, the context vector contains the informa-
Different attention models employ attention for different tion of all feature vectors that are important to that specific
purposes, meaning that distinct query types are necessary. feature vector. In other words, the context vectors capture
There are basic queries, which are queries that are typically the relations between the feature vectors. For example, self-
straightforward to define based on the data and model. For attention allows one to extract the relations between words:
example, the hidden state for one prediction in an RNN is which verbs refer to which nouns, which pronouns refer to
often used as the query for the next prediction. One could which nouns, etc. For images, self-attention can be used to
also use a vector of auxiliary variables as query. For exam- determine which image regions relate to each other.
ple, when doing medical image classification, general While self-attention is placed in the query-related cate-
patient characteristics can be incorporated into a query. gory, it is also very much related to the feature model.
Some attention mechanisms, such as co-attention, rota- Namely, self-attention is a technique that is often used in
tory attention, and attention-over-attention, use specialized the feature model to create improved representations of the
queries. For example, rotatory attention uses the context vec- feature vectors. For example, the Transformer model for
tor from another attention module as query, while interac- language processing [13], and the Transformer model for
tive co-attention uses an averaged keys vector based on image processing [15], both use multiple rounds of (multi-
another input. Another case one can consider is when atten- head) self-attention to improve the representation of the fea-
tion is calculated based purely on the feature vectors. This ture vectors. The relations captured by the self-attention
concept has been mentioned before and is referred to as self- mechanism are incorporated into new representations. A
attention or intra-attention [71]. We say that the models use simple method of determining such a new representation is
self-attentive queries. There are two ways of interpreting such to simply set the feature vectors equal to the acquired self-
queries. First, one can say that the query is constant. For attention context vectors [71], as presented in (36).
example, document classification requires only a single clas-
sification as the output of the model. As such, the query is f ðnewÞ ¼ c ; (36)
always the same, namely: “What is the class of the doc- df 1 df 1
ument?”. The query can be ignored and attention can be cal-
culated based only on the features themselves. Score where f ðnewÞ is the updated feature vector. Another possi-
functions can be adjusted for this by making the query vec- bility is to add the context vectors to the previous feature
tor a vector of constants or removing it entirely: vectors with an additional normalization layer [13]:
     
score k l ¼ w T act W  kl þ b : (33) f ðnewÞ ¼ Normalize f ðoldÞ þ c ; (37)
df 1 df 1 df 1
dk 1 1dw dw dk dk 1 dw 1

Additionally, one can also interpret self-attention as learn- where f ðoldÞ is the previous feature vector, and NormalizeðÞ
ing the query along the way, meaning that the query can be is a normalization layer [72]. Using such techniques, self-
defined as a trainable vector of weights. For example, the attention has been used to create improved word or sen-
dot-product score function may take the following form: tence embeddings that enhance model accuracy [71].
Self-attention is arguably one of the more important
 
types of attention, partly due to its vital role in the highly
score kl ¼ q T kl ; (34) popular Transformer model. Self-attention is a very general
dk 1 1dk dk 1
mechanism and can be applied to practically any problem.
where q 2 Rdk is a trainable vector of weights. One could As such, self-attention has been extensively explored in
also interpret vector b 2 Rdw as the query in (33). Another many different fields in both Transformer-based architec-
use of self-attention is to uncover the relations between the tures and other types of models. For example, in [73], self-
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on October 23,2023 at 05:05:47 UTC from IEEE Xplore. Restrictions apply.
BRAUWERS AND FRASINCAR: GENERAL SURVEY ON ATTENTION MECHANISMS IN DEEP LEARNING 3291

Fig. 9. An illustration of multi-head attention. Fig. 10. An example illustration of multi-hop attention. Solid arrows rep-
resent the base multi-hop model structure, while dotted arrows represent
optional connections.
attention is explored for image recognition tasks, and results
indicate that the technique may have substantial advantages
transformations of F . As such, each attention head has its
with regards to robustness and generalization. In [74], self- ðjÞ ðjÞ
own learnable weights matrices W ðjÞ q , W K and W V for
attention is used in a generative adversarial network (GAN)
these transformations. The calculation of the query, keys,
[75] to determine which regions of the input image to focus
and values for the jth head are defined as follows:
on when generating the regions of a new image. In [76],
ðjÞ
self-attention is used to design a state-of-the-art medical q ðjÞ ¼ W ðjÞ
q  q ; K ðjÞ ¼ W K  F ;
dq 1 dq 1 dk nf dk df df nf
image segmentation model. Naturally, self-attention can dq dq
ðjÞ
(38)
also be used for video processing. In [77], a self-attention V ðjÞ ¼ W V  F :
model is proposed for the purpose of video summarization dv nf dv df df nf
that reaches state-of-the-art results. In other fields, like Thus, each head creates its own representations of the query
audio processing, self-attention has been explored as well. q , and the input matrix F . Each head can therefore learn to
In [78], self-attention is used to create a speech recognition focus on different parts of the inputs, allowing the model to
model. Self-attention has also been explored in overlapping attend to more information. For example, when training a
domains. For example, in [79], the self-attention Trans- machine translation model, one attention head can learn to
former architecture is used to create a model that can recog- focus on which nouns (e.g., student, car, apple) do certain
nize phrases from audio and by lip-reading from a video. verbs (e.g., walking, driving, buying) refer to, while another
For the problem of next item recommendation, [80] pro- attention head learns to focus on which nouns refer to cer-
poses a Transformer model that explicitly captures item- tain pronouns (e.g., he, she, it) [13]. Each head will also cre-
item relations using self-attention. Self-attention also has ðjÞ
applications in any natural language processing fields. For ate its own vector of attention scores eðjÞ ¼ ½e1 ; . . . ; eðjÞ
nf 
example, in [81], self-attention is used for sentiment analy- 2 Rnf , and a corresponding vector of attention weights
ðjÞ
sis. Self-attention is also highly popular for graph models. a ðjÞ ¼ ½a1 ; . . . ; aðjÞ
nf  2 R . As can be expected, each atten-
nf

For example, self-attention is explored in [82] for the pur- tion model produces its own context vector c ðjÞ 2 Rdv , as fol-
pose of representation learning in communication networks lows:
and rating networks. Additionally, the first attention model
X
nf
ðjÞ ðjÞ
for graph networks was based on self-attention [83]. cðjÞ ¼ al  v l : (39)
dv 1 dv 1
l¼1 11

The goal is still to create a single context vector as output


3.3.2 Multiplicity of Queries
of the attention model. As such, the context vectors pro-
In previous examples, the attention model generally used a duced by the individual attention heads are concatenated
single query for a prediction. We say that such models use into a single vector. Afterwards, a linear transformation is
singular query attention. However, there are attention archi- applied using the weight matrix W O 2 Rdc dv d to make sure
tectures that allow the model to compute attention using the resulting context vector c 2 Rdc has the desired dimen-
multiple queries. Note that this is different from, for exam- sion. This calculation is presented in (40). The dimension dc
ple, an RNN that may involve multiple queries to produce a can be pre-specified by, for example, setting it equal to dv ,
sequence of predictions. Namely, such a model still requires so that the context vector dimension is unchanged.
only a single query per prediction.
 
One example of a technique that incorporates multiple ð1Þ ðdÞ
queries is multi-head attention [13], as presented in Fig. 9. c ¼ W O  concat c ; :::; c : (40)
dc 1 dc dv d dv 1 dv 1
Multi-head attention works by implementing multiple
attention modules in parallel by utilizing multiple different Multi-head attention processes multiple attention mod-
versions of the same query. The idea is to linearly transform ules in parallel, but attention modules can also be imple-
the query q using different weight matrices. Each newly mented sequentially to iteratively adjust the context vectors.
formed query essentially asks for a different type of relevant Each of these attention modules are referred to as
information, allowing the attention model to introduce “repetitions” or “rounds” of attention. Such attention archi-
more information into the context vector calculation. An tectures are referred to as multi-hop attention models, also
attention model implements d  1 heads with each attention known as multi-step attention models. An important note to
head having its own query vector, keys matrix, and values consider is the fact that multi-hop attention is a mechanism
matrix: q ðjÞ , K ðjÞ and V ðjÞ , for j ¼ 1; . . . ; d. The query q ðjÞ is that has been proposed in various forms throughout various
obtained by linearly transforming the original query q , works. While the mechanism always involves multiple
while the matrices K ðjÞ and V ðjÞ are obtained through linear rounds of attention, the multi-hop implementation
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on October 23,2023 at 05:05:47 UTC from IEEE Xplore. Restrictions apply.
3292 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 4, APRIL 2023

proposed in [84] differs from the mechanism proposed in


[85] or [86]. Another interesting example is [87], where a
“multi-hop” attention model is proposed that would actu-
ally be considered alternating co-attention in this survey, as
explained in Section 3.1.1.
We present a general form of multi-hop attention that is
largely a generalization of the techniques introduced in [85]
and [88]. Fig. 10 provides an example implementation of a
multi-hop attention mechanism.
The general idea is to iteratively transform the query,
and use the query to transform the context vector, such
that the model can extract different information in each
step. Remember that a query is similar to a literal question. Fig. 11. An illustration of capsule-based attention.
As such, one can interpret the transformed queries as ask-
ing the same question in a different manner or from a dif-
ferent perspective, similarly to the queries in multi-head scale with the number of repetitions. Yet, using multiple
attention. The query that was previously denoted by q is hops with different weight matrices can also be viable, as
now referred to as the initial query, and is denoted by q ð0Þ . shown by the Transformer model [13] and in [88]. It may be
At hop s, the current query q ðsÞ is transformed into a new difficult to grasp why cðsÞ is part of the query input for the
query representation q ðsþ1Þ , possibly using the current con- attention model. Essentially, this technique is closely related
text vector cðsÞ as another input, and some transformation to self-attention in the sense that, in each iteration, a new
function transformðÞ: context representation is created from the feature vectors
  and the context vector. The essence of this mechanism is
that one wants to iteratively alter the query and the context
q ðsþ1Þ ¼ transform q ðsÞ ; c ðsÞ : (41)
dq 1 dq 1 dv 1 vector, while attending to the feature vectors. In the process,
the new representations of the context vector absorb more
For the specific form of the transformation function different kinds of information. This is also the main differ-
transformðÞ, [85] proposes to use a mechanism similar to ence between this type of attention and multi-head atten-
self-attention. Essentially, the queries used by the question tion. Multi-head attention creates multiple context vectors
answer matching model proposed in [85] were originally from multiple queries and combines them to create a final
based on a set of feature vectors extracted from a question. context vector as output. Multi-hop attention iteratively
[85] also defines the original query q ð0Þ as the unweighted refines the context vector by incorporating information
average of these feature vectors. At each hop s, attention from the different queries. This does have the disadvantage
can be calculated on these feature vectors using the previ- of having to calculate attention sequentially.
ous query q ðsÞ as the query in this process. The resulting Interestingly, due to the variations in which multi-hop
context vector of this calculation is the next query vector. attention has been proposed, some consider the Trans-
Using the context vector cðsÞ instead of q ðsÞ as the query for former model’s encoder and decoder to consist of several
this process is also a possibility, which is similar to the LCR- single-hop attention mechanisms [84] instead of being a
Rot-hop model proposed in [43] and the multi-step model multi-hop model. However, in the context of this survey,
proposed in [88]. Such a connection is represented by the we consider the Transformer model to be an alternative
dotted arrows in Fig. 10. The transformation mechanism form of the multi-hop mechanism, as the features matrix F
uses either the q ðsÞ or the context vector c ðsÞ as query, but a is not directly reused in each step. Instead, F is only used as
combination via concatenation is also possible. an input for the first hop, and is transformed via self-atten-
Each query representation is used as input for the atten- tion into a new representation. The self-attention mecha-
tion module to compute attention on the columns of the fea- nism uses each feature vector in F as a query, resulting in a
ture matrix F , as seen previously. One main difference, matrix of context vectors as output of each attention hop.
however, is that the context vector cðsÞ is also used as input, The intermediate context vectors are turned into matrices
so that the actual query input for the attention model is the and represent iterative transformations of the matrix F ,
concatenation of cðsÞ and q ðsþ1Þ . The adjusted attention score which are used in the consecutive steps. Thus, the Trans-
function is presented in (42). Note that the initial context former model iteratively refines the features matrix F by
vector cð0Þ is predefined. One way of doing this is by setting extracting and incorporating new information.
it equal to the unweighted average of the value vectors When dealing with a classification task, another idea is to
v 1 ; . . . ; vnf 2 Rdv extracted from F . use a different query for each class. This is the basic princi-
   ple behind capsule-based attention [89], as inspired by the
ðsÞ capsule networks [90]. Suppose we have the feature vectors
el ¼ score concat q ðsþ1Þ ; cðsÞ Þ; kl : (42)
11 dq 1 dv 1 dk 1 f 1 ; . . . ; f nf 2 Rdf , and suppose there are are dy classes that
the model can predict. Then, a capsule-based attention
An alignment function and the value vectors are then used model defines a capsule for each of the dy classes that each
to produce the next context vector cðsþ1Þ . One must note that take as input the feature vectors. Each capsule consists of, in
in [85], the weights used in each iteration are the same order, an attention module, a probability module, and a
weights, meaning that the number of parameters do not reconstruction module, which are depicted in Fig. 11. The
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on October 23,2023 at 05:05:47 UTC from IEEE Xplore. Restrictions apply.
BRAUWERS AND FRASINCAR: GENERAL SURVEY ON ATTENTION MECHANISMS IN DEEP LEARNING 3293

attention modules all use self-attentive queries, so each example, in [94], a Transformer model is implemented for
module learns its own query: ”Which feature vectors are image captioning. In [95], Transformers are explored for med-
important to identify this class?”. In [89], a self-attentive ical image segmentation. In [96], a Transformer model is used
multiplicative score function is used for this purpose: for emotion recognition in text messages. A last example of an
application of Transformers is [17], which proposes a Trans-
ec;l ¼ q Tc k l ; (43) former model for recommender systems. In comparison with
11 1dk dk 1
multi-head and multi-hop attention, capsule-based attention
where ec;l 2 R1 is the attention score for vector l in capsule c, is arguably the least popular of the mechanisms discussed for
and q c 2 Rdk is a trainable query for capsule c, for c ¼ the multiplicity of queries. One example is [97], where an
1; . . . ; dy . Each attention module then uses an alignment attention-based capsule network is proposed that also
function, and uses the produced attention weights to deter- includes a multi-hop attention mechanism for the purpose of
mine a context vector c c 2 Rdv . Next, the context vector c c is visual question answering. Another example is [98], where
fed through a probability layer consisting of a linear trans- capsule-based attention is used for aspect-level sentiment
formation with a sigmoid activation function: analysis of restaurant reviews.
The multiplicity of queries is a particularly interesting cat-
 
egory due to the Transformer model [13], which combines a
pc ¼ sigmoid wc  cc þ bc ;
T
(44)
11 1dv dv 1 11 form of multi-hop and multi-head attention. Due to the initial
success of the Transformer model, many improvements and
where wc 2 Rdv and bc 2 R1 are trainable capsule-specific iterations of the model have been produced that typically
weights parameters, and pc 2 R1 is the predicted probability aim to improve the predictive performance, the computa-
that the correct class is class c. The final layer is the recon- tional efficiency, or both. For example, the Transformer-XL
struction module that creates a class vector representation. [99] is an extension of the original Transformer that uses a
This representation r c 2 Rdv is determined by simply multi- recurrence mechanism to not be limited by a context window
plying the context vector c c by the probability pc : when processing the outputs. This allows the model to learn
significantly longer dependencies while also being computa-
rc ¼ pc  c c : (45) tionally more efficient during the evaluation phase. Another
dv 1 11 dv 1
extension of the Transformer is known as the Reformer
The capsule representation is used when training the model. model [100]. This model is significantly more efficient com-
First of all, the model is trained to predict the probabilities putationally, by means of locality-sensitive hashing, and
p1 ; . . . ; pdy as accurately as possible compared to the true memory-wise, by means of reversible residual layers. Such
values. Second, via a joint loss function, the model is also computational improvements are vital, since one of the main
trained to accurately construct the capsule representations disadvantages of the Transformer model is the sheer compu-
r 1 ; . . . ; r dy . A features representation f 2 Rdf is defined tational cost due to the complexity of the model scaling
which is simply the unweighted average of the original fea- quadratically with the amount of input feature vectors. The
ture vectors. The idea is to train the model such that vector Linformer model [101] manages to reduce the complexity of
representations from capsules that are not the correct class the model to scale linearly, while achieving similar perfor-
differ significantly from f while the representation from the mance as the Transformer model. This is achieved by approx-
correct capsule is very similar to f . A dot-product between imating the attention weights using a low-rank matrix. The
the capsule representations and the features representation Lite-Transformer model proposed in [102] achieves similar
is used in [89] as a measure of the distance between the vec- results by implementing two branches within the Trans-
tors. Note that dv must equal df in this case, otherwise the former block that specialize in capturing global and local
vectors would have incompatible dimensions. Interestingly, information. Another interesting Transformer architecture is
since attention is calculated for each class individually, one the Synthesizer [103]. This model replaces the pairwise self-
can track which specific feature vectors are important for attention mechanism with “synthetic” attention weights.
which specific class. In [89], this idea is used to discover Interestingly, the performance of this model is relatively close
which words correspond to which sentiment class. to the original Transformer, meaning that the necessity of the
The number of tasks that can make use of multiple queries pairwise self-attention mechanism of the Transformer model
is substantial, due to how general the mechanisms are. As may be questionable. For a more comprehensive overview of
such, the techniques described in this section have been exten- Transformer architectures, we refer to [104].
sively explored in various domains. For example, multi-head
attention has been used for speaker recognition based on
audio spectrograms [91]. In [92], multi-head attention is used 4 EVALUATION OF ATTENTION MODELS
for recommendation of news articles. Additionally, multi-
head attention can be beneficial for graph attention models as In this section, we present various types of evaluation for
well [83]. As for multi-hop attention, quite a few papers have attention models. First, one can evaluate the structure of
been mentioned before, but there are still many other interest- attention models using the taxonomy presented in Section 3.
ing examples. For example, in [93], a multi-hop attention For such an analysis, we consider the attention mechanism
model is proposed for medication recommendation. Further- categories (see Fig. 3) as orthogonal dimensions of a model.
more, practically every Transformer model makes use of both The structure of a model can be analyzed by determining
multi-head and multi-hop attention. The Transformer model which mechanism a model uses for each category. Table 3
has been extensively explored in various domains. For provides an overview of attention models found in the
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on October 23,2023 at 05:05:47 UTC from IEEE Xplore. Restrictions apply.
3294 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 4, APRIL 2023

TABLE 3
Attention Models Analyzed Based on the Proposed Taxonomy

A plus sign (+) between two mechanisms indicates that both techniques were combined in the same model, while a comma (,) indicates that both mechanisms were
tested in the same paper, but not necessarily as a combination in the same model.

literature with a corresponding analysis based on the atten- Rate (AER) to measure the accuracy of the attention weights
tion mechanisms the models implement. with respect to annotated attention vectors. [116] incorpo-
Second, we discuss various techniques for evaluating the rates this idea into an attention model by supervising the
performance of attention models. The performance of atten- attention mechanism using gold attention vectors. A joint
tion models can be evaluated using extrinsic or intrinsic per- loss function consisting of the regular task-specific loss and
formance measures, which are discussed in Sections 4.1 and the attention weights loss function is constructed for this
4.2, respectively. purpose. The gold attention vectors are based on annotated
text data sets where keywords are hand-labelled. However,
4.1 Extrinsic Evaluation since attention is inspired by human attention, one could
evaluate attention models by comparing them to the atten-
In general, the performance of an attention model is mea-
tion behaviour of humans.
sured using extrinsic performance measures. For example, per-
formance measures typically used in the field of natural
language processing are the BLEU [107], METEOR [108], 4.2.1 Evaluation via Human Attention
and Perplexity [109] metrics. In the field of audio proc- In [117], the concept of attention correctness is proposed,
essing, the Word Error Rate [110] and Phoneme Error which is a quantitative intrinsic performance metric that
Rate [111] are generally employed. For general classifica- evaluates the quality of the attention mechanism based on
tion tasks, error rates, precision, and recall are generally actual human attention behaviour. First, the calculation of
used. For computer vision tasks, the PSNR [112], SSIM this metric requires data that includes the attention behav-
[113], or IoU [114] metrics are used. Using these perfor- iour of a human. For example, a data set containing images
mance measures, an attention model can either be com- with the corresponding regions that a human focuses on
pared to other state-of-the-art models, or an ablation when performing a certain task, such as image captioning.
study can be performed. If possible, the importance of The collection of regions focused on by the human is
the attention mechanism can be tested by replacing it referred to as the ground truth region. Suppose an attention
with another mechanism and observing whether the model attends to the nf feature vectors f 1 ; . . . ; f nf 2 Rdf .
overall performance of the model decreases [105], [115]. Feature vector f i corresponds to region Ri of the given
An example of this is replacing the weighted average image, for i ¼ 1; . . . ; nf . We define the set G as the set of
used to produce the context vector with a simple regions that belong to the ground truth region, such that
unweighted average and observing whether there is a Ri 2 G if Ri is part of the ground truth region. The attention
decrease in overall model performance [35]. This ablation model calculates the attention weights a1 ; . . . ; anf 2 R1 via
method can be used to evaluate whether the attention the usual attention process. The Attention Correctness (AC)
weights can actually distinguish important from irrele- metric can then be calculated using (46).
vant information. X
AC ¼ ai : (46)
11 11
i:Ri 2G
4.2 Intrinsic Evaluation
Attention models can also be evaluated using attention-spe- Thus, this metric is equal to the sum of the attention weights
cific intrinsic performance measures. In [4], the attention for the ground truth regions. Since the attention weights
weights are formally evaluated via the Alignment Error sum up to 1 due to, for example, a softmax alignment
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on October 23,2023 at 05:05:47 UTC from IEEE Xplore. Restrictions apply.
BRAUWERS AND FRASINCAR: GENERAL SURVEY ON ATTENTION MECHANISMS IN DEEP LEARNING 3295

function, the AC value will be a value between 0 and 1. If be diagnosed via the attention weights if the model is found
the model attends to only the ground truth regions, then to focus on the incorrect parts of the data, if such informa-
AC is equal to 1, and if the attention model does not attend tion is available. Yet, conversely, attention weights may
to any of the ground truth regions, AC will be equal to 0. only be used to obtain plausible explanations for why cer-
In [118], a rank correlation metric is used to compare the tain parts of the data are focused on, rather than concluding
generated attention weights to the attention behaviour of that those parts are significant to the problem [121]. How-
humans. The conclusion of this work is that attention maps ever, one should still be cautious as the viability of such
generated by standard attention models generally do not approaches can depend on the model architecture [122].
correspond to human attention. Attention models often
focus on much larger regions or multiple small non-adjacent
regions. As such, a technique to improve attention models is 5 CONCLUSION
to allow the model to learn from human attention patterns In this survey, we have provided an overview of recent
via a joint loss of the regular loss function and an attention research on attention models in deep learning. Attention
weight loss function based on the human gaze behaviour, mechanisms have been a prominent development for deep
similarly to how annotated attention vectors are used in learning models as they have shown to improve model per-
[116] to supervise the attention mechanism. [117] proposes formance significantly, producing state-of-the-art results for
to use human attention data to supervise the attention various tasks in several fields of research. We have pre-
mechanism in such a manner. Similarly, a state-of-the-art sented a comprehensive taxonomy that can be used to cate-
video captioning model is proposed in [119] that learns gorize and explain the diverse number of attention
from human gaze data to improve the attention mechanism. mechanisms proposed in the literature. The organization of
the taxonomy was motivated based on the structure of a
task model that consists of a feature model, an attention
4.2.2 Manual Evaluation model, a query model, and an output model. Furthermore,
A method that is often used to evaluate attention models is the attention mechanisms have been discussed using a
the manual inspection of attention weights. As previously framework based on queries, keys, and values. Last, we
mentioned, the attention weights are a direct indication of have shown how one can use extrinsic and intrinsic meas-
which parts of the data the attention model finds most ures to evaluate the performance of attention models, and
important. Therefore, observing which parts of the inputs how one can use the taxonomy to analyze the structure of
the model focuses on can be helpful in determining if the attention models.
model is behaving correctly. This allows for some interpre- The attention mechanism is typically relatively simple
tation of the behaviour of models that are typically known to understand and implement and can lead to significant
to be black boxes. However, rather than checking if the improvements in performance. As such, it is no surprise
model focuses on the most important parts of the data, that this is a highly active field of research with new
some use the attention weights to determine which parts of attention mechanisms and models being developed con-
the data are most important. This would imply that atten- stantly. Not only are new mechanisms consistently being
tion models provide a type of explanation, which is a sub- developed, but there is also still ample opportunity for
ject of contention among researchers. Particularly, in [120], the exploration of existing mechanisms for new tasks. For
extensive experiments are conducted for various natural example, multi-dimensional attention [64] is a technique
language processing tasks to investigate the relation that shows promising results and is general enough to be
between attention weights and important information to implemented in almost any attention model. However, it
determine whether attention can actually provide meaning- has not seen much application in current works. Simi-
ful explanations. In this paper titled “Attention is not larly, multi-head attention [13] is a technique that can be
Explanation”, it is found that attention weights do not tend efficiently parallelized and implemented in practically
to correlate with important features. Additionally, the any attention model. Yet, it is mostly seen only in Trans-
authors are able to replace the produced attention weights former-based architectures. Lastly, similarly to how [43]
with completely different values while keeping the model combines rotatory attention with multi-hop attention,
output the same. These so-called “adversarial” attention combining multi-dimensional attention, multi-head atten-
distributions show that an attention model may focus on tion, capsule-based attention, or any of the other mecha-
completely different information and still come to the same nisms presented in this survey may produce new state-of-
conclusions, which makes interpretation difficult. Yet, in the-art results for the various fields of research mentioned
another paper titled “Attention is not not Explanation” in this survey.
[121], the claim that attention is not explanation is ques- This survey has mainly focused on attention mecha-
tioned by challenging the assumptions of the previous nisms for supervised models, since these comprise the
work. It is found that the adversarial attention distributions largest proportion of the attention models in the litera-
do not perform as reliably well as the learned attention ture. In comparison to the total amount of research that
weights, indicating that it was not proved that attention is has been done on attention models, research on attention
not viable for explanation. models for semi-supervised learning [123], [124] or unsu-
In general, the conclusion regarding the interpretability pervised learning [125], [126] has received limited atten-
of attention models is that researchers must be extremely tion and has only become active recently. Attention may
careful when drawing conclusions based on attention pat- play a more significant role for such tasks in the future as
terns. For example, problems with an attention model can obtaining large amounts of labeled data is a difficult task.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on October 23,2023 at 05:05:47 UTC from IEEE Xplore. Restrictions apply.
3296 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 4, APRIL 2023

Yet, as larger and more detailed data sets become avail- [17] F. Sun et al., “BERT4Rec: Sequential recommendation with bidi-
rectional encoder representations from transformer,” in Proc.
able, the research on attention models can advance even 28th ACM Int. Conf. Inf. Knowl. Manage., 2019, pp. 1441–1450.
further. For example, we mentioned the fact that attention [18] F. Wang and D. M. J. Tax, “Survey on the attention based RNN
weights can be trained directly based on hand-annotated model and its applications in computer vision,” 2016,
data [116] or actual human attention behaviour [117], arXiv:1601.06823.
[19] J. B. Lee, R. A. Rossi, S. Kim, N. K. Ahmed, and E. Koh,
[119]. As new data sets are released, future research may “Attention models in graphs: A survey,” ACM Trans. Knowl. Dis-
focus on developing attention models that can incorpo- covery Data, vol. 13, pp. 62:1–62:25, 2019.
rate those types of data. [20] S. Chaudhari, V. Mithal, G. Polatkan, and R. Ramanath, “An
While attention is intuitively easy to understand, there attentive survey of attention models,” ACM Trans. Intell. Syst.
Technol., vol. 12, no. 5, pp. 1–32, 2021.
still is a substantial lack of theoretical support for attention. [21] D. Hu, “An introductory survey on attention mechanisms in
As such, we expect more theoretical studies to additionally NLP problems,” in Proc. Intell. Syst. Conf. , ser. AISC, vol. 1038,
contribute to the understanding of the attention mechanisms 2020, pp. 432–448.
in complex deep learning systems. Nevertheless, the practi- [22] A. Galassi, M. Lippi, and P. Torroni, “Attention, please! a critical
review of neural attention models in natural language proc-
cal advantages of attention models are clear. Since attention essing,” 2019, arXiv:1902.02181.
models provide significant performance improvements in a [23] M. Daniluk, T. Rockt€aschel, J. Welbl, and S. Riedel, “Frustratingly
variety of fields, and as there are ample opportunities for short attention spans in neural language modeling,” in Proc. 5th
Int. Conf. Learn. Representations, 2017.
more advancements, we foresee that these models will still
[24] Y. Xu, Q. Kong, Q. Huang, W. Wang, and M. D. Plumbley,
receive significant attention in the time to come. “Attention and localization based on a deep convolutional recur-
rent model for weakly supervised audio tagging,” in Proc. 18th
REFERENCES Annu. Conf. Int. Speech Commun. Assoc., 2017, pp. 3083–3087.
[25] C. Yu, K. S. Barsim, Q. Kong, and B. Yang, “Multi-level atten-
[1] H. Larochelle and G. E. Hinton, “Learning to combine foveal tion model for weakly supervised audio classification,” in
glimpses with a third-order Boltzmann machine,” in Proc. 24th Proc. Detection Classification Acoustic Scenes Events Workshop,
Annu. Conf. Neural Inf. Process. Syst., 2010, pp. 1243–1251. 2018, pp. 188–192.
[2] V. Mnih, N. Heess, A. Graves, and K. kavukcuoglu, “Recurrent [26] S. Sharma, R. Kiros, and R. Salakhutdinov, “Action recognition
models of visual attention,” in Proc. 27th Annu. Conf. Neural Inf. using visual attention,” in Proc. 4th Int. Conf. Learn. Representa-
Process. Syst., 2014, pp. 2204–2212. tions Workshop, 2016.
[3] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine transla- [27] L. Gao, Z. Guo, H. Zhang, X. Xu, and H. T. Shen, “Video cap-
tion by jointly learning to align and translate,” in Proc. 3rd Int. tioning with attention-based LSTM and semantic consis-
Conf. Learn. Representation, 2015. tency,” IEEE Trans. Multimedia, vol. 19, no. 9, pp. 2045–2055,
[4] T. Luong, H. Pham, and C. D. Manning, “Effective approaches to Sep. 2017.
attention-based neural machine translation,” in Proc. Conf. Empir. [28] H. Ying et al., “Sequential recommender system based on hierar-
Methods Natural Lang. Process., 2015, pp. 1412–1421. chical attention networks,” in Proc. 27th Int. Joint Conf. Artif.
[5] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy, Intell., 2018, pp. 3926–3932.
“Hierarchical attention networks for document classification,” in [29] H. Song, D. Rajan, J. Thiagarajan, and A. Spanias, “Attend and
Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics: Human diagnose: Clinical time series analysis using attention models,”
Lang. Technologies, 2016, pp. 1480–1489. in Proc. 32nd AAAI Conf. Artif. Intell., 2018, pp. 4091–4098.
[6] Y. Wang, M. Huang, X. Zhu, and L. Zhao, “Attention-based [30] D. T. Tran, A. Iosifidis, J. Kanniainen, and M. Gabbouj,
LSTM for aspect-level sentiment classification,” in Proc. Conf. “Temporal attention-augmented bilinear network for financial
Empir. Methods Natural Lang. Process., 2016, pp. 606–615. time-series data analysis,” IEEE Trans. Neural Netw. Learn. Syst.,
[7] P. Anderson et al., “Bottom-up and top-down attention for image vol. 30, no. 5, pp. 1407–1418, May 2019.
captioning and visual question answering,” in Proc. IEEE/CVF [31] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, and
Conf. Comput. Vis. Pattern Recognit., 2018, pp. 6077–6086. Y. Bengio, “Graph attention networks,” in Proc. 6th Int. Conf.
[8] K. Xu et al., “Show, attend and tell: Neural image caption genera- Learn. Representations, 2018.
tion with visual attention,” in Proc. 32nd Int. Conf. Mach. Learn., [32] J. Lu, J. Yang, D. Batra, and D. Parikh, “Hierarchical question-
2015, pp. 2048–2057. image co-attention for visual question answering,” in Proc. 30th
[9] Y. Ma, H. Peng, and E. Cambria, “Targeted aspect-based senti- Annu. Conf. Neural Inf. Process. Syst., 2016, pp. 289–297.
ment analysis via embedding commonsense knowledge into an [33] F. Fan, Y. Feng, and D. Zhao, “Multi-grained attention network
attentive LSTM,” in Proc. 32nd AAAI Conf. Artif. Intell., 2018, pp. for aspect-level sentiment classification,” in Proc. Conf. Empir.
5876–5883. Methods Natural Lang. Process., 2018, pp. 3433–3442.
[10] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Ben- [34] D. Ma, S. Li, X. Zhang, and H. Wang, “Interactive attention net-
gio, “Attention-based models for speech recognition,” in Proc. works for aspect-level sentiment classification,” in Proc. 26th Int.
28th Annu. Conf. Neural Inf. Process. Syst., 2015, pp. 577–585. Joint Conf. Artif. Intell., 2017, pp. 4068–4074.
[11] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio, [35] M. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi,
“End-to-end attention-based large vocabulary speech recog- “Bidirectional attention flow for machine comprehension,” in
nition,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Proc. 4th Int. Conf. Learn. Representations, 2016.
2016, pp. 4945–4949. [36] S. Zheng and R. Xia, “Left-center-right separated neural network
[12] S. Kim, T. Hori, and S. Watanabe, “Joint CTC-attention based end- for aspect-based sentiment analysis with rotatory attention,”
to-end speech recognition using multi-task learning,” in Proc. 2018, arXiv:1802.00892.
IEEE Int. Conf. Acoust. Speech Signal Process., 2017, pp. 4835–4839. [37] B. Jing, P. Xie, and E. Xing, “On the automatic generation of med-
[13] A. Vaswani et al., “Attention is all you need,” in Proc. 31st Annu. ical imaging reports,” in Proc. 56th Annu. Meeting Assoc. Comput.
Conf. Neural Inf. Process. Syst., 2017, pp. 5998–6008. Linguistics, 2018, pp. 2577–2586.
[14] K. Cho, B. van Merri€enboer, D. Bahdanau, and Y. Bengio, “On [38] J. Gao et al., “CAMP: Co-attention memory networks for diagno-
the properties of neural machine translation: Encoder–decoder sis prediction in healthcare,” in Proc. IEEE Int. Conf. Data Mining,
approaches,” in Proc. 8th Workshop Syntax Semantics Sructure Stat- 2019, pp. 1036–1041.
ist. Transl., 2014, pp. 103–111. [39] Y. Tay, A. T. Luu, and S. C. Hui, “Multi-pointer co-attention net-
[15] N. Parmar et al., “Image Transformer,” in Proc. 35th Int. Conf. works for recommendation,” in Proc. 24th ACM SIGKDD Int.
Mach. Learn., 2018, pp. 4055–4064. Conf. Knowl. Discovery Data Mining, 2018, pp. 2309–2318.
[16] L. Zhou, Y. Zhou, J. J. Corso, R. Socher, and C. Xiong, “End-to-end [40] S. Liu, Z. Chen, H. Liu, and X. Hu, “User-video co-attention net-
dense video captioning with masked transformer,” in Proc. IEEE/ work for personalized micro-video recommendation,” in Proc.
CVF Conf. Comput. Vis. Pattern Recognit., 2018, pp. 8739–8748. World Wide Web Conf., 2019, pp. 3020–3026.

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on October 23,2023 at 05:05:47 UTC from IEEE Xplore. Restrictions apply.
BRAUWERS AND FRASINCAR: GENERAL SURVEY ON ATTENTION MECHANISMS IN DEEP LEARNING 3297

[41] M. Tu, G. Wang, J. Huang, Y. Tang, X. He, and B. Zhou, “Multi- [64] T. Shen, T. Zhou, G. Long, J. Jiang, S. Pan, and C. Zhang,
hop reading comprehension across multiple documents by rea- “DiSAN: Directional self-attention network for RNN/CNN-free
soning over heterogeneous graphs,” in Proc. 57th Annu. Meeting language understanding,” in Proc. 32nd AAAI Conf. Artif. Intell.,
Assoc. Comput. Linguistics, 2019, pp. 2704–2713. 2018, pp. 5446–5455.
[42] Y.-J. Lu and C.-T. Li, “GCAN: Graph-aware co-attention net- [65] O. Arshad, I. Gallo, S. Nawaz, and A. Calefati, “Aiding intra-text
works for explainable fake news detection on social media,” in representations with visual context for multimodal named entity
Proc. 58th Annu. Meeting Assoc. Comput. Linguistics, 2020, recognition,” in Proc. Int. Conf. Document Anal. Recognit., 2019,
pp. 505–514. pp. 337–342.
[43] O. Wallaart and F. Frasincar, “A hybrid approach for aspect- [66] W. Wu, X. Sun, and H. Wang, “Question condensing networks for
based sentiment analysis using a lexicalized domain ontology answer selection in community question answering,” in Proc. 56th
and attentional neural models,” in Proc. 16th Extended Semantic Annu. Meeting Assoc. Comput. Linguistics, 2018, pp. 1746–1755.
Web Conf., 2019, pp. 363–378. [67] O. Oktay et al., “Attention U-Net: Learning where to look for the
[44] S. Zhao and Z. Zhang, “Attention-via-attention neural pancreas,” in Proc. 1st Med. Imag. Deep Learn. Conf., 2018.
machine translation,” in Proc. 32nd AAAI Conf. Artif. Intell., [68] R. Tan, J. Sun, B. Su, and G. Liu, “Extending the transformer with
2018, pp. 563–570. context and multi-dimensional mechanism for dialogue response
[45] L. Wu, L. Chen, R. Hong, Y. Fu, X. Xie, and M. Wang, “A hierar- generation,” in Proc. 8th Int. Conf. Natural Lang. Process. Chinese
chical attention model for social contextual image recommen- Comput., 2019, pp. 189–199.
dation,” IEEE Trans. Knowl. Data Eng., vol. 32, no. 10, pp. 1854– [69] L. Chen, B. Lv, C. Wang, S. Zhu, B. Tan, and K. Yu, “Schema-
1867, Oct. 2020. guided multi-domain dialogue state tracking with graph atten-
[46] Y. Wang, S. Wang, J. Tang, N. O’Hare, Y. Chang, and B. Li, tion neural networks,” in Proc. 34th AAAI Conf. Artif. Intell., 2020,
“Hierarchical attention network for action recognition in vid- pp. 7521–7528.
eos,” 2016, arXiv:1607.06416. [70] H. Wang, G. Liu, A. Liu, Z. Li, and K. Zheng, “DMRAN: A hierar-
[47] Z. Li, Y. Wei, Y. Zhang, and Q. Yang, “Hierarchical attention chical fine-grained attention-based network for recommendation,”
transfer network for cross-domain sentiment classification,” in in Proc. 28th Int. Joint Conf. Artif. Intell., 2019, pp. 3698–3704.
Proc. 32nd AAAI Conf. Artif. Intell., 2018, pp. 5852–5859. [71] Z. Lin et al., “A structured self-attentive sentence embedding,” in
[48] C. Xing, Y. Wu, W. Wu, Y. Huang, and M. Zhou, “Hierarchical Proc. 5th Int. Conf. Learn. Representations, 2017.
recurrent attention network for response generation,” in Proc. [72] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,”
32nd AAAI Conf. Artif. Intell., 2018, pp. 5610–5617. 2016, arXiv:1607.06450.
[49] V. A. Sindagi and V. M. Patel, “HA-CCN: Hierarchical attention-
[73] H. Zhao, J. Jia, and V. Koltun, “Exploring self-attention for image
based crowd counting network,” IEEE Trans. Image Process.,
recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recog-
vol. 29, pp. 323–335, 2020.
nit., 2020, pp. 10 076–10 085.
[50] D. Kiela, C. Wang, and K. Cho, “Dynamic meta-embeddings for
[74] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena, “Self-atten-
improved sentence representations,” in Proc. Conf. Empir. Meth-
ods Natural Lang. Process., 2018, pp. 1466–1477. tion generative adversarial networks,” in Proc. 36th Int. Conf.
[51] S. Maharjan, M. Montes, F. A. Gonzalez, and T. Solorio, “A Mach. Learn., 2019, pp. 7354–7363.
genre-aware attention model to improve the likability prediction [75] I. Goodfellow et al., “Generative adversarial nets,” in Proc. 27th
of books,” in Proc. Conf. Empir. Methods Natural Lang. Process., Annu. Conf. Neural Inf. Process. Syst., 2014, pp. 2672–2680.
2018, pp. 3381–3391. [76] A. Sinha and J. Dolz, “Multi-scale self-guided attention for medi-
[52] G. I. Winata, Z. Lin, and P. Fung, “Learning multilingual cal image segmentation,” IEEE J. Biomed. Health Inform., vol. 25,
meta-embeddings for code-switching named entity recog- no. 1, pp. 121–130, Jan. 2021.
nition,” in Proc. 4th Workshop Representation Learn. NLP, 2019, [77] J. Fajtl, H. S. Sokeh, V. Argyriou, D. Monekosso, and P. Remag-
pp. 181–186. nino, “Summarizing videos with attention,” in Proc. Asian Conf.
[53] R. Jin, L. Lu, J. Lee, and A. Usman, “Multi-representational con- Comput. Vis., 2018, pp. 39–54.
volutional neural networks for text classification,” Comput. Intell., [78] J. Salazar, K. Kirchhoff, and Z. Huang, “Self-attention networks
vol. 35, no. 3, pp. 599–609, 2019. for connectionist temporal classification in speech recognition,”
[54] A. Sordoni, P. Bachman, A. Trischler, and Y. Bengio, “Iterative alter- in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2019,
nating neural attention for machine reading,” 2016, arXiv:1606. pp. 7115–7119.
022456. [79] T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zisser-
[55] A. Graves, G. Wayne, and I. Danihelka, “Neural Turing man, “Deep audio-visual speech recognition,” IEEE Trans.
machines,” 2014, arXiv:1410.5401. Pattern Anal. Mach. Intell., to be published, doi: 10.1109/
[56] D. Britz, A. Goldie, M.-T. Luong, and Q. Le, “Massive explora- TPAMI.2018.2889052.
tion of neural machine translation architectures,” in Proc. Conf. [80] S. Zhang, Y. Tay, L. Yao, and A. Sun, “Next item recommenda-
Empir. Methods Natural Lang. Process., 2017, pp. 1442–1451. tion with self-attention,” 2018, arXiv:1808.06414.
[57] R. J. Williams, “Simple statistical gradient-following algorithms [81] G. Letarte, F. Paradis, P. Giguere, and F. Laviolette,
for connectionist reinforcement learning,” Mach. Learn., vol. 8, “Importance of self-attention for sentiment analysis,” in Proc.
no. 3, pp. 229–256, 1992. Workshop BlackboxNLP: Analyzing Interpreting Neural Netw,
[58] T. Shen, T. Zhou, G. Long, J. Jiang, S. Wang, and C. Zhang, 2018, pp. 267–275.
“Reinforced self-attention network: a hybrid of hard and soft [82] A. Sankar, Y. Wu, L. Gou, W. Zhang, and H. Yang, “Dysat: Deep
attention for sequence modeling,” in Proc. 27th Int. Joint Conf. neural representation learning on dynamic graphs via self-atten-
Artif. Intell., 2018, pp. 4345–4352. tion networks,” in Proc. 13th Int. Conf. Web Search Data Mining,
[59] M. Malinowski, C. Doersch, A. Santoro, and P. Battaglia, 2020, pp. 519–527.
“Learning visual question answering by bootstrapping hard [83] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Li o, and
attention,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 3–20. Y. Bengio, “Graph attention networks,” in Proc. 5th Int. Conf.
[60] Y. Liu, W. Wang, Y. Hu, J. Hao, X. Chen, and Y. Gao, “Multi- Learn. Representations, 2017.
agent game abstraction via graph attention neural network,” in [84] S. Iida, R. Kimura, H. Cui, P.-H. Hung, T. Utsuro, and M. Nagata,
Proc. 34th AAAI Conf. Artif. Intell., 2020, pp. 7211–7218. “Attention over heads: A multi-hop attention for neural machine
[61] S. Seo, J. Huang, H. Yang, and Y. Liu, “Interpretable convolu- translation,” in Proc. 57th Annu. Meeting Assoc. Comput. Linguis-
tional neural networks with dual local and global attention for tics: Student Res. Workshop, 2019, pp. 217–222.
review rating prediction,” in Proc. 11th ACM Conf. Recommender [85] N. K. Tran and C. Niedereee, “Multihop attention networks for
Syst., 2017, pp. 297–305. question answer matching,” in Proc. 41st ACM SIGIR Int. Conf.
[62] J. Wang et al., “Aspect sentiment classification towards question- Res. Develop. Inf. Retrieval, 2018, pp. 325–334.
answering with reinforced bidirectional attention network,” in [86] Y. Gong and S. R. Bowman, “Ruminating reader: Reasoning with
Proc. 57th Annu. Meeting Assoc. Comput. Linguistics, 2019, gated multi-hop attention,” in Proc. 5th Int. Conf. Learn. Represen-
pp. 3548–3557. tation, 2017.
[63] M. Jiang, C. Li, J. Kong, Z. Teng, and D. Zhuang, “Cross-level [87] S. Yoon, S. Byun, S. Dey, and K. Jung, “Speech emotion recogni-
reinforced attention network for person re-identification,” J. Vis. tion using multi-hop attention mechanism,” in Proc. IEEE Int.
Commun. Image Representation, vol. 69, 2020, Art. no. 102775. Conf. Acoust. Speech Signal Process., 2019, pp. 2822–2826.

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on October 23,2023 at 05:05:47 UTC from IEEE Xplore. Restrictions apply.
3298 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 4, APRIL 2023

[88] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, “Stacked attention [114] M. A. Rahman and Y. Wang, “Optimizing intersection-over-
networks for image question answering,” in Proc. IEEE/CVF union in deep neural networks for image segmentation,” in Proc.
Conf. Comput. Vis. Pattern Recognit., 2016, pp. 21–29. 12th Int. Symp. Vis. Comput., 2016, pp. 234–244.
[89] Y. Wang, A. Sun, J. Han, Y. Liu, and X. Zhu, “Sentiment analysis [115] X. Chen, L. Yao, and Y. Zhang, “Residual attention U-net for
by capsules,” in Proc. World Wide Web Conf., 2018, pp. 1165–1174. automated multi-class segmentation of COVID-19 chest CT
[90] S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing images,” 2020, arXiv:2004.05645.
between capsules,” in Proc. 31st Annu. Conf. Neural Inf. Process. [116] S. Liu, Y. Chen, K. Liu, and J. Zhao, “Exploiting argument infor-
Syst., 2017, pp. 3859–3869. mation to improve event detection via supervised attention
[91] M. India, P. Safari, and J. Hernando, “Self multi-head attention mechanisms,” in Proc. 55th Annu. Meeting Assoc. Comput. Linguis-
for speaker recognition,” in Proc. 20th Annu. Conf. Int. Speech tics, 2017, pp. 1789–1798.
Commun. Assoc., 2019, pp. 2822–2826. [117] C. Liu, J. Mao, F. Sha, and A. Yuille, “Attention correctness in
[92] C. Wu, F. Wu, S. Ge, T. Qi, Y. Huang, and X. Xie, “Neural news neural image captioning,” in Proc. 31st AAAI Conf. Artif. Intell.,
recommendation with multi-head self-attention,” in Proc. Conf. 2017, pp. 4176–4182.
Empir. Methods Natural Lang. Process., 9th Int. Joint Conf. Natural [118] A. Das, H. Agrawal, L. Zitnick, D. Parikh, and D. Batra, “Human
Lang. Process., 2019, pp. 6389–6394. attention in visual question answering: Do humans and deep
[93] Y. Wang, W. Chen, D. Pi, and L. Yue, “Adversarially regularized networks look at the same regions?,” Comput. Vis. Image Under-
medication recommendation model with multi-hop memory standing, vol. 163, pp. 90–100, 2017.
network,” Knowl. Inf. Syst., vol. 63, no. 1, pp. 125–142, 2021. [119] Y. Yu, J. Choi, Y. Kim, K. Yoo, S.-H. Lee, and G. Kim,
[94] M. Cornia, M. Stefanini, L. Baraldi, and R. Cucchiara, “Meshed- “Supervising neural attention models for video captioning by
memory transformer for image captioning,” in Proc. IEEE/CVF human gaze data,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog-
Conf. Comput. Vis. Pattern Recognit., 2020, pp. 10 578–10 587. nit., 2017, pp. 6119–6127.
[95] J. Chen et al., “TransUnet: Transformers make strong encoders [120] S. Jain and B. C. Wallace, “Attention is not explanation,” in Proc.
for medical image segmentation,” 2021, arXiv:2102.04306. Conf. North Amer. Chapter Assoc. Comput. Linguistics: Hum. Lan-
guage Technol., 2019, pp. 3543–3556.
[96] P. Zhong, D. Wang, and C. Miao, “Knowledge-enriched trans-
former for emotion detection in textual conversations,” in Proc. [121] S. Wiegreffe and Y. Pinter, “Attention is not not explanation,” in
Conf. Emp. Methods Natural Lang. Process., 9th Int. Joint Conf. Natu- Proc. Conf. Empir. Methods Natural Lang. Process., 9th Int. Joint
ral Lang. Process., 2019, pp. 165–176. Conf. Natural Lang. Process., 2019, pp. 11–20.
[122] A. K. Mohankumar, P. Nema, S. Narasimhan, M. M. Khapra,
[97] Y. Zhou, R. Ji, J. Su, X. Sun, and W. Chen, “Dynamic capsule
B. V. Srinivasan, and B. Ravindran, “Towards transparent and
attention for visual question answering,” in Proc. 33rd AAAI
explainable attention models,” in Proc. 58th Annu. Meeting Assoc.
Conf. Artif. Intell., 2019, pp. 9324–9331.
Comput. Linguistics, 2020, pp. 4206–4216.
[98] Y. Wang, A. Sun, M. Huang, and X. Zhu, “Aspect-level sentiment
[123] K. K. Thekumparampil, C. Wang, S. Oh, and L.-J. Li, “Attention-
analysis using AS-capsules,” in Proc. World Wide Web Conf., 2019,
based graph neural network for semi-supervised learning,”
pp. 2033–2044. 2018, arXiv:1803.03735.
[99] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. Le, and R. Salakhutdi- [124] D. Nie, Y. Gao, L. Wang, and D. Shen, “ASDNet: Attention based
nov, “Transformer-XL: Attentive language models beyond a semi-supervised deep networks for medical image segmentation,”
fixed-length context,” in Proc. 57th Annu. Meeting Assoc. Comput. in Proc. 21st Int. Conf. Med. Image Comput. Comput.-Assisted Interven-
Linguistics, 2019, pp. 2978–2988. tion, 2018, pp. 370–378.
[100] N. Kitaev, º. Kaiser, and A. Levskaya, “Reformer: The efficient [125] Y. Alami Mejjati , C. Richardt, J. Tompkin, D. Cosker, and K. I. Kim,
Transformer,” in Proc. 8th Int. Conf. Learn. Representations, 2020. “Unsupervised attention-guided image-to-image translation,” in
[101] S. Wang, B. Li, M. Khabsa, H. Fang, and H. Ma, “Linformer: Self- Proc. 32nd Annu. Conf. Neural Inf. Process. Syst., 2018, pp. 3693–3703.
attention with linear complexity,” 2020, arXiv:2006.04768. [126] R. He, W. S. Lee, H. T. Ng, and D. Dahlmeier, “An unsupervised
[102] Z. Wu, Z. Liu, J. Lin, Y. Lin, and S. Han, “Lite transformer with neural attention model for aspect extraction,” in Proc. 55th Annu.
long-short range attention,” in Proc. 8th Int. Conf. Learn. Represen- Meeting Assoc. Comput. Linguistics, 2017, pp. 388–397.
tations, 2020.
[103] Y. Tay, D. Bahri, D. Metzler, D. C. Juan, Z. Zhao, and C. Zheng, Gianni Brauwers received the BS degree in
“Synthesizer: Rethinking self-attention for transformer models,” econometrics and operations research from
in Proc. 38th Int. Conf. Mach. Learn., vol. 139, 2021, pp. 10183–10192. Erasmus University Rotterdam, Rotterdam, the
[104] Y. Tay, M. Dehghani, D. Bahri, and D. Metzler, “Efficient trans- Netherlands, in 2019. He is currently working
formers: A survey,” 2020, arXiv:2009.06732. toward the MS degree in econometrics and man-
[105] X. Li et al., “Beyond RNNs: Positional self-attention with co- agement science at Erasmus University Rotter-
attention for video question answering,” in Proc. 33rd AAAI Conf. dam. He is a research assistant with Erasmus
Artif. Intell., 2019, pp. 8658–8665. University Rotterdam, focusing his research on
[106] A. W. Yu et al., “QANet: Combining local convolution with neural attention models and sentiment analysis.
global self-attention for reading comprehension,” in Proc. 6th Int.
Conf. Learn. Representations, 2018.
[107] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: A method
for automatic evaluation of machine translation,” in Proc. 40th Flavius Frasincar received the MS degree in
Annu. Meeting Assoc. Comput. Linguistics, 2002, pp. 311–318. computer science, in 1996, the MPhil degree in
[108] S. Banerjee and A. Lavie, “METEOR: An automatic metric for MT computer science from the Politehnica University
evaluation with improved correlation with human judgments,” of Bucharest, Bucharest, Romania, in 1997, the
in Proc. Workshop Intrinsic Extrinsic Eval. Measures Mach. Transl. PDEng degree in computer science, in 2000, and
Summarization, 2005, pp. 65–72. the PhD degree in computer science from the
[109] R. Sennrich, “Perplexity minimization for translation model Eindhoven University of Technology, Eindhoven,
domain adaptation in statistical machine translation,” in Proc. 13th the Netherlands, in 2005. Since 2005, he has
Conf. Eur. Chapter Assoc. Comput. Linguistics, 2012, pp. 539–549. been an assistant professor in computer science
[110] M. Popovic and H. Ney, “Word error rates: Decomposition over with Erasmus University Rotterdam, Rotterdam,
POS classes and applications for error analysis,” in Proc. 2nd the Netherlands. He has published in numerous
Workshop Statist. Mach. Transl., 2007, pp. 48–55. conferences and journals in the areas of databases, Web information
[111] 
P. Schwarz, P. Matejka, and J. Cernockỳ, “Towards lower error systems, personalization, machine learning, and the Semantic Web. He
rates in phoneme recognition,” in Proc. 7th Int. Conf. Text Speech is a member of the editorial boards of Decision Support Systems, Inter-
Dialogue, 2004, pp. 465–472. national Journal of Web Engineering and Technology, and Computa-
[112] D. S. Turaga, Y. Chen, and J. Caviedes, “No reference PSNR esti- tional Linguistics in the Netherlands Journal, and co-editor-in-chief of the
mation for compressed pictures,” Signal Process.: Image Commun., Journal of Web Engineering. He is a member of the association for Com-
vol. 19, no. 2, pp. 173–184, 2004. puting Machinery.
[113] P. Ndajah, H. Kikuchi, M. Yukawa, H. Watanabe, and
S. Muramatsu, “SSIM image quality metric for denoised images,” " For more information on this or any other computing topic,
in Proc. 3rd WSEAS Int. Conf. Vis. Imaging Simul., 2010, pp. 53–58. please visit our Digital Library at www.computer.org/csdl.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on October 23,2023 at 05:05:47 UTC from IEEE Xplore. Restrictions apply.

You might also like