A General Survey On Attention Mechanisms in Deep Learning
A General Survey On Attention Mechanisms in Deep Learning
Abstract—Attention is an important mechanism that can be employed for a variety of deep learning models across many different
domains and tasks. This survey provides an overview of the most important attention mechanisms proposed in the literature.
The various attention mechanisms are explained by means of a framework consisting of a general attention model, uniform notation,
and a comprehensive taxonomy of attention mechanisms. Furthermore, the various measures for evaluating attention models are
reviewed, and methods to characterize the structure of attention models based on the proposed framework are discussed. Last, future
work in the field of attention models is considered.
Index Terms—Attention models, deep learning, introductory and survey, neural nets, supervised learning
1041-4347 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on October 23,2023 at 05:05:47 UTC from IEEE Xplore. Restrictions apply.
3280 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 4, APRIL 2023
The structure of this paper is as follows. Section 2 intro- Fig. 2. The inner mechanisms of the general attention module.
duces a general attention model that provides the reader
with a basic understanding of the properties of attention (RNN), a convolutional neural network (CNN), a simple
and how it can be applied. One of the main contributions of embedding layer, a linear transformation of the original
this paper is the taxonomy of attention techniques pre- data, or no transformation at all. Essentially, the feature
sented in Section 3. In this section, attention mechanisms model consists of all the steps that transform the original
are explained and categorized according to the presented input X into the feature vectors f 1 ; . . . ; f nf that the attention
taxonomy. Section 4 provides an overview of performance model will attend to.
measures and methods for evaluating attention models. To determine which vectors to attend to, the attention
Furthermore, the taxonomy is used to evaluate the structure model requires the query q 2 Rdq , where dq indicates the
of various attention models. Lastly, in Section 5, we give size of the query vector. This query is extracted by the query
our conclusions and suggestions for further research. model, and is generally designed based on the type of output
that is desired of the model. A query tells the attention
model which feature vectors to attend to. It can be inter-
2 GENERAL ATTENTION MODEL preted literally as a query, or a question. For example, for
This section presents a general form of attention with corre- the task of image captioning, suppose that one uses a
sponding notation. The notation introduced here is based decoder RNN model to produce the output caption based
on the notation that was introduced in [23] and popularized on feature vectors obtained from the image by a CNN. At
in [13]. The framework presented in this section is used each prediction step, the hidden state of the RNN model
throughout the rest of this paper. can be used as a query to attend to the CNN feature vectors.
To implement a general attention model, it is necessary to In each step, the query is a question in the sense that it asks
first describe the general characteristics of a model that can for the necessary information from the feature vectors based
employ attention. We will refer to the complete model as the on the current prediction context.
task model, of which the structure is presented in Fig. 1. This
model simply takes an input, carries out the specified task, 2.2 Attention Output
and produces the desired output. For example, the task The feature vectors and query are used as input for the
model can be a language model that takes as input a piece of attention model. This model consists of a single, or a collec-
text, and produces as output a summary of the contents, a tion of general attention modules. An overview of a general
classification of the sentiment, or the text translated word for attention module is presented in Fig. 2. The input of the gen-
word to another language. Alternatively, the task model can eral attention module is the query q 2 Rdq , and the matrix of
take an image, and produce a caption or segmentation for feature vectors F ¼ ½ff 1 ; . . . ; f nf 2 Rdf nf . Two separate
that image. The task model consists of four submodels: the matrices are extracted from the matrix F : the keys matrix
feature model, the query model, the attention model, and the K ¼ ½kk1 ; . . . ; knf 2 Rdk nf , and the values matrix V ¼
output model. In Section 2.1, the feature model and query ½vv1 ; . . . ; v nf 2 Rdv nf , where dk and dv indicate, respectively,
model are discussed, which are used to prepare the input for the dimensions of the key vectors (columns of K ) and value
the attention calculation. In Section 2.2, the attention model vectors (columns of V ). The general way of obtaining these
and output model are discussed, which are concerned with matrices is through a linear transformation of F using the
producing the output. Last, in Section 2.3, we highlight the weight matrices W K 2 Rdk df and W V 2 Rdv df , for K and
applications of attention. V , respectively. The calculations of K and V are presented
in (1). Both weight matrices can be learned during training
or predefined by the researcher. For example, one can
2.1 Attention Input
choose to define both W K and W V as equal to the identity
Suppose the task model takes as input the matrix X 2
matrix to retain the original feature vectors. Other ways of
Rdx nx , where dx represents the size of the input vectors and
defining the keys and the values are also possible, such as
nx represents the amount of input vectors. The columns in
using completely separate inputs for the keys and values.
this matrix can represent the words in a sentence, the pixels
The only constraint to be obeyed is that the number of col-
in an image, the characteristics of an acoustic sequence, or
umns in K and V remains the same.
any other collection of inputs. The feature model is then
employed to extract the nf feature vectors f 1 ; . . . ; f nf 2 Rdf
K ¼ WK F ; V ¼ WV F : (1)
from X , where df represents the size of the feature vectors. dk nf dk df df nf dv nf dv df df nf
The feature model can be a recurrent neural network
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on October 23,2023 at 05:05:47 UTC from IEEE Xplore. Restrictions apply.
BRAUWERS AND FRASINCAR: GENERAL SURVEY ON ATTENTION MECHANISMS IN DEEP LEARNING 3281
The goal of the attention module is to produce a model. While an RNN produces a sequence of hidden state
weighted average of the value vectors in V . The weights vectors, a CNN creates feature maps, where each region in
used to produce this output are obtained via an attention the image is represented by a feature vector. The RNN hid-
scoring and alignment step. The query q and the keys matrix den states are organized sequentially, while the CNN fea-
K are used to calculate the vector of attention scores e ¼ ture maps are organized spatially. Yet, attention can still be
½e1 ; . . . ; enf 2 Rnf . This is done via the score function applied in both situations, since the attention mechanism
scoreðÞ, as illustrated in (2). does not inherently depend on the organization of the fea-
ture vectors. This characteristic makes attention easy to
implement in a wide variety of models in different domains.
el ¼ score q ; kl : (2)
11 dq 1 dk 1 Another domain where attention can be applied is audio
processing [24], [25]. Acoustic sequences can be represented
As discussed before, the query symbolizes a request for by a sequence of feature vectors that relate to certain time
information. The attention score el represents how impor- periods of the audio sample. These vectors could simply be
tant the information contained in the key vector kl is accord- the raw input audio, or they can be extracted via, for exam-
ing to the query. If the dimensions of the query and key ple, an RNN or CNN. Video processing is another domain
vectors are the same, an example of a score function would where attention can be applied intuitively [26], [27]. Video
be to take the dot-product of the vectors. The different types data consists of sequences of images, so attention can be
of score functions are further discussed in Section 3.2.1. applied to the individual images, as well as the entire
Next, the attention scores are processed further through sequence. Recommender systems often incorporate a user’s
an alignment layer. The attention scores can generally have interaction history to produce recommendations. Feature
a wide range outside of ½0; 1. However, since the goal is to vectors can be extracted based on, for example, the id’s or
produce a weighted average, the scores are redistributed other characteristics of the products the user interacted
via an alignment function alignðÞ as defined in (3). with, and attention can be applied to them [28]. Attention
can generally also be applied to many problems that use a
al ¼ align el ; e ; (3) time series as input, be it medical [29], financial [30], or any-
11 11 nf 1 thing else, as long as feature vectors can be extracted.
The fact that attention does not rely on the organization of
where al 2 R1 is the attention weight corresponding to the the feature vectors allows it to be applied to various prob-
lth value vector. One example of an alignment function lems that each use data with different structures, as illus-
would be to use a softmax function, but the various other trated by the previous domain examples. Yet, this can be
alignment types are discussed in Section 3.2.2. The attention taken even further by applying attention to data where there
weights provide a rather intuitive interpretation for the is irregular structure. For example, protein structures, city
attention module. Each weight is a direct indication of how traffic flows, and communication networks cannot always be
important each feature vector is relative to the others for represented using neatly structured organizations, such as
this particular problem. This can provide us with a more in- sequences, like time series, or grids, like images. In such
depth understanding of the model behaviour, and the rela- cases, the different aspects of the data are often represented
tions between inputs and outputs. The vector of attention as nodes in a graph. These nodes can be represented by fea-
weights a ¼ ½a1 ; . . . ; anf 2 Rnf is used to produce the con- ture vectors, meaning that attention can be applied in
text vector c 2 Rdv by calculating a weighted average of the domains that use graph-structured data as well [19], [31].
columns of the values matrix V , as shown in (4). In general, attention can be applied to any problem for
which a set of feature vectors can be defined or extracted.
X
nf
As such, the general attention model presented in Fig. 2 is
c ¼ al v l : (4)
dv 1
l¼1 11 dv 1 applicable to a wide range of domains. The problem, how-
ever, is that there is a large variety of different applications
As illustrated in Fig. 1, the context vector is then used in and extensions of the general attention module. As such, in
the output model to create the output y^. This output model Section 3, a comprehensive overview is provided of a collec-
translates the context vector into an output prediction. For tion of different attention mechanisms.
example, it could be a simple softmax layer that takes as
input the context vector c, as shown in (5).
3 ATTENTION TAXONOMY
y^ ¼ softmax Wc c þ bc ; There are many different types of attention mechanisms and
(5)
dy 1 dy dv dv 1 dy 1 extensions, and a model can use different combinations of
these attention techniques. As such, we propose a taxonomy
where dy is the number of output choices or classes, and that can be used to classify different types of attention mecha-
W c 2 Rdy dv and bc 2 Rdy are trainable weights. nisms. Fig. 3 provides a visual overview of the different catego-
ries and subcategories that the attention mechanisms can be
2.3 Attention Applications organized in. The three major categories are based on whether
Attention is a rather general mechanism that can be used in an attention technique is designed to handle specific types of
a wide variety of problem domains. Consider the task of feature vectors (feature-related), specific types of model
machine translation using an RNN model. Also, consider queries (query-related), or whether it is simply a general mech-
the problem of image classification using a basic CNN anism that is related to neither the feature model, nor the query
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on October 23,2023 at 05:05:47 UTC from IEEE Xplore. Restrictions apply.
3282 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 4, APRIL 2023
model (general). Further explanations of these categories and attend to these various vectors. These features may have
their subcategories are provided in the following subsections. specific structures that require special attention mechanisms
Each mechanism discussed in this section is either a modifica- to handle them. These mechanisms can be categorized to
tion to the existing inner mechanisms of the general attention deal with one of the following feature characteristics: the
module presented in Section 2, or an extension of it. multiplicity of features, the levels of features, or the repre-
The presented taxonomy can also be used to analyze the sentations of features.
architecture of attention models. Namely, the major catego-
ries and their subcategories can be interpreted as orthogonal
dimensions of an attention model. An attention model can 3.1.1 Multiplicity of Features
consist of a combination of techniques taken from any or all
For most tasks, a model only processes a single input, such
categories. Some characteristics, such as the scoring and
as an image, a sentence, or an acoustic sequence. We refer to
alignment functions, are generally required for any atten-
such a mechanism as singular features attention. Other mod-
tion model. Other mechanisms, such as multi-head atten-
els are designed to use attention based on multiple inputs to
tion or co-attention are not necessary in every situation.
Lastly, in Table 1, an overview of used notation with corre- allow one to introduce more information into the model
that can be exploited in various ways. However, this does
sponding descriptions is provided.
imply the presence of multiple feature matrices that require
special attention mechanisms to be fully used. For example,
3.1 Feature-Related Attention Mechanisms [32] introduces a concept named co-attention to allow the
Based on a particular set of input data, a feature model proposed visual question answering (VQA) model to jointly
extracts feature vectors so that the attention model can attend to both an image and a question.
TABLE 1
Notation
Symbol Description
F Matrix of size df nf containing the feature vectors f 1 ; . . . ; f nf 2 Rdf as columns. These feature vectors are extracted
by the feature model.
K Matrix of size dk nf containing the key vectors k 1 ; . . . ; knf 2 Rdk as columns. These vectors are used to calculate the
attention scores.
V Matrix of size dv nf containing the value vectors v1 ; . . . ; v nf 2 Rdv as columns. These vectors are used to calculate
the context vector.
WK Weights matrix of size dk df used to create the K matrix from the F matrix.
WV Weights matrix of size dv df used to create the V matrix from the F matrix.
q Query vector of size dq . This vector essentially represents a question, and is used to calculate the attention scores.
c Context vector of size dv . This vector is the output of the attention model.
e Score vector of size dnf containing the attention scores e1 ; . . . ; enf 2 R1 . These are used to calculate the attention
weights.
a Attention weights vector of size dnf containing the attention weights a1 ; . . . ; anf 2 R1 . These are the weights used in
the calculation of the context vector.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on October 23,2023 at 05:05:47 UTC from IEEE Xplore. Restrictions apply.
BRAUWERS AND FRASINCAR: GENERAL SURVEY ON ATTENTION MECHANISMS IN DEEP LEARNING 3283
ð1Þ ð2Þ ð1Þ ð2Þ
Ai;j ¼ wTA concat k i ; kj ; ki kj ; (13)
11 13dk dk 1 dk 1 dk 1 dk 1
(11)
From the attention scores, attention weights are created eð2Þ ¼ w 2 act W 1 K ð1Þ A þ W 2 K ð2Þ ;
ð2Þ 1dw ð1Þ ð1Þ ð1Þ ð1Þ ð2Þ ð2Þ ð2Þ ð2Þ
via an alignment function, and are used to produce the con- 1nf dw dk dk nf nf nf dw dk dk nf
context vectors are concatenated and used in the output again, using attention. First, the key vectors k t1 ; . . . ; ktnt 2
t t
Rdk and value vectors v t1 ; . . . ; v tnt 2 Rdv are extracted
f
model for prediction. from
dtf
A mechanism separate from co-attention that still uses the target phrase feature vectors f 1 ; . . . ; f nt 2 R , similarly
f t t
f
multiple inputs is the rotatory attention mechanism [36]. This as before, using a linear transformation, where dtk and dtv are
technique is typically used in a text sentiment analysis set- the dimensions of the key and value vectors, respectively.
ting where there are three inputs involved: the phrase for Note, again, that the original feature vectors as keys and val-
which the sentiment needs to be determined (target phrase), ues in [36]. The attention scores for the left-aware target
the text before the target phrase (left context), and the text representation are then calculated using (19).
after the target phrase (right context). The words in these
three inputs are all encoded by the feature model, produc-
elit ¼ score rl ; kti : (19)
ing the following feature matrices: F t ¼ ½ff t1 ; . . . ; f tnt 2 11 dlv 1 dt 1
f k
dtf ntf dlf nlf
R , F ¼
l
½ff l1 ; . . . ; f lnl 2R , and F ¼ r
½ff r1 ; . . . ; f rnr 2
f
drf nrf f
The attention scores can be combined with an alignment
R , for the target phrase words, left context words, and
function and the corresponding value vectors to produce
right context words, respectively, where dtf , dlf , and drf rep- t
the context vector r lt 2 Rdv . For this attention calculation,
resent the dimensions of the feature vectors for the corre-
[34] proposes to use the same score and alignment functions
sponding inputs, and ntf , nlf , and nrf represent the number
as before. The right-aware target representation r rt can be
of feature vectors for the corresponding inputs. The fea-
calculated in a similar manner. Finally, to obtain the full
ture model used in [36] consists of word embeddings and
representation vector r that is used to determine the classifi-
separate Bi-LSTM models for the target phrase, the left
cation, the vectors r l , r r , r lt , and r rt are concatenated
context, and the right context. This means that the feature
together, as shown in (20).
vectors are in fact the hidden state vectors obtained from
the Bi-LSTM models. Using these features, the idea is to
extract a single vector r from the inputs such that a soft- r ¼ concat r ; rr ; r ; r
l r lt rt
: (20)
ðdlv þdrv þdtv þdtv Þ1 dlv 1 dv 1 dtv 1 dtv 1
max layer can be used for classification. As such, we are
now faced with two challenges: how to represent the
inputs as a single vector, and how to incorporate the To summarize, rotatory attention uses the target
information from the left and right context into that vec- phrase to compute new representations for the left and
tor. [36] proposes to use the rotatory attention mechanism right context using attention, and then uses these left
for this purpose. and right representations to calculate new representa-
First, a single target phrase representation is created by tions for the target phrase. The first step is designed to
using a pooling layer that takes the average over the col- capture the words in the left and right contexts that are
umns of F t , as shown in (17). most important to the target phrase. The second step is
there to capture the most important information in the
nt
1 X f actual target phrase itself. Essentially, the mechanism
rt ¼ t f ti : (17) rotates attention between the target and the contexts to
dtf 1 nf i¼1 dt 1
f improve the representations.
There are many applications where combining informa-
r t is then used as a query to create a context vector out of the tion from different inputs into a single model can be highly
left and right contexts, separately. For example, for the left beneficial. For example, in the field of medical data, there
l
context, the key vectors kl1 ; . . . ; klnl 2 Rdk and value vectors are often many different types of data available, such as
l
v l1 ; . . . ; vlnl 2 Rdv are extracted f
from the left context feature various scans or documents, that can provide different
dl
vectors ffl1 ; . . . ; f lnl 2 R f , similarly as before, where dlk and types of information. In [37], a co-attention mechanism is
f
dlv are the dimensions of the key and value vectors, respec- used for automatic medical report generation to attend to
tively. Note that [36] proposes to use the original feature both images and semantic tags simultaneously. Similarly,
vectors as keys and values, meaning that the linear transfor- in [38], a co-attention model is proposed that combines
mation consists of a multiplication by an identity matrix. general demographics features and patient medical history
Next, the scores are calculated using (18). features to predict future health information. Additionally,
an ablation study is used in [38] to show that the co-atten-
eli ¼ score r ; kli
t
: (18) tion part of the model specifically improves performance.
11 dtf 1 dl 1 A field where multi-feature attention has been extensively
k
explored is the domain of recommender systems. For
For the score function, [36] proposes to use an activated gen- example, in [39], a co-attention network is proposed that
eral score function [34] with a tanh activation function. The attends to both product reviews and the reviews a user
attention scores can be combined with an alignment func- has written. In [40], a model is proposed for video recom-
tion and the corresponding value vectors to produce the mendation that attends to both user features and video
l
context vector rl 2 Rdv . The alignment function used in [36] features. Co-attention techniques have also been used in
takes the form of a softmax function. An analogous proce- combination with graph networks for the purpose of, for
dure can be performed to obtain the representation of the example, reading comprehension across multiple docu-
right context, r r . These two context representations can then ments [41] and fake news detection [42]. In comparison to
be used to create new representations of the target phrase, co-attention, rotatory attention has typically been explored
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on October 23,2023 at 05:05:47 UTC from IEEE Xplore. Restrictions apply.
3286 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 4, APRIL 2023
only in the field of sentiment analysis, which is most likely Fig. 8. An illustration of hierarchical attention.
due to the specific structure of the data that is necessary to
use this technique. An implementation of rotatory atten-
tion is proposed in [43] for sentiment analysis, where the attention module is the context vector cðcÞ . The complete
mechanism is extended by repeating the attention rotation context output of the attention model is the concatenation of
to iteratively further improve the representations. the word-level, and character-level context vectors.
The attention-via-attention technique uses representa-
tions for each level. However, accurate representations may
3.1.2 Feature Levels
not always be available for each level of the data, or it may
The previously discussed attention mechanisms process be desirable to let the model create the representations dur-
data at a single level. We refer to these attention techniques ing the process by building them from lower level represen-
as single-level attention mechanisms. However, some data tations. A technique referred to as hierarchical attention [5]
types can be analyzed and represented on multiple levels. can be used in this situation. Hierarchical attention is
For example, when analyzing documents, one can analyze another technique that allows one to apply attention on dif-
the document at the sentence level, word level, or even the ferent levels of the data. Yet, the exact mechanisms work
character level. When representations or embeddings of all quite differently compared to attention-via-attention. The
these levels are available, one can exploit the extra levels of idea is to start at the lowest level, and then create represen-
information. For example, one could choose to perform tations, or summaries, of the next level using attention. This
translation based on either just the characters, or just the process is repeated till the highest level is reached. To make
words of the sentence. However, in [44], a technique named this a little clearer, suppose one attempts to create a model
attention-via-attention is introduced that allows one to incor- for document classification, similarly to the implementation
porate information from both the character, and the word from [5]. We analyze a document containing nS sentences,
levels. The idea is to predict the sentence translation charac- with the sth sentence containing ns words, for s ¼ 1; . . . ; nS .
ter-by-character, while also incorporating information from One could use attention based on just the collection of
a word-level attention module. words to classify the document. However, a significant
To begin with, a feature model (consisting of, for exam- amount of important context is then left out of the analysis,
ple, word embeddings and RNNs) is used to encode the since the model will consider all words as a single long sen-
input sentence into both a character-level feature matrix tence, and will therefore not consider the context within the
ðcÞ ðcÞ
df nf
F ðcÞ 2 R , and a word-level feature matrix F ðwÞ 2 separate sentences. Instead, one can use the hierarchical
ðwÞ ðwÞ
df nf ðcÞ ðcÞ structure of a document (words form sentences, and senten-
R , where df and nf represent, respectively, the
dimension of the embeddings of the characters, and the ces form the document).
ðwÞ ðwÞ Fig. 8 illustrates the structure of hierarchical attention.
number of characters, while df and nf represent the
same but at the word level. It is crucial for this method that For each sentence in the document, a sentence representa-
ðSÞ
each level in the data can be represented or embedded. tion cðsÞ 2 Rdv is produced, for s ¼ 1; . . . ; nS , where dðSÞ
v is
When attempting to predict a character in the translated the dimension of the value vectors used in the attention
sentence, a query q ðcÞ 2 Rdq is created by the query model model for sentence representations (Attention ModuleS in
(like a character-level RNN), where dq is the dimension of Fig. 8). The representation is a context vector from an atten-
the query vectors. As illustrated in Fig. 7, the query is used tion module that essentially summarizes the sentence. Each
to calculate attention on the word-level feature ðwÞ
vectors F ðwÞ . sentence is first put through a feature model to extract the
This generates the context vector cðwÞ 2 Rdv , where dðwÞ ðSÞ
n ðSÞ
feature matrix F ðsÞ 2 R f
d
v s
, for s ¼ 1; . . . ; nS , where df
represents the dimension of the value vectors for the word-
represents the dimension of the feature vector for each
level attention module. This context vector summarizes
word, and ns represents the amount of words in sentence s.
which words contain the most important information for
For extra clarification, the columns of F ðsÞ are feature vec-
predicting the next character. If we know which words are
tors that correspond to the words in sentence s. As shown
most important, then it becomes easier to identify which
in Fig. 8, each feature matrix F ðsÞ is used as input for an
characters in the input sentence are most important. Thus,
attention model, which produces the context vector cðsÞ , for
the next step is to attend to the character-level features in
each s ¼ 1; . . . ; nS . No queries are used in this step, so it can
F ðcÞ , with an additional query input: the word-level context
be considered a self-attentive mechanism. The context vec-
vector cðwÞ . The actual query input for the attention model
tors are essentially summaries of the words in the sentences.
will therefore be the concatenation of the query q ðcÞ and the ðSÞ
nS
word context vector c ðwÞ . The output of this character-level The matrix of context vectors C ¼ ½ccð1Þ ; . . . ; cðnS Þ 2 Rdv
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on October 23,2023 at 05:05:47 UTC from IEEE Xplore. Restrictions apply.
BRAUWERS AND FRASINCAR: GENERAL SURVEY ON ATTENTION MECHANISMS IN DEEP LEARNING 3287
TABLE 2
Overview of Score Function (scoreðqq; k l Þ) Forms
is constructed by grouping all the obtained context vectors embedding xðei Þ is of size dei , for i ¼ 1; . . . ; E. Since not all
together as columns. Finally, attention is calculated using C embeddings are of the same size, a transformation is per-
as feature input, producing the representation of the entire formed to normalize the embedding dimensions. Using
ðDÞ
document in the context vector cðDÞ 2 Rdv , where dðDÞ is embedding-specific weight parameters, each embedding
v
the dimension of the value vectors in the attention model xðei Þ is transformed into the size-normalized embedding
for document representation (Attention ModuleD in Fig. 8). xðti Þ 2 Rdt , where dt is the size of every transformed word
This context vector can be used to classify the document, embedding, as shown in (21).
since it is essentially a summary of all the sentences (and
therefore also the words) in the document. x ðti Þ ¼ W ei x ðei Þ þ b ei ; (21)
dt 1 dt dei dei 1 dt 1
Multi-level models can be used in a variety of tasks. For
example, in [28], hierarchical attention is used in a recom- where W ei 2 Rdt dei , and b ei 2 Rdt are trainable, embedding-
mender system to model user preferences at the long-term specific weights matrices. The final embedding x ðeÞ 2 Rdt is
level and the short-term level. Similarly, [45] proposes a a weighted average of the previously calculated trans-
hierarchical model for recommending social media images formed representations, as shown in (22).
based on user preferences. Hierarchical attention has also
been successfully applied in other domains. For example, X
E
[46] proposes to use hierarchical attention in a video action x ðeÞ ¼ ai x ðti Þ : (22)
dt 1 dt 1
recognition model to capture motion information at the the i¼1 11
long-term level and the short-term level. Furthermore, [47]
proposes a hierarchical attention model for cross-domain The final representation x ðeÞ can be interpreted as the con-
text vector from an attention model, meaning that the
sentiment classification. In [48], a hierarchical attention
weights a1 ; . . . ; aE 2 R1 are attention weights. Attention can
model for chatbot response generation is proposed. Lastly,
be calculated as normally, where the columns of the features
using image data, [49] proposes a hierarchical attention
matrix F are the transformed representations x ðt1 Þ ; . . . ; x ðtE Þ .
model for crowd counting.
The query in this case can be ignored since it is constant in
all cases. Essentially, the query is “Which representations
3.1.3 Feature Representations are the most important?” in every situation. As such, this is
In a basic attention model, a single embedding or represen- a self-attentive mechanism.
tation model is used to produce feature representations for While an interesting idea, applications of multi-represen-
the model to attend to. This is referred to as single-representa- tational attention are limited. One example of the application
tional attention. Yet, one may also opt to incorporate multiple of this technique is found in [52], where a multi-representa-
representations into the model. In [50], it is argued that tional attention mechanism has been applied to generate
allowing a model access to multiple embeddings can allow multi-lingual meta-embeddings. Another example is [53],
one to create even higher quality representations. Similarly, where a multi-representational text classification model is
[51] incorporates multiple representations of the same book proposed that incorporates different representations of the
(textual, syntactic, semantic, visual etc.) into the feature same text. For example, the proposed model uses embed-
model. Feature representations are an important part of the dings from part-of-speech tagging, named entity recogniz-
attention model, but attention can also be an important part ers, and character-level and word-level embeddings.
of the feature model. The idea is to create a new representa-
tion by taking a weighted average of multiple representa- 3.2 General Attention Mechanisms
tions, where the weights are determined via attention. This This major category consists of attention mechanisms that
technique is referred to as multi-representational attention, can be applied in any type of attention model. The structure
and allows one to create so-called meta-embeddings. Sup- of this component can be broken down into the following
pose one wants to create a meta-embedding for a word x sub-aspects: the attention score function, the attention align-
for which E embeddings x ðe1 Þ ; . . . ; x ðeE Þ are available. Each ment, and attention dimensionality.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on October 23,2023 at 05:05:47 UTC from IEEE Xplore. Restrictions apply.
3288 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 4, APRIL 2023
dimensional attention is used to extend graph attention net- feature vectors f 1 ; . . . ; f nf . These relations can then be used
works for dialogue state tracking. Lastly, for the task of as additional information to incorporate into new represen-
next-item recommendation, [70] proposes a model that tations of the feature vectors. With basic attention mecha-
incorporates multi-dimensional attention. nisms, the keys matrix K , and the values matrix V are
extracted from the features matrix F , while the query q is
3.3 Query-Related Attention Mechanisms produced separately. For this type of self-attention, the
Queries are an important part of any attention model, since query vectors are extracted in a similar process as the keys
they directly determine which information is extracted from and values, via a transformation matrix of trainable weights
the feature vectors. These queries are based on the desired W Q 2 Rdq df . We define the matrix Q ¼ ½qq1 ; . . . ; q nf 2
output of the task model, and can be interpreted as literal Rdq nf , which can be obtained as follows:
questions. Some queries have specific characteristics that
require specific types of mechanisms to process them. As Q ¼ WQ F : (35)
dq nf dq df df nf
such, this category encapsulates the attention mechanisms
that deal with specific types of query characteristics. The
mechanisms in this category deal with one of the two fol- Each column of Q can be used as the query for the atten-
lowing query characteristics: the type of queries or the mul- tion model. When attention is calculated using a query q ,
tiplicity of queries. the resulting context vector c will summarize the informa-
tion in the feature vectors that is important to the query.
3.3.1 Type of Queries Since the query, or a column of Q , is now also a feature vec-
tor representation, the context vector contains the informa-
Different attention models employ attention for different tion of all feature vectors that are important to that specific
purposes, meaning that distinct query types are necessary. feature vector. In other words, the context vectors capture
There are basic queries, which are queries that are typically the relations between the feature vectors. For example, self-
straightforward to define based on the data and model. For attention allows one to extract the relations between words:
example, the hidden state for one prediction in an RNN is which verbs refer to which nouns, which pronouns refer to
often used as the query for the next prediction. One could which nouns, etc. For images, self-attention can be used to
also use a vector of auxiliary variables as query. For exam- determine which image regions relate to each other.
ple, when doing medical image classification, general While self-attention is placed in the query-related cate-
patient characteristics can be incorporated into a query. gory, it is also very much related to the feature model.
Some attention mechanisms, such as co-attention, rota- Namely, self-attention is a technique that is often used in
tory attention, and attention-over-attention, use specialized the feature model to create improved representations of the
queries. For example, rotatory attention uses the context vec- feature vectors. For example, the Transformer model for
tor from another attention module as query, while interac- language processing [13], and the Transformer model for
tive co-attention uses an averaged keys vector based on image processing [15], both use multiple rounds of (multi-
another input. Another case one can consider is when atten- head) self-attention to improve the representation of the fea-
tion is calculated based purely on the feature vectors. This ture vectors. The relations captured by the self-attention
concept has been mentioned before and is referred to as self- mechanism are incorporated into new representations. A
attention or intra-attention [71]. We say that the models use simple method of determining such a new representation is
self-attentive queries. There are two ways of interpreting such to simply set the feature vectors equal to the acquired self-
queries. First, one can say that the query is constant. For attention context vectors [71], as presented in (36).
example, document classification requires only a single clas-
sification as the output of the model. As such, the query is f ðnewÞ ¼ c ; (36)
always the same, namely: “What is the class of the doc- df 1 df 1
ument?”. The query can be ignored and attention can be cal-
culated based only on the features themselves. Score where f ðnewÞ is the updated feature vector. Another possi-
functions can be adjusted for this by making the query vec- bility is to add the context vectors to the previous feature
tor a vector of constants or removing it entirely: vectors with an additional normalization layer [13]:
score k l ¼ w T act W kl þ b : (33) f ðnewÞ ¼ Normalize f ðoldÞ þ c ; (37)
df 1 df 1 df 1
dk 1 1dw dw dk dk 1 dw 1
Additionally, one can also interpret self-attention as learn- where f ðoldÞ is the previous feature vector, and NormalizeðÞ
ing the query along the way, meaning that the query can be is a normalization layer [72]. Using such techniques, self-
defined as a trainable vector of weights. For example, the attention has been used to create improved word or sen-
dot-product score function may take the following form: tence embeddings that enhance model accuracy [71].
Self-attention is arguably one of the more important
types of attention, partly due to its vital role in the highly
score kl ¼ q T kl ; (34) popular Transformer model. Self-attention is a very general
dk 1 1dk dk 1
mechanism and can be applied to practically any problem.
where q 2 Rdk is a trainable vector of weights. One could As such, self-attention has been extensively explored in
also interpret vector b 2 Rdw as the query in (33). Another many different fields in both Transformer-based architec-
use of self-attention is to uncover the relations between the tures and other types of models. For example, in [73], self-
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on October 23,2023 at 05:05:47 UTC from IEEE Xplore. Restrictions apply.
BRAUWERS AND FRASINCAR: GENERAL SURVEY ON ATTENTION MECHANISMS IN DEEP LEARNING 3291
Fig. 9. An illustration of multi-head attention. Fig. 10. An example illustration of multi-hop attention. Solid arrows rep-
resent the base multi-hop model structure, while dotted arrows represent
optional connections.
attention is explored for image recognition tasks, and results
indicate that the technique may have substantial advantages
transformations of F . As such, each attention head has its
with regards to robustness and generalization. In [74], self- ðjÞ ðjÞ
own learnable weights matrices W ðjÞ q , W K and W V for
attention is used in a generative adversarial network (GAN)
these transformations. The calculation of the query, keys,
[75] to determine which regions of the input image to focus
and values for the jth head are defined as follows:
on when generating the regions of a new image. In [76],
ðjÞ
self-attention is used to design a state-of-the-art medical q ðjÞ ¼ W ðjÞ
q q ; K ðjÞ ¼ W K F ;
dq 1 dq 1 dk nf dk df df nf
image segmentation model. Naturally, self-attention can dq dq
ðjÞ
(38)
also be used for video processing. In [77], a self-attention V ðjÞ ¼ W V F :
model is proposed for the purpose of video summarization dv nf dv df df nf
that reaches state-of-the-art results. In other fields, like Thus, each head creates its own representations of the query
audio processing, self-attention has been explored as well. q , and the input matrix F . Each head can therefore learn to
In [78], self-attention is used to create a speech recognition focus on different parts of the inputs, allowing the model to
model. Self-attention has also been explored in overlapping attend to more information. For example, when training a
domains. For example, in [79], the self-attention Trans- machine translation model, one attention head can learn to
former architecture is used to create a model that can recog- focus on which nouns (e.g., student, car, apple) do certain
nize phrases from audio and by lip-reading from a video. verbs (e.g., walking, driving, buying) refer to, while another
For the problem of next item recommendation, [80] pro- attention head learns to focus on which nouns refer to cer-
poses a Transformer model that explicitly captures item- tain pronouns (e.g., he, she, it) [13]. Each head will also cre-
item relations using self-attention. Self-attention also has ðjÞ
applications in any natural language processing fields. For ate its own vector of attention scores eðjÞ ¼ ½e1 ; . . . ; eðjÞ
nf
example, in [81], self-attention is used for sentiment analy- 2 Rnf , and a corresponding vector of attention weights
ðjÞ
sis. Self-attention is also highly popular for graph models. a ðjÞ ¼ ½a1 ; . . . ; aðjÞ
nf 2 R . As can be expected, each atten-
nf
For example, self-attention is explored in [82] for the pur- tion model produces its own context vector c ðjÞ 2 Rdv , as fol-
pose of representation learning in communication networks lows:
and rating networks. Additionally, the first attention model
X
nf
ðjÞ ðjÞ
for graph networks was based on self-attention [83]. cðjÞ ¼ al v l : (39)
dv 1 dv 1
l¼1 11
attention modules all use self-attentive queries, so each example, in [94], a Transformer model is implemented for
module learns its own query: ”Which feature vectors are image captioning. In [95], Transformers are explored for med-
important to identify this class?”. In [89], a self-attentive ical image segmentation. In [96], a Transformer model is used
multiplicative score function is used for this purpose: for emotion recognition in text messages. A last example of an
application of Transformers is [17], which proposes a Trans-
ec;l ¼ q Tc k l ; (43) former model for recommender systems. In comparison with
11 1dk dk 1
multi-head and multi-hop attention, capsule-based attention
where ec;l 2 R1 is the attention score for vector l in capsule c, is arguably the least popular of the mechanisms discussed for
and q c 2 Rdk is a trainable query for capsule c, for c ¼ the multiplicity of queries. One example is [97], where an
1; . . . ; dy . Each attention module then uses an alignment attention-based capsule network is proposed that also
function, and uses the produced attention weights to deter- includes a multi-hop attention mechanism for the purpose of
mine a context vector c c 2 Rdv . Next, the context vector c c is visual question answering. Another example is [98], where
fed through a probability layer consisting of a linear trans- capsule-based attention is used for aspect-level sentiment
formation with a sigmoid activation function: analysis of restaurant reviews.
The multiplicity of queries is a particularly interesting cat-
egory due to the Transformer model [13], which combines a
pc ¼ sigmoid wc cc þ bc ;
T
(44)
11 1dv dv 1 11 form of multi-hop and multi-head attention. Due to the initial
success of the Transformer model, many improvements and
where wc 2 Rdv and bc 2 R1 are trainable capsule-specific iterations of the model have been produced that typically
weights parameters, and pc 2 R1 is the predicted probability aim to improve the predictive performance, the computa-
that the correct class is class c. The final layer is the recon- tional efficiency, or both. For example, the Transformer-XL
struction module that creates a class vector representation. [99] is an extension of the original Transformer that uses a
This representation r c 2 Rdv is determined by simply multi- recurrence mechanism to not be limited by a context window
plying the context vector c c by the probability pc : when processing the outputs. This allows the model to learn
significantly longer dependencies while also being computa-
rc ¼ pc c c : (45) tionally more efficient during the evaluation phase. Another
dv 1 11 dv 1
extension of the Transformer is known as the Reformer
The capsule representation is used when training the model. model [100]. This model is significantly more efficient com-
First of all, the model is trained to predict the probabilities putationally, by means of locality-sensitive hashing, and
p1 ; . . . ; pdy as accurately as possible compared to the true memory-wise, by means of reversible residual layers. Such
values. Second, via a joint loss function, the model is also computational improvements are vital, since one of the main
trained to accurately construct the capsule representations disadvantages of the Transformer model is the sheer compu-
r 1 ; . . . ; r dy . A features representation f 2 Rdf is defined tational cost due to the complexity of the model scaling
which is simply the unweighted average of the original fea- quadratically with the amount of input feature vectors. The
ture vectors. The idea is to train the model such that vector Linformer model [101] manages to reduce the complexity of
representations from capsules that are not the correct class the model to scale linearly, while achieving similar perfor-
differ significantly from f while the representation from the mance as the Transformer model. This is achieved by approx-
correct capsule is very similar to f . A dot-product between imating the attention weights using a low-rank matrix. The
the capsule representations and the features representation Lite-Transformer model proposed in [102] achieves similar
is used in [89] as a measure of the distance between the vec- results by implementing two branches within the Trans-
tors. Note that dv must equal df in this case, otherwise the former block that specialize in capturing global and local
vectors would have incompatible dimensions. Interestingly, information. Another interesting Transformer architecture is
since attention is calculated for each class individually, one the Synthesizer [103]. This model replaces the pairwise self-
can track which specific feature vectors are important for attention mechanism with “synthetic” attention weights.
which specific class. In [89], this idea is used to discover Interestingly, the performance of this model is relatively close
which words correspond to which sentiment class. to the original Transformer, meaning that the necessity of the
The number of tasks that can make use of multiple queries pairwise self-attention mechanism of the Transformer model
is substantial, due to how general the mechanisms are. As may be questionable. For a more comprehensive overview of
such, the techniques described in this section have been exten- Transformer architectures, we refer to [104].
sively explored in various domains. For example, multi-head
attention has been used for speaker recognition based on
audio spectrograms [91]. In [92], multi-head attention is used 4 EVALUATION OF ATTENTION MODELS
for recommendation of news articles. Additionally, multi-
head attention can be beneficial for graph attention models as In this section, we present various types of evaluation for
well [83]. As for multi-hop attention, quite a few papers have attention models. First, one can evaluate the structure of
been mentioned before, but there are still many other interest- attention models using the taxonomy presented in Section 3.
ing examples. For example, in [93], a multi-hop attention For such an analysis, we consider the attention mechanism
model is proposed for medication recommendation. Further- categories (see Fig. 3) as orthogonal dimensions of a model.
more, practically every Transformer model makes use of both The structure of a model can be analyzed by determining
multi-head and multi-hop attention. The Transformer model which mechanism a model uses for each category. Table 3
has been extensively explored in various domains. For provides an overview of attention models found in the
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on October 23,2023 at 05:05:47 UTC from IEEE Xplore. Restrictions apply.
3294 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 4, APRIL 2023
TABLE 3
Attention Models Analyzed Based on the Proposed Taxonomy
A plus sign (+) between two mechanisms indicates that both techniques were combined in the same model, while a comma (,) indicates that both mechanisms were
tested in the same paper, but not necessarily as a combination in the same model.
literature with a corresponding analysis based on the atten- Rate (AER) to measure the accuracy of the attention weights
tion mechanisms the models implement. with respect to annotated attention vectors. [116] incorpo-
Second, we discuss various techniques for evaluating the rates this idea into an attention model by supervising the
performance of attention models. The performance of atten- attention mechanism using gold attention vectors. A joint
tion models can be evaluated using extrinsic or intrinsic per- loss function consisting of the regular task-specific loss and
formance measures, which are discussed in Sections 4.1 and the attention weights loss function is constructed for this
4.2, respectively. purpose. The gold attention vectors are based on annotated
text data sets where keywords are hand-labelled. However,
4.1 Extrinsic Evaluation since attention is inspired by human attention, one could
evaluate attention models by comparing them to the atten-
In general, the performance of an attention model is mea-
tion behaviour of humans.
sured using extrinsic performance measures. For example, per-
formance measures typically used in the field of natural
language processing are the BLEU [107], METEOR [108], 4.2.1 Evaluation via Human Attention
and Perplexity [109] metrics. In the field of audio proc- In [117], the concept of attention correctness is proposed,
essing, the Word Error Rate [110] and Phoneme Error which is a quantitative intrinsic performance metric that
Rate [111] are generally employed. For general classifica- evaluates the quality of the attention mechanism based on
tion tasks, error rates, precision, and recall are generally actual human attention behaviour. First, the calculation of
used. For computer vision tasks, the PSNR [112], SSIM this metric requires data that includes the attention behav-
[113], or IoU [114] metrics are used. Using these perfor- iour of a human. For example, a data set containing images
mance measures, an attention model can either be com- with the corresponding regions that a human focuses on
pared to other state-of-the-art models, or an ablation when performing a certain task, such as image captioning.
study can be performed. If possible, the importance of The collection of regions focused on by the human is
the attention mechanism can be tested by replacing it referred to as the ground truth region. Suppose an attention
with another mechanism and observing whether the model attends to the nf feature vectors f 1 ; . . . ; f nf 2 Rdf .
overall performance of the model decreases [105], [115]. Feature vector f i corresponds to region Ri of the given
An example of this is replacing the weighted average image, for i ¼ 1; . . . ; nf . We define the set G as the set of
used to produce the context vector with a simple regions that belong to the ground truth region, such that
unweighted average and observing whether there is a Ri 2 G if Ri is part of the ground truth region. The attention
decrease in overall model performance [35]. This ablation model calculates the attention weights a1 ; . . . ; anf 2 R1 via
method can be used to evaluate whether the attention the usual attention process. The Attention Correctness (AC)
weights can actually distinguish important from irrele- metric can then be calculated using (46).
vant information. X
AC ¼ ai : (46)
11 11
i:Ri 2G
4.2 Intrinsic Evaluation
Attention models can also be evaluated using attention-spe- Thus, this metric is equal to the sum of the attention weights
cific intrinsic performance measures. In [4], the attention for the ground truth regions. Since the attention weights
weights are formally evaluated via the Alignment Error sum up to 1 due to, for example, a softmax alignment
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on October 23,2023 at 05:05:47 UTC from IEEE Xplore. Restrictions apply.
BRAUWERS AND FRASINCAR: GENERAL SURVEY ON ATTENTION MECHANISMS IN DEEP LEARNING 3295
function, the AC value will be a value between 0 and 1. If be diagnosed via the attention weights if the model is found
the model attends to only the ground truth regions, then to focus on the incorrect parts of the data, if such informa-
AC is equal to 1, and if the attention model does not attend tion is available. Yet, conversely, attention weights may
to any of the ground truth regions, AC will be equal to 0. only be used to obtain plausible explanations for why cer-
In [118], a rank correlation metric is used to compare the tain parts of the data are focused on, rather than concluding
generated attention weights to the attention behaviour of that those parts are significant to the problem [121]. How-
humans. The conclusion of this work is that attention maps ever, one should still be cautious as the viability of such
generated by standard attention models generally do not approaches can depend on the model architecture [122].
correspond to human attention. Attention models often
focus on much larger regions or multiple small non-adjacent
regions. As such, a technique to improve attention models is 5 CONCLUSION
to allow the model to learn from human attention patterns In this survey, we have provided an overview of recent
via a joint loss of the regular loss function and an attention research on attention models in deep learning. Attention
weight loss function based on the human gaze behaviour, mechanisms have been a prominent development for deep
similarly to how annotated attention vectors are used in learning models as they have shown to improve model per-
[116] to supervise the attention mechanism. [117] proposes formance significantly, producing state-of-the-art results for
to use human attention data to supervise the attention various tasks in several fields of research. We have pre-
mechanism in such a manner. Similarly, a state-of-the-art sented a comprehensive taxonomy that can be used to cate-
video captioning model is proposed in [119] that learns gorize and explain the diverse number of attention
from human gaze data to improve the attention mechanism. mechanisms proposed in the literature. The organization of
the taxonomy was motivated based on the structure of a
task model that consists of a feature model, an attention
4.2.2 Manual Evaluation model, a query model, and an output model. Furthermore,
A method that is often used to evaluate attention models is the attention mechanisms have been discussed using a
the manual inspection of attention weights. As previously framework based on queries, keys, and values. Last, we
mentioned, the attention weights are a direct indication of have shown how one can use extrinsic and intrinsic meas-
which parts of the data the attention model finds most ures to evaluate the performance of attention models, and
important. Therefore, observing which parts of the inputs how one can use the taxonomy to analyze the structure of
the model focuses on can be helpful in determining if the attention models.
model is behaving correctly. This allows for some interpre- The attention mechanism is typically relatively simple
tation of the behaviour of models that are typically known to understand and implement and can lead to significant
to be black boxes. However, rather than checking if the improvements in performance. As such, it is no surprise
model focuses on the most important parts of the data, that this is a highly active field of research with new
some use the attention weights to determine which parts of attention mechanisms and models being developed con-
the data are most important. This would imply that atten- stantly. Not only are new mechanisms consistently being
tion models provide a type of explanation, which is a sub- developed, but there is also still ample opportunity for
ject of contention among researchers. Particularly, in [120], the exploration of existing mechanisms for new tasks. For
extensive experiments are conducted for various natural example, multi-dimensional attention [64] is a technique
language processing tasks to investigate the relation that shows promising results and is general enough to be
between attention weights and important information to implemented in almost any attention model. However, it
determine whether attention can actually provide meaning- has not seen much application in current works. Simi-
ful explanations. In this paper titled “Attention is not larly, multi-head attention [13] is a technique that can be
Explanation”, it is found that attention weights do not tend efficiently parallelized and implemented in practically
to correlate with important features. Additionally, the any attention model. Yet, it is mostly seen only in Trans-
authors are able to replace the produced attention weights former-based architectures. Lastly, similarly to how [43]
with completely different values while keeping the model combines rotatory attention with multi-hop attention,
output the same. These so-called “adversarial” attention combining multi-dimensional attention, multi-head atten-
distributions show that an attention model may focus on tion, capsule-based attention, or any of the other mecha-
completely different information and still come to the same nisms presented in this survey may produce new state-of-
conclusions, which makes interpretation difficult. Yet, in the-art results for the various fields of research mentioned
another paper titled “Attention is not not Explanation” in this survey.
[121], the claim that attention is not explanation is ques- This survey has mainly focused on attention mecha-
tioned by challenging the assumptions of the previous nisms for supervised models, since these comprise the
work. It is found that the adversarial attention distributions largest proportion of the attention models in the litera-
do not perform as reliably well as the learned attention ture. In comparison to the total amount of research that
weights, indicating that it was not proved that attention is has been done on attention models, research on attention
not viable for explanation. models for semi-supervised learning [123], [124] or unsu-
In general, the conclusion regarding the interpretability pervised learning [125], [126] has received limited atten-
of attention models is that researchers must be extremely tion and has only become active recently. Attention may
careful when drawing conclusions based on attention pat- play a more significant role for such tasks in the future as
terns. For example, problems with an attention model can obtaining large amounts of labeled data is a difficult task.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on October 23,2023 at 05:05:47 UTC from IEEE Xplore. Restrictions apply.
3296 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 4, APRIL 2023
Yet, as larger and more detailed data sets become avail- [17] F. Sun et al., “BERT4Rec: Sequential recommendation with bidi-
rectional encoder representations from transformer,” in Proc.
able, the research on attention models can advance even 28th ACM Int. Conf. Inf. Knowl. Manage., 2019, pp. 1441–1450.
further. For example, we mentioned the fact that attention [18] F. Wang and D. M. J. Tax, “Survey on the attention based RNN
weights can be trained directly based on hand-annotated model and its applications in computer vision,” 2016,
data [116] or actual human attention behaviour [117], arXiv:1601.06823.
[19] J. B. Lee, R. A. Rossi, S. Kim, N. K. Ahmed, and E. Koh,
[119]. As new data sets are released, future research may “Attention models in graphs: A survey,” ACM Trans. Knowl. Dis-
focus on developing attention models that can incorpo- covery Data, vol. 13, pp. 62:1–62:25, 2019.
rate those types of data. [20] S. Chaudhari, V. Mithal, G. Polatkan, and R. Ramanath, “An
While attention is intuitively easy to understand, there attentive survey of attention models,” ACM Trans. Intell. Syst.
Technol., vol. 12, no. 5, pp. 1–32, 2021.
still is a substantial lack of theoretical support for attention. [21] D. Hu, “An introductory survey on attention mechanisms in
As such, we expect more theoretical studies to additionally NLP problems,” in Proc. Intell. Syst. Conf. , ser. AISC, vol. 1038,
contribute to the understanding of the attention mechanisms 2020, pp. 432–448.
in complex deep learning systems. Nevertheless, the practi- [22] A. Galassi, M. Lippi, and P. Torroni, “Attention, please! a critical
review of neural attention models in natural language proc-
cal advantages of attention models are clear. Since attention essing,” 2019, arXiv:1902.02181.
models provide significant performance improvements in a [23] M. Daniluk, T. Rockt€aschel, J. Welbl, and S. Riedel, “Frustratingly
variety of fields, and as there are ample opportunities for short attention spans in neural language modeling,” in Proc. 5th
Int. Conf. Learn. Representations, 2017.
more advancements, we foresee that these models will still
[24] Y. Xu, Q. Kong, Q. Huang, W. Wang, and M. D. Plumbley,
receive significant attention in the time to come. “Attention and localization based on a deep convolutional recur-
rent model for weakly supervised audio tagging,” in Proc. 18th
REFERENCES Annu. Conf. Int. Speech Commun. Assoc., 2017, pp. 3083–3087.
[25] C. Yu, K. S. Barsim, Q. Kong, and B. Yang, “Multi-level atten-
[1] H. Larochelle and G. E. Hinton, “Learning to combine foveal tion model for weakly supervised audio classification,” in
glimpses with a third-order Boltzmann machine,” in Proc. 24th Proc. Detection Classification Acoustic Scenes Events Workshop,
Annu. Conf. Neural Inf. Process. Syst., 2010, pp. 1243–1251. 2018, pp. 188–192.
[2] V. Mnih, N. Heess, A. Graves, and K. kavukcuoglu, “Recurrent [26] S. Sharma, R. Kiros, and R. Salakhutdinov, “Action recognition
models of visual attention,” in Proc. 27th Annu. Conf. Neural Inf. using visual attention,” in Proc. 4th Int. Conf. Learn. Representa-
Process. Syst., 2014, pp. 2204–2212. tions Workshop, 2016.
[3] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine transla- [27] L. Gao, Z. Guo, H. Zhang, X. Xu, and H. T. Shen, “Video cap-
tion by jointly learning to align and translate,” in Proc. 3rd Int. tioning with attention-based LSTM and semantic consis-
Conf. Learn. Representation, 2015. tency,” IEEE Trans. Multimedia, vol. 19, no. 9, pp. 2045–2055,
[4] T. Luong, H. Pham, and C. D. Manning, “Effective approaches to Sep. 2017.
attention-based neural machine translation,” in Proc. Conf. Empir. [28] H. Ying et al., “Sequential recommender system based on hierar-
Methods Natural Lang. Process., 2015, pp. 1412–1421. chical attention networks,” in Proc. 27th Int. Joint Conf. Artif.
[5] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy, Intell., 2018, pp. 3926–3932.
“Hierarchical attention networks for document classification,” in [29] H. Song, D. Rajan, J. Thiagarajan, and A. Spanias, “Attend and
Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics: Human diagnose: Clinical time series analysis using attention models,”
Lang. Technologies, 2016, pp. 1480–1489. in Proc. 32nd AAAI Conf. Artif. Intell., 2018, pp. 4091–4098.
[6] Y. Wang, M. Huang, X. Zhu, and L. Zhao, “Attention-based [30] D. T. Tran, A. Iosifidis, J. Kanniainen, and M. Gabbouj,
LSTM for aspect-level sentiment classification,” in Proc. Conf. “Temporal attention-augmented bilinear network for financial
Empir. Methods Natural Lang. Process., 2016, pp. 606–615. time-series data analysis,” IEEE Trans. Neural Netw. Learn. Syst.,
[7] P. Anderson et al., “Bottom-up and top-down attention for image vol. 30, no. 5, pp. 1407–1418, May 2019.
captioning and visual question answering,” in Proc. IEEE/CVF [31] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, and
Conf. Comput. Vis. Pattern Recognit., 2018, pp. 6077–6086. Y. Bengio, “Graph attention networks,” in Proc. 6th Int. Conf.
[8] K. Xu et al., “Show, attend and tell: Neural image caption genera- Learn. Representations, 2018.
tion with visual attention,” in Proc. 32nd Int. Conf. Mach. Learn., [32] J. Lu, J. Yang, D. Batra, and D. Parikh, “Hierarchical question-
2015, pp. 2048–2057. image co-attention for visual question answering,” in Proc. 30th
[9] Y. Ma, H. Peng, and E. Cambria, “Targeted aspect-based senti- Annu. Conf. Neural Inf. Process. Syst., 2016, pp. 289–297.
ment analysis via embedding commonsense knowledge into an [33] F. Fan, Y. Feng, and D. Zhao, “Multi-grained attention network
attentive LSTM,” in Proc. 32nd AAAI Conf. Artif. Intell., 2018, pp. for aspect-level sentiment classification,” in Proc. Conf. Empir.
5876–5883. Methods Natural Lang. Process., 2018, pp. 3433–3442.
[10] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Ben- [34] D. Ma, S. Li, X. Zhang, and H. Wang, “Interactive attention net-
gio, “Attention-based models for speech recognition,” in Proc. works for aspect-level sentiment classification,” in Proc. 26th Int.
28th Annu. Conf. Neural Inf. Process. Syst., 2015, pp. 577–585. Joint Conf. Artif. Intell., 2017, pp. 4068–4074.
[11] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio, [35] M. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi,
“End-to-end attention-based large vocabulary speech recog- “Bidirectional attention flow for machine comprehension,” in
nition,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Proc. 4th Int. Conf. Learn. Representations, 2016.
2016, pp. 4945–4949. [36] S. Zheng and R. Xia, “Left-center-right separated neural network
[12] S. Kim, T. Hori, and S. Watanabe, “Joint CTC-attention based end- for aspect-based sentiment analysis with rotatory attention,”
to-end speech recognition using multi-task learning,” in Proc. 2018, arXiv:1802.00892.
IEEE Int. Conf. Acoust. Speech Signal Process., 2017, pp. 4835–4839. [37] B. Jing, P. Xie, and E. Xing, “On the automatic generation of med-
[13] A. Vaswani et al., “Attention is all you need,” in Proc. 31st Annu. ical imaging reports,” in Proc. 56th Annu. Meeting Assoc. Comput.
Conf. Neural Inf. Process. Syst., 2017, pp. 5998–6008. Linguistics, 2018, pp. 2577–2586.
[14] K. Cho, B. van Merri€enboer, D. Bahdanau, and Y. Bengio, “On [38] J. Gao et al., “CAMP: Co-attention memory networks for diagno-
the properties of neural machine translation: Encoder–decoder sis prediction in healthcare,” in Proc. IEEE Int. Conf. Data Mining,
approaches,” in Proc. 8th Workshop Syntax Semantics Sructure Stat- 2019, pp. 1036–1041.
ist. Transl., 2014, pp. 103–111. [39] Y. Tay, A. T. Luu, and S. C. Hui, “Multi-pointer co-attention net-
[15] N. Parmar et al., “Image Transformer,” in Proc. 35th Int. Conf. works for recommendation,” in Proc. 24th ACM SIGKDD Int.
Mach. Learn., 2018, pp. 4055–4064. Conf. Knowl. Discovery Data Mining, 2018, pp. 2309–2318.
[16] L. Zhou, Y. Zhou, J. J. Corso, R. Socher, and C. Xiong, “End-to-end [40] S. Liu, Z. Chen, H. Liu, and X. Hu, “User-video co-attention net-
dense video captioning with masked transformer,” in Proc. IEEE/ work for personalized micro-video recommendation,” in Proc.
CVF Conf. Comput. Vis. Pattern Recognit., 2018, pp. 8739–8748. World Wide Web Conf., 2019, pp. 3020–3026.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on October 23,2023 at 05:05:47 UTC from IEEE Xplore. Restrictions apply.
BRAUWERS AND FRASINCAR: GENERAL SURVEY ON ATTENTION MECHANISMS IN DEEP LEARNING 3297
[41] M. Tu, G. Wang, J. Huang, Y. Tang, X. He, and B. Zhou, “Multi- [64] T. Shen, T. Zhou, G. Long, J. Jiang, S. Pan, and C. Zhang,
hop reading comprehension across multiple documents by rea- “DiSAN: Directional self-attention network for RNN/CNN-free
soning over heterogeneous graphs,” in Proc. 57th Annu. Meeting language understanding,” in Proc. 32nd AAAI Conf. Artif. Intell.,
Assoc. Comput. Linguistics, 2019, pp. 2704–2713. 2018, pp. 5446–5455.
[42] Y.-J. Lu and C.-T. Li, “GCAN: Graph-aware co-attention net- [65] O. Arshad, I. Gallo, S. Nawaz, and A. Calefati, “Aiding intra-text
works for explainable fake news detection on social media,” in representations with visual context for multimodal named entity
Proc. 58th Annu. Meeting Assoc. Comput. Linguistics, 2020, recognition,” in Proc. Int. Conf. Document Anal. Recognit., 2019,
pp. 505–514. pp. 337–342.
[43] O. Wallaart and F. Frasincar, “A hybrid approach for aspect- [66] W. Wu, X. Sun, and H. Wang, “Question condensing networks for
based sentiment analysis using a lexicalized domain ontology answer selection in community question answering,” in Proc. 56th
and attentional neural models,” in Proc. 16th Extended Semantic Annu. Meeting Assoc. Comput. Linguistics, 2018, pp. 1746–1755.
Web Conf., 2019, pp. 363–378. [67] O. Oktay et al., “Attention U-Net: Learning where to look for the
[44] S. Zhao and Z. Zhang, “Attention-via-attention neural pancreas,” in Proc. 1st Med. Imag. Deep Learn. Conf., 2018.
machine translation,” in Proc. 32nd AAAI Conf. Artif. Intell., [68] R. Tan, J. Sun, B. Su, and G. Liu, “Extending the transformer with
2018, pp. 563–570. context and multi-dimensional mechanism for dialogue response
[45] L. Wu, L. Chen, R. Hong, Y. Fu, X. Xie, and M. Wang, “A hierar- generation,” in Proc. 8th Int. Conf. Natural Lang. Process. Chinese
chical attention model for social contextual image recommen- Comput., 2019, pp. 189–199.
dation,” IEEE Trans. Knowl. Data Eng., vol. 32, no. 10, pp. 1854– [69] L. Chen, B. Lv, C. Wang, S. Zhu, B. Tan, and K. Yu, “Schema-
1867, Oct. 2020. guided multi-domain dialogue state tracking with graph atten-
[46] Y. Wang, S. Wang, J. Tang, N. O’Hare, Y. Chang, and B. Li, tion neural networks,” in Proc. 34th AAAI Conf. Artif. Intell., 2020,
“Hierarchical attention network for action recognition in vid- pp. 7521–7528.
eos,” 2016, arXiv:1607.06416. [70] H. Wang, G. Liu, A. Liu, Z. Li, and K. Zheng, “DMRAN: A hierar-
[47] Z. Li, Y. Wei, Y. Zhang, and Q. Yang, “Hierarchical attention chical fine-grained attention-based network for recommendation,”
transfer network for cross-domain sentiment classification,” in in Proc. 28th Int. Joint Conf. Artif. Intell., 2019, pp. 3698–3704.
Proc. 32nd AAAI Conf. Artif. Intell., 2018, pp. 5852–5859. [71] Z. Lin et al., “A structured self-attentive sentence embedding,” in
[48] C. Xing, Y. Wu, W. Wu, Y. Huang, and M. Zhou, “Hierarchical Proc. 5th Int. Conf. Learn. Representations, 2017.
recurrent attention network for response generation,” in Proc. [72] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,”
32nd AAAI Conf. Artif. Intell., 2018, pp. 5610–5617. 2016, arXiv:1607.06450.
[49] V. A. Sindagi and V. M. Patel, “HA-CCN: Hierarchical attention-
[73] H. Zhao, J. Jia, and V. Koltun, “Exploring self-attention for image
based crowd counting network,” IEEE Trans. Image Process.,
recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recog-
vol. 29, pp. 323–335, 2020.
nit., 2020, pp. 10 076–10 085.
[50] D. Kiela, C. Wang, and K. Cho, “Dynamic meta-embeddings for
[74] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena, “Self-atten-
improved sentence representations,” in Proc. Conf. Empir. Meth-
ods Natural Lang. Process., 2018, pp. 1466–1477. tion generative adversarial networks,” in Proc. 36th Int. Conf.
[51] S. Maharjan, M. Montes, F. A. Gonzalez, and T. Solorio, “A Mach. Learn., 2019, pp. 7354–7363.
genre-aware attention model to improve the likability prediction [75] I. Goodfellow et al., “Generative adversarial nets,” in Proc. 27th
of books,” in Proc. Conf. Empir. Methods Natural Lang. Process., Annu. Conf. Neural Inf. Process. Syst., 2014, pp. 2672–2680.
2018, pp. 3381–3391. [76] A. Sinha and J. Dolz, “Multi-scale self-guided attention for medi-
[52] G. I. Winata, Z. Lin, and P. Fung, “Learning multilingual cal image segmentation,” IEEE J. Biomed. Health Inform., vol. 25,
meta-embeddings for code-switching named entity recog- no. 1, pp. 121–130, Jan. 2021.
nition,” in Proc. 4th Workshop Representation Learn. NLP, 2019, [77] J. Fajtl, H. S. Sokeh, V. Argyriou, D. Monekosso, and P. Remag-
pp. 181–186. nino, “Summarizing videos with attention,” in Proc. Asian Conf.
[53] R. Jin, L. Lu, J. Lee, and A. Usman, “Multi-representational con- Comput. Vis., 2018, pp. 39–54.
volutional neural networks for text classification,” Comput. Intell., [78] J. Salazar, K. Kirchhoff, and Z. Huang, “Self-attention networks
vol. 35, no. 3, pp. 599–609, 2019. for connectionist temporal classification in speech recognition,”
[54] A. Sordoni, P. Bachman, A. Trischler, and Y. Bengio, “Iterative alter- in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2019,
nating neural attention for machine reading,” 2016, arXiv:1606. pp. 7115–7119.
022456. [79] T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zisser-
[55] A. Graves, G. Wayne, and I. Danihelka, “Neural Turing man, “Deep audio-visual speech recognition,” IEEE Trans.
machines,” 2014, arXiv:1410.5401. Pattern Anal. Mach. Intell., to be published, doi: 10.1109/
[56] D. Britz, A. Goldie, M.-T. Luong, and Q. Le, “Massive explora- TPAMI.2018.2889052.
tion of neural machine translation architectures,” in Proc. Conf. [80] S. Zhang, Y. Tay, L. Yao, and A. Sun, “Next item recommenda-
Empir. Methods Natural Lang. Process., 2017, pp. 1442–1451. tion with self-attention,” 2018, arXiv:1808.06414.
[57] R. J. Williams, “Simple statistical gradient-following algorithms [81] G. Letarte, F. Paradis, P. Giguere, and F. Laviolette,
for connectionist reinforcement learning,” Mach. Learn., vol. 8, “Importance of self-attention for sentiment analysis,” in Proc.
no. 3, pp. 229–256, 1992. Workshop BlackboxNLP: Analyzing Interpreting Neural Netw,
[58] T. Shen, T. Zhou, G. Long, J. Jiang, S. Wang, and C. Zhang, 2018, pp. 267–275.
“Reinforced self-attention network: a hybrid of hard and soft [82] A. Sankar, Y. Wu, L. Gou, W. Zhang, and H. Yang, “Dysat: Deep
attention for sequence modeling,” in Proc. 27th Int. Joint Conf. neural representation learning on dynamic graphs via self-atten-
Artif. Intell., 2018, pp. 4345–4352. tion networks,” in Proc. 13th Int. Conf. Web Search Data Mining,
[59] M. Malinowski, C. Doersch, A. Santoro, and P. Battaglia, 2020, pp. 519–527.
“Learning visual question answering by bootstrapping hard [83] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Li o, and
attention,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 3–20. Y. Bengio, “Graph attention networks,” in Proc. 5th Int. Conf.
[60] Y. Liu, W. Wang, Y. Hu, J. Hao, X. Chen, and Y. Gao, “Multi- Learn. Representations, 2017.
agent game abstraction via graph attention neural network,” in [84] S. Iida, R. Kimura, H. Cui, P.-H. Hung, T. Utsuro, and M. Nagata,
Proc. 34th AAAI Conf. Artif. Intell., 2020, pp. 7211–7218. “Attention over heads: A multi-hop attention for neural machine
[61] S. Seo, J. Huang, H. Yang, and Y. Liu, “Interpretable convolu- translation,” in Proc. 57th Annu. Meeting Assoc. Comput. Linguis-
tional neural networks with dual local and global attention for tics: Student Res. Workshop, 2019, pp. 217–222.
review rating prediction,” in Proc. 11th ACM Conf. Recommender [85] N. K. Tran and C. Niedereee, “Multihop attention networks for
Syst., 2017, pp. 297–305. question answer matching,” in Proc. 41st ACM SIGIR Int. Conf.
[62] J. Wang et al., “Aspect sentiment classification towards question- Res. Develop. Inf. Retrieval, 2018, pp. 325–334.
answering with reinforced bidirectional attention network,” in [86] Y. Gong and S. R. Bowman, “Ruminating reader: Reasoning with
Proc. 57th Annu. Meeting Assoc. Comput. Linguistics, 2019, gated multi-hop attention,” in Proc. 5th Int. Conf. Learn. Represen-
pp. 3548–3557. tation, 2017.
[63] M. Jiang, C. Li, J. Kong, Z. Teng, and D. Zhuang, “Cross-level [87] S. Yoon, S. Byun, S. Dey, and K. Jung, “Speech emotion recogni-
reinforced attention network for person re-identification,” J. Vis. tion using multi-hop attention mechanism,” in Proc. IEEE Int.
Commun. Image Representation, vol. 69, 2020, Art. no. 102775. Conf. Acoust. Speech Signal Process., 2019, pp. 2822–2826.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on October 23,2023 at 05:05:47 UTC from IEEE Xplore. Restrictions apply.
3298 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 35, NO. 4, APRIL 2023
[88] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, “Stacked attention [114] M. A. Rahman and Y. Wang, “Optimizing intersection-over-
networks for image question answering,” in Proc. IEEE/CVF union in deep neural networks for image segmentation,” in Proc.
Conf. Comput. Vis. Pattern Recognit., 2016, pp. 21–29. 12th Int. Symp. Vis. Comput., 2016, pp. 234–244.
[89] Y. Wang, A. Sun, J. Han, Y. Liu, and X. Zhu, “Sentiment analysis [115] X. Chen, L. Yao, and Y. Zhang, “Residual attention U-net for
by capsules,” in Proc. World Wide Web Conf., 2018, pp. 1165–1174. automated multi-class segmentation of COVID-19 chest CT
[90] S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing images,” 2020, arXiv:2004.05645.
between capsules,” in Proc. 31st Annu. Conf. Neural Inf. Process. [116] S. Liu, Y. Chen, K. Liu, and J. Zhao, “Exploiting argument infor-
Syst., 2017, pp. 3859–3869. mation to improve event detection via supervised attention
[91] M. India, P. Safari, and J. Hernando, “Self multi-head attention mechanisms,” in Proc. 55th Annu. Meeting Assoc. Comput. Linguis-
for speaker recognition,” in Proc. 20th Annu. Conf. Int. Speech tics, 2017, pp. 1789–1798.
Commun. Assoc., 2019, pp. 2822–2826. [117] C. Liu, J. Mao, F. Sha, and A. Yuille, “Attention correctness in
[92] C. Wu, F. Wu, S. Ge, T. Qi, Y. Huang, and X. Xie, “Neural news neural image captioning,” in Proc. 31st AAAI Conf. Artif. Intell.,
recommendation with multi-head self-attention,” in Proc. Conf. 2017, pp. 4176–4182.
Empir. Methods Natural Lang. Process., 9th Int. Joint Conf. Natural [118] A. Das, H. Agrawal, L. Zitnick, D. Parikh, and D. Batra, “Human
Lang. Process., 2019, pp. 6389–6394. attention in visual question answering: Do humans and deep
[93] Y. Wang, W. Chen, D. Pi, and L. Yue, “Adversarially regularized networks look at the same regions?,” Comput. Vis. Image Under-
medication recommendation model with multi-hop memory standing, vol. 163, pp. 90–100, 2017.
network,” Knowl. Inf. Syst., vol. 63, no. 1, pp. 125–142, 2021. [119] Y. Yu, J. Choi, Y. Kim, K. Yoo, S.-H. Lee, and G. Kim,
[94] M. Cornia, M. Stefanini, L. Baraldi, and R. Cucchiara, “Meshed- “Supervising neural attention models for video captioning by
memory transformer for image captioning,” in Proc. IEEE/CVF human gaze data,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog-
Conf. Comput. Vis. Pattern Recognit., 2020, pp. 10 578–10 587. nit., 2017, pp. 6119–6127.
[95] J. Chen et al., “TransUnet: Transformers make strong encoders [120] S. Jain and B. C. Wallace, “Attention is not explanation,” in Proc.
for medical image segmentation,” 2021, arXiv:2102.04306. Conf. North Amer. Chapter Assoc. Comput. Linguistics: Hum. Lan-
guage Technol., 2019, pp. 3543–3556.
[96] P. Zhong, D. Wang, and C. Miao, “Knowledge-enriched trans-
former for emotion detection in textual conversations,” in Proc. [121] S. Wiegreffe and Y. Pinter, “Attention is not not explanation,” in
Conf. Emp. Methods Natural Lang. Process., 9th Int. Joint Conf. Natu- Proc. Conf. Empir. Methods Natural Lang. Process., 9th Int. Joint
ral Lang. Process., 2019, pp. 165–176. Conf. Natural Lang. Process., 2019, pp. 11–20.
[122] A. K. Mohankumar, P. Nema, S. Narasimhan, M. M. Khapra,
[97] Y. Zhou, R. Ji, J. Su, X. Sun, and W. Chen, “Dynamic capsule
B. V. Srinivasan, and B. Ravindran, “Towards transparent and
attention for visual question answering,” in Proc. 33rd AAAI
explainable attention models,” in Proc. 58th Annu. Meeting Assoc.
Conf. Artif. Intell., 2019, pp. 9324–9331.
Comput. Linguistics, 2020, pp. 4206–4216.
[98] Y. Wang, A. Sun, M. Huang, and X. Zhu, “Aspect-level sentiment
[123] K. K. Thekumparampil, C. Wang, S. Oh, and L.-J. Li, “Attention-
analysis using AS-capsules,” in Proc. World Wide Web Conf., 2019,
based graph neural network for semi-supervised learning,”
pp. 2033–2044. 2018, arXiv:1803.03735.
[99] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. Le, and R. Salakhutdi- [124] D. Nie, Y. Gao, L. Wang, and D. Shen, “ASDNet: Attention based
nov, “Transformer-XL: Attentive language models beyond a semi-supervised deep networks for medical image segmentation,”
fixed-length context,” in Proc. 57th Annu. Meeting Assoc. Comput. in Proc. 21st Int. Conf. Med. Image Comput. Comput.-Assisted Interven-
Linguistics, 2019, pp. 2978–2988. tion, 2018, pp. 370–378.
[100] N. Kitaev, º. Kaiser, and A. Levskaya, “Reformer: The efficient [125] Y. Alami Mejjati , C. Richardt, J. Tompkin, D. Cosker, and K. I. Kim,
Transformer,” in Proc. 8th Int. Conf. Learn. Representations, 2020. “Unsupervised attention-guided image-to-image translation,” in
[101] S. Wang, B. Li, M. Khabsa, H. Fang, and H. Ma, “Linformer: Self- Proc. 32nd Annu. Conf. Neural Inf. Process. Syst., 2018, pp. 3693–3703.
attention with linear complexity,” 2020, arXiv:2006.04768. [126] R. He, W. S. Lee, H. T. Ng, and D. Dahlmeier, “An unsupervised
[102] Z. Wu, Z. Liu, J. Lin, Y. Lin, and S. Han, “Lite transformer with neural attention model for aspect extraction,” in Proc. 55th Annu.
long-short range attention,” in Proc. 8th Int. Conf. Learn. Represen- Meeting Assoc. Comput. Linguistics, 2017, pp. 388–397.
tations, 2020.
[103] Y. Tay, D. Bahri, D. Metzler, D. C. Juan, Z. Zhao, and C. Zheng, Gianni Brauwers received the BS degree in
“Synthesizer: Rethinking self-attention for transformer models,” econometrics and operations research from
in Proc. 38th Int. Conf. Mach. Learn., vol. 139, 2021, pp. 10183–10192. Erasmus University Rotterdam, Rotterdam, the
[104] Y. Tay, M. Dehghani, D. Bahri, and D. Metzler, “Efficient trans- Netherlands, in 2019. He is currently working
formers: A survey,” 2020, arXiv:2009.06732. toward the MS degree in econometrics and man-
[105] X. Li et al., “Beyond RNNs: Positional self-attention with co- agement science at Erasmus University Rotter-
attention for video question answering,” in Proc. 33rd AAAI Conf. dam. He is a research assistant with Erasmus
Artif. Intell., 2019, pp. 8658–8665. University Rotterdam, focusing his research on
[106] A. W. Yu et al., “QANet: Combining local convolution with neural attention models and sentiment analysis.
global self-attention for reading comprehension,” in Proc. 6th Int.
Conf. Learn. Representations, 2018.
[107] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: A method
for automatic evaluation of machine translation,” in Proc. 40th Flavius Frasincar received the MS degree in
Annu. Meeting Assoc. Comput. Linguistics, 2002, pp. 311–318. computer science, in 1996, the MPhil degree in
[108] S. Banerjee and A. Lavie, “METEOR: An automatic metric for MT computer science from the Politehnica University
evaluation with improved correlation with human judgments,” of Bucharest, Bucharest, Romania, in 1997, the
in Proc. Workshop Intrinsic Extrinsic Eval. Measures Mach. Transl. PDEng degree in computer science, in 2000, and
Summarization, 2005, pp. 65–72. the PhD degree in computer science from the
[109] R. Sennrich, “Perplexity minimization for translation model Eindhoven University of Technology, Eindhoven,
domain adaptation in statistical machine translation,” in Proc. 13th the Netherlands, in 2005. Since 2005, he has
Conf. Eur. Chapter Assoc. Comput. Linguistics, 2012, pp. 539–549. been an assistant professor in computer science
[110] M. Popovic and H. Ney, “Word error rates: Decomposition over with Erasmus University Rotterdam, Rotterdam,
POS classes and applications for error analysis,” in Proc. 2nd the Netherlands. He has published in numerous
Workshop Statist. Mach. Transl., 2007, pp. 48–55. conferences and journals in the areas of databases, Web information
[111]
P. Schwarz, P. Matejka, and J. Cernockỳ, “Towards lower error systems, personalization, machine learning, and the Semantic Web. He
rates in phoneme recognition,” in Proc. 7th Int. Conf. Text Speech is a member of the editorial boards of Decision Support Systems, Inter-
Dialogue, 2004, pp. 465–472. national Journal of Web Engineering and Technology, and Computa-
[112] D. S. Turaga, Y. Chen, and J. Caviedes, “No reference PSNR esti- tional Linguistics in the Netherlands Journal, and co-editor-in-chief of the
mation for compressed pictures,” Signal Process.: Image Commun., Journal of Web Engineering. He is a member of the association for Com-
vol. 19, no. 2, pp. 173–184, 2004. puting Machinery.
[113] P. Ndajah, H. Kikuchi, M. Yukawa, H. Watanabe, and
S. Muramatsu, “SSIM image quality metric for denoised images,” " For more information on this or any other computing topic,
in Proc. 3rd WSEAS Int. Conf. Vis. Imaging Simul., 2010, pp. 53–58. please visit our Digital Library at www.computer.org/csdl.
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on October 23,2023 at 05:05:47 UTC from IEEE Xplore. Restrictions apply.