0% found this document useful (0 votes)
17 views14 pages

Analyzing the Structure of Attention

This paper analyzes the structure of attention in the Transformer language model GPT-2, focusing on how attention interacts with syntax and part-of-speech tags. The authors visualize attention at various levels of granularity and find that attention aligns with syntactic dependency relations most strongly in the middle layers, while deeper layers capture longer-distance relationships. Additionally, the study reveals specific attention patterns targeted by individual attention heads across a large corpus.

Uploaded by

Shubhankar Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views14 pages

Analyzing the Structure of Attention

This paper analyzes the structure of attention in the Transformer language model GPT-2, focusing on how attention interacts with syntax and part-of-speech tags. The authors visualize attention at various levels of granularity and find that attention aligns with syntactic dependency relations most strongly in the middle layers, while deeper layers capture longer-distance relationships. Additionally, the study reveals specific attention patterns targeted by individual attention heads across a large corpus.

Uploaded by

Shubhankar Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Analyzing the Structure of Attention in a Transformer Language Model

Jesse Vig Yonatan Belinkov


Palo Alto Research Center Harvard John A. Paulson School of
Machine Learning and Engineering and Applied Sciences and
Data Science Group MIT Computer Science and
Interaction and Analytics Lab Artificial Intelligence Laboratory
Palo Alto, CA, USA Cambridge, MA, USA
[email protected] [email protected]

Abstract attention in NLP models, ranging from attention


matrix heatmaps (Bahdanau et al., 2015; Rush
The Transformer is a fully attention-based et al., 2015; Rocktäschel et al., 2016) to bipartite
alternative to recurrent networks that has
graph representations (Liu et al., 2018; Lee et al.,
achieved state-of-the-art results across a range
of NLP tasks. In this paper, we analyze 2017; Strobelt et al., 2018). A visualization tool
the structure of attention in a Transformer designed specifically for multi-head self-attention
language model, the GPT-2 small pretrained in the Transformer (Jones, 2017; Vaswani et al.,
model. We visualize attention for individ- 2018) was introduced in Vaswani et al. (2017).
ual instances and analyze the interaction be- We extend the work of Jones (2017), by visu-
tween attention and syntax over a large cor-
alizing attention in the Transformer at three lev-
pus. We find that attention targets different
parts of speech at different layer depths within
els of granularity: the attention-head level, the
the model, and that attention aligns with de- model level, and the neuron level. We also adapt
pendency relations most strongly in the mid- the original encoder-decoder implementation to
dle layers. We also find that the deepest layers the decoder-only GPT-2 model, as well as the
of the model capture the most distant relation- encoder-only BERT model.
ships. Finally, we extract exemplar sentences In addition to visualizing attention for individ-
that reveal highly specific patterns targeted by
ual inputs to the model, we also analyze attention
particular attention heads.
in aggregate over a large corpus to answer the fol-
1 Introduction lowing research questions:

Contextual word representations have recently


• Does attention align with syntactic depen-
been used to achieve state-of-the-art perfor-
dency relations?
mance across a range of language understanding
tasks (Peters et al., 2018; Radford et al., 2018;
Devlin et al., 2018). These representations are • Which attention heads attend to which part-
obtained by optimizing a language modeling (or of-speech tags?
similar) objective on large amounts of text. The
underlying architecture may be recurrent, as in • How does attention capture long-distance re-
ELMo (Peters et al., 2018), or based on multi-head lationships versus short-distance ones?
self-attention, as in OpenAI’s GPT (Radford et al.,
2018) and BERT (Devlin et al., 2018), which are We apply our analysis to the GPT-2 small pre-
based on the Transformer (Vaswani et al., 2017). trained model. We find that attention follows de-
Recently, the GPT-2 model (Radford et al., 2019) pendency relations most strongly in the middle
outperformed other language models in a zero- layers of the model, and that attention heads tar-
shot setting, again based on self-attention. get particular parts of speech depending on layer
An advantage of using attention is that it can depth. We also find that attention spans the great-
help interpret the model by showing how the est distance in the deepest layers, but varies signif-
model attends to different parts of the input (Bah- icantly between heads. Finally, our method for ex-
danau et al., 2015; Belinkov and Glass, 2019). tracting exemplar sentences yields many intuitive
Various tools have been developed to visualize patterns.

63
Proceedings of the Second BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 63–76
Florence, Italy, August 1, 2019. c 2019 Association for Computational Linguistics
2 Related Work multi-head setting, the queries, keys, and values
are linearly projected h times, and the attention
Recent work suggests that the Transformer im-
operation is performed in parallel for each repre-
plicitly encodes syntactic information such as de-
sentation, with the results concatenated.
pendency parse trees (Hewitt and Manning, 2019;
Raganato and Tiedemann, 2018), anaphora (Voita
et al., 2018), and subject-verb pairings (Goldberg, 4 Visualizing Individual Inputs
2019; Wolf, 2019). Other work has shown that
RNNs also capture syntax, and that deeper layers In this section, we present three visualizations of
in the model capture increasingly high-level con- attention in the Transformer model: the attention-
structs (Blevins et al., 2018). head view, the model view, and the neuron view.
In contrast to past work that measure a model’s Source code and Jupyter notebooks are avail-
syntactic knowledge through linguistic probing able at https://github.com/jessevig/
tasks, we directly compare the model’s atten- bertviz, and a video demonstration can be
tion patterns to syntactic constructs such as de- found at https://vimeo.com/339574955.
pendency relations and part-of-speech tags. Ra- A more detailed discussion of the tool is provided
ganato and Tiedemann (2018) also evaluated de- in Vig (2019).
pendency trees induced from attention weights in a
Transformer, but in the context of encoder-decoder 4.1 Attention-head View
translation models.
The attention-head view (Figure 1) visualizes at-
3 Transformer Architecture tention for one or more heads in a model layer.
Self-attention is depicted as lines connecting the
Stacked Decoder: GPT-2 is a stacked decoder
attending tokens (left) with the tokens being at-
Transformer, which inputs a sequence of tokens
tended to (right). Colors identify the head(s), and
and applies position and token embeddings fol-
line weight reflects the attention weight. This view
lowed by several decoder layers. Each layer ap-
closely follows the design of Jones (2017), but has
plies multi-head self-attention (see below) in com-
been adapted to the GPT-2 model (shown in the
bination with a feedforward network, layer nor-
figure) and BERT model (not shown).
malization, and residual connections. The GPT-2
small model has 12 layers and 12 heads.
Self-Attention: Given an input x, the self-
attention mechanism assigns to each token xi a set
of attention weights over the tokens in the input:

Attn(xi ) = (αi,1 (x), αi,2 (x), ..., αi,i (x)) (1)

where αi,j (x) is the attention that xi pays to xj .


The weights are positive and sum to one. Attention
in GPT-2 is right-to-left, so αi,j is defined only for
j ≤ i. In the multi-layer, multi-head setting, α is
specific to a layer and head.
The attention weights αi,j (x) are computed
from the scaled dot-product of the query vector of
xi and the key vector of xj , followed by a softmax
operation. The attention weights are then used to
produce a weighted sum of value vectors: Figure 1: Attention-head view of GPT-2 for layer 4,
head 11, which focuses attention on previous token.

QK T
Attention(Q, K, V ) = softmax( √ )V (2) This view helps focus on the role of specific at-
dk
tention heads. For instance, in the shown example,
using query matrix Q, key matrix K, and value the chosen attention head attends primarily to the
matrix V , where dk is the dimension of K. In a previous token position.

64
5 Analyzing Attention in Aggregate

In this section we explore the aggregate proper-


ties of attention across an entire corpus. We ex-
amine how attention interacts with syntax, and we
compare long-distance versus short-distance rela-
tionships. We also extract exemplar sentences that
reveal patterns targeted by each attention head.

5.1 Methods
5.1.1 Part-of-Speech Tags
Past work suggests that attention heads in the
Transformer may specialize in particular linguis-
Figure 2: Model view of GPT-2, for same input as in tic phenomena (Vaswani et al., 2017; Raganato
Figure 1 (excludes layers 6–11 and heads 6–11). and Tiedemann, 2018; Vig, 2019). We explore
whether individual attention heads in GPT-2 target
4.2 Model View particular parts of speech. Specifically, we mea-
sure the proportion of total attention from a given
The model view (Figure 2) visualizes attention head that focuses on tokens with a given part-of-
across all of the model’s layers and heads for a speech tag, aggregated over a corpus:
particular input. Attention heads are presented in
tabular form, with rows representing layers and |x| P
P P i
columns representing heads. Each head is shown αi,j (x)·1pos(xj )=tag
x∈X i=1 j=1
in a thumbnail form that conveys the coarse shape Pα (tag) = (3)
|x| P
i
of the attention pattern, following the small multi- P P
αi,j (x)
ples design pattern (Tufte, 1990). Users may also x∈X i=1 j=1
click on any head to enlarge it and see the tokens.
This view facilitates the detection of coarse- where tag is a part-of-speech tag, e.g., NOUN, x is
grained differences between heads. For example, a sentence from the corpus X, αi,j is the attention
several heads in layer 0 share a horizontal-stripe from xi to xj for the given head (see Section 3),
pattern, indicating that tokens attend to the current and pos(xj ) is the part-of-speech tag of xj . We
position. Other heads have a triangular pattern, also compute the share of attention directed from
showing that they attend to the first token. In the each part of speech in a similar fashion.
deeper layers, some heads display a small number
of highly defined lines, indicating that they are tar- 5.1.2 Dependency Relations
geting specific relationships between tokens.
Recent work shows that Transformers and recur-
4.3 Neuron View rent models encode dependency relations (Hewitt
The neuron view (Figure 3) visualizes how indi- and Manning, 2019; Raganato and Tiedemann,
vidual neurons interact to produce attention. This 2018; Liu et al., 2019). However, different mod-
view displays the queries and keys for each to- els capture dependency relations at different layer
ken, and demonstrates how attention is computed depths. In a Transformer model, the middle layers
from the scaled dot product of these vectors. The were most predictive of dependencies (Liu et al.,
element-wise product shows how specific neurons 2019; Tenney et al., 2019). Recurrent models were
influence the dot product and hence attention. found to encode dependencies in lower layers for
Whereas the attention-head view and the model language models (Liu et al., 2019) and in deeper
view show what attention patterns the model layers for translation models (Belinkov, 2018).
learns, the neuron view shows how the model We analyze how attention aligns with depen-
forms these patterns. For example, it can help dency relations in GPT-2 by computing the pro-
identify neurons responsible for specific attention portion of attention that connects tokens that are
patterns, as illustrated in Figure 3. also in a dependency relation with one another. We

65
Figure 3: Neuron view for layer 8, head 6, which targets items in lists. Positive and negative values are colored blue
and orange, respectively, and color saturation indicates magnitude. This view traces the computation of attention
(Section 3) from the selected token on the left to each of the tokens on the right. Connecting lines are weighted
based on attention between the respective tokens. The arrows (not in visualization) identify the neurons that most
noticeably contribute to this attention pattern: the lower arrows point to neurons that contribute to attention towards
list items, while the upper arrow identifies a neuron that helps focus attention on the first token in the sequence.

refer to this metric as dependency alignment: Variabilityα represents the mean absolute de-
viation1 of α over X, scaled to the [0, 1] inter-
|x| P
P P i val.2,3 Variability scores for three example atten-
αi,j (x)dep(xi , xj ) tion heads are shown in Figure 4.
x∈X i=1 j=1
DepAlα = (4) 5.1.3 Attention Distance
|x| P
P P i
αi,j (x) Past work suggests that deeper layers in NLP
x∈X i=1 j=1
models capture longer-distance relationships than
where dep(xi , xj ) is an indicator function that re- lower layers (Belinkov, 2018; Raganato and
turns 1 if xi and xj are in a dependency relation Tiedemann, 2018). We test this hypothesis on
and 0 otherwise. We run this analysis under three GPT-2 by measuring the mean distance (in num-
alternate formulations of dependency: (1) the at- ber of tokens) spanned by attention for each head.
tending token (xi ) is the parent in the dependency Specifically, we compute the average distance be-
relation, (2) the token receiving attention (xj ) is tween token pairs in all sentences in the corpus,
the parent, and (3) either token is the parent. weighted by the attention between the tokens:
We hypothesized that heads that focus attention
based on position—for example, the head in Fig- |x| P
i
ure 1 that focuses on the previous token—would
P P
αi,j (x) · (i − j)
not align well with dependency relations, since x∈X i=1 j=1
D̄α = (6)
they do not consider the content of the text. To dis- |x| P
P P i
tinguish between content-dependent and content- αi,j (x)
x∈X i=1 j=1
independent (position-based) heads, we define at-
tention variability, which measures how attention We also explore whether heads with more dis-
varies over different inputs; high variability would 1
We considered using variance to measure attention vari-
suggest a content-dependent head, while low vari- ability; however, attention is sparse for many attention heads
ability would indicate a content-independent head: after filtering first-token attention (see Section 5.2.3), result-
ing in a very low variance (due to αi,j (x) ≈ 0 and ᾱi,j ≈ 0)
for many content-sensitive attention heads. We did not use a
|x| P
P P i probability distance measure, as attention values do not sum
|αi,j (x) − ᾱi,j | to one due to filtering first-token attention.
x∈X i=1 j=1 2
The upper bound is 1 because the denominator is an
Variabilityα = (5)
|x| P
P P i upper bound on the numerator.
3
2· αi,j (x) When computing variability, we only include the first N
x∈X i=1 j=1 tokens (N =10) of each x ∈ X to ensure a sufficient amount
of data at each position i. The positional patterns appeared to
where ᾱi,j is the mean of αi,j (x) over all x ∈ X. be consistent across the entire sequence.

66
Figure 4: Attention heads in GPT-2 visualized for an example input sentence, along with aggregate metrics com-
puted from all sentences in the corpus. Note that the average sentence length in the corpus is 27.7 tokens. Left:
Focuses attention primarily on current token position. Center: Disperses attention roughly evenly across all pre-
vious tokens. Right: Focuses on words in repeated phrases.

set for GPT-2. We first extracted 10,000 articles,


and then sampled 100,000 sentences from these ar-
ticles. For the qualitative analysis described later,
we used the full dataset; for the quantitative anal-
ysis, we used a subset of 10,000 sentences.

5.2.2 Tools
We computed attention weights using the
pytorch-pretrained-BERT5 implemen-
tation of the GPT-2 small model. We extracted
Figure 5: Proportion of attention focused on first token, syntactic features using spaCy (Honnibal and
broken out by layer and head. Montani, 2017) and mapped the features from
the spaCy-generated tokens to the corresponding
tokens from the GPT-2 tokenizer.6
persed attention patterns (Figure 4, center) tend to
capture more distant relationships. We measure 5.2.3 Filtering Null Attention
attention dispersion based on the entropy4 of the
attention distribution (Ghader and Monz, 2017): We excluded attention focused on the first token
of each sentence from the analysis because it was
i
not informative; other tokens appeared to focus
on this token by default when no relevant tokens
X
Entropyα (xi ) = − αi,j (x)log(αi,j (x)) (7)
j=1 were found elsewhere in the sequence. On aver-
age, 57% of attention was directed to the first to-
Figure 4 shows the mean distance and entropy ken. Some heads focused over 97% of attention
values for three example attention heads. to this token on average (Figure 5), which is con-
5.2 Experimental Setup sistent with recent work showing that individual
attention heads may have little impact on over-
5.2.1 Dataset
all performance (Voita et al., 2019; Michel et al.,
We focused our analysis on text from English 2019). We refer to the attention directed to the first
Wikipedia, which was not included in the training token as null attention.
4
When computing entropy, we exclude attention to the
5
first (null) token (see Section 5.2.3) and renormalize the re- https://github.com/huggingface/
maining weights. We exclude tokens that focus over 90% of pytorch-pretrained-BERT
6
attention to the first token, to avoid a disproportionate influ- In cases where the GPT-2 tokenizer split a word into
ence from the remaining attention from these tokens. multiple pieces, we assigned the features to all word pieces.

67
Figure 6: Each heatmap shows the proportion of total attention directed to the given part of speech, broken out by
layer (vertical axis) and head (horizontal axis). Scales vary by tag. Results for all tags available in appendix.

Figure 7: Each heatmap shows the proportion of total attention that originates from the given part of speech, broken
out by layer (vertical axis) and head (horizontal axis). Scales vary by tag. Results for all tags available in appendix.

5.3 Results null) attention weights sum to a value close to


one. Thus, the net weight for each token in the
5.3.1 Part-of-Speech Tags
weighted sum (Section 5.1.1) is close to one, and
Figure 6 shows the share of attention directed to the proportion reduces to the frequency of the part
various part-of-speech tags (Eq. 3) broken out by of speech in the corpus.
layer and head. Most tags are disproportionately Beyond the initial layers, attention heads spe-
targeted by one or more attention heads. For ex- cialize in focusing attention from particular part-
ample, nouns receive 43% of attention in layer 9, of-speech tags. However, the effect is less pro-
head 0, compared to a mean of 21% over all heads. nounced compared to the tags receiving attention;
For 13 of 16 tags, a head exists with an attention for 7 out of 16 tags, there is a head that focuses
share more than double the mean for the tag. attention from that tag with a frequency more than
The attention heads that focus on a particular double the tag average. Many of these specialized
tag tend to cluster by layer depth. For example, heads also cluster by layer. For example, the top
the top five heads targeting proper nouns are all in ten heads for focusing attention from punctuation
the last three layers of the model. This may be due are all in the last six layers.
to several attention heads in the deeper layers fo-
cusing on named entities (see Section 5.4), which 5.3.2 Dependency Relations
may require the broader context available in the Figure 8 shows the dependency alignment scores
deeper layers. In contrast, the top five heads tar- (Eq. 4) broken out by layer. Attention aligns with
geting determiners—a lower-level construct—are dependency relations most strongly in the mid-
all in the first four layers of the model. This is con- dle layers, consistent with recent syntactic probing
sistent with previous findings showing that deeper analyses (Liu et al., 2019; Tenney et al., 2019).
layers focus on higher-level properties (Blevins One possible explanation for the low alignment
et al., 2018; Belinkov, 2018). in the initial layers is that many heads in these lay-
Figure 7 shows the proportion of attention di- ers focus attention based on position rather than
rected from various parts of speech. The values content, according to the attention variability (Eq.
appear to be roughly uniform in the initial lay- 5) results in Figure 10. Figure 4 (left and center)
ers of the model. The reason is that the heads in shows two examples of position-focused heads
these layers pay little attention to the first (null) to- from layer 0 that have relatively low dependency
ken (Figure 5), and therefore the remaining (non- alignment7 (0.04 and 0.10, respectively); the first

68
Figure 8: Proportion of attention that is aligned with dependency relations, aggregated by layer. The orange line
shows the baseline proportion of token pairs that share a dependency relationship, independent of attention.

Figure 9: Proportion of attention directed to various dependency types, broken out by layer.

tion with the adjacent token. As we’ll discuss in


the next section, token distance is highly predic-
tive of dependency relations.
One hypothesis for why attention diverges from
dependency relations in the deeper layers is that
several attention heads in these layers target very
specific constructs (Tables 1 and 2) as opposed to
more general dependency relations. The deepest
layers also target longer-range relationships (see
next section), whereas dependency relations span
relatively short distances (3.89 tokens on average).
Figure 10: Attention variability by layer / head.
High-values indicate content-dependent heads, and low
We also analyzed the specific dependency types
values indicate content-independent (position-based) of tokens receiving attention (Figure 9). Sub-
heads. jects (csubj, csubjpass, nsubj, nsubjpass) were
targeted more in deeper layers, while auxiliaries
(aux), conjunctions (cc), determiners (det), ex-
head focuses attention primarily on the current to- pletives (expl), and negations (neg) were targeted
ken position (which cannot be in a dependency re- more in lower layers, consistent with previous
lation with itself) and the second disperses atten- findings (Belinkov, 2018). For some other depen-
tion roughly evenly, without regard to content. dency types, the interpretations were less clear.
An interesting counterexample is layer 4, head
5.3.3 Attention Distance
11 (Figure 1), which has the highest depen-
dency alignment out of all the heads (DepAlα = We found that attention distance (Eq. 6) is greatest
0.42)7 but is also the most position-focused in the deepest layers (Figure 11, right), confirm-
(Variabilityα = 0.004). This head focuses atten- ing that these layers capture longer-distance rela-
tion on the previous token, which in our corpus tionships. Attention distance varies greatly across
has a 42% chance of being in a dependency rela- heads (SD = 3.6), even when the heads are in the
same layer, due to the wide variation in attention
7
Assuming relation may be in either direction. structures (e.g., Figure 4 left and center).

69
Figure 11: Mean attention distance by layer / head (left), and by layer (right).

ple, the probability of being in a dependency rela-


tion is 0.42 for adjacent tokens, 0.07 for tokens at
a distance of 5, and 0.02 for tokens at a distance of
10. The layers (2–4) in which attention spanned
the shortest distance also had the highest depen-
dency alignment.

5.4 Qualitative Analysis


To get a sense of the lexical patterns targeted by
each attention head, we extracted exemplar sen-
tences that most strongly induced attention in that
Figure 12: Mean attention entropy by layer / head.
Higher values indicate more diffuse attention. head. Specifically, we ranked sentences by the
maximum token-to-token attention weight within
each sentence. Results for three attention heads
We also explored the relationship between at- are shown in Tables 1–3. We found other attention
tention distance and attention entropy (Eq. 7), heads that detected entities (people, places, dates),
which measures how diffuse an attention pattern passive verbs, acronyms, nicknames, paired punc-
is. Overall, we found a moderate correlation (r = tuation, and other syntactic and semantic proper-
0.61, p < 0.001) between the two. As Figure 12 ties. Most heads captured multiple types of pat-
shows, many heads in layers 0 and 1 have high en- terns.
tropy (e.g., Figure 4, center), which may explain
why these layers have a higher attention distance 6 Conclusion
compared to layers 2–4.
In this paper, we analyzed the structure of atten-
One counterexample is layer 5, head 1 (Fig-
tion in the GPT-2 Transformer language model.
ure 4, right), which has the highest mean attention
We found that many attention heads specialize
distance of any head (14.2), and one of the low-
in particular part-of-speech tags and that different
est mean entropy scores (0.41). This head con-
tags are targeted at different layer depths. We also
centrates attention on individual words in repeated
found that the deepest layers capture the most dis-
phrases, which often occur far apart from one an-
tant relationships, and that attention aligns most
other.
strongly with dependency relations in the middle
We also explored how attention distance re-
layers where attention distance is lowest.
lates to dependency alignment. Across all heads,
Our qualitative analysis revealed that the struc-
we found a negative correlation between the two
ture of attention is closely tied to the training ob-
quantities (r = −0.73, p < 0.001). This is con-
jective; for GPT-2, which was trained using left-
sistent with the fact that the probability of two to-
to-right language modeling, attention often fo-
kens sharing a dependency relation decreases as
cused on words most relevant to predicting the
the distance between them increases8 ; for exam-
next token in the sequence. For future work, we
8
This is true up to a distance of 18 tokens; 99.8% of de- would like to extend the analysis to other Trans-
pendency relations occur within this distance. former models such as BERT, which has a bidi-

70
Rank Sentence
1 The Australian search and rescue service is provided by Aus S AR , which is part of the
Australian Maritime Safety Authority ( AM SA ).
2 In 1925 , Bapt ists worldwide formed the Baptist World Alliance ( B WA ).
3 The Oak dale D ump is listed as an Environmental Protection Agency Super fund site due
to the contamination of residential drinking water wells with volatile organic compounds (
V OC s ) and heavy metals .
Table 1: Exemplar sentences for layer 10, head 10, which focuses attention from acronyms to the associated phrase.
The tokens with maximum attention are underlined; the attending token is bolded and the token receiving attention
is italicized. It appears that attention is directed to the part of the phrase that would help the model choose the next
word piece in the acronym (after the token paying attention), reflecting the language modeling objective.

Rank Sentence
1 After the two prototypes were completed , production began in Mar iet ta , Georgia , ...
3 The fictional character Sam Fisher of the Spl inter Cell video game series by Ubisoft was
born in Tow son , as well as residing in a town house , as stated in the novel izations ...
4 Suicide bombers attack three hotels in Am man , Jordan , killing at least 60 people .
Table 2: Exemplar sentences for layer 11, head 2, which focuses attention from commas to the preceding place
name (or the last word piece thereof). The likely purpose of this attention head is to help the model choose the
related place name that would follow the comma, e.g. the country or state in which the city is located.

Rank Sentence
1 With the United States isolation ist and Britain stout ly refusing to make the ” continental
commitment ” to defend France on the same scale as in World War I , the prospects of Anglo
- American assistance in another war with Germany appeared to be doubtful ...
2 The show did receive a somewhat favorable review from noted critic Gilbert Se ld es in the
December 15 , 1962 TV Guide : ” The whole notion on which The Beverly Hill bill ies is
founded is an encouragement to ignorance ...
3 he Arch im edes won significant market share in the education markets of the UK , Ireland
, Australia and New Zealand ; the success of the Arch im edes in British schools was due
partly to its predecessor the BBC Micro and later to the Comput ers for Schools scheme ...
Table 3: Exemplar sentences for layer 11, head 10 which focuses attention from the end of a noun phrase to the
head noun. In the first sentence, for example, the head noun is prospects and the remainder of the noun phrase is
of Anglo - American assistance in another war with Germany. The purpose of this attention pattern is likely to
predict the word (typically a verb) that follows the noun phrase, as the head noun is a strong predictor of this.

rectional architecture and is trained on both token- approaches (Section 2). While linguistic prob-
level and sentence-level tasks. ing precisely quantifies the amount of information
Although the Wikipedia sentences used in our encoded in various components of the model, it
analysis cover a diverse range of topics, they all requires training and evaluating a probing clas-
follow a similar encyclopedic format and style. sifier. Analyzing attention is a simpler process
Further study is needed to determine how attention that also produces human-interpretable descrip-
patterns manifest in other types of content, such as tions of model behavior, though recent work casts
dialog scripts or song lyrics. We would also like doubt on its role in explaining individual predic-
to analyze attention patterns in text much longer tions (Jain and Wallace, 2019). The results of
than a single sentence, especially for new Trans- our analyses were often consistent with those from
former variants such as the Transformer-XL (Dai probing approaches.
et al., 2019) and Sparse Transformer (Child et al.,
2019), which can handle very long contexts.
7 Acknowledgements
We believe that interpreting a model based on Y.B. was supported by the Harvard Mind, Brain,
attention is complementary to linguistic probing and Behavior Initiative.

71
References Jaesong Lee, Joong-Hwi Shin, and Jun-Seok Kim.
2017. Interactive visualization and manipulation
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- of attention-based neural machine translation. In
gio. 2015. Neural machine translation by jointly Proceedings of the 2017 Conference on Empirical
learning to align and translate. In International Con- Methods in Natural Language Processing: System
ference on Learning Representations (ICLR). Demonstrations.

Yonatan Belinkov. 2018. On Internal Language Rep- Nelson F. Liu, Matt Gardner, Yonatan Belinkov,
resentations in Deep Learning: An Analysis of Ma- Matthew Peters, and Noah A. Smith. 2019. Lin-
chine Translation and Speech Recognition. Ph.D. guistic knowledge and transferability of contextual
thesis, Massachusetts Institute of Technology. representations. In Proceedings of the 17th Annual
Conference of the North American Chapter of the
Yonatan Belinkov and James Glass. 2019. Analysis Association for Computational Linguistics: Human
methods in neural language processing: A survey. Language Technologies (NAACL-HLT).
Transactions of the Association for Computational
Linguistics, 7:49–72. Shusen Liu, Tao Li, Zhimin Li, Vivek Srikumar, Vale-
rio Pascucci, and Peer-Timo Bremer. 2018. Visual
Terra Blevins, Omer Levy, and Luke Zettlemoyer. interrogation of attention-based models for natural
2018. Deep RNNs encode soft hierarchical syntax. language inference and machine comprehension. In
arXiv preprint arXiv:1805.04218. Proceedings of the 2018 Conference on Empirical
Methods in Natural Language Processing: System
Rewon Child, Scott Gray, Alec Radford, and Demonstrations.
Ilya Sutskever. 2019. Generating long se-
quences with sparse transformers. arXiv preprint Paul Michel, Omer Levy, and Graham Neubig. 2019.
arXiv:1904.10509. Are sixteen heads really better than one? arXiv
preprint arXiv:1905.10650.
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G.
Carbonell, Quoc V. Le, and Ruslan Salakhutdinov. Matthew Peters, Mark Neumann, Mohit Iyyer, Matt
2019. Transformer-XL: Attentive language models Gardner, Christopher Clark, Kenton Lee, and Luke
beyond a fixed-length context. Zettlemoyer. 2018. Deep contextualized word repre-
sentations. In Proceedings of the 2018 Conference
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and of the North American Chapter of the Association
Kristina Toutanova. 2018. BERT: Pre-training of for Computational Linguistics: Human Language
deep bidirectional transformers for language under- Technologies, Volume 1 (Long Papers), pages 2227–
standing. arXiv preprint arXiv:1810.04805. 2237. Association for Computational Linguistics.

Alec Radford, Karthik Narasimhan, Tim Salimans, and


Hamidreza Ghader and Christof Monz. 2017. What
Ilya Sutskever. 2018. Improving language under-
does attention in neural machine translation pay at-
standing by generative pre-training. Technical re-
tention to? In Proceedings of the Eighth Interna-
port.
tional Joint Conference on Natural Language Pro-
cessing (Volume 1: Long Papers), pages 30–39. Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Dario Amodei, and Ilya Sutskever. 2019. Language
Yoav Goldberg. 2019. Assessing BERT’s syntactic models are unsupervised multitask learners. Techni-
abilities. arXiv preprint arXiv:1901.05287. cal report.
John Hewitt and Christopher D. Manning. 2019. A Alessandro Raganato and Jörg Tiedemann. 2018. An
structural probe for finding syntax in word repre- analysis of encoder representations in Transformer-
sentations. In Proceedings of the Conference of based machine translation. In Proceedings of the
the North American Chapter of the Association for 2018 EMNLP Workshop BlackboxNLP: Analyzing
Computational Linguistics: Human Language Tech- and Interpreting Neural Networks for NLP, pages
nologies. 287–297. Association for Computational Linguis-
tics.
Matthew Honnibal and Ines Montani. 2017. spaCy 2:
Natural language understanding with Bloom embed- Tim Rocktäschel, Edward Grefenstette, Karl Moritz
dings, convolutional neural networks and incremen- Hermann, Tomas Kocisky, and Phil Blunsom. 2016.
tal parsing. To appear. Reasoning about entailment with neural attention.
In International Conference on Learning Represen-
Sarthak Jain and Byron C. Wallace. 2019. Attention is tations (ICLR).
not explanation. CoRR, abs/1902.10186.
Alexander M. Rush, Sumit Chopra, and Jason Weston.
Llion Jones. 2017. Tensor2tensor transformer 2015. A neural attention model for abstractive sen-
visualization. https://github.com/ tence summarization. In Proceedings of the 2015
tensorflow/tensor2tensor/tree/ Conference on Empirical Methods in Natural Lan-
master/tensor2tensor/visualization. guage Processing.

72
H. Strobelt, S. Gehrmann, M. Behrisch, A. Perer,
H. Pfister, and A. M. Rush. 2018. Seq2Seq-Vis:
A Visual Debugging Tool for Sequence-to-Sequence
Models. ArXiv e-prints.
Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019.
BERT rediscovers the classical NLP pipeline. In
Proceedings of the Association for Computational
Linguistics.
Edward Tufte. 1990. Envisioning Information. Graph-
ics Press, Cheshire, CT, USA.
Ashish Vaswani, Samy Bengio, Eugene Brevdo, Fran-
cois Chollet, Aidan N. Gomez, Stephan Gouws,
Llion Jones, Łukasz Kaiser, Nal Kalchbrenner, Niki
Parmar, Ryan Sepassi, Noam Shazeer, and Jakob
Uszkoreit. 2018. Tensor2tensor for neural machine
translation. CoRR, abs/1803.07416.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. arXiv preprint arXiv:1706.03762.
Jesse Vig. 2019. A multiscale visualization of atten-
tion in the Transformer model. In Proceedings of
the Association for Computational Linguistics: Sys-
tem Demonstrations.
Elena Voita, Pavel Serdyukov, Rico Sennrich, and Ivan
Titov. 2018. Context-aware neural machine trans-
lation learns anaphora resolution. In Proceedings
of the 56th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Pa-
pers), pages 1264–1274. Association for Computa-
tional Linguistics.
Elena Voita, David Talbot, Fedor Moiseev, Rico Sen-
nrich, and Ivan Titov. 2019. Analyzing multi-
head self-attention: Specialized heads do the heavy
lifting, the rest can be pruned. arXiv preprint
arXiv:1905.09418.
Thomas Wolf. 2019. Some additional experiments ex-
tending the tech report ”Assessing BERTs syntactic
abilities” by Yoav Goldberg. Technical report.

73
A Appendix
Figures A.1 and A.2 shows the results from Fig-
ures 6 and 7 for the full set of part-of-speech tags.

74
Figure A.1: Each heatmap shows the proportion of total attention directed to the given part of speech, broken out
by layer (vertical axis) and head (horizontal axis).

75
Figure A.2: Each heatmap shows the proportion of attention originating from the given part of speech, broken out
by layer (vertical axis) and head (horizontal axis).

76

You might also like