Analyzing the Structure of Attention
Analyzing the Structure of Attention
63
Proceedings of the Second BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 63–76
Florence, Italy, August 1, 2019. c 2019 Association for Computational Linguistics
2 Related Work multi-head setting, the queries, keys, and values
are linearly projected h times, and the attention
Recent work suggests that the Transformer im-
operation is performed in parallel for each repre-
plicitly encodes syntactic information such as de-
sentation, with the results concatenated.
pendency parse trees (Hewitt and Manning, 2019;
Raganato and Tiedemann, 2018), anaphora (Voita
et al., 2018), and subject-verb pairings (Goldberg, 4 Visualizing Individual Inputs
2019; Wolf, 2019). Other work has shown that
RNNs also capture syntax, and that deeper layers In this section, we present three visualizations of
in the model capture increasingly high-level con- attention in the Transformer model: the attention-
structs (Blevins et al., 2018). head view, the model view, and the neuron view.
In contrast to past work that measure a model’s Source code and Jupyter notebooks are avail-
syntactic knowledge through linguistic probing able at https://github.com/jessevig/
tasks, we directly compare the model’s atten- bertviz, and a video demonstration can be
tion patterns to syntactic constructs such as de- found at https://vimeo.com/339574955.
pendency relations and part-of-speech tags. Ra- A more detailed discussion of the tool is provided
ganato and Tiedemann (2018) also evaluated de- in Vig (2019).
pendency trees induced from attention weights in a
Transformer, but in the context of encoder-decoder 4.1 Attention-head View
translation models.
The attention-head view (Figure 1) visualizes at-
3 Transformer Architecture tention for one or more heads in a model layer.
Self-attention is depicted as lines connecting the
Stacked Decoder: GPT-2 is a stacked decoder
attending tokens (left) with the tokens being at-
Transformer, which inputs a sequence of tokens
tended to (right). Colors identify the head(s), and
and applies position and token embeddings fol-
line weight reflects the attention weight. This view
lowed by several decoder layers. Each layer ap-
closely follows the design of Jones (2017), but has
plies multi-head self-attention (see below) in com-
been adapted to the GPT-2 model (shown in the
bination with a feedforward network, layer nor-
figure) and BERT model (not shown).
malization, and residual connections. The GPT-2
small model has 12 layers and 12 heads.
Self-Attention: Given an input x, the self-
attention mechanism assigns to each token xi a set
of attention weights over the tokens in the input:
QK T
Attention(Q, K, V ) = softmax( √ )V (2) This view helps focus on the role of specific at-
dk
tention heads. For instance, in the shown example,
using query matrix Q, key matrix K, and value the chosen attention head attends primarily to the
matrix V , where dk is the dimension of K. In a previous token position.
64
5 Analyzing Attention in Aggregate
5.1 Methods
5.1.1 Part-of-Speech Tags
Past work suggests that attention heads in the
Transformer may specialize in particular linguis-
Figure 2: Model view of GPT-2, for same input as in tic phenomena (Vaswani et al., 2017; Raganato
Figure 1 (excludes layers 6–11 and heads 6–11). and Tiedemann, 2018; Vig, 2019). We explore
whether individual attention heads in GPT-2 target
4.2 Model View particular parts of speech. Specifically, we mea-
sure the proportion of total attention from a given
The model view (Figure 2) visualizes attention head that focuses on tokens with a given part-of-
across all of the model’s layers and heads for a speech tag, aggregated over a corpus:
particular input. Attention heads are presented in
tabular form, with rows representing layers and |x| P
P P i
columns representing heads. Each head is shown αi,j (x)·1pos(xj )=tag
x∈X i=1 j=1
in a thumbnail form that conveys the coarse shape Pα (tag) = (3)
|x| P
i
of the attention pattern, following the small multi- P P
αi,j (x)
ples design pattern (Tufte, 1990). Users may also x∈X i=1 j=1
click on any head to enlarge it and see the tokens.
This view facilitates the detection of coarse- where tag is a part-of-speech tag, e.g., NOUN, x is
grained differences between heads. For example, a sentence from the corpus X, αi,j is the attention
several heads in layer 0 share a horizontal-stripe from xi to xj for the given head (see Section 3),
pattern, indicating that tokens attend to the current and pos(xj ) is the part-of-speech tag of xj . We
position. Other heads have a triangular pattern, also compute the share of attention directed from
showing that they attend to the first token. In the each part of speech in a similar fashion.
deeper layers, some heads display a small number
of highly defined lines, indicating that they are tar- 5.1.2 Dependency Relations
geting specific relationships between tokens.
Recent work shows that Transformers and recur-
4.3 Neuron View rent models encode dependency relations (Hewitt
The neuron view (Figure 3) visualizes how indi- and Manning, 2019; Raganato and Tiedemann,
vidual neurons interact to produce attention. This 2018; Liu et al., 2019). However, different mod-
view displays the queries and keys for each to- els capture dependency relations at different layer
ken, and demonstrates how attention is computed depths. In a Transformer model, the middle layers
from the scaled dot product of these vectors. The were most predictive of dependencies (Liu et al.,
element-wise product shows how specific neurons 2019; Tenney et al., 2019). Recurrent models were
influence the dot product and hence attention. found to encode dependencies in lower layers for
Whereas the attention-head view and the model language models (Liu et al., 2019) and in deeper
view show what attention patterns the model layers for translation models (Belinkov, 2018).
learns, the neuron view shows how the model We analyze how attention aligns with depen-
forms these patterns. For example, it can help dency relations in GPT-2 by computing the pro-
identify neurons responsible for specific attention portion of attention that connects tokens that are
patterns, as illustrated in Figure 3. also in a dependency relation with one another. We
65
Figure 3: Neuron view for layer 8, head 6, which targets items in lists. Positive and negative values are colored blue
and orange, respectively, and color saturation indicates magnitude. This view traces the computation of attention
(Section 3) from the selected token on the left to each of the tokens on the right. Connecting lines are weighted
based on attention between the respective tokens. The arrows (not in visualization) identify the neurons that most
noticeably contribute to this attention pattern: the lower arrows point to neurons that contribute to attention towards
list items, while the upper arrow identifies a neuron that helps focus attention on the first token in the sequence.
refer to this metric as dependency alignment: Variabilityα represents the mean absolute de-
viation1 of α over X, scaled to the [0, 1] inter-
|x| P
P P i val.2,3 Variability scores for three example atten-
αi,j (x)dep(xi , xj ) tion heads are shown in Figure 4.
x∈X i=1 j=1
DepAlα = (4) 5.1.3 Attention Distance
|x| P
P P i
αi,j (x) Past work suggests that deeper layers in NLP
x∈X i=1 j=1
models capture longer-distance relationships than
where dep(xi , xj ) is an indicator function that re- lower layers (Belinkov, 2018; Raganato and
turns 1 if xi and xj are in a dependency relation Tiedemann, 2018). We test this hypothesis on
and 0 otherwise. We run this analysis under three GPT-2 by measuring the mean distance (in num-
alternate formulations of dependency: (1) the at- ber of tokens) spanned by attention for each head.
tending token (xi ) is the parent in the dependency Specifically, we compute the average distance be-
relation, (2) the token receiving attention (xj ) is tween token pairs in all sentences in the corpus,
the parent, and (3) either token is the parent. weighted by the attention between the tokens:
We hypothesized that heads that focus attention
based on position—for example, the head in Fig- |x| P
i
ure 1 that focuses on the previous token—would
P P
αi,j (x) · (i − j)
not align well with dependency relations, since x∈X i=1 j=1
D̄α = (6)
they do not consider the content of the text. To dis- |x| P
P P i
tinguish between content-dependent and content- αi,j (x)
x∈X i=1 j=1
independent (position-based) heads, we define at-
tention variability, which measures how attention We also explore whether heads with more dis-
varies over different inputs; high variability would 1
We considered using variance to measure attention vari-
suggest a content-dependent head, while low vari- ability; however, attention is sparse for many attention heads
ability would indicate a content-independent head: after filtering first-token attention (see Section 5.2.3), result-
ing in a very low variance (due to αi,j (x) ≈ 0 and ᾱi,j ≈ 0)
for many content-sensitive attention heads. We did not use a
|x| P
P P i probability distance measure, as attention values do not sum
|αi,j (x) − ᾱi,j | to one due to filtering first-token attention.
x∈X i=1 j=1 2
The upper bound is 1 because the denominator is an
Variabilityα = (5)
|x| P
P P i upper bound on the numerator.
3
2· αi,j (x) When computing variability, we only include the first N
x∈X i=1 j=1 tokens (N =10) of each x ∈ X to ensure a sufficient amount
of data at each position i. The positional patterns appeared to
where ᾱi,j is the mean of αi,j (x) over all x ∈ X. be consistent across the entire sequence.
66
Figure 4: Attention heads in GPT-2 visualized for an example input sentence, along with aggregate metrics com-
puted from all sentences in the corpus. Note that the average sentence length in the corpus is 27.7 tokens. Left:
Focuses attention primarily on current token position. Center: Disperses attention roughly evenly across all pre-
vious tokens. Right: Focuses on words in repeated phrases.
5.2.2 Tools
We computed attention weights using the
pytorch-pretrained-BERT5 implemen-
tation of the GPT-2 small model. We extracted
Figure 5: Proportion of attention focused on first token, syntactic features using spaCy (Honnibal and
broken out by layer and head. Montani, 2017) and mapped the features from
the spaCy-generated tokens to the corresponding
tokens from the GPT-2 tokenizer.6
persed attention patterns (Figure 4, center) tend to
capture more distant relationships. We measure 5.2.3 Filtering Null Attention
attention dispersion based on the entropy4 of the
attention distribution (Ghader and Monz, 2017): We excluded attention focused on the first token
of each sentence from the analysis because it was
i
not informative; other tokens appeared to focus
on this token by default when no relevant tokens
X
Entropyα (xi ) = − αi,j (x)log(αi,j (x)) (7)
j=1 were found elsewhere in the sequence. On aver-
age, 57% of attention was directed to the first to-
Figure 4 shows the mean distance and entropy ken. Some heads focused over 97% of attention
values for three example attention heads. to this token on average (Figure 5), which is con-
5.2 Experimental Setup sistent with recent work showing that individual
attention heads may have little impact on over-
5.2.1 Dataset
all performance (Voita et al., 2019; Michel et al.,
We focused our analysis on text from English 2019). We refer to the attention directed to the first
Wikipedia, which was not included in the training token as null attention.
4
When computing entropy, we exclude attention to the
5
first (null) token (see Section 5.2.3) and renormalize the re- https://github.com/huggingface/
maining weights. We exclude tokens that focus over 90% of pytorch-pretrained-BERT
6
attention to the first token, to avoid a disproportionate influ- In cases where the GPT-2 tokenizer split a word into
ence from the remaining attention from these tokens. multiple pieces, we assigned the features to all word pieces.
67
Figure 6: Each heatmap shows the proportion of total attention directed to the given part of speech, broken out by
layer (vertical axis) and head (horizontal axis). Scales vary by tag. Results for all tags available in appendix.
Figure 7: Each heatmap shows the proportion of total attention that originates from the given part of speech, broken
out by layer (vertical axis) and head (horizontal axis). Scales vary by tag. Results for all tags available in appendix.
68
Figure 8: Proportion of attention that is aligned with dependency relations, aggregated by layer. The orange line
shows the baseline proportion of token pairs that share a dependency relationship, independent of attention.
Figure 9: Proportion of attention directed to various dependency types, broken out by layer.
69
Figure 11: Mean attention distance by layer / head (left), and by layer (right).
70
Rank Sentence
1 The Australian search and rescue service is provided by Aus S AR , which is part of the
Australian Maritime Safety Authority ( AM SA ).
2 In 1925 , Bapt ists worldwide formed the Baptist World Alliance ( B WA ).
3 The Oak dale D ump is listed as an Environmental Protection Agency Super fund site due
to the contamination of residential drinking water wells with volatile organic compounds (
V OC s ) and heavy metals .
Table 1: Exemplar sentences for layer 10, head 10, which focuses attention from acronyms to the associated phrase.
The tokens with maximum attention are underlined; the attending token is bolded and the token receiving attention
is italicized. It appears that attention is directed to the part of the phrase that would help the model choose the next
word piece in the acronym (after the token paying attention), reflecting the language modeling objective.
Rank Sentence
1 After the two prototypes were completed , production began in Mar iet ta , Georgia , ...
3 The fictional character Sam Fisher of the Spl inter Cell video game series by Ubisoft was
born in Tow son , as well as residing in a town house , as stated in the novel izations ...
4 Suicide bombers attack three hotels in Am man , Jordan , killing at least 60 people .
Table 2: Exemplar sentences for layer 11, head 2, which focuses attention from commas to the preceding place
name (or the last word piece thereof). The likely purpose of this attention head is to help the model choose the
related place name that would follow the comma, e.g. the country or state in which the city is located.
Rank Sentence
1 With the United States isolation ist and Britain stout ly refusing to make the ” continental
commitment ” to defend France on the same scale as in World War I , the prospects of Anglo
- American assistance in another war with Germany appeared to be doubtful ...
2 The show did receive a somewhat favorable review from noted critic Gilbert Se ld es in the
December 15 , 1962 TV Guide : ” The whole notion on which The Beverly Hill bill ies is
founded is an encouragement to ignorance ...
3 he Arch im edes won significant market share in the education markets of the UK , Ireland
, Australia and New Zealand ; the success of the Arch im edes in British schools was due
partly to its predecessor the BBC Micro and later to the Comput ers for Schools scheme ...
Table 3: Exemplar sentences for layer 11, head 10 which focuses attention from the end of a noun phrase to the
head noun. In the first sentence, for example, the head noun is prospects and the remainder of the noun phrase is
of Anglo - American assistance in another war with Germany. The purpose of this attention pattern is likely to
predict the word (typically a verb) that follows the noun phrase, as the head noun is a strong predictor of this.
rectional architecture and is trained on both token- approaches (Section 2). While linguistic prob-
level and sentence-level tasks. ing precisely quantifies the amount of information
Although the Wikipedia sentences used in our encoded in various components of the model, it
analysis cover a diverse range of topics, they all requires training and evaluating a probing clas-
follow a similar encyclopedic format and style. sifier. Analyzing attention is a simpler process
Further study is needed to determine how attention that also produces human-interpretable descrip-
patterns manifest in other types of content, such as tions of model behavior, though recent work casts
dialog scripts or song lyrics. We would also like doubt on its role in explaining individual predic-
to analyze attention patterns in text much longer tions (Jain and Wallace, 2019). The results of
than a single sentence, especially for new Trans- our analyses were often consistent with those from
former variants such as the Transformer-XL (Dai probing approaches.
et al., 2019) and Sparse Transformer (Child et al.,
2019), which can handle very long contexts.
7 Acknowledgements
We believe that interpreting a model based on Y.B. was supported by the Harvard Mind, Brain,
attention is complementary to linguistic probing and Behavior Initiative.
71
References Jaesong Lee, Joong-Hwi Shin, and Jun-Seok Kim.
2017. Interactive visualization and manipulation
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- of attention-based neural machine translation. In
gio. 2015. Neural machine translation by jointly Proceedings of the 2017 Conference on Empirical
learning to align and translate. In International Con- Methods in Natural Language Processing: System
ference on Learning Representations (ICLR). Demonstrations.
Yonatan Belinkov. 2018. On Internal Language Rep- Nelson F. Liu, Matt Gardner, Yonatan Belinkov,
resentations in Deep Learning: An Analysis of Ma- Matthew Peters, and Noah A. Smith. 2019. Lin-
chine Translation and Speech Recognition. Ph.D. guistic knowledge and transferability of contextual
thesis, Massachusetts Institute of Technology. representations. In Proceedings of the 17th Annual
Conference of the North American Chapter of the
Yonatan Belinkov and James Glass. 2019. Analysis Association for Computational Linguistics: Human
methods in neural language processing: A survey. Language Technologies (NAACL-HLT).
Transactions of the Association for Computational
Linguistics, 7:49–72. Shusen Liu, Tao Li, Zhimin Li, Vivek Srikumar, Vale-
rio Pascucci, and Peer-Timo Bremer. 2018. Visual
Terra Blevins, Omer Levy, and Luke Zettlemoyer. interrogation of attention-based models for natural
2018. Deep RNNs encode soft hierarchical syntax. language inference and machine comprehension. In
arXiv preprint arXiv:1805.04218. Proceedings of the 2018 Conference on Empirical
Methods in Natural Language Processing: System
Rewon Child, Scott Gray, Alec Radford, and Demonstrations.
Ilya Sutskever. 2019. Generating long se-
quences with sparse transformers. arXiv preprint Paul Michel, Omer Levy, and Graham Neubig. 2019.
arXiv:1904.10509. Are sixteen heads really better than one? arXiv
preprint arXiv:1905.10650.
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G.
Carbonell, Quoc V. Le, and Ruslan Salakhutdinov. Matthew Peters, Mark Neumann, Mohit Iyyer, Matt
2019. Transformer-XL: Attentive language models Gardner, Christopher Clark, Kenton Lee, and Luke
beyond a fixed-length context. Zettlemoyer. 2018. Deep contextualized word repre-
sentations. In Proceedings of the 2018 Conference
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and of the North American Chapter of the Association
Kristina Toutanova. 2018. BERT: Pre-training of for Computational Linguistics: Human Language
deep bidirectional transformers for language under- Technologies, Volume 1 (Long Papers), pages 2227–
standing. arXiv preprint arXiv:1810.04805. 2237. Association for Computational Linguistics.
72
H. Strobelt, S. Gehrmann, M. Behrisch, A. Perer,
H. Pfister, and A. M. Rush. 2018. Seq2Seq-Vis:
A Visual Debugging Tool for Sequence-to-Sequence
Models. ArXiv e-prints.
Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019.
BERT rediscovers the classical NLP pipeline. In
Proceedings of the Association for Computational
Linguistics.
Edward Tufte. 1990. Envisioning Information. Graph-
ics Press, Cheshire, CT, USA.
Ashish Vaswani, Samy Bengio, Eugene Brevdo, Fran-
cois Chollet, Aidan N. Gomez, Stephan Gouws,
Llion Jones, Łukasz Kaiser, Nal Kalchbrenner, Niki
Parmar, Ryan Sepassi, Noam Shazeer, and Jakob
Uszkoreit. 2018. Tensor2tensor for neural machine
translation. CoRR, abs/1803.07416.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. arXiv preprint arXiv:1706.03762.
Jesse Vig. 2019. A multiscale visualization of atten-
tion in the Transformer model. In Proceedings of
the Association for Computational Linguistics: Sys-
tem Demonstrations.
Elena Voita, Pavel Serdyukov, Rico Sennrich, and Ivan
Titov. 2018. Context-aware neural machine trans-
lation learns anaphora resolution. In Proceedings
of the 56th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Pa-
pers), pages 1264–1274. Association for Computa-
tional Linguistics.
Elena Voita, David Talbot, Fedor Moiseev, Rico Sen-
nrich, and Ivan Titov. 2019. Analyzing multi-
head self-attention: Specialized heads do the heavy
lifting, the rest can be pruned. arXiv preprint
arXiv:1905.09418.
Thomas Wolf. 2019. Some additional experiments ex-
tending the tech report ”Assessing BERTs syntactic
abilities” by Yoav Goldberg. Technical report.
73
A Appendix
Figures A.1 and A.2 shows the results from Fig-
ures 6 and 7 for the full set of part-of-speech tags.
74
Figure A.1: Each heatmap shows the proportion of total attention directed to the given part of speech, broken out
by layer (vertical axis) and head (horizontal axis).
75
Figure A.2: Each heatmap shows the proportion of attention originating from the given part of speech, broken out
by layer (vertical axis) and head (horizontal axis).
76