来源材料

资料来源

← 首页

                                                                           What Does BERT Look At?
                                                                         An Analysis of BERT’s Attention
                                             Kevin Clark†       Urvashi Khandelwal† Omer Levy‡ Christopher D. Manning†
                                                                †
                                                                  Computer Science Department, Stanford University
                                                                              ‡
                                                                                Facebook AI Research
                                                            {kevclark,urvashik,manning}@cs.stanford.edu
                                                                               omerlevy@fb.com


                                                               Abstract                          study1 the attention maps of a pre-trained model.
                                                                                                 Attention (Bahdanau et al., 2015) has been a
                                             Large pre-trained neural networks such as           highly successful neural network component. It is
                                             BERT have had great recent success in NLP,          naturally interpretable because an attention weight
                                             motivating a growing body of research investi-
                                                                                                 has a clear meaning: how much a particular word




arXiv:1906.04341v1 [cs.CL] 11 Jun 2019
                                             gating what aspects of language they are able
                                             to learn from unlabeled data. Most recent anal-     will be weighted when computing the next repre-
                                             ysis has focused on model outputs (e.g., lan-       sentation for the current word. Our analysis fo-
                                             guage model surprisal) or internal vector rep-      cuses on the 144 attention heads in BERT2 (De-
                                             resentations (e.g., probing classifiers). Com-      vlin et al., 2019), a large pre-trained Transformer
                                             plementary to these works, we propose meth-         (Vaswani et al., 2017) network that has demon-
                                             ods for analyzing the attention mechanisms of       strated excellent performance on many tasks.
                                             pre-trained models and apply them to BERT.
                                             BERT’s attention heads exhibit patterns such
                                                                                                    We first explore generally how BERT’s atten-
                                             as attending to delimiter tokens, specific po-      tion heads behave. We find that there are com-
                                             sitional offsets, or broadly attending over the     mon patterns in their behavior, such as attending to
                                             whole sentence, with heads in the same layer        fixed positional offsets or attending broadly over
                                             often exhibiting similar behaviors. We further      the whole sentence. A surprisingly large amount
                                             show that certain attention heads correspond        of BERT’s attention focuses on the deliminator to-
                                             well to linguistic notions of syntax and coref-     ken [SEP], which we argue is used by the model
                                             erence. For example, we find heads that at-
                                                                                                 as a sort of no-op. Generally we find that attention
                                             tend to the direct objects of verbs, determiners
                                             of nouns, objects of prepositions, and corefer-     heads in the same layer tend to behave similarly.
                                             ent mentions with remarkably high accuracy.            We next probe each attention head for linguistic
                                             Lastly, we propose an attention-based probing       phenomena. In particular, we treat each head as a
                                             classifier and use it to further demonstrate that   simple no-training-required classifier that, given a
                                             substantial syntactic information is captured in    word as input, outputs the most-attended-to other
                                             BERT’s attention.                                   word. We then evaluate the ability of the heads
                                                                                                 to classify various syntactic relations. While no
                                         1   Introduction                                        single head performs well at many relations, we
                                         Large pre-trained language models achieve very          find that particular heads correspond remarkably
                                         high accuracy when fine-tuned on supervised tasks       well to particular relations. For example, we find
                                         (Dai and Le, 2015; Peters et al., 2018; Radford         heads that find direct objects of verbs, determin-
                                         et al., 2018), but it is not fully understood why.      ers of nouns, objects of prepositions, and objects
                                         The strong results suggest pre-training teaches the     of possessive pronouns with >75% accuracy. We
                                         models about the structure of language, but what        perform a similar analysis for coreference resolu-
                                         specific linguistic features do they learn?             tion, also finding a BERT head that performs quite
                                            Recent work has investigated this question by        well. These results are intriguing because the be-
                                         examining the outputs of language models on             havior of the attention heads emerges purely from
                                         carefully chosen input sentences (Linzen et al.,        self-supervised training on unlabeled data, without
                                         2016) or examining the internal vector representa-      explicit supervision for syntax or coreference.
                                         tions of the model through methods such as prob-           1
                                                                                                     Code will be released at https://github.com/
                                         ing classifiers (Adi et al., 2017; Belinkov et al.,     clarkkev/attention-analysis.
                                                                                                   2
                                         2017). Complementary to these approaches, we                We use the English base-sized model.
           Head 1-1                   Head 3-1                    Head 8-7                   Head 11-6
        Attends broadly         Attends to next token         Attends to [SEP]           Attends to periods




Figure 1: Examples of heads exhibiting the patterns discussed in Section 3. The darkness of a line indicates the
strength of the attention weight (some attention weights are so low they are invisible).


   Our findings show that particular heads special-       Attention weights can be viewed as governing how
ize to specific aspects of syntax. To get a more          “important” every other token is when producing
overall measure of the attention heads’ syntac-           the next representation for the current token.
tic ability, we propose an attention-based probing           BERT is pre-trained on 3.3 billion tokens of En-
classifier that takes attention maps as input. The        glish text to perform two tasks. In the “masked
classifier achieves 77 UAS at dependency pars-            language modeling” task, the model predicts the
ing, showing BERT’s attention captures a substan-         identities of words that have been masked-out of
tial amount about syntax. Several recent works            the input text. In the “next sentence prediction”
have proposed incorporating syntactic information         task, the model predicts whether the second half
to improve attention (Eriguchi et al., 2016; Chen         of the input follows the first half of the input in the
et al., 2018; Strubell et al., 2018). Our work sug-       corpus, or is a random paragraph. Further training
gests that to an extent this kind of syntax-aware         the model on supervised data results in impres-
attention already exists in BERT, which may be            sive performance across a variety of tasks rang-
one of the reason for its success.                        ing from sentiment analysis to question answering.
                                                          An important detail of BERT is the preprocessing
2    Background: Transformers and BERT
                                                          used for the input text. A special token [CLS] is
Although our analysis methods are applicable              added to the beginning of the text and another to-
to any model that uses an attention mechanism,            ken [SEP] is added to the end. If the input consists
in this paper we analyze BERT (Devlin et al.,             of multiple separate texts (e.g., a reading compre-
2019), a large Transformer (Vaswani et al., 2017)         hension example consists of a separate question
network. Transformers consist of multiple lay-            and context), [SEP] tokens are also used to sep-
ers where each layer contains multiple attention          arate them. As we show in the next section, these
heads. An attention head takes as input a sequence        special tokens play an important role in BERT’s
of vectors h = [h1 , ..., hn ] corresponding to the       attention. We use the “base” sized BERT model,
n tokens of the input sentence. Each vector hi            which has 12 layers containing 12 attention heads
is transformed into query, key, and value vectors         each. We use <layer>-<head number> to denote
qi , ki , vi through separate linear transformations.     a particular attention head.
The head computes attention weights α between
all pairs of words as softmax-normalized dot prod-        3   Surface-Level Patterns in Attention
ucts between the query and key vectors. The out-
put o of the attention head is a weighted sum of the      Before looking at specific linguistic phenomena,
value vectors.                                            we first perform an analysis of surface-level pat-
                                         n
                                         X                terns in how BERT’s attention heads behave. Ex-
            exp (qiT kj )
    αij = Pn           T
                                  oi =         αij vj     amples of heads exhibiting these patterns are
           l=1 exp (qi kl )              j=1              shown in Figure 1.
                                                                                                             3.0
                               [CLS]                                                                                                        All unmasked tokens
                 0.8           [SEP]                                                                         2.5                            [SEP]
                               . or ,


Avg. Attention
                 0.6                                                                                                                        . or ,
                                                                                  ∂L
                                                                                  ∂α
                                                                                                             2.0
                 0.4

                                                                                     Average
                                                                                                             1.5
                 0.2
                                                                                                             1.0
                 0.0
                                                                                                             0.5
                               2        4    6        8      10    12
                                             Layer                                                           0.0
                 1.0                                                                                                   2       4       6        8      10         12
                                                                                                                                       Layer
                 0.8



Avg. Attention
                 0.6                                                        Figure 3: Gradient-based feature importance estimates
                                                                            for attention to [SEP], periods/commas, and other to-
                 0.4                                                        kens.
                 0.2                        [SEP] -> [SEP]




                                                                         Avg. Attention Entropy (nats)
                                            other -> [SEP]
                 0.0                                                                                                               uniform attention
                           2            4    6        8       10    12                                   4
                                             Layer

 Figure 2: Each point corresponds to the average atten-                                                  2
 tion a particular BERT attention head puts toward a to-
 ken type. Above: heads often attend to “special” to-                                                                                               BERT Heads
 kens. Early heads attend to [CLS], middle heads attend                                                  0
 to [SEP], and deep heads attend to periods and com-                                                               2       4          6     8          10     12
 mas. Often more than half of a head’s total attention is                                                                             Layer
 to these tokens. Below: heads attend to [SEP] tokens
 even more when the current token is [SEP] itself.                          Figure 4: Entropies of attention distributions. In the
                                                                            first layer there are particularly high-entropy heads that
                                                                            produce bag-of-vector-like representations.
 Setup. We extract the attention maps from BERT-
 base over 1000 random Wikipedia segments. We
                                                                            3.2                               Attending to Separator Tokens
 follow the setup used for pre-training BERT where
 each segment consists of at most 128 tokens                                Interestingly, we found that a substantial amount
 corresponding to two consecutive paragraphs of                             of BERT’s attention focuses on a few tokens (see
 Wikipedia (although we do not mask out in-                                 Figure 2). For example, over half of BERT’s atten-
 put words or as in BERT’s training). The in-                               tion in layers 6-10 focuses on [SEP]. To put this in
 put presented to the model is [CLS]<paragraph-                             context, since most of our segments are 128 tokens
 1>[SEP]<paragraph-2>[SEP].                                                 long, the average attention for a token occurring
                                                                            twice in a segments like [SEP] would normally be
                                                                            around 1/64. [SEP] and [CLS] are guaranteed to
 3.1                   Relative Position
                                                                            be present and are never masked out, while pe-
 First, we compute how often BERT’s attention                               riods and commas are the most common tokens
 heads attend to the current token, the previous to-                        in the data excluding “the,” which might be why
 ken, or the next token. We find that most heads                            the model treats these tokens differently. A sim-
 put little attention on the current token. However,                        ilar pattern occurs for the uncased BERT model,
 there are heads that specialize to attending heavily                       suggesting there is a systematic reason for the at-
 on the next or previous token, especially in ear-                          tention to special tokens rather than it being an ar-
 lier layers of the network. In particular four atten-                      tifact of stochastic training.
 tion heads (in layers 2, 4, 7, and 8) on average put                          One possible explanation is that [SEP] is used
 >50% of their attention on the previous token and                          to aggregate segment-level information which can
 five attention heads (in layers 1, 2, 2, 3, and 6) put                     then be read by other heads. However, further
 >50% of their attention on the next token.                                 analysis makes us doubtful this is the case. If this
explanation were true, we would expect attention        whole input in the last layer.
heads processing [SEP] to attend broadly over the
whole segment to build up these representations.        4     Probing Individual Attention Heads
However, they instead almost entirely (more than        Next, we investigate individual attention heads to
90%; see bottom of Figure 2) attend to themselves       probe what aspects of language they have learned.
and the other [SEP] token. Furthermore, qualita-        In particular, we evaluate attention heads on la-
tive analysis (see Figure 5) shows that heads with      beled datasets for tasks like dependency parsing.
specific functions attend to [SEP] when the func-       An overview of our results is shown in Figure 5.
tion is not called for. For example, in head 8-10
direct objects attend to their verbs. For this head,    4.1    Method
non-nouns mostly attend to [SEP]. Therefore, we         We wish to evaluate attention heads at word-level
speculate that attention over these special tokens      tasks, but BERT uses byte-pair tokenization (Sen-
might be used as a sort of “no-op” when the atten-      nrich et al., 2016), which means some words
tion head’s function is not applicable.                 (∼8% in our data) are split up into multiple to-
   To further investigate this hypothesis, we ap-       kens. We therefore convert token-token attention
ply gradient-based measures of feature importance       maps to word-word attention maps. For attention
(Sundararajan et al., 2017). In particular, we com-     to a split-up word, we sum up the attention weights
pute the magnitude of the gradient of the loss from     over its tokens. For attention from a split-up word,
BERT’s masked language modeling task with re-           we take the mean of the attention weights over its
spect to each attention weight. Intuitively, this       tokens. These transformations preserve the prop-
value measures how much changing the attention          erty that the attention from each word sums to
to a token will change BERT’s outputs. Results          1. For a given attention head and word, we take
are shown in Figure 3. Starting in layer 5 – the        whichever other word receives the most attention
same layer where attention to [SEP] becomes high        weight as that model’s prediction3
– the gradients for attention to [SEP] become very
small. This indicates that attending more or less to    4.2    Dependency Syntax
[SEP] does not substantially change BERT’s out-         Setup. We extract attention maps from BERT on
puts, supporting the theory that attention to [SEP]     the Wall Street Journal portion of the Penn Tree-
is used as a no-op for attention heads.                 bank (Marcus et al., 1993) annotated with Stanford
                                                        Dependencies. We evaluate both “directions” of
3.3   Focused vs Broad Attention                        prediction for each attention head: the head word
                                                        attending to the dependent and the dependent at-
Lastly, we measure whether attention heads fo-
                                                        tending to the head word. Some dependency rela-
cus on a few words or attend broadly over many
                                                        tions are simpler to predict than others: for exam-
words. To do this, we compute the average en-
                                                        ple a noun’s determiner is often the immediately
tropy of each head’s attention distribution (see
                                                        preceding word. Therefore as a point of compar-
Figure 4). We find that some attention heads, es-
                                                        ison, we show predictions from a simple fixed-
pecially in lower layers, have very broad atten-
                                                        offset baseline. For example, a fixed offset of -2
tion. These high-entropy attention heads typically
                                                        means the word two positions to the left of the de-
spend at most 10% of their attention mass on any
                                                        pendent is always considered to be the head.
single word. The output of these heads is roughly
a bag-of-vectors representation of the sentence.        Results. Table 1 shows that there is no single at-
   We also measured entropies for all attention         tention head that does well at syntax “overall”; the
heads from only the [CLS] token. While the av-          best head gets 34.5 UAS, which is not much better
erage entropies from [CLS] for most layers are          than the right-branching baseline, which gets 26.3
very close to the ones shown in Figure 4, the           UAS. This finding is similar to the one reported by
last layer has a high entropy from [CLS] of 3.89        Raganato and Tiedemann (2018), who also evalu-
nats, indicating very broad attention. This find-       ate individual attention heads for syntax.
ing makes sense given that the representation for          However, we do find that certain attention heads
the [CLS] token is used as input for the “next sen-     specialize to specific dependency relations, some-
tence prediction” task during pre-training, so it at-      3
                                                             We ignore [SEP] and [CLS], although in practice this
tends broadly to aggregate a representation for the     does not significantly change the accuracies for most heads.
                       Head 8-10                                                 Head 8-11
    - Direct objects attend to their verbs                    - Noun modifiers (e.g., determiners) attend
    - 86.8% accuracy at the dobj relation                       to their noun
                                                              - 94.3% accuracy at the det relation




                       Head 7-6                                                   Head 4-10
   - Possessive pronouns and apostrophes                      - Passive auxiliary verbs attend to the
     attend to the head of the corresponding NP                 verb they modify
   - 80.5% accuracy at the poss relation                      - 82.5% accuracy at the auxpass relation




                       Head 9-6                                                   Head 5-4
   - Prepositions attend to their objects                     - Coreferent mentions attend to their antecedents
   - 76.3% accuracy at the pobj relation                      - 65.1% accuracy at linking the head of a
                                                                coreferent mention to the head of an antecedent




Figure 5: BERT attention heads that correspond to linguistic phenomena. In the example attention maps, the
darkness of a line indicates the strength of the attention weight. All attention to/from red words is colored red;
these colors are there to highlight certain parts of the attention heads’ behaviors. For Head 9-6, we don’t show
attention to [SEP] for clarity. Despite not being explicitly trained on these tasks, BERT’s attention heads perform
remarkably well, illustrating how syntax-sensitive behavior can emerge from self-supervised training alone.
 Relation         Head        Accuracy Baseline             analysis to see if the relations well-captured by at-
                                                            tention are similar or different for other languages.
 All              7-6            34.5       26.3 (1)
 prep             7-4            66.7       61.8 (-1)       4.3    Coreference Resolution
 pobj             9-6            76.3       34.6 (-2)       Having shown BERT attention heads reflect cer-
 det              8-11           94.3       51.7 (1)        tain aspects of syntax, we now explore using at-
 nn               4-10           70.4       70.2 (1)        tention heads for the more challenging semantic
 nsubj            8-2            58.5       45.5 (1)        task of coreference resolution. Coreference links
 amod             4-10           75.6       68.3 (1)        are usually longer than syntactic dependencies and
 dobj             8-10           86.8       40.0 (-2)       state-of-the-art systems generally perform much
 advmod           7-6            48.8       40.2 (1)        worse at coreference compared to parsing.
 aux              4-10           81.1       71.5 (1)
                                                            Setup. We evaluate the attention heads on coref-
 poss             7-6            80.5       47.7 (1)        erence resolution using the CoNLL-2012 dataset4
 auxpass          4-10           82.5        40.5 (1)       (Pradhan et al., 2012). In particular, we compute
 ccomp            8-1            48.8       12.4 (-2)       antecedent selection accuracy: what percent of the
 mark             8-2            50.7        14.5 (2)       time does the head word of a coreferent mention
 prt              6-7            99.1       91.4 (-1)       most attend to the head of one of that mention’s
                                                            antecedents. We compare against three baselines
Table 1: The best performing attentions heads of
BERT on WSJ dependency parsing by dependency                for selecting an antecedent:
type. Numbers after baseline accuracies show the best
                                                                • Picking the nearest other mention.
offset found (e.g., (1) means the word to the right is
predicted as the head). We show the 10 most common              • Picking the nearest other mention with the
relations as well as 5 other ones attention heads do well         same head word as the current mention.
on. Bold highlights particularly effective heads.
                                                                • A simple rule-based system inspired by Lee
                                                                  et al. (2011). It proceeds through 4 sieves: (1)
times achieving high accuracy and substantially                   full string match, (2) head word match, (3)
outperforming the fixed-offset baseline. We find                  number/gender/person match, (4) all other
that for all relations in Table 1 except pobj, the                mentions. The nearest mention satisfying the
dependent attends to the head word rather than the                earliest sieve is returned.
other way around, likely because each dependent             We also show the performance of a recent neural
has exactly one head but heads have multiple de-            coreference system from Wiseman et al. (2015).
pendents. We also note heads can disagree with
standard annotation conventions while still per-            Results. Results are shown in Table 2. We find
forming syntactic behavior. For example, head 7-            that one of BERT’s attention heads achieves de-
6 marks ’s as the dependent for the poss relation,          cent coreference resolution performance, improv-
while gold-standard labels mark the complement              ing by over 10 accuracy points on the string-
of an ’s as the dependent (the accuracy in Table 1          matching baseline and performing close to the
counts ’s as correct). Such disagreements high-             rule-based system. It is particularly good with
light how these syntactic behaviors in BERT are             nominal mentions, perhaps because it is capable
learned as a by-product of self-supervised train-           of fuzzy matching between synonyms as seen in
ing, not by copying a human design.                         the bottom right of Figure 5.
   Figure 5 shows some examples of the attention            5     Probing Attention Head Combinations
behavior. While the similarity between machine-
learned attention weights and human-defined syn-            Since individual attention heads specialize to par-
tactic relations are striking, we note these are re-        ticular aspects of syntax, the model’s overall
lations for which attention heads do particularly           “knowledge” about syntax is distributed across
well on. There are many relations for which BERT            multiple attention heads. We now measure this
only slightly improves over the simple baseline, so         overall ability by proposing a novel family of
we would not say individual attention heads cap-            attention-based probing classifiers and applying
ture dependency structure as a whole. We think                 4
                                                                 We truncate documents to 128 tokens long to keep mem-
it would be interesting future work to extend our           ory usage manageable.
Model             All Pronoun Proper Nominal              and-words probing classifier assigns the probabil-
                                                          ity of word i being word j’s head as
Nearest           27       29         29            19                        Xn
Head match        52       47         67            40                                             k
                                                                 p(i|j) ∝ exp      Wk,: (vi ⊕ vj )αij +
Rule-based        69       70         77            60
                                                                                     k=1
Neural coref      83*       –          –            –                                                           
                                                                                                           k
Head 5-4          65       64         73            58                                     Uk,: (vi ⊕ vj )αji

*Only roughly comparable because on non-truncated docu-   Where v denotes GloVe embeddings and ⊕ de-
ments and with different mention detection.
                                                          notes concatenation. The GloVe embeddings are
Table 2: Accuracies (%) for systems at selecting a        held fixed in training, so only the two weight ma-
correct antecedent given a coreferent mention in the      trices W and U are learned. The dot product
CoNLL-2012 data. One of BERT’s attention heads per-       Wk,: (vi ⊕vj ) produces a word-sensitive weight for
forms fairly well at coreference.                         the particular attention head.
                                                          Results. We evaluate our methods on the Penn
them to dependency parsing. For these classifiers         Treebank dev set annotated with Stanford depen-
we treat the BERT attention outputs as fixed, i.e.,       dencies. We compare against three baselines:
we do not back-propagate into BERT and only
                                                             • A right-branching baseline that always pre-
train a small number of parameters.
                                                               dicts the head is to the dependent’s right.
   The probing classifiers are basically graph-
based dependency parsers. Given an input word,               • A simple one-hidden-layer network that takes
the classifier produces a probability distribution             as input the GloVe embeddings for the depen-
over other words in the sentence indicating how                dent and candidate head as well as distance
likely each other word is to be the syntactic head             features between the two words.5
of the current one.                                          • Our attention-and-words probe, but with at-
                                                               tention maps from a BERT network with pre-
Attention-Only Probe. Our first probe learns a
                                                               trained word/positional embeddings but ran-
simple linear combination of attention weights.
                                                               domly initialized other weights. This kind of
                       X
                        n                                     baseline is surprisingly strong at other prob-
                                  k        k
       p(i|j) ∝ exp           wk αij + uk αji                  ing tasks (Conneau et al., 2018).
                        k=1
                                                             Results are shown in Table 3. We find the Attn +
where p(i|j) is the probability of word i being           GloVe probing classifier substantially outperforms
                          k is the attention weight
word j’s syntactic head, αij                              our baselines and achieves a decent UAS of 77,
from word i to word j produced by head k, and n           suggesting BERT’s attention maps have a fairly
is the number of attention heads. We include both         thorough representation of English syntax.
directions of attention: candidate head to depen-            As a rough comparison, we also report results
dent as well as dependent to candidate head. The          from the structural probe from Hewitt and Man-
weight vectors w and u are trained using standard         ning (2019), which builds a probing classifier on
supervised learning on the train set.                     top of BERT’s vector representations rather than
                                                          attention. The scores are not directly compara-
Attention-and-Words Probe. Given our finding              ble because the structural probe only uses a sin-
that heads specialize to particular syntactic rela-       gle layer of BERT, produces undirected rather than
tions, we believe probing classifiers should benefit      directed parse trees, and is trained to produce the
from having information about the input words. In         syntactic distance between words rather than di-
particular, we build a model that sets the weights        rectly predicting the tree structure. Nevertheless,
of the attention heads based on the GloVe (Pen-           the similarity in score to our Attn + Glove probing
nington et al., 2014) embeddings for the input            classifier suggests there is not much more syntac-
words. Intuitively, if the dependent and candi-           tic information in BERT’s vector representations
date head are “the” and “cat,” the probing classi-        compared to its attention maps.
fier should learn to assign most of the weight to            5
                                                               Indicator features for short distances as well as continu-
the head 8-11, which achieves excellent perfor-           ous distance features, with distance ahead/behind treated sep-
mance at the determiner relation. The attention-          arately to capture word order
    Model                                  UAS
    Structural probe                    80 UUAS*
    Right-branching                         26
    Distances + GloVe                       58
    Random Init Attn + GloVe                30
    Attn                                    61
    Attn + GloVe                            77

Table 3: Results of attention-based probing classifiers
on dependency parsing. A simple model taking BERT
attention maps and GloVe embeddings as input per-
forms quite well. *Not directly comparable to our num-
bers; see text.


   Overall, our results from probing both individ-
ual and combinations of attention heads suggest
that BERT learns some aspects syntax purely as a
by-product of self-supervised training. Other work
has drawn a similar conclusions from examin-
ing BERT’s predictions on agreement tasks (Gold-
berg, 2019) or internal vector representations (He-
witt and Manning, 2019; Liu et al., 2019). Tra-
ditionally, syntax-aware models have been devel-
oped through architecture design (e.g., recursive
neural networks) or from direct supervision from
human-curated treebanks. Our findings are part of
a growing body of work indicating that indirect
supervision from rich pre-training tasks like lan-
guage modeling can also produce models sensitive
to language’s hierarchical structure.

6     Clustering Attention Heads
Are attention heads in the same layer similar to
each other or different? Can attention heads be           Figure 6: BERT attention heads embedded in two-
clearly grouped by behavior? We investigate these         dimensional space. Distance between points approx-
questions by computing the distances between all          imately matches the average Jensen-Shannon diver-
pairs of attention heads. Formally, we measure the        gences between the outputs of the corresponding heads.
                                                          Heads in the same layer tend to be close together. At-
distance between two heads Hi and Hj as:
                                                          tention head “behavior” was found through the analysis
           X                                              methods discussed throughout this paper.
                 JS(Hi (token), Hj (token))
         token∈data

Where JS is the Jensen-Shannon Divergence be-             have similarly, often corresponding to behaviors
tween attention distributions. Using these dis-           we have already discussed in this paper. Heads
tances, we visualize the attention heads by apply-        within the same layer are often fairly close to each
ing multidimensional scaling (Kruskal, 1964) to           other, meaning that heads within the layer have
embed each head in two dimensions such that the           similar attention distributions. This finding is a bit
Euclidean distance between embeddings reflects            surprising given that Tu et al. (2018) show that en-
the Jensen-Shannon distance between the corre-            couraging attention heads to have different behav-
sponding heads as closely as possible.                    iors can improve Transformer performance at ma-
   Results are shown in Figure 6. We find that            chine translation. One possibility for the apparent
there are several clear clusters of heads that be-        redundancy in BERT’s attention heads is the use
of attention dropout, which causes some attention       ral machine translation system captures anaphora,
weights to be zeroed-out during training.               similar to our finding for BERT.
                                                           Concurrently with our work Voita et al. (2019)
7   Related Work                                        identify syntactic, positional, and rare-word-
                                                        sensitive attention heads in machine translation
There has been substantial recent work perform-         models. They also demonstrate that many atten-
ing analysis to better understand what neural net-      tion heads can be pruned away without substan-
works learn, especially from language model pre-        tially hurting model performance. Interestingly,
training. One line of research examines the out-        the important attention heads that remain after
puts of language models on carefully chosen in-         pruning tend to be ones with identified behaviors.
put sentences (Linzen et al., 2016; Khandelwal          Michel et al. (2019) similarly show that many of
et al., 2018; Gulordava et al., 2018; Marvin and        BERT’s attention heads can be pruned. Although
Linzen, 2018). For example, the model’s perfor-         our analysis in this paper only found interpretable
mance at subject-verb agreement (generating the         behaviors in a subset of BERT’s attention heads,
correct number of a verb far away from its sub-         these recent works suggest that there might not be
ject) provides a measure of the model’s syntactic       much to explain for some attention heads because
ability, although it does not reveal how that ability   they have little effect on model perfomance.
is captured by the network.                                Jain and Wallace (2019) argue that attention of-
   Another line of work investigates the internal       ten does not “explain” model predictions. They
vector representations of the model (Adi et al.,        show that attention weights frequently do not cor-
2017; Giulianelli et al., 2018; Zhang and Bow-          relate with other measures of feature importance.
man, 2018), often using probing classifiers. Prob-      Furthermore, attention weights can often be sub-
ing classifiers are simple neural networks that take    stantially changed without altering model predic-
the vector representations of a pre-trained model       tions. However, our motivation for looking at at-
as input and are trained to do a supervised task        tention is different: rather than explaining model
(e.g., part-of-speech tagging). If a probing clas-      predictions, we are seeking to understand infor-
sifier achieves high accuracy, it suggests that the     mation learned by the models. For example, if
input representations reflect the corresponding as-     a particular attention head learns a syntactic rela-
pect of language (e.g., low-level syntax). Like         tion, we consider that an important finding from
our work, some of these studies have also demon-        an analysis perspective even if that head is not
strated models capturing aspects of syntax (Shi         always used when making predictions for some
et al., 2016; Blevins et al., 2018) or coreference      downstream task.
(Tenney et al., 2018, 2019; Liu et al., 2019) with-
out explicitly being trained for the tasks.             8   Conclusion
   With regards to analyzing attention, Vig (2019)
                                                        We have proposed a series of analysis methods for
builds a visualization tool for the BERT’s atten-
                                                        understanding the attention mechanisms of mod-
tion and reports observations about the attention
                                                        els and applied them to BERT. While most recent
behavior, but does not perform quantitative anal-
                                                        work on model analysis for NLP has focused on
ysis. Burns et al. (2018) analyze the attention
                                                        probing vector representations or model outputs,
of memory networks to understand model perfor-
                                                        we have shown that a substantial amount of lin-
mance on a question answering dataset. There has
                                                        guistic knowledge can be found not only in the
also been some initial work in correlating atten-
                                                        hidden states, but also in the attention maps. We
tion with syntax. Raganato and Tiedemann (2018)
                                                        think probing attention maps complements these
evaluate the attention heads of a machine trans-
                                                        other model analysis techniques, and should be
lation model on dependency parsing, but only re-
                                                        part of the toolkit used by researchers to under-
port overall UAS scores instead of investigating
                                                        stand what neural networks learn about language.
heads for specific syntactic relations or using prob-
ing classifiers. Marecek and Rosa (2018) propose
                                                        Acknowledgements
heuristic ways of converting attention scores to
syntactic trees, but do not quantitatively evaluate     We thank the anonymous reviews for their
their approach. For coreference, Voita et al. (2018)    thoughtful comments and suggestions. Kevin is
show that the attention of a context-aware neu-         supported by a Google PhD Fellowship.
References                                               Urvashi Khandelwal, He He, Peng Qi, and Daniel Ju-
                                                           rafsky. 2018. Sharp nearby, fuzzy far away: How
Yossi Adi, Einat Kermany, Yonatan Belinkov, Ofer           neural language models use context. In ACL.
  Lavi, and Yoav Goldberg. 2017. Fine-grained anal-
  ysis of sentence embeddings using auxiliary predic-    Joseph B Kruskal. 1964. Multidimensional scaling by
  tion tasks. In ICLR.                                      optimizing goodness of fit to a nonmetric hypothe-
                                                            sis. Psychometrika, 29(1):1–27.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
  gio. 2015. Neural machine translation by jointly       Heeyoung Lee, Yves Peirsman, Angel Chang,
  learning to align and translate. In ICLR.                Nathanael Chambers, Mihai Surdeanu, and Dan Ju-
                                                           rafsky. 2011. Stanford’s multi-pass sieve corefer-
Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, Has-         ence resolution system at the conll-2011 shared task.
  san Sajjad, and James R. Glass. 2017. What do neu-       In CoNLL.
  ral machine translation models learn about morphol-
  ogy? In ACL.                                           Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg.
                                                           2016. Assessing the ability of lstms to learn syntax-
Terra Blevins, Omer Levy, and Luke S. Zettlemoyer.         sensitive dependencies. TACL.
  2018. Deep rnns encode soft hierarchical syntax. In
  ACL.                                                   Nelson F. Liu, Matt Gardner, Yonatan Belinkov, M. Pe-
                                                           ters, and Noah A. Smith. 2019. Linguistic knowl-
Kaylee Burns, Aida Nematzadeh, Alison Gopnik, and          edge and transferability of contextual representa-
  Thomas L. Griffiths. 2018. Exploiting attention to       tions. In NAACL-HLT.
  reveal shortcomings in memory models. In Black-
  boxNLP@EMNLP.                                          Mitchell P Marcus, Mary Ann Marcinkiewicz, and
                                                           Beatrice Santorini. 1993. Building a large annotated
Kehai Chen, Rui Wang, Masao Utiyama, Eiichiro              corpus of english: The Penn treebank. Computa-
  Sumita, and Tiejun Zhao. 2018. Syntax-directed at-       tional linguistics, 19(2):313–330.
  tention for neural machine translation. In AAAI.
                                                         David Marecek and Rudolf Rosa. 2018. Extract-
Alexis Conneau, Germán Kruszewski, Guillaume              ing syntactic trees from transformer encoder self-
  Lample, Loı̈c Barrault, and Marco Baroni. 2018.          attentions. In BlackboxNLP@EMNLP.
  What you can cram into a single $&!#* vector:
  Probing sentence embeddings for linguistic proper-     Rebecca Marvin and Tal Linzen. 2018. Targeted syn-
  ties. In ACL.                                            tactic evaluation of language models. In EMNLP.
                                                         Paul Michel, Omer Levy, and Graham Neubig. 2019.
Andrew M Dai and Quoc V Le. 2015. Semi-supervised          Are sixteen heads really better than one? arXiv
  sequence learning. In NIPS.                              preprint arXiv:1905.10650.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and            Jeffrey Pennington, Richard Socher, and Christopher
   Kristina Toutanova. 2019. BERT: Pre-training of          Manning. 2014. Glove: Global vectors for word
   deep bidirectional transformers for language under-      representation. In EMNLP.
   standing. In NAACL-HLT.
                                                         Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt
Akiko Eriguchi, Kazuma Hashimoto, and Yoshimasa           Gardner, Christopher Clark, Kenton Lee, and Luke
  Tsuruoka. 2016. Tree-to-sequence attentional neu-       Zettlemoyer. 2018. Deep contextualized word rep-
  ral machine translation. In ACL.                        resentations. In NAACL-HLT.
Mario Giulianelli, Jack Harding, Florian Mohnert,        Sameer Pradhan, Alessandro Moschitti, Nianwen Xue,
 Dieuwke Hupkes, and Willem H. Zuidema. 2018.              Olga Uryupina, and Yuchen Zhang. 2012. Conll-
 Under the hood: Using diagnostic classifiers to in-       2012 shared task: Modeling multilingual unre-
 vestigate and improve how language models track           stricted coreference in ontonotes. In Joint Confer-
 agreement information. In BlackboxNLP@EMNLP.              ence on EMNLP and CoNLL-Shared Task.
Yoav Goldberg. 2019. Assessing BERT’s syntactic          Alec Radford, Karthik Narasimhan, Tim Salimans,
  abilities. arXiv preprint arXiv:1901.05287.              and Ilya Sutskever. 2018.         Improving lan-
                                                           guage understanding by generative pre-training.
Kristina Gulordava, Piotr Bojanowski, Edouard Grave,       https://blog.openai.com/language-unsupervised.
  Tal Linzen, and Marco Baroni. 2018. Colorless
  green recurrent networks dream hierarchically. In      Alessandro Raganato and Jörg Tiedemann. 2018.
  NAACL-HLT.                                               An analysis of encoder representations in
                                                           transformer-based machine translation. In Black-
John Hewitt and Christopher D. Manning. 2019. Find-        boxNLP@EMNLP.
  ing syntax with structural probes. In NAACL-HLT.
                                                         Rico Sennrich, Barry Haddow, and Alexandra Birch.
Sarthak Jain and Byron C. Wallace. 2019. Attention is      2016. Neural machine translation of rare words with
  not explanation. arXiv preprint arXiv:1902.10186.        subword units. In ACL.
Xing Shi, Inkit Padhi, and Kevin Knight. 2016. Does
  string-based neural mt learn source syntax?    In
  EMNLP.
Emma Strubell, Patrick Verga, Daniel Andor,
  David I Weiss, and Andrew McCallum. 2018.
  Linguistically-informed self-attention for semantic
  role labeling. In EMNLP.
Mukund Sundararajan, Ankur Taly, and Qiqi Yan.
 2017. Axiomatic attribution for deep networks. In
 ICML.
Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019.
   Bert rediscovers the classical nlp pipeline. arXiv
   preprint arXiv:1905.05950.
Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang,
   Adam Poliak, R Thomas McCoy, Najoung Kim,
   Benjamin Van Durme, Samuel R Bowman, Dipan-
   jan Das, et al. 2018. What do you learn from con-
   text? probing for sentence structure in contextual-
   ized word representations. In ICLR.
Zhaopeng Tu, Baosong Yang, Michael R. Lyu, and
  Tong Zhang. 2018. Multi-head attention with dis-
  agreement regularization. In EMNLP.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
  Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
  Kaiser, and Illia Polosukhin. 2017. Attention is all
  you need. In NIPS.
Jesse Vig. 2019. Visualizing attention in transformer-
   based language models.             arXiv preprint
   arXiv:1904.02679.
Elena Voita, Pavel Serdyukov, Rico Sennrich, and Ivan
  Titov. 2018. Context-aware neural machine transla-
  tion learns anaphora resolution. In ACL.
Elena Voita, David Talbot, Fedor Moiseev, Rico Sen-
  nrich, and Ivan Titov. 2019. Analyzing multi-
  head self-attention: Specialized heads do the heavy
  lifting, the rest can be pruned. arXiv preprint
  arXiv:1905.09418.
Sam Joshua Wiseman, Alexander Matthew Rush, Stu-
  art Merrill Shieber, and Jason Weston. 2015. Learn-
  ing anaphoricity and antecedent ranking features for
  coreference resolution. In ACL.
Kelly W. Zhang and Samuel R. Bowman. 2018. Lan-
  guage modeling teaches you more syntax than trans-
  lation does: Lessons learned through auxiliary task
  analysis. In BlackboxNLP@EMNLP.