What Does BERT Look at? An Analysis of BERT's Attention

Metadata

Slug: bertology_2019
Year: 2019
Venue: ACL Workshop
Authors: Kevin Clark, Urvashi Khandelwal, Omer Levy, Christopher D. Manning
Reading status: read complete
Compute regime: TPU and accelerator Transformer era
Primary sources: PDF, extracted text

Compute Setup

The paper does not list analysis hardware in the extracted text. It studies a pretrained English BERT-base model, not a new pretraining run. The relevant source-stated scale is model and analysis scale: BERT-base has 12 layers and 12 attention heads per layer, for 144 heads; it was pretrained on 3.3B English tokens in the original BERT setup; this paper extracts attention maps rather than training BERT itself.

For surface-level attention analysis, the authors extract attention maps from BERT-base over 1,000 random Wikipedia segments. They follow the BERT pretraining input format with at most 128 tokens from two consecutive paragraphs, using [CLS]<paragraph-1>[SEP]<paragraph-2>[SEP]. For syntax and coreference probes, they use fixed BERT attention maps and train only small probing classifiers. The paper explicitly says the BERT attention outputs are fixed and that they do not backpropagate into BERT. Under the project rule, hardware is inferred as 2019 CPU/GPU/accelerator forward-pass analysis over a downloaded pretrained model, not TPU-scale pretraining.

Bottleneck

The bottleneck is not raw training cost but the interpretability of an already-expensive pretrained Transformer. BERT's attention tensor is high-dimensional: every input segment produces token-token maps for 144 heads. Since self-attention is quadratic in sequence length, even analysis code has to choose manageable sequence lengths and datasets. The paper truncates CoNLL-2012 documents to 128 tokens to keep memory usage manageable.

The compute structure also shapes the kind of science the paper can do. Full ablations or retraining would entangle analysis with a large optimization run. Instead, the authors treat BERT as a fixed object, convert attention maps into features, and ask what structure is already present. This turns a large-model analysis into repeated forward passes plus lightweight supervised probes.

Method Adaptation

The method is deliberately low-update and attention-centric. First, it extracts attention maps from a fixed pretrained model. Second, because BERT uses subword tokenization and around 8% of words in their data are split, it converts token-token attention into word-word attention by summing attention to split words and averaging attention from split words. This preserves the property that outgoing attention sums to 1 at the word level.

For individual-head probes, the method uses attention itself as a no-training classifier: for a word and a head, predict whichever other word receives the most attention, testing both head-to-dependent and dependent-to-head directions for dependencies. For combined probes, it trains small graph-based parsers over attention weights, optionally conditioning head weights on fixed GloVe embeddings. The key compute adaptation is that only probe parameters are learned; the expensive BERT model remains frozen.

The analysis also exploits BERT's input format. Since [CLS] and [SEP] are always present, the authors can examine whether attention mass to these tokens acts as a computational sink or "no-op" path for heads whose function is not applicable to a token. Gradient-based feature-importance checks are used to see whether attention to [SEP] affects masked-LM outputs.

Evidence

The surface analysis finds that special tokens dominate many heads. In layers 6-10, over half of BERT's attention focuses on [SEP]; for segments of about 128 tokens, a token appearing twice would receive about 1/64 attention under a uniform baseline. When [SEP] is the current token, some heads attend more than 90% to [SEP] tokens. Starting around layer 5, gradients of the masked-LM loss with respect to attention to [SEP] become very small, supporting the interpretation that the model can use [SEP] as a low-effect attention target.

The positional analysis finds heads that attend strongly to neighboring tokens. Four heads in layers 2, 4, 7, and 8 put more than 50% average attention on the previous token; five heads in layers 1, 2, 2, 3, and 6 put more than 50% on the next token.

The linguistic probes give concrete scores. Individual heads identify some dependency relations with high accuracy: head 8-10 reaches 86.8% on direct objects, head 8-11 reaches 94.3% on determiners, head 7-6 reaches 80.5% on possessives, head 4-10 reaches 82.5% on passive auxiliaries, and head 9-6 reaches 76.3% on prepositional objects. However, the best single head for overall syntax only gets 34.5 UAS, compared with a 26.3 right-branching baseline, so syntax is distributed across heads.

Coreference shows a similar pattern. Head 5-4 reaches 65% antecedent-selection accuracy on truncated CoNLL-2012 examples, above the 52% head-match baseline and close to the 69% rule-based system, though below a neural coreference system reported at roughly 83%. The attention-plus-GloVe dependency probe reaches 77 UAS, compared with 61 for attention alone, 58 for distances plus GloVe, 30 for randomly initialized attention plus GloVe, and 26 for right branching.

Historical Effect

This paper helped define "BERTology" as a compute-light mode of studying large pretrained models. Rather than treating BERT only as a benchmark score, it made frozen attention maps an empirical object. That mattered historically because the field was entering an era where pretraining was expensive enough that many researchers could analyze, probe, and reuse models more easily than they could train them from scratch.

Limits

The paper itself cautions against equating attention with explanation. Some heads have interpretable behavior, but many do not, and individual heads do not capture full dependency structure. The analysis is for English BERT-base and 128-token inputs, so it does not establish behavior for larger models, longer contexts, other languages, or fine-tuned task heads. Hardware is not listed, and the paper should not be cited as evidence about BERT pretraining compute.