BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Metadata

Slug: bert_2018
Year: 2018
Venue: NAACL
Authors: Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
Reading status: read complete
Compute regime: TPU and accelerator Transformer era
Primary sources: PDF, extracted text

Compute Setup

The paper explicitly lists TPU training hardware and makes BERT a clean example of the 2018 Cloud TPU Transformer regime. Appendix A.2 states that training of BERTBASE was performed on 4 Cloud TPUs in Pod configuration, 16 TPU chips total; training of BERTLARGE was performed on 16 Cloud TPUs, 64 TPU chips total; each pre-training took 4 days. The same appendix gives the training scale: batch size 256 sequences, maximum sequence length 512, or 128,000 tokens per batch at full length, for 1,000,000 steps. The pretraining corpus is BooksCorpus, 800M words, plus English Wikipedia, 2,500M words.

The model sizes are also compute-relevant. BERTBASE is L=12, H=768, A=12, 110M parameters; BERTLARGE is L=24, H=1024, A=16, 340M parameters. Fine-tuning is much cheaper than pretraining: the paper says all results can be replicated from the released pretrained model in at most 1 hour on a single Cloud TPU, or a few hours on a GPU, and gives SQuAD as about 30 minutes on a single Cloud TPU to reach Dev F1 91.0. Thus the paper separates a large shared accelerator pretraining bill from many smaller downstream fine-tuning jobs.

Bottleneck

The central compute bottleneck is how to spend dense Transformer accelerator throughput on bidirectional language representation without making the prediction target visible. Left-to-right language modeling is easy to mask causally, but it cannot use right context at every layer. A fully bidirectional encoder can use both left and right context, but a naive denoising setup risks letting the model copy the target token. BERT's masked language model creates sparse prediction targets, and the appendix notes the cost consequence: compared with standard language model training, masked LM only predicts 15% of tokens in each batch, so more pre-training steps may be required.

The second bottleneck is sequence length. The paper explicitly says longer sequences are disproportionately expensive because attention is quadratic in sequence length. A full 512-token batch has four times the sequence length of 128-token training, but self-attention inside each layer has sixteen times the token-pair matrix area. On 2018 TPU pods, the dense matrix multiplications are favorable, but memory and attention compute still scale sharply with length.

Method Adaptation

BERT adapts Transformer pretraining to this setup by making bidirectional pretraining fit dense accelerator batches. It masks 15% of WordPiece positions and predicts only the original vocabulary id for those positions. To reduce the mismatch between pretraining and fine-tuning, the selected positions are not always replaced by [MASK]; the appendix compares the 80/10/10 recipe against alternatives. The next-sentence-prediction objective adds a cheap sentence-pair signal: 50% of the time sentence B follows sentence A, and 50% of the time it is random. This makes the same packed input representation useful for downstream entailment, paraphrase, and QA tasks.

The most direct compute adaptation is the sequence-length schedule. BERT pre-trains with sequence length 128 for 90% of steps and then sequence length 512 for the remaining 10% to learn positional embeddings. That keeps most updates in the high-throughput, short-attention regime while reserving long-context training for the end. BERT also amortizes the expensive encoder across tasks: fine-tuning keeps the architecture almost unchanged and changes batch size, learning rate, and epochs. The paper's GLUE setup uses batch size 32 for 3 epochs, SQuAD v1.1 uses batch size 32, and SWAG uses batch size 16, which are small downstream runs compared with the 1M-step pretraining job.

Evidence

The headline evidence ties performance to this pretrain-once, fine-tune-many compute structure. The abstract reports new state-of-the-art results on eleven NLP tasks, including GLUE 80.5, MultiNLI 86.7%, SQuAD v1.1 Test F1 93.2, and SQuAD v2.0 Test F1 83.1. Table 1 reports BERTLARGE with a GLUE table average of 82.1, while the text says the official leaderboard score is 80.5 compared with OpenAI GPT at 72.8. Table 2 reports SQuAD v1.1 results including BERTLARGE single at 84.1 EM / 90.9 F1, and the ensemble with TriviaQA at 86.2 EM / 92.2 F1 on dev and 87.4 EM / 93.2 F1 on test. Table 3 reports SQuAD v2.0 BERTLARGE single at 80.0 EM / 83.1 F1 on test.

Scaling evidence is explicit: BERTBASE has 110M parameters and BERTLARGE 340M, and the model-size ablation argues that larger pretrained models improve even small downstream tasks when sufficiently pretrained. The appendix also asks whether BERT needs 128,000 words per batch times 1,000,000 steps; it reports an additional 1.0% MNLI accuracy when training 1M steps compared with 500k. That is not just an architecture claim; it is evidence that BERT's quality depended on sustained accelerator-scale pretraining.

Historical Effect

BERT established pretrain-then-fine-tune Transformer encoders as the dominant NLP compute pattern before decoder-only LLMs took over. In the compute spine, it is a TPU-era example of spending accelerator days once, releasing a large reusable representation, and amortizing it across many downstream tasks with cheap fine-tuning. It also made the training budget legible: chips, batch tokens, sequence length, steps, and pretraining days are all exposed in the paper, so model quality can be read as a function of dense accelerator throughput rather than as an isolated modeling trick.

Limits

BERT's compute choices also define its limits. [MASK] creates a mismatch because the token is not present during fine-tuning. The paper mitigates this with the 80/10/10 replacement rule but does not remove the mismatch. Masked LM predicts only a sparse subset of tokens, so it may require more pretraining steps than standard language modeling. The 512-token length cap leaves long-document modeling outside the main design, and the 90% short-sequence schedule is a practical concession to quadratic attention cost. Finally, the encoder-only interface is excellent for classification, ranking, extraction, and representation transfer, but it is not a general autoregressive generation interface.