Attention Is All You Need
Attention Is All You Need
Metadata
- Slug:
attention_is_all_you_need_2017 - Year: 2017
- Venue: NeurIPS
- Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
- Reading status: read complete
- Compute regime: TPU and accelerator Transformer era
- Primary sources: PDF, extracted text
Compute Setup
The paper explicitly lists the main training setup: one machine with eight NVIDIA P100 GPUs. This is not inferred. For the base models, each training step took about 0.4 seconds and training ran for 100,000 steps, or about 12 hours. For the big models, each step took about 1.0 second and training ran for 300,000 steps, or 3.5 days.
The data pipeline is token-batched rather than sentence-count-batched. WMT 2014 English-German has about 4.5 million sentence pairs and uses a shared BPE vocabulary of about 37,000 tokens. WMT 2014 English-French has 36 million sentences and a 32,000 word-piece vocabulary. Sentence pairs are grouped by approximate sequence length; each training batch contains about 25,000 source tokens and 25,000 target tokens. This reduces padding waste and keeps dense matrix multiplications better filled.
The paper also estimates training cost in FLOPs by multiplying training time, GPU count, and an assumed sustained single-precision capacity for each GPU. For P100 it uses 9.5 TFLOPS, making accelerator cost part of the reported result rather than background infrastructure.
Bottleneck
The paper identifies the sequence bottleneck of recurrent models: recurrence prevents parallelization within a single example. That bottleneck becomes more severe for long sequences because memory limits force smaller batches, and sequential dependency increases the number of operations on the critical path.
For 2017 accelerators, the favorable primitive was large batched matrix multiplication. RNNs used GPU arithmetic, but they could not expose enough parallel work across the sequence dimension.
The paper formalizes this with three criteria: per-layer complexity, parallelizable computation, and maximum path length for long-range dependencies. A recurrent layer requires O(n) sequential operations over a length-n sequence. A self-attention layer connects positions with a constant number of sequential operations, converting sequence modeling from a time-unrolled loop into batched attention and feed-forward matrix operations.
The tradeoff is quadratic attention. The authors argue that self-attention is faster than recurrence when sequence length n is smaller than representation dimension d, which is usually true for their machine-translation sentence representations. They already note that very long sequences may need restricted attention, foreshadowing later memory-efficient attention work.
Method Adaptation
The Transformer is a direct adaptation to accelerator structure:
- Self-attention replaces recurrence, reducing sequential operations to a constant number of layer steps.
- Scaled dot-product attention is implemented with optimized matrix multiplication, making it faster and more memory-efficient than additive attention on the available hardware.
- Multi-head attention keeps projections dense and batched while allowing different representation subspaces.
- Positional encodings add order information without reintroducing recurrent computation.
- The feed-forward blocks are position-wise dense layers, another accelerator-friendly primitive.
Scaled dot-product attention is justified as faster and more space-efficient than additive attention because it uses optimized matrix multiplication. Multi-head attention with h = 8 heads keeps each head dimension small (d_k = d_v = d_model / h = 64), so total computation is similar to single-head full-dimensional attention while exposing multiple dense projections. The position-wise feed-forward network is also easy to batch across positions and examples.
The optimizer and schedule are also tuned to this accelerator regime. The paper uses Adam with beta values 0.9 and 0.98, a 4000-step linear warmup, then inverse-square-root decay. Warmup matters because the model is trained with large token batches on a new architecture; it lets the run ramp up without wasting the expensive early P100 steps on instability.
Evidence
- On WMT 2014 English-German, the big Transformer reached 28.4 BLEU.
- On WMT 2014 English-French, the paper reports 41.8 BLEU in the abstract and Table 2 for Transformer big; the results text also states 41.0 BLEU for the big English-French model. Both are reported as single-model results against prior systems.
- Training the big English-German model took 3.5 days on 8 P100 GPUs.
- Table 2 reports estimated training cost of
3.3 * 10^18FLOPs for the base model and2.3 * 10^19FLOPs for Transformer big, compared with larger costs for several prior recurrent, convolutional, and ensemble systems. - The base model surpasses all previously published English-German models and ensembles at a fraction of the listed training cost.
- At inference/evaluation time, the paper uses beam search with beam size 4 and length penalty 0.6, and averages the last 5 checkpoints for base models or last 20 for big models.
Historical Effect
The paper moved mainstream sequence modeling from recurrence-shaped computation to matrix-multiply-shaped computation. This is the core architectural step that made later accelerator-scale language models practical.
In the compute-structure spine, the Transformer is not just a method innovation. It is the point where sequence modeling becomes well aligned with GPU/TPU throughput, enabling dense scaling in later BERT, GPT, and large language model systems.
Once recurrence is removed, scaling model width, batch tokens, and training FLOPs becomes a more direct engineering path for language and sequence modeling.
Limits
- Full self-attention has quadratic sequence-length cost.
- The paper's strongest evidence is machine translation, not yet general-purpose language modeling at internet scale.
- It uses a single multi-GPU machine, not the later datacenter-scale training systems.
- The reported cost model uses estimated sustained single-precision TFLOPS, not measured end-to-end utilization.
Links
- Parent regime: compute spine
- Later linked card: GPT-3 2020
- Method index: transformer
- Ledger updates: compute bottlenecks