Reformer: The Efficient Transformer

Download PDF

Reformer: The Efficient Transformer

Metadata

Compute Setup

The paper states that training for all experiments was parallelized across 8 devices, specifically "8 GPUs or 8 TPU v3 cores." It does not list the GPU model, memory size, or exact TPU slice beyond TPU v3 cores. Under the project rule, this is an explicitly multi-accelerator 2020 setup, but the exact non-TPU device SKU is not specified. The main long-sequence experiments use enwik8-64K and imagenet64. For ablations, the paper uses 3-layer models with d_model = 1024, d_ff = 4096, 8 attention heads, total batch size 8 sequences, and the Adafactor optimizer. WMT English-German experiments follow Transformer hyperparameters. The paper also emphasizes that large Reformer models can fit on a single core, but the hardware model of that core is not given.

Bottleneck

Reformer analyzes memory as a compound bottleneck rather than only an attention bottleneck. Dense QK^T attention has shape [batch, length, length]. At sequence length 64K, even batch size 1 gives a 64K x 64K matrix, which in float32 takes 16GB before values, gradients, or other activations are counted. The introduction gives a second arithmetic example: activations for 64K tokens, embedding size 1024, and batch size 8 are 64K x 1K x 8, or about 0.5B floats, requiring another 2GB. Feed-forward layers can be worse because d_ff is often much larger than d_model; with d_ff = 4K and 16 layers, the paper again calls the 64K-token memory demand impractical.

The practical consequence is that large Transformer models can require a multi-accelerator setup even for one training step. Standard attention also wastes computation when the useful attention pattern is sparse but the implementation still forms all pairwise scores.

Method Adaptation

Reformer combines three compute adaptations. First, LSH attention replaces all-pairs attention with approximate nearest-neighbor attention. Queries and keys are shared, vectors are assigned to locality-sensitive hash buckets, sorted by bucket and position, and attention is computed within chunks after sorting. The goal is to avoid materializing the length-by-length matrix while still letting similar distant tokens attend to each other. Multiple hash rounds can be run in parallel to reduce missed neighbors, trading extra compute for better approximation.

Second, the model uses reversible residual layers. Instead of storing activations at every layer for backpropagation, each layer's inputs can be reconstructed from its outputs and the layer parameters. This removes the activation-memory factor that grows with the number of layers. Third, feed-forward chunking processes sections of the sequence at a time, because feed-forward computations are independent across positions. This reduces the memory pressure from d_ff without changing the numerical function. The paper also notes that, with long enough batch x sequence products, moving parameters to and from CPU memory can be amortized more effectively than in ordinary small-batch Transformers.

The design is therefore not a single sparse-attention trick. LSH attacks the O(L^2) attention matrix; reversibility attacks stored layer activations; chunking attacks wide feed-forward intermediates.

Evidence

The paper's first evidence is controlled ablation. Shared query-key attention does not hurt training curves and appears slightly faster on enwik8. Reversible layers have learning curves nearly the same as ordinary Transformer layers with the same parameter count. On a synthetic duplication task that requires non-local lookup, a 1-layer LSH model trained with 4 hashes reaches 99.9% accuracy when evaluated with 4 hashes and 100% when evaluated with 8 hashes, showing the compute/accuracy tradeoff of hash rounds.

The long-sequence experiments are the central evidence. Reformer trains on enwik8 chunked into 2^16 = 64K-token subsequences and on imagenet64 generation with sequence length around 12K. The paper reports that regular attention gets slower as sequence length grows while LSH attention speed remains flat when total tokens are held fixed. It also reports that 20-layer big Reformers fit in memory and train on enwik8 and imagenet64, while comparable Transformer baselines were too slow and memory-hungry. A 12-layer enwik8 Reformer trained for 20K steps reaches 1.19 bits/dim, and a longer tuned run reaches 1.05 bits/dim on the enwik8 test set. In translation, reversible Transformer models reach WMT14 En-De BLEU 27.6 at base 100K steps, 28.0 at base 500K steps, and 29.1 for a big 300K-step model, close to strong Transformer baselines.

Historical Effect

Reformer broadened the long-context Transformer design space from sparse patterns alone to full training-memory accounting. Its influence was in making reversible layers, activation recomputation by construction, feed-forward chunking, and approximate attention part of the same conversation. It also gave the community a memorable hardware-scale argument: a single 64K attention matrix can consume an accelerator's memory before the rest of the model is counted.

Limits

The paper's own evidence shows several limits. LSH attention is approximate, and quality depends on the number of hash rounds; more rounds improve accuracy but spend more compute. Sorting, chunking, causal masking, and shared-QK details make the implementation more complex than dense attention. The exact GPU hardware is not listed, which limits reproducibility of wall-clock and memory comparisons. Finally, some benefits only appear clearly at very long sequence lengths; for short sentence-level WMT examples, the authors do not apply LSH attention because typical examples are shorter than the 128-token LSH chunk size.

Links