Linformer: Self-Attention with Linear Complexity

Metadata

Reading status: read complete
Year: 2020
Compute regime: Sparse and memory-efficient scaling (sparse_memory_efficient_scaling)
PDF: 2020-linformer_2020.pdf
Extracted text: 2020-linformer_2020.txt
PDF URL: https://arxiv.org/pdf/2006.04768.pdf
OpenAlex:
Citation count source/date:
Citation count:
Reading card created: 2026-06-15

Compute Setup

The paper explicitly reports accelerator hardware. All pretraining experiments use BookCorpus plus English Wikipedia, about 3300M words, with an MLM objective, and are "parallelized across 64 Tesla V100 GPUs with 250k updates." The paper also says it uses mixed-precision training by default, but this is orthogonal to the proposed attention mechanism. For deployment-style measurement, it benchmarks inference speed and memory on a single 16GB Tesla V100 GPU card. Those inference tests use randomly generated inputs up to sequence length 65536 and choose batch size according to the maximum batch that fits in memory, so the memory numbers are effectively device-capacity results rather than just theoretical counts.

Bottleneck

Linformer targets the same quadratic self-attention bottleneck as the other 2020 long-context papers, but with a different diagnosis. Standard self-attention forms the context mapping matrix P in R^(n x n), requiring multiplication of n x d matrices and O(n^2) time and space. The paper emphasizes both training and deployment cost: large Transformers are slow to train, and real-world inference often needs distillation or compression. It also argues that common memory-saving techniques do not solve the whole problem. Microbatching and gradient checkpointing can fit training into memory, but they trade time for memory and do not speed up inference. Distillation speeds a smaller student but does not speed the teacher's training. Sparse and LSH methods reduce asymptotic cost but can introduce sequential operations, constants, or performance drops.

Method Adaptation

The method adapts attention by treating the attention matrix as low rank. The paper first performs spectrum analysis of pretrained RoBERTa attention matrices and finds long-tail singular-value distributions, implying much of P can be approximated by a small number of directions. Rather than computing an SVD for every attention matrix, Linformer inserts learned projection matrices E_i and F_i in R^(n x k) for keys and values. Keys and values are projected from n positions down to k projected positions, and attention is computed as an n x k context matrix. With k much smaller than n, this changes attention cost from O(n^2) to O(nk), and the theorem section argues for k choices that need not grow linearly with sequence length.

The design is device-friendly because it remains dense linear algebra. Unlike arbitrary sparse attention, Linformer keeps the computation as projections and dense matrix multiplies, which V100 GPUs execute efficiently. The paper also explores projection sharing to reduce parameter and memory overhead: headwise sharing, key-value sharing, and layerwise sharing. Layerwise sharing uses a single projection matrix across the whole model, and the experiments report little degradation, which matters for deployment memory as well as training memory.

Evidence

The pretraining evidence is that low-rank projection does not collapse language modeling. With max sequence lengths n = 512 and n = 1024, validation perplexity approaches the standard Transformer as k increases; the paper highlights k = 128 for n = 512 and k = 256 for n = 1024 as nearly on par. In the longer-sequence pretraining comparison, n ranges over 512, 1024, 2048, and 4096 while k stays fixed at 256, and final perplexities remain about the same after convergence. That is the core empirical support for the claim that compute is controlled by k rather than the ratio n/k.

Downstream results preserve that story. On SST-2, IMDB, QNLI, and QQP, the same-corpus RoBERTa-base baseline averages 92.25. Linformer at n = 512, k = 256 with shared key-value and layerwise projections averages 92.30, and n = 1024, k = 256 variants remain around 91.83-92.18. The paper notes that all baselines and Linformer variants were pretrained with the same objective, corpus, and up to 250K updates, while Linformer reaches those updates with less wall-clock time.

The single-GPU inference table gives the clearest compute-device evidence. On the 16GB V100, Linformer with k = 128 is 1.5x faster and allows a 1.7x larger maximum batch size even at n = 512. At n = 4096, k = 128 gives 3.4x time savings and 14x memory savings; k = 256 gives 3.2x and 13x. At n = 65536, k = 128 gives 20x time savings and 60x memory savings, while k = 256 gives 18x and 52x. These are reported against a standard Transformer under the same V100 capacity constraint.

Historical Effect

Linformer became a key early example of replacing the attention matrix with a dense low-rank surrogate rather than a sparse pattern. Its compute contribution was the claim that long-context attention could remain GPU-friendly dense linear algebra and still avoid quadratic memory. It also influenced later evaluation habits by pairing theoretical complexity with concrete inference memory and maximum-batch measurements on a named GPU.

Limits

The limits are important. The largest sequence-length evidence, up to 65536, is an inference benchmark on randomly generated data, not a trained long-document application at that length. The low-rank assumption may fail on tasks that require many fine-grained global interactions, and fixed projections may lose information that sparse or exact attention could preserve. The paper reports strong parity on selected GLUE and IMDB tasks, but those tasks do not by themselves prove quality on very long reasoning or retrieval-heavy documents. Finally, the pretraining setup still uses 64 V100 GPUs, so the method reduces per-sequence attention cost without making pretraining itself small.