Longformer: The Long-Document Transformer

Download PDF

Longformer: The Long-Document Transformer

Metadata

Compute Setup

The paper explicitly lists GPU hardware for several parts of the work. For autoregressive character language modeling, the appendix says the experiments ran on 48GB RTX8000 GPUs, with mixed precision training, gradient checkpointing, and attention kept in fp32 because fp16 attention caused overflow and NaNs later in training. The small character model ran on 4 RTX8000 GPUs for 16 days, and the large model ran on 8 RTX8000 GPUs for 13 days. The staged character-LM schedule starts at sequence length 2048 and reaches 23040 in phase 5, marked as the GPU memory limit.

For pretrained Longformer, the paper continues from RoBERTa rather than training from scratch. It trains base and large models for 65K gradient updates, sequence length 4096, batch size 64, maximum learning rate 3e-5, and a 500-step warmup. The appendix also states that some task models ran on 32GB V100 GPUs, with small models taking about 1 day on 4 GPUs and large models about 1 day on 8 GPUs. WikiHop fine-tuning is listed as single RTX8000 GPU training, with Longformer-base taking about a day for 5 epochs.

Bottleneck

The bottleneck is the quadratic attention matrix in tasks whose evidence is longer than BERT's 512-token window. Longformer documents dataset lengths directly: WikiHop averages 1535 wordpieces with 95th percentile 3627, TriviaQA averages 6589 with 95th percentile 17126, HotpotQA averages 1316, OntoNotes averages 506, and Hyperpartisan averages 705 with 95th percentile 1975. The paper's motivation is that chunking or truncation loses cross-chunk information and forces complex task-specific stitching models.

There is also an implementation bottleneck. A sliding-window attention matrix is banded, but PyTorch and TensorFlow do not directly provide the banded matrix multiplication needed to compute only selected diagonals of QK^T. A naive loop implementation is memory efficient but unusably slow. A vectorized chunking implementation is fast because it uses a single PyTorch matrix multiplication, but it computes some zeros and consumes about 2x the memory of a perfectly optimized implementation. The custom CUDA path is needed when longest sequences and dilation matter.

Method Adaptation

Longformer adapts attention around locality plus a small number of global positions. Each token attends to a fixed-size window, giving O(n x w) complexity for fixed window w. Stacking layers grows the receptive field, and dilated windows can reach tens of thousands of tokens without increasing per-layer window size. For task outputs that require whole-document aggregation, the model adds symmetric global attention: selected tokens attend to all tokens and all tokens attend to them. The paper uses [CLS] for classification, question tokens for QA, and task-specific global positions for candidate-answer style inputs.

The device adaptation is three-tiered. Longformer-loop is simple and memory efficient but too slow for experiments. Longformer-chunks is the pretraining/fine-tuning workhorse because it keeps the computation in large dense PyTorch multiplies, accepting extra zero work in exchange for speed. Longformer-cuda is a TVM-generated custom CUDA kernel for banded attention; it supports dilation and autoregressive mode, is the most memory efficient implementation, and is reported as about as fast as optimized full self-attention. This division mirrors hardware tradeoffs: use dense library kernels when memory permits, use custom kernels when sequence length or dilation pushes the memory budget.

The encoder-decoder LED variant makes the same compute choice at a higher level. The encoder uses Longformer local plus global attention, while the decoder retains full attention over the generated prefix and encoded tokens. That is appropriate because long-document summarization has very long inputs and comparatively short outputs.

Evidence

The language-model evidence shows the memory target. Longformer evaluates character LM with sequences of length 32256 and trains through staged phases ending at 23040 tokens because of GPU memory. The small model reports BPC 1.10 on text8 and 1.00 on enwik8, described as state of the art in that setting. The large enwik8 model reports 0.99 BPC, competitive with larger long-context models.

The pretraining evidence is also compute-specific. Before continued pretraining, simply changing RoBERTa to Longformer at sequence length 4096 with copied position embeddings gives base BPC 1.957; after 2K updates it improves to 1.753, and after 65K updates to 1.705. The large model similarly improves to 1.358 after 65K updates. This shows that the model can reuse RoBERTa weights while learning to operate in a 4096-token sparse-attention regime.

Downstream results support the long-context budget. Longformer-base improves over RoBERTa-base on WikiHop 75.0 versus 72.4, TriviaQA 75.2 versus 74.3, HotpotQA 64.4 versus 63.5, IMDB 95.7 versus 95.3, and Hyperpartisan 94.8 versus 87.4. Longformer-large reports leaderboard F1 of 81.9 on WikiHop and 77.3 on TriviaQA, ahead of the then-current state of the art in the table. The WikiHop ablation is especially compute-structural: removing both separate global projections and global attention drops accuracy to 65.5, while full 4096-token Longformer reaches 73.8 or 75.0 depending on training length. LED shows input-length scaling on arXiv summarization, with 16384-token LED-large at ROUGE-1/2/L 46.63/19.62/41.83.

Historical Effect

Longformer helped establish long-document Transformers as a GPU-engineering problem, not just an architecture sketch. It connected a simple sparse pattern, local plus global attention, with the practical question of how to make banded attention run under real GPU memory limits. It also showed that sparse long-context attention could be a drop-in replacement for RoBERTa attention in the pretrain/fine-tune workflow, which mattered for adoption.

Limits

The limits follow from the same design choices. Global attention is task specified, so the model needs human choices about which tokens should be global. The fastest and most memory-efficient paths depend on specialized CUDA/TVM implementation, and the chunked implementation pays extra memory by computing masked zeros. The paper reports that 23040 characters was a GPU memory limit in character LM, and pretrained Longformer is mainly a 4096-token model, so longer inputs still require careful batching, truncation, or specialized LED-style setups.

Links