Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Metadata

Slug: transformer_xl_2019
Year: 2019
Venue: ACL
Authors: Zihang Dai et al.
Reading status: read complete
Compute regime: TPU and accelerator Transformer era
Primary sources: PDF, extracted text

Compute Setup

The local extracted text does not list the full training hardware or cluster size. Under the project rule, training is inferred as GPU-based language-model training in the late-2018/2019 accelerator era because the paper discusses GPU memory and reports evaluation speed on one GPU. The clearest hardware statement is the evaluation-speed table: slowdown is measured by per-token time on one GPU. The recurrence section also says the model can cache as many previous segments as GPU memory allows.

The experiments still define a compute scale. WikiText-103 has 103M training tokens from 28K articles, with average article length 3.6K tokens; attention length is 384 during training and 1600 during evaluation. For enwik8/text8, attention length is 784 during training and 3800 during evaluation. A 20-layer Transformer-XL with about 0.3B parameters is trained for 400K steps on One Billion Word.

Bottleneck

The bottleneck is the fixed-length Transformer segment. Vanilla Transformers train on separate segments, so information cannot flow across segment boundaries. This creates context fragmentation: early tokens lack prior context, and dependencies longer than the segment length are unavailable. Increasing segment length directly increases attention memory and compute, so it is constrained by GPU memory.

Evaluation adds recomputation. A vanilla Transformer can use more context by sliding the segment one position at a time and predicting the last token, but each new position recomputes almost the whole segment. The paper describes this as O(N x L) for sequence length N and segment length L.

Method Adaptation

Transformer-XL caches hidden states from prior segments and reuses them as memory for the next segment. During training, previous hidden states are fixed, so gradients remain within the current segment, but the forward computation attends to longer context. During evaluation, the cache avoids rerunning a full overlapping window for every token.

This design is explicitly memory-shaped. The memory length M and current segment length L determine the size of the attention context, and the paper notes that multiple previous segments can be cached as GPU memory permits. Backpropagation length can remain shorter than evaluation attention length: in one ablation, training backpropagation length is only 128, but evaluation attention can increase to 640 when recurrence and the new positional encoding are used. That separation is the key compute adaptation. It does not make attention free, but it lets inference use longer context than training backpropagation would otherwise allow.

The second adaptation is relative positional encoding. Reusing hidden states across segments breaks naive absolute positions because cached states are interpreted at new temporal offsets. Transformer-XL makes attention depend on relative distance and uses an efficient relative-attention form rather than a naive construction that would grow too expensive with memory length.

Evidence

The benchmark evidence ties accuracy to long-context compute. On WikiText-103, the paper reports reducing prior state-of-the-art perplexity from 20.5 to 18.3. It reports 0.99 bpc on enwik8, 1.08 bpc on text8, and One Billion Word perplexity improving from 23.7 to 21.8. The abstract summarizes dependency length as 80% longer than RNNs and 450% longer than vanilla Transformers.

The WikiText-103 ablation shows why the structure matters. For a 128M-parameter model, recurrence plus the paper's positional encoding gets PPL best 26.77 with attention length 500. Removing recurrence worsens PPL best to 29.02 and attention length to 260. In the 151M setting, PPL improves from 23.43 to 23.16 with attention length 450, with another listed result reaching 23.09 at attention length 640. Under the same GPU memory constraints, the appendix states Transformer-XL remains superior despite shorter backpropagation length.

The evaluation-speed table is the clearest device-backed compute result. It reports per-token evaluation time on one GPU by describing how much slower Al-Rfou et al.'s vanilla Transformer baseline is: 1,874x slower than the paper's method in the main setting, 1,409x slower with Shaw et al. relative encodings, 773x slower without recurrence at attention length 1800, and 363x slower at attention length 800. These large factors come from cache reuse rather than a faster individual matrix multiply.

Historical Effect

Transformer-XL made segment recurrence a standard long-context Transformer compute idea. It showed that context length could be expanded by reusing activations across segments, not only by increasing quadratic attention length inside one segment. It separated training backpropagation length, evaluation attention length, and dependency length, reframing hidden states as a finite GPU-resident cache that trades memory for less recomputation.

Limits

The method spends GPU memory to save recomputation. A longer cache improves context but increases attention memory and bandwidth because current tokens attend over both memory and the current segment. The cache is fixed during training, so gradients do not flow through the full extended history.

The benefit depends on the relative positional encoding. The paper's ablations show that recurrence without the right encoding loses much of the long-attention advantage, and absolute encodings generalize poorly when evaluation attention length differs from training. Finally, because the local extracted text does not list complete training hardware, this card should not infer a specific number or type of training accelerator. The source-backed device statement is one-GPU per-token evaluation plus the stated GPU-memory cache constraint.