Fast Inference from Transformers via Speculative Decoding
Fast Inference from Transformers via Speculative Decoding
Metadata
- Reading status: read complete
- Year: 2023
- Compute regime: Efficient and edge inference (
efficient_edge_inference) - PDF: 2023-speculative_decoding_2023.pdf
- Extracted text: 2023-speculative_decoding_2023.txt
- OpenAlex:
- Citation count source/date:
- Citation count:
- Reading card created: 2026-06-15
Compute Setup
The paper explicitly states the hardware for its wall-clock measurements: batch size 1 on a single TPU-v4. The implemented latency experiments use T5 version 1.1 encoder-decoder checkpoints, with T5-XXL 11B as the target model and smaller T5 checkpoints as approximation models: T5-large 800M, T5-base 250M, and T5-small 77M. The tested tasks are English-German translation fine-tuned on WMT En-De and CNN/DailyMail summarization.
The source also reports broader model-scale probes without wall-clock hardware: a GPT-like 97M decoder target with a 6M approximation model on LM1B, and a LaMDA 137B decoder target with LaMDA 8B, 2B, and 100M approximation models on dialog. Those measure acceptance rates, not device-specific latency.
Bottleneck
The bottleneck is serial autoregressive decoding from a large target model. Standard generation gets one token per target-model step, so latency is tied to the number of sequential target passes. The paper emphasizes that large-model inference is often not bottlenecked on arithmetic operations but on memory bandwidth and communication. That is especially true at batch size 1, where each decode step rereads model weights and KV cache with limited arithmetic reuse.
Speculative decoding attacks the memory/latency side of the problem. It may spend more total arithmetic, because one algorithm iteration evaluates multiple positions of the target model in parallel and also runs the approximation model. But if the hardware can evaluate the target positions concurrently without increasing wall time, it reduces serial target-model calls and can reduce memory accesses to target weights and KV cache per generated token. The method is therefore best suited to a serving regime with spare parallel compute but a serial latency or memory-bandwidth bottleneck.
Method Adaptation
The method uses a small approximation model Mq to propose gamma future tokens, then runs the large target model Mp over the proposed prefix positions in parallel. An acceptance/rejection procedure accepts as many draft tokens as are distributionally valid and samples a correction token when needed. The important algorithmic constraint is exactness: outputs follow the same distribution as standard target-model decoding, including stochastic sampling, not just greedy decoding.
The compute adaptation is controlled by two quantities. The acceptance rate alpha measures how often the draft agrees well enough with the target, and the cost coefficient c measures the wall-clock cost of one draft step relative to one target step. The paper reports that for approximation models a couple orders of magnitude smaller than the target, c was always below 0.05 and often close to zero. The theoretical wall-time improvement depends on alpha, gamma, and c; the method assumes gamma + 1 concurrent target evaluations can be run in parallel. That assumption makes hardware parallelism part of the algorithm, not just an implementation detail.
The choice of draft size is a memory/latency tradeoff. T5-small is cheap and gives strong speedups despite lower acceptance than larger drafts. T5-large has slightly higher acceptance but costs more and can produce lower measured speedup. This is the core serving lesson: the best draft is not necessarily the most accurate small model; it is the model that balances acceptance against draft overhead and target parallelism.
Evidence
The main wall-clock table supports the compute argument. On a single TPU-v4 with batch size 1, WMT En-De using T5-XXL and T5-small reaches 3.4x speedup at temperature 0 with gamma = 7, alpha = 0.75, and draft cost around 0.02. At temperature 1, the same task/model pair reaches 2.6x with alpha = 0.62. CNN/DailyMail with T5-small reaches 3.1x at temperature 0 and 2.3x at temperature 1. T5-base also performs well, reaching 2.8x and 2.4x on En-De, and 3.0x and 2.2x on CNN/DailyMail depending on temperature.
The table also shows why memory and draft overhead matter. T5-large has higher acceptance than T5-small in several settings, but its larger cost coefficient, about 0.11, cuts speedups to 1.7x on En-De temperature 0, 1.4x on En-De temperature 1, 2.2x on CNN/DailyMail temperature 0, and 1.7x on CNN/DailyMail temperature 1. The smaller draft is faster because the cost of better approximation is not worth it.
The broader acceptance-rate experiments show generality. For the GPT-like 97M target, a 6M GPT-like draft gives alpha around 0.88 to 0.89, while unigram and bigram drafts are much weaker but nonzero. For T5-XXL, T5-small/base/large approximation models usually produce alpha between about 0.53 and 0.82 across tasks and temperatures. LaMDA 137B is included as a large-scale target for acceptance analysis, but not as a wall-clock benchmark in the paper.
Historical Effect
Speculative decoding established a clean route to lossless LLM latency reduction: use a cheap model to create likely future tokens and use the expensive model to verify many of them in one parallel pass. Its historical importance is that it separates distributional correctness from serial decoding. Earlier blockwise or adaptive computation ideas often changed outputs, supported only greedy decoding, or required architecture changes; this paper made the exact stochastic version practical with existing checkpoints.
In serving terms, the paper helped shift the question from "how many FLOPs per token?" to "how many serial target-model passes and memory reads per token?" That framing remains central in modern LLM inference stacks, especially for batch-1 or low-latency workloads where memory bandwidth, KV cache traffic, and decode serialization dominate.
Limits
The method is not free compute. It can increase total arithmetic, particularly when acceptance is low, because rejected draft tokens still required target verification. It also assumes enough hardware concurrency to run gamma + 1 target positions in parallel without increasing wall time. If the serving system is already fully compute-saturated with large batches, the same algorithm may deliver smaller latency gains or none.
Speedup is highly task-, temperature-, and draft-dependent. The paper shows higher speedups for argmax decoding than stochastic temperature-1 decoding, because sharper distributions raise acceptance. It also shows that larger drafts are not automatically better. Finally, only T5 wall-clock results are implemented; GPT-like and LaMDA cases are acceptance-rate analyses, so device-specific speedup should not be extrapolated without new measurements.
Links
- Compute regime:
history/compute_regimes/efficient_edge_inference/README.md - Source PDF and extracted text are listed in metadata above.
- Queue status:
read_complete. - Method index: speculative_decoding
- Ledger updates: compute bottlenecks