Speculative decoding
Compute interpretation
Inference acceleration method that trades extra draft-model compute for lower latency on large target models.
Supporting reading cards
- Fast Inference from Transformers via Speculative Decoding (2023,
efficient_edge_inference)
Obsolete or less central under later compute
Track this only through linked reading cards; do not treat this method page as standalone evidence.