Memory-efficient attention
Compute interpretation
Attention implementations that make GPU memory hierarchy and IO traffic first-class constraints.
Supporting reading cards
- Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context (2019,
tpu_accelerator_transformer_era) - Big Bird: Transformers for Longer Sequences (2020,
sparse_memory_efficient_scaling) - Linformer: Self-Attention with Linear Complexity (2020,
sparse_memory_efficient_scaling) - Longformer: The Long-Document Transformer (2020,
sparse_memory_efficient_scaling) - Reformer: The Efficient Transformer (2020,
sparse_memory_efficient_scaling) - FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (2022,
sparse_memory_efficient_scaling) - FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning (2023,
sparse_memory_efficient_scaling) - DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (2024,
sparse_memory_efficient_scaling) - Qwen3 Technical Report (2025,
hyperscale_dense_llm_training) - Gemma 3 Technical Report (2025,
efficient_edge_inference) - Kimi K2: Open Agentic Intelligence (2025,
sparse_memory_efficient_scaling) - DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models (2025,
sparse_memory_efficient_scaling)
Obsolete or less central under later compute
Track this only through linked reading cards; do not treat this method page as standalone evidence.