← All methods

Memory-efficient attention

Compute interpretation

Attention implementations that make GPU memory hierarchy and IO traffic first-class constraints.

Supporting reading cards

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context (2019, tpu_accelerator_transformer_era)
Big Bird: Transformers for Longer Sequences (2020, sparse_memory_efficient_scaling)
Linformer: Self-Attention with Linear Complexity (2020, sparse_memory_efficient_scaling)
Longformer: The Long-Document Transformer (2020, sparse_memory_efficient_scaling)
Reformer: The Efficient Transformer (2020, sparse_memory_efficient_scaling)
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (2022, sparse_memory_efficient_scaling)
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning (2023, sparse_memory_efficient_scaling)
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (2024, sparse_memory_efficient_scaling)
Qwen3 Technical Report (2025, hyperscale_dense_llm_training)
Gemma 3 Technical Report (2025, efficient_edge_inference)
Kimi K2: Open Agentic Intelligence (2025, sparse_memory_efficient_scaling)
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models (2025, sparse_memory_efficient_scaling)

Obsolete or less central under later compute

Track this only through linked reading cards; do not treat this method page as standalone evidence.