Sparse and memory-efficient scaling

Memory, activation cost, and communication pressure drive MoE, attention kernels, sharding, and recomputation.

14 papers Regime 6 of 10

Sparse and memory-efficient scaling

Device/setup

Accelerator clusters where memory bandwidth, HBM capacity, interconnect routing, and IO-aware kernels constrain useful scaling.

Bottleneck

Dense attention and dense FFN cost grow faster than available memory and communication budgets.

Methods that fit

Mixture-of-experts, automatic sharding, Switch/GLaM/Mixtral routing, FlashAttention, sparse/linear/long-context attention, and economical MoE designs spend compute conditionally or reduce IO.

Methods that became obsolete or less central

Naive dense full-attention scaling and expert routing without load-balancing or systems support became less attractive.

Representative papers

Rank Year Paper Priority Status
63 2017 Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer 5 downloaded / read_complete
64 2020 GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding 5 downloaded / read_complete
65 2021 Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity 5 downloaded / read_complete
66 2022 FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness 5 downloaded / read_complete
67 2021 GLaM: Efficient Scaling of Language Models with Mixture-of-Experts 4 downloaded / read_complete
68 2024 Mixtral of Experts 4 downloaded / read_complete
69 2020 Big Bird: Transformers for Longer Sequences 3 downloaded / read_complete
70 2020 Linformer: Self-Attention with Linear Complexity 3 downloaded / read_complete
71 2020 Longformer: The Long-Document Transformer 3 downloaded / read_complete
72 2020 Reformer: The Efficient Transformer 3 downloaded / read_complete
73 2023 FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning 3 downloaded / read_complete
74 2024 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model 3 downloaded / read_complete
129 2025 Kimi K2: Open Agentic Intelligence 4 downloaded / read_complete
130 2025 DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models 4 downloaded / read_complete

Open questions

  • Track when sparsity saves wall-clock or serving cost rather than only improving nominal FLOP counts.

Papers in this compute regime 14