Sparse and memory-efficient scaling
Memory, activation cost, and communication pressure drive MoE, attention kernels, sharding, and recomputation.
14 papers
Regime 6 of 10
Sparse and memory-efficient scaling
Device/setup
Accelerator clusters where memory bandwidth, HBM capacity, interconnect routing, and IO-aware kernels constrain useful scaling.
Bottleneck
Dense attention and dense FFN cost grow faster than available memory and communication budgets.
Methods that fit
Mixture-of-experts, automatic sharding, Switch/GLaM/Mixtral routing, FlashAttention, sparse/linear/long-context attention, and economical MoE designs spend compute conditionally or reduce IO.
Methods that became obsolete or less central
Naive dense full-attention scaling and expert routing without load-balancing or systems support became less attractive.
Representative papers
Open questions
- Track when sparsity saves wall-clock or serving cost rather than only improving nominal FLOP counts.