Sparse and memory-efficient scaling

Device/setup

Accelerator clusters where memory bandwidth, HBM capacity, interconnect routing, and IO-aware kernels constrain useful scaling.

Bottleneck

Dense attention and dense FFN cost grow faster than available memory and communication budgets.

Methods that fit

Mixture-of-experts, automatic sharding, Switch/GLaM/Mixtral routing, FlashAttention, sparse/linear/long-context attention, and economical MoE designs spend compute conditionally or reduce IO.

Methods that became obsolete or less central

Naive dense full-attention scaling and expert routing without load-balancing or systems support became less attractive.

Representative papers

Rank	Year	Paper	Priority	Status
63	2017	Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer	5	downloaded / read_complete
64	2020	GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding	5	downloaded / read_complete
65	2021	Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity	5	downloaded / read_complete
66	2022	FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness	5	downloaded / read_complete
67	2021	GLaM: Efficient Scaling of Language Models with Mixture-of-Experts	4	downloaded / read_complete
68	2024	Mixtral of Experts	4	downloaded / read_complete
69	2020	Big Bird: Transformers for Longer Sequences	3	downloaded / read_complete
70	2020	Linformer: Self-Attention with Linear Complexity	3	downloaded / read_complete
71	2020	Longformer: The Long-Document Transformer	3	downloaded / read_complete
72	2020	Reformer: The Efficient Transformer	3	downloaded / read_complete
73	2023	FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning	3	downloaded / read_complete
74	2024	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	3	downloaded / read_complete
129	2025	Kimi K2: Open Agentic Intelligence	4	downloaded / read_complete
130	2025	DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models	4	downloaded / read_complete

Open questions

Track when sparsity saves wall-clock or serving cost rather than only improving nominal FLOP counts.