← 全部范式 / 06

稀疏化与内存高效扩展

内存、激活开销和通信压力推动了 MoE、注意力内核、分片和重计算。

14 篇论文第 6 个，共 10 个计算范式

Sparse and memory-efficient scaling

英文原文文件：README.md

设备/设置

加速器集群，其中内存带宽、HBM 容量、互连路由和 IO-aware kernels 限制了有效扩展。

瓶颈

密集注意力和密集 FFN 的成本增长快于可用内存与通信预算。

适配的方法

MoE、自动分片、Switch/GLaM/Mixtral 路由、FlashAttention、稀疏/线性/长上下文注意力和经济型 MoE 通过条件计算或减少 IO 来扩展。

变得过时或不再中心的方法

朴素的全注意力密集缩放，以及缺乏负载均衡或系统支持的专家路由方案，已不再具有吸引力。

代表性论文

排名	年份	论文	优先级	状态
63	2017	Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer	5	downloaded / read_complete
64	2020	GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding	5	downloaded / read_complete
65	2021	Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity	5	downloaded / read_complete
66	2022	FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness	5	downloaded / read_complete
67	2021	GLaM: Efficient Scaling of Language Models with Mixture-of-Experts	4	downloaded / read_complete
68	2024	Mixtral of Experts	4	downloaded / read_complete
69	2020	Big Bird: Transformers for Longer Sequences	3	downloaded / read_complete
70	2020	Linformer: Self-Attention with Linear Complexity	3	downloaded / read_complete
71	2020	Longformer: The Long-Document Transformer	3	downloaded / read_complete
72	2020	Reformer: The Efficient Transformer	3	downloaded / read_complete
73	2023	FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning	3	downloaded / read_complete
74	2024	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model	3	downloaded / read_complete
129	2025	Kimi K2: Open Agentic Intelligence	4	downloaded / read_complete
130	2025	DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models	4	downloaded / read_complete

开放问题

判断稀疏性何时节省墙钟时间或服务成本，而不只是降低名义 FLOPs。

相关论文 14

2017 Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

2020 GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

2021 Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

2022 FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

2021 GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

2024 Mixtral of Experts

2020 Big Bird: Transformers for Longer Sequences

2020 Linformer: Self-Attention with Linear Complexity

2020 Longformer: The Long-Document Transformer

2020 Reformer: The Efficient Transformer

2023 FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

2024 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

2025 Kimi K2: Open Agentic Intelligence

2025 DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models