稀疏化与内存高效扩展

内存、激活开销和通信压力推动了 MoE、注意力内核、分片和重计算。

14 篇论文 第 6 个,共 10 个计算范式

Sparse and memory-efficient scaling

英文原文文件:README.md

设备/设置

加速器集群,其中内存带宽、HBM 容量、互连路由和 IO-aware kernels 限制了有效扩展。

瓶颈

密集注意力和密集 FFN 的成本增长快于可用内存与通信预算。

适配的方法

MoE、自动分片、Switch/GLaM/Mixtral 路由、FlashAttention、稀疏/线性/长上下文注意力和经济型 MoE 通过条件计算或减少 IO 来扩展。

变得过时或不再中心的方法

朴素的全注意力密集缩放,以及缺乏负载均衡或系统支持的专家路由方案,已不再具有吸引力。

代表性论文

排名 年份 论文 优先级 状态
63 2017 Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer 5 downloaded / read_complete
64 2020 GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding 5 downloaded / read_complete
65 2021 Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity 5 downloaded / read_complete
66 2022 FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness 5 downloaded / read_complete
67 2021 GLaM: Efficient Scaling of Language Models with Mixture-of-Experts 4 downloaded / read_complete
68 2024 Mixtral of Experts 4 downloaded / read_complete
69 2020 Big Bird: Transformers for Longer Sequences 3 downloaded / read_complete
70 2020 Linformer: Self-Attention with Linear Complexity 3 downloaded / read_complete
71 2020 Longformer: The Long-Document Transformer 3 downloaded / read_complete
72 2020 Reformer: The Efficient Transformer 3 downloaded / read_complete
73 2023 FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning 3 downloaded / read_complete
74 2024 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model 3 downloaded / read_complete
129 2025 Kimi K2: Open Agentic Intelligence 4 downloaded / read_complete
130 2025 DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models 4 downloaded / read_complete

开放问题

  • 判断稀疏性何时节省墙钟时间或服务成本,而不只是降低名义 FLOPs。

相关论文 14