超大规模稠密 LLM 训练

训练成为数据中心级问题:模型/数据并行、优化器状态分片、计算最优的规模扩展。

19 篇论文 第 5 个,共 10 个计算范式

Hyperscale dense LLM training

英文原文文件:README.md

设备/设置

大规模 GPU/TPU 集群,结合模型、数据、流水线、优化器状态,有时还有序列并行;内存和互连主导设计。

瓶颈

参数/优化器内存、通信、数据规模、token 分配、可靠性和前沿模型推理成本。

适配的方法

GPT 式密集 Transformer、Megatron 张量并行、ZeRO/FSDP、缩放律、Chinchilla 计算分配、Pathways/TPU pod 训练和开放密集 LLM 发布适配超大密集集群。

变得过时或不再中心的方法

只看参数量、忽略 token 预算、推理预算和可靠性的缩放叙事,在 compute-optimal 分析之后说服力随之减弱。

代表性论文

排名 年份 论文 优先级 状态
45 2019 Language Models are Unsupervised Multitask Learners 5 downloaded / read_complete
46 2019 Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism 5 downloaded / read_complete
47 2019 ZeRO: Memory Optimizations Toward Training Trillion Parameter Models 5 downloaded / read_complete
48 2020 Language Models are Few-Shot Learners 5 downloaded / read_complete
49 2020 Scaling Laws for Neural Language Models 5 downloaded / read_complete
50 2022 Training Compute-Optimal Large Language Models 5 downloaded / read_complete
51 2022 PaLM: Scaling Language Modeling with Pathways 5 downloaded / read_complete
52 2021 Scaling Language Models: Methods, Analysis and Insights from Training Gopher 4 downloaded / read_complete
53 2023 Llama 2: Open Foundation and Fine-Tuned Chat Models 4 downloaded / read_complete
54 2023 LLaMA: Open and Efficient Foundation Language Models 4 downloaded / read_complete
55 2022 BLOOM: A 176B-Parameter Open-Access Multilingual Language Model 3 downloaded / read_complete
56 2022 GPT-NeoX-20B: An Open-Source Autoregressive Language Model 3 downloaded / read_complete
57 2022 OPT: Open Pre-trained Transformer Language Models 3 downloaded / read_complete
58 2023 Gemini: A Family of Highly Capable Multimodal Models 3 downloaded / read_complete
59 2023 Mistral 7B 3 downloaded / read_complete
60 2023 Textbooks Are All You Need 3 downloaded / read_complete
61 2024 The Llama 3 Herd of Models 3 downloaded / read_complete
62 2023 A Survey of Large Language Models 1 downloaded / read_complete
124 2025 Qwen3 Technical Report 4 downloaded / read_complete

开放问题

  • 区分闭源前沿基础设施说法与可复现的开放训练栈和已披露硬件细节。

相关论文 19