Parallelism and sharding
英文原文文件:parallelism.md
计算解释
Model、data、pipeline 和 optimizer-state 划分方法,使大模型能够适配分布式 accelerator 集群。
支撑阅读卡
- Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation (2016,
multi_gpu_dense_training) - Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour (2017,
multi_gpu_dense_training) - Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour with Batch Normalization (2018,
multi_gpu_dense_training) - GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism (2018,
tpu_accelerator_transformer_era) - Mesh-TensorFlow: Deep Learning for Supercomputers (2018,
tpu_accelerator_transformer_era) - Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism (2019,
hyperscale_dense_llm_training) - ZeRO: Memory Optimizations Toward Training Trillion Parameter Models (2019,
hyperscale_dense_llm_training) - GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding (2020,
sparse_memory_efficient_scaling) - PaLM: Scaling Language Modeling with Pathways (2022,
hyperscale_dense_llm_training) - Gemini: A Family of Highly Capable Multimodal Models (2023,
hyperscale_dense_llm_training) - The Llama 3 Herd of Models (2024,
hyperscale_dense_llm_training) - Kimi k1.5: Scaling Reinforcement Learning with LLMs (2025,
inference_time_compute_post_training) - Kimi K2: Open Agentic Intelligence (2025,
sparse_memory_efficient_scaling) - Kimi K2.5: Visual Agentic Intelligence (2026,
inference_time_compute_post_training)
后续计算范式下过时或退居次要的内容
仅通过已链接的阅读卡追踪,不将本方法页视为独立证据来源。