← 返回方法列表

Parallelism and sharding

英文原文文件：parallelism.md

计算解释

Model、data、pipeline 和 optimizer-state 划分方法，使大模型能够适配分布式 accelerator 集群。

支撑阅读卡

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation (2016, multi_gpu_dense_training)
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour (2017, multi_gpu_dense_training)
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour with Batch Normalization (2018, multi_gpu_dense_training)
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism (2018, tpu_accelerator_transformer_era)
Mesh-TensorFlow: Deep Learning for Supercomputers (2018, tpu_accelerator_transformer_era)
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism (2019, hyperscale_dense_llm_training)
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models (2019, hyperscale_dense_llm_training)
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding (2020, sparse_memory_efficient_scaling)
PaLM: Scaling Language Modeling with Pathways (2022, hyperscale_dense_llm_training)
Gemini: A Family of Highly Capable Multimodal Models (2023, hyperscale_dense_llm_training)
The Llama 3 Herd of Models (2024, hyperscale_dense_llm_training)
Kimi k1.5: Scaling Reinforcement Learning with LLMs (2025, inference_time_compute_post_training)
Kimi K2: Open Agentic Intelligence (2025, sparse_memory_efficient_scaling)
Kimi K2.5: Visual Agentic Intelligence (2026, inference_time_compute_post_training)

后续计算范式下过时或退居次要的内容

仅通过已链接的阅读卡追踪，不将本方法页视为独立证据来源。