多 GPU 稠密训练

主要瓶颈变为多 GPU 间的同步、批量大小、深度与内存稳定性。

12 篇论文 第 3 个,共 10 个计算范式

Multi-GPU dense training

英文原文文件:README.md

设备/设置

同步多 GPU 服务器和小集群,PCIe/NVLink/InfiniBand 等通信成为核心设计约束。

瓶颈

深度、批量扩展、梯度同步、数值范围和通信开销成为限制因素。

适配的方法

残差连接、大批量 SGD、混合精度、归一化变体、密集连接、深度可分卷积和分布式序列模型均针对这一集群规模进行适配。

变得过时或不再中心的方法

忽略 all-reduce 成本、激活内存和批量效应的单设备方法,其可迁移性随之降低。

代表性论文

排名 年份 论文 优先级 状态
21 2016 Identity Mappings in Deep Residual Networks 6 downloaded / read_complete
22 2015 Deep Residual Learning for Image Recognition 5 downloaded / read_complete
23 2016 Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation 5 downloaded / read_complete
24 2017 Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour 5 downloaded / read_complete
25 2017 Mixed Precision Training 5 downloaded / read_complete
26 2016 Layer Normalization 4 downloaded / read_complete
27 2016 Xception: Deep Learning with Depthwise Separable Convolutions 3 downloaded / read_complete
28 2018 Group Normalization 3 downloaded / read_complete
29 2019 EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks 3 downloaded / read_complete
30 2015 Rethinking the Inception Architecture for Computer Vision 2 downloaded / read_complete
31 2016 Densely Connected Convolutional Networks 2 downloaded / read_complete
32 2018 Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour with Batch Normalization 2 downloaded / read_complete

开放问题

  • 追踪通信感知优化如何为 Transformer 规模的分布式训练铺平道路。

相关论文 12