Multi-GPU dense training

The bottleneck shifts to synchronization, batch size, depth, and memory stability across multiple GPUs.

12 papers Regime 3 of 10

Multi-GPU dense training

Device/setup

Synchronous multi-GPU servers and small clusters, with PCIe/NVLink/Infiniband-style communication becoming a central design constraint.

Bottleneck

Depth, batch scaling, gradient synchronization, numerical range, and communication overhead became the limiting factors.

Methods that fit

Residual connections, large-minibatch SGD, mixed precision, normalization variants, dense connectivity, depthwise separable convolutions, and distributed sequence models fit this cluster scale.

Methods that became obsolete or less central

Single-device recipes that ignored all-reduce cost, activation memory, and batch-size effects became less portable.

Representative papers

Rank Year Paper Priority Status
21 2016 Identity Mappings in Deep Residual Networks 6 downloaded / read_complete
22 2015 Deep Residual Learning for Image Recognition 5 downloaded / read_complete
23 2016 Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation 5 downloaded / read_complete
24 2017 Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour 5 downloaded / read_complete
25 2017 Mixed Precision Training 5 downloaded / read_complete
26 2016 Layer Normalization 4 downloaded / read_complete
27 2016 Xception: Deep Learning with Depthwise Separable Convolutions 3 downloaded / read_complete
28 2018 Group Normalization 3 downloaded / read_complete
29 2019 EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks 3 downloaded / read_complete
30 2015 Rethinking the Inception Architecture for Computer Vision 2 downloaded / read_complete
31 2016 Densely Connected Convolutional Networks 2 downloaded / read_complete
32 2018 Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour with Batch Normalization 2 downloaded / read_complete

Open questions

  • Trace how communication-aware optimization prepared the path for Transformer-scale distributed training.

Papers in this compute regime 12