Multi-GPU dense training

Device/setup

Synchronous multi-GPU servers and small clusters, with PCIe/NVLink/Infiniband-style communication becoming a central design constraint.

Bottleneck

Depth, batch scaling, gradient synchronization, numerical range, and communication overhead became the limiting factors.

Methods that fit

Residual connections, large-minibatch SGD, mixed precision, normalization variants, dense connectivity, depthwise separable convolutions, and distributed sequence models fit this cluster scale.

Methods that became obsolete or less central

Single-device recipes that ignored all-reduce cost, activation memory, and batch-size effects became less portable.

Representative papers

Rank	Year	Paper	Priority	Status
21	2016	Identity Mappings in Deep Residual Networks	6	downloaded / read_complete
22	2015	Deep Residual Learning for Image Recognition	5	downloaded / read_complete
23	2016	Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation	5	downloaded / read_complete
24	2017	Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour	5	downloaded / read_complete
25	2017	Mixed Precision Training	5	downloaded / read_complete
26	2016	Layer Normalization	4	downloaded / read_complete
27	2016	Xception: Deep Learning with Depthwise Separable Convolutions	3	downloaded / read_complete
28	2018	Group Normalization	3	downloaded / read_complete
29	2019	EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks	3	downloaded / read_complete
30	2015	Rethinking the Inception Architecture for Computer Vision	2	downloaded / read_complete
31	2016	Densely Connected Convolutional Networks	2	downloaded / read_complete
32	2018	Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour with Batch Normalization	2	downloaded / read_complete

Open questions

Trace how communication-aware optimization prepared the path for Transformer-scale distributed training.