Multi-GPU dense training
The bottleneck shifts to synchronization, batch size, depth, and memory stability across multiple GPUs.
12 papers
Regime 3 of 10
Multi-GPU dense training
Device/setup
Synchronous multi-GPU servers and small clusters, with PCIe/NVLink/Infiniband-style communication becoming a central design constraint.
Bottleneck
Depth, batch scaling, gradient synchronization, numerical range, and communication overhead became the limiting factors.
Methods that fit
Residual connections, large-minibatch SGD, mixed precision, normalization variants, dense connectivity, depthwise separable convolutions, and distributed sequence models fit this cluster scale.
Methods that became obsolete or less central
Single-device recipes that ignored all-reduce cost, activation memory, and batch-size effects became less portable.
Representative papers
Open questions
- Trace how communication-aware optimization prepared the path for Transformer-scale distributed training.