Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

2017 Multi-GPU dense training 2.6k citations
Download PDF

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Metadata

  • Slug: imagenet_one_hour_2017
  • Year: 2017
  • Venue: arXiv
  • Authors: Priya Goyal et al.
  • Reading status: read complete
  • Compute regime: Multi-GPU dense training
  • Primary sources: PDF, extracted text

Compute Setup

The paper explicitly describes the hardware system. Experiments use Facebook Big Basin GPU servers. Each server contains 8 NVIDIA Tesla P100 GPUs connected by NVIDIA NVLink, 3.2TB of local NVMe SSD storage, a Mellanox ConnectX-4 50Gbit Ethernet network card, and Wedge100 Ethernet switches. The main result trains ResNet-50 on ImageNet with 256 GPUs and minibatch size 8192 in one hour.

The paper also provides a bandwidth sanity check. ResNet-50 has about 25 million parameters, or about 100MB in 32-bit floats. Backpropagation on a single Tesla P100 takes 120ms. Because allreduce moves roughly twice the bytes of the buffer, the authors estimate a peak network requirement of about 200MB per 0.125 seconds, or 12.8Gbit/s, which fits within 50Gbit Ethernet. The baseline is one 8-GPU Big Basin server with minibatch 256, and the large run uses 32 servers at 8 GPUs each.

Bottleneck

The bottleneck is scaling synchronous SGD without changing the optimization problem too much. More GPUs increase throughput only if each synchronization covers enough examples, so the global minibatch grows. But large minibatches change early optimization dynamics and can destabilize training if the learning rate is simply raised from the start. Communication is the other bottleneck: every GPU must receive the summed gradients before the next update, and allreduce can erase speedup if it is not overlapped with backpropagation.

Batch normalization adds a less obvious constraint. The paper argues that changing the per-worker batch size changes the BN statistics and therefore the loss being optimized. The recipe keeps per-worker sample size n fixed at 32 while increasing the number of workers k, so the global batch kn grows without changing the BN minibatch definition.

Method Adaptation

The training method is a hardware-scaling recipe. The linear scaling rule sets the learning rate proportional to global minibatch size: eta = 0.1 * kn / 256. For the 8192-minibatch run, this gives eta = 3.2. Because that rate is too aggressive at initialization, the paper uses gradual warmup for the first 5 epochs, starting at eta = 0.1 and linearly increasing to the scaled value. All models still train for the standard 90 epochs with drops at epochs 30, 60, and 80.

The distributed implementation adapts communication to the P100 cluster. Gradients are allreduced as soon as each layer's gradient is ready, while backpropagation continues for lower layers. Intra-server reduction and broadcast use NCCL for sufficiently large buffers, and inter-server communication uses Gloo. The paper compares ring and halving/doubling allreduce and reports that, on 32 servers, halving/doubling is 3x faster than ring for the relevant smaller buffers. Local reductions, cross-server allreduce, and broadcasts are pipelined where possible.

The data pipeline also matters. Each epoch uses a single random shuffle partitioned across workers; the paper warns that incorrect multi-worker shuffling can contaminate scaling conclusions.

Evidence

The baseline ResNet-50 run uses 8 P100 GPUs, minibatch 256, and top-1 validation error 23.60% +/- 0.12. The large run uses 256 GPUs, minibatch 8192, learning rate 3.2, and gradual warmup, reaching 23.74% +/- 0.09. No warmup yields 24.84% +/- 0.37, and constant warmup yields 25.88% +/- 0.56, showing that the large-batch failure is primarily an early optimization problem rather than an unavoidable generalization penalty.

Wall-clock evidence is the point of the card. The paper states that an 8-P100 setup takes 29 hours for ResNet-50, while the 256-GPU system trains the same 90-epoch ImageNet workload in one hour. Timing plots show marginally below ideal throughput scaling, about 90% efficiency, and with 352 GPUs the implementation completes one pass over the roughly 1.28 million ImageNet training images in about 30 seconds. The paper attributes most of the efficiency to hiding allreduce communication behind gradient computation.

The method has a breaking point. Validation error remains stable from minibatch 64 to 8192, then begins to rise; beyond 64K, training diverges under the linear scaling rule. ImageNet-5k, a roughly 5x larger dataset, shows a similar curve, suggesting dataset size alone does not automatically permit much larger useful minibatches.

Historical Effect

This paper turned ImageNet training into a systems benchmark for synchronous multi-GPU training. It moved the compute question from "can one GPU train a CNN?" to "how do we preserve SGD behavior while using hundreds of GPUs?" The answer combined optimizer scaling, BN semantics, data shuffling, and communication overlap, making large-batch distributed SGD a standard recipe for dense vision models.

Limits

The result is tuned for ResNet/ImageNet-style dense data parallelism. It does not address model parallelism or memory-limited giant models. Accuracy degrades beyond the 8K-minibatch regime and diverges beyond 64K under the stated rule. The recipe depends on stable per-worker BN size, correct global shuffling, and communication that can be hidden under backpropagation; different networks or slower interconnects may not share the same scaling behavior.

Links