Mixed Precision Training

Metadata

Slug: mixed_precision_2017
Year: 2017
Venue: ICLR
Authors: Paulius Micikevicius et al.
Reading status: read complete
Compute regime: Multi-GPU dense training
Primary sources: PDF, extracted text

Compute Setup

The paper explicitly states the GPU generations used. Baseline FP32 experiments are conducted on NVIDIA Maxwell or Pascal GPUs. Mixed precision experiments are conducted on Volta V100, using Tensor Core operations that multiply FP16 inputs and accumulate into FP32 for convolutions, fully connected layers, and recurrent matrix multiplies. Some speech-recognition experiments use Maxwell GPUs with FP16 storage only to emulate the Tensor Core behavior on non-Volta hardware.

The method is evaluated across CNNs, detection models, speech recognition, translation, language modeling, and GANs. Several runs are large for the period: the English DeepSpeech 2 model has about 115M parameters trained on 6,000 hours of speech; the Mandarin model has about 215M parameters trained on 2,600 hours; the bigLSTM language model uses two 8192-cell LSTM layers with 1024 projection, a 793K word vocabulary, sampled softmax with 8K negatives, and batch size 1024 aggregated over 4 GPUs.

Bottleneck

The bottleneck is the combined memory and arithmetic cost of dense training. The paper frames program speed as limited by arithmetic bandwidth, memory bandwidth, or latency. FP32 training stores weights, activations, and gradients with 32-bit values; for large networks, saved activations dominate training memory because they must be reused in backpropagation. Moving twice as many bytes also increases memory bandwidth pressure.

FP16 solves the storage and Tensor Core throughput side but creates numerical bottlenecks. FP16 has a narrower dynamic range than FP32. Gradients can underflow to zero, weight updates can disappear when added to much larger weights, and reductions can lose important small values. The compute problem is therefore not simply "train in half precision"; it is placing each precision where the hardware and optimizer can tolerate it.

Method Adaptation

The recipe has three main device-facing adaptations. First, weights, activations, and gradients are stored in FP16 for forward and backward passes, cutting the storage and bandwidth of these tensors roughly in half. Second, the optimizer maintains an FP32 master copy of weights and applies updates to that copy, then rounds back to FP16 for forward/backward computation. This prevents small updates from being lost when they would be too small to affect an FP16 weight.

Third, loss scaling shifts gradients into the FP16-representable range. The paper shows SSD activation gradients with many small values and explains that scaling the loss before backprop scales all gradients by the same factor; gradients are then unscaled before clipping or update so hyperparameters remain unchanged. For SSD, a scaling factor of 8 is enough; for the bigLSTM language model, a factor of 128 recovers baseline behavior.

Arithmetic is also placed selectively. Vector dot products use FP16 multiply with FP32 accumulation, matching Volta Tensor Core capabilities. Large reductions, especially batchnorm statistics and softmax, are performed in FP32 while still reading and writing FP16 tensors. The paper notes that these layers are memory-bandwidth limited and not very sensitive to arithmetic speed, so FP32 arithmetic there does not erase the main gains.

Evidence

The memory claim is direct: FP16 forward/backward storage roughly halves training memory, even after adding the FP32 master weights, because activations dominate. The paper also states that FP16 math throughput in recent GPUs is 2x to 8x higher than FP32, and that DeepBench operations on Volta show 2x-6x speedups over FP32 when limited by memory or arithmetic bandwidth. Latency-limited operations see smaller gains.

ImageNet classification shows accuracy preservation with identical hyperparameters. Reported ILSVRC12 top-1 accuracies are: AlexNet 56.77% FP32 versus 56.93% mixed precision; VGG-D 65.40 versus 65.43; GoogLeNet 68.33 versus 68.43; Inception v2 70.03 versus 70.02; Inception v3 73.85 versus 74.13; ResNet-50 75.92 versus 76.04. Loss scaling is not required for these CNN classification runs.

Detection demonstrates why loss scaling matters. Faster R-CNN has 69.1% mAP baseline, 68.6% mixed precision without scaling, and 69.7% with scaling. Multibox SSD has 76.9% baseline, diverges without loss scaling, and reaches 77.1% with scaling. This is the clearest evidence that FP16 storage alone is not robust: preserving small gradients is necessary for some workloads.

Speech and language further stress the method. The English speech model improves from 2.20 to 1.99 CER and Mandarin from 15.82 to 15.01 in pseudo-FP16. Machine translation with 3- and 5-layer 1024-cell LSTMs matches FP32 with loss scaling, while no scaling slightly degrades. The bigLSTM language model diverges after 300K iterations without loss scaling but matches FP32 with scale 128.

Historical Effect

This paper established the practical recipe behind automatic mixed precision: FP16 tensors for throughput and memory, FP32 master weights and reductions for optimization stability, and loss scaling for underflow. Its historical importance is that it translated Volta Tensor Cores from a hardware feature into a general training method across vision, speech, and language workloads.

Limits

Speedups depend on kernels and bottleneck type. Operations limited by latency do not benefit much, and full-network gains depend on framework/library support. Some models require loss scaling, and static scaling factors can need monitoring for overflow. Mixed precision reduces memory and math cost but does not by itself solve distributed communication, input pipeline limits, or optimizer-state memory for very large models.