Dropout: A Simple Way to Prevent Neural Networks from Overfitting

Metadata

Slug: dropout_2014
Year: 2014
Venue: JMLR
Authors: Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov
Reading status: read complete
Compute regime: Single-GPU deep learning
Primary sources: PDF, extracted text

Compute Setup

The paper explicitly states that the implementation is GPU-based: the appendix says the code is GPU-based and uses the CUDA libraries cudamat and cuda-convnet. It does not name an exact GPU model for the dropout experiments. Under the project rule, the device context is therefore the 2012-2014 single-GPU CUDA deep-learning era, with convolutional workloads shaped by AlexNet-style cuda-convnet and dense matrix operations shaped by cudamat.

The training scale is broad rather than one monolithic run. The paper evaluates MNIST, SVHN, CIFAR-10/100, ImageNet, TIMIT speech, Reuters-RCV1 text, and a computational biology task. Table 1 lists data sizes from 60K MNIST training images to 1.2M ILSVRC-2012 training images and 1.1M TIMIT frames. The compute premise is not that dropout reduces training time; it does the opposite. The conclusion states that a dropout network typically takes 2-3 times longer to train than a standard neural network of the same architecture. Its compute advantage is at inference and model-combination scale: it approximates a large ensemble with one test-time network.

Bottleneck

The bottleneck is overfitting under high-capacity neural networks when practical ensemble compute is limited. The introduction says large networks are slow to use, which makes it difficult to address overfitting by combining predictions from many different large neural nets at test time. It also says using many large networks is infeasible in applications where quick response matters. In 2014 GPU terms, the tension is clear: larger models and CUDA kernels make state-of-the-art vision and speech systems possible, but memory and latency do not permit evaluating an explicit exponential or even large finite ensemble for every input.

Dropout also creates a training bottleneck of its own. Each training case effectively samples a different random architecture, so gradients are noisy and not exactly gradients of the final test architecture. That extra noise explains the 2-3x training-time penalty. The method therefore trades more stochastic training work for lower inference cost and better generalization.

Method Adaptation

Dropout adapts model averaging to the single-GPU regime by moving the combinatorics into stochastic minibatch training. During training, each unit is retained with probability p; dropping a unit removes it and its incoming and outgoing connections for that example. A network with n units can be viewed as a collection of 2^n thinned networks, but they share weights, so the stored parameter count remains that of one model. At test time, explicit Monte Carlo averaging over many thinned models is replaced by one unthinned network with outgoing weights scaled by p, matching the expected activation under dropout.

This design fits GPU execution because it preserves the normal dense network shape over minibatches while multiplying activations by Bernoulli masks. It avoids storing separate ensemble members and avoids running many forward passes per test example. The appendix states that a different dropout mask is sampled for each training case in every minibatch for dropout RBMs, and the feed-forward algorithm applies the same per-case Bernoulli vector idea. The practical recipe also uses max-norm constraints, high final momentum, decaying learning rates, and larger hidden layers when retention probability is low. The appendix warns that smaller p requires bigger n, which slows training and can underfit if capacity is not increased.

Evidence

The evidence is unusually cross-domain. The abstract reports improvements in vision, speech recognition, document classification, and computational biology. On MNIST, DBM-pretrained dropout nets reach 0.79% error in the permutation-invariant setting. On SVHN, the text says the best non-dropout convolutional net gets 3.95% error, dropout in fully connected layers reduces it to 3.02%, and dropout in all layers reduces it to 2.55%. On CIFAR, the text states that adding dropout in fully connected layers reduces CIFAR-10 error from 14.98% to 14.32%, adding dropout in every layer reduces it to 12.61%, and on CIFAR-100 dropout reduces error from 43.48% to 37.20%.

ImageNet is the clearest accelerator-era benchmark. The paper discusses the ILSVRC subset with roughly 1,000 images in each of 1,000 categories and reports that convolutional nets with dropout won ILSVRC-2012. Table 6 reports top-5 test error about 16% for the dropout conv net family, versus about 26% for the best standard-vision-feature methods discussed in the text. The conclusion then gives the compute cost: dropout improves broad benchmarks, but a dropout network usually takes 2-3 times longer to train.

Historical Effect

Dropout made larger GPU-trained networks usable before batch normalization, residual architectures, and much larger data regimes changed the regularization mix. It is a compute-structure method because it converts explicit ensemble averaging into one noisy training run plus one cheap inference model. That mattered when CUDA made large nets feasible but memory, latency, and training budgets still made ordinary ensembles difficult. It also normalized the idea that extra training compute can be spent to preserve inference simplicity, a recurring pattern in later distillation, pretraining, and model compression work.

Limits

The main limitation is the training-time penalty. The paper directly identifies noisy parameter updates as a major cause of the 2-3x slowdown. Dropout also introduces another hyperparameter, the retention probability p, whose best value depends on layer type and model width; typical hidden-unit values are 0.5 to 0.8, with input layers often around 0.8 for image patches or speech frames. The paper does not give exact hardware models, so device-level throughput and memory claims cannot be made beyond the stated CUDA implementation. Later normalization, residual architectures, stronger data augmentation, and much larger data sets reduced dropout's centrality in some convolutional models, although the compute idea of cheap test-time averaging persisted.