ImageNet Classification with Deep Convolutional Neural Networks

2012 Single-GPU deep learning 766 citations ★ Test-of-time award
Download PDF

ImageNet Classification with Deep Convolutional Neural Networks

Metadata

  • Slug: alexnet_2012
  • Year: 2012
  • Venue: NeurIPS
  • Authors: Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton
  • Reading status: read complete
  • Compute regime: Single-GPU deep learning
  • Primary sources: PDF, extracted text

Compute Setup

The paper explicitly lists the device: two NVIDIA GTX 580 GPUs with 3GB of memory each. Training took five to six days. The run used SGD on 1.2 million ImageNet training images for roughly 90 cycles through the dataset, with batch size 128, momentum 0.9, weight decay 0.0005, initial learning rate 0.01, and three manual 10x learning-rate drops.

This hardware is not incidental. The authors state that network size was limited mainly by GPU memory and tolerable training time. A single GTX 580 had only 3GB of memory, so the model was split across two cards. The split used direct GPU-to-GPU memory access without routing through host memory, and communication was restricted to selected layers so transfer stayed an acceptable fraction of computation.

The software stack is also part of the setup: optimized GPU 2D convolution plus CPU-side Python augmentation overlapped with GPU training on the previous batch.

Bottleneck

The ImageNet-scale dataset made hand-engineered feature pipelines and smaller models insufficient. The limiting resource became dense convolution throughput and GPU memory. CNNs were attractive before 2012, but the paper says they had been prohibitively expensive to apply at large scale to high-resolution images. ImageNet supplied enough labels to train a high-capacity model without immediate overfitting, while GPUs supplied enough convolution throughput to make the experiment feasible.

Memory shaped the architecture. The final network has about 60 million parameters, five convolutional layers, and three fully connected layers. Since activations and weights had to fit into 3GB-per-GPU memory, the model is manually partitioned. Communication is not free, so some layers connect across GPUs and others remain local.

Overfitting is another compute bottleneck. Training many independent models would be too expensive because one CNN already takes several days, so the paper uses cheaper regularization and augmentation instead of relying only on ensembles.

Method Adaptation

AlexNet fits the 2012 GPU setup in several concrete ways:

  • Convolutions are implemented on GPUs and dominate the useful compute.
  • The network is partitioned across two GPUs, with some layers split by channel and communication restricted to reduce transfer overhead.
  • ReLU nonlinearities reduce training time compared with saturating activations.
  • Local response normalization and overlapping pooling are used within a GPU-friendly convolutional pipeline.
  • Data augmentation is generated on the CPU while the GPU trains on the previous batch, making the augmentation effectively hidden behind GPU work.
  • Dropout is used in the fully connected layers to control overfitting in a high-capacity model trained on a large dataset.

The method is therefore a fit between ImageNet-scale data and commodity gaming GPUs. ReLUs are a compute adaptation because the paper reports they reach a CIFAR-10 training-error target six times faster than equivalent tanh units, reducing expensive training iterations.

The two-GPU partition is both a memory and bandwidth adaptation. Half the kernels or neurons are placed on each GPU, but the columns are not fully independent. Some layers read from both GPUs and others read only from local maps. This gives enough cross-talk for accuracy while keeping inter-GPU traffic bounded.

The augmentation and inference design also reflect compute budget. Random crops and flips expand the effective training set by a factor of 2048 without storing transformed images. At test time, the network averages ten 224 x 224 crops. Dropout approximates a large ensemble with shared weights, costing about a factor of two in training iterations rather than several multi-day runs.

Evidence

  • The paper reports ILSVRC-2010 top-1/top-5 test error of 37.5% and 17.0%.
  • In ILSVRC-2012, the submitted model achieved 15.3% top-5 error, far ahead of the next-best entry at 26.2%.
  • The two-GPU network trained slightly faster than a one-GPU version and achieved lower validation error.
  • Training used SGD with batch size 128, momentum 0.9, weight decay 0.0005, and manual learning-rate drops.
  • The two-GPU partition reduces top-1 and top-5 error by 1.7 and 1.2 percentage points compared with a smaller one-GPU convolutional network.
  • Response normalization and overlapping pooling each reduce reported top-1/top-5 error.
  • CPU-side augmentation is described as computationally free in practice because it overlaps with GPU training on the previous batch.
  • Without dropout, the network substantially overfits; dropout roughly doubles iterations to convergence but avoids training many full models.
  • ILSVRC-2010 comparison: the previous competition best was 47.1% top-1 and 28.2% top-5, while a later Fisher-vector system reached 45.7% and 25.7%; the CNN reached 37.5% and 17.0%.

Historical Effect

AlexNet marks the practical start of the GPU deep-learning regime in mainstream AI history. It did not only show that CNNs worked; it showed that large supervised datasets plus programmable GPU throughput could beat CPU-era vision systems by a wide margin.

The compute lesson is that architecture, regularization, and data processing were all shaped around the memory and bandwidth constraints of two GTX 580 cards.

The paper also points forward explicitly: the authors say results can improve by waiting for faster GPUs and larger datasets. ImageNet creates the data pressure; AlexNet shows that commodity GPU convolution is the matching compute substrate.

Limits

  • The result depends on labeled supervision and heavy ImageNet-scale annotation.
  • Multi-GPU scaling is hand-structured and small by later standards.
  • The paper's device fit is excellent for dense convolution, but not yet for sequence modeling or large-scale language pretraining.
  • The two-GPU split is a bespoke architecture decision, not a general distributed training system.

Links