Very Deep Convolutional Networks for Large-Scale Image Recognition

Metadata

Slug: vgg_2014
Year: 2014
Venue: ICLR
Authors: Karen Simonyan, Andrew Zisserman
Reading status: read complete
Compute regime: Early multi-GPU CNN training
Primary sources: PDF, extracted text

Compute Setup

The paper explicitly states the training hardware: a single system with four NVIDIA Titan Black GPUs. The implementation is a modified December 2013 branch of C++ Caffe. Multi-GPU training uses synchronous data parallelism: each minibatch is split into GPU batches, gradients are computed in parallel, then averaged so the result matches single-GPU training on the full batch. The authors report a 3.75x speedup over one GPU, and say one network took 2-3 weeks to train depending on architecture.

The main run uses minibatch size 256, momentum 0.9, weight decay 5e-4, dropout 0.5 in the first two fully connected layers, and 370K iterations or 74 epochs. Input crops are fixed 224x224 RGB images. Configurations range from 11 to 19 weight layers and 133M to 144M parameters, mostly in the fully connected classifier.

Bottleneck

VGG's bottleneck is dense convolutional depth under early-2010s GPU memory and wall-clock limits. The paper wants to test depth as the primary design variable while keeping the architecture simple, but deeper networks were harder to initialize, expensive to train, and large enough that even a 4-GPU workstation required weeks per model. The fully connected layers create a large parameter footprint, while dense multi-scale and multi-crop evaluation multiplies inference work.

The paper also identifies costs it avoids. Local response normalization does not improve ILSVRC performance in the A-LRN comparison and increases memory consumption and computation time. Large filters are another target: a 7x7 convolution has 49C^2 weights, while three stacked 3x3 layers have 27C^2 weights for the same effective receptive field, plus more nonlinearities.

Method Adaptation

The architecture adapts to GPU CNN constraints by making depth regular and kernel sizes small. All convolution strides are 1, padding preserves spatial resolution, and 2x2 max-pooling reduces resolution between blocks. Repeated 3x3 convolutions preserve the parallel dense-convolution workload GPUs handle well while reducing parameter count versus equivalent large filters. Configurations D and E use 16 and 19 weight layers with a uniform channel schedule that doubles after pooling up to 512 channels.

Training is adapted around stability and reuse. Because bad initialization could stall deeper nets, the authors first train shallow configuration A and initialize the first four convolutional layers and last three fully connected layers of deeper models from it. For training scale S=384, they initialize from the S=256 model and use a smaller initial learning rate. Scale-jittered models are fine-tuned from fixed-scale models for speed reasons. These are compute-saving curriculum decisions, not just accuracy tricks.

The test-time method is also compute-aware. The fully connected layers are converted to convolutional layers and applied densely over the whole rescaled image, avoiding repeated recomputation for many crops. The paper still evaluates 50 crops per scale over three scales as a reference, and combines dense plus multi-crop outputs for best accuracy, but it explicitly says the computation time of many crops may not justify the gains in practice. VGG therefore distinguishes the practical inference path from the benchmark-maximizing path.

Evidence

The architecture table gives the parameter and depth evidence. Model A has 11 weight layers and 133M parameters; model D has 16 layers and 138M; model E has 19 layers and 144M. In the single-scale validation table, deeper models improve error: A at S=256 has 29.6% top-1 and 10.4% top-5 error, while E with scale jittering has 25.5% top-1 and 8.0% top-5. A-LRN is slightly worse than A, 29.7%/10.5%, supporting the decision to drop LRN.

Multi-scale evaluation improves the strongest models. For D and E trained with S in [256,512] and tested at 256, 384, and 512, the best single-network validation result is 24.8% top-1 and 7.5% top-5. The paper states that configuration E reaches 7.3% top-5 test error at this stage. Combining dense and multi-crop evaluation improves validation top-5 to 7.1% for E, and the single-net test result in the state-of-the-art table is 7.0%.

Ensembling shows how much benchmark accuracy was bought with extra inference compute. The ILSVRC submission used seven networks and dense evaluation for 7.3% test error. Post-submission, an ensemble of the two best multi-scale models D and E reaches 7.0% test error with dense evaluation and 6.8% with combined dense and multi-crop evaluation. The paper reports that VGG placed second in ILSVRC 2014 classification and first in localization; the localization appendix reports 25.3% top-5 test localization error.

The hardware evidence matters because those results are not cheap: each trained network required 2-3 weeks on four Titan Black GPUs, and the final comparisons depend on multiple separately trained networks.

Historical Effect

VGG made the "deep but simple" CNN backbone a durable baseline. Its historical compute effect is that an expensive 4-GPU training campaign produced representations that were easy to reuse, inspect, and transfer. Downstream detection and segmentation work adopted VGG-like backbones because the architecture was regular and the feature hierarchy was strong.

It also clarified a design tradeoff of the era. Rather than use complex branching or hand-tuned normalization, VGG spends dense convolutional compute and wall-clock time on depth. That is inefficient by later standards, but it gave the field a clean scaling axis before residual networks, bottleneck blocks, depthwise convolutions, and accelerator-optimized efficient CNNs changed the trade space.

Limits

The 133M-144M parameter count is large for the period, especially because of the fully connected layers.
Training a single network took 2-3 weeks on four Titan Black GPUs.
Best benchmark numbers use multi-scale, multi-crop, and ensemble inference that multiply serving cost.
The method scales dense computation rather than solving memory efficiency; later architectures achieved better accuracy-cost tradeoffs.