Going Deeper with Convolutions

Metadata

Slug: googlenet_2014
Year: 2014
Venue: CVPR
Authors: Christian Szegedy et al.
Reading status: read complete
Compute regime: Single-GPU deep learning
Primary sources: PDF, extracted text

Compute Setup

The paper explicitly says training used DistBelief with modest model and data parallelism, and that the reported implementation was CPU based only. The authors add a GPU feasibility estimate: GoogLeNet could likely be trained to convergence on a few high-end GPUs within a week, with memory usage as the main limitation. That is a rare case where the published result is not a GPU training result even though the architecture is designed for the CPU/GPU dense-kernel era.

Device basis: explicit CPU-based distributed training for the reported training setup, with GPU feasibility discussed but not used as the main reported implementation.

Bottleneck

The bottleneck is increasing network capacity without letting parameters, multiply-adds, and memory grow out of control. The paper's systems argument is explicit: non-uniform sparse computation may reduce arithmetic counts, but CPU/GPU libraries of the period are much better at dense matrix multiplication. Lookup overheads, cache misses, and engineering complexity can erase nominal sparse savings. Convolutions already provide spatial sparsity while remaining dense inside local patches, and the authors note that fully connected feature dimensions had become attractive again because they better optimize parallel computation.

The other bottleneck is inference practicality. ImageNet-winning CNNs were growing deeper and wider, but the paper argues that power and memory use matter if models are to become usable rather than academic. GoogLeNet is therefore not just an accuracy architecture; it is an attempt to increase depth and width while keeping dense operations regular and the low-memory inference footprint acceptable.

Method Adaptation

GoogLeNet adapts to this compute structure through Inception modules:

Approximate sparse multi-scale structure with dense branches.
Use 1 x 1 projections before expensive 3 x 3 and 5 x 5 convolutions.
Use average pooling instead of large fully connected layers.
Add auxiliary classifiers for gradient flow and regularization.
Keep the branch structure regular enough for dense CPU/GPU kernels.

The central adaptation is the 1 x 1 bottleneck. The paper says 1 x 1 convolutions are used as reductions before the expensive 3 x 3 and 5 x 5 convolutions, and that they also include rectified linear activations. This lets each Inception module aggregate visual information at multiple scales while limiting the input channel count of the costly branches. In compute terms, the architecture buys width with cheap dense projections, then spends compute selectively on larger receptive fields.

Memory constraints shape depth placement. The paper says it was beneficial, for "memory efficiency during training," to start using Inception modules only at higher layers while leaving lower layers in more traditional convolutional form. Average pooling before the classifier cuts parameter count relative to large fully connected heads. Auxiliary classifiers attached at Inception 4a and 4d add training signal but are discarded at inference, so they spend training compute without increasing the deployed model.

Evidence

Table 1 is the compute evidence: it lists each layer's parameters and operation counts. Early layers dominate operations, such as the 7 x 7 convolution at 34M ops and the 3 x 3 convolution at 360M ops, while later Inception modules use reduction branches to keep operations bounded despite larger channel counts. The final linear classifier is listed at about 1000K parameters and 1M ops after average pooling, which is far smaller than the fully connected heads common in AlexNet-style systems.

The network is 22 parameterized layers deep, or 27 if pooling is counted. The paper says it was designed for "computational efficiency and practicality" so inference can run on individual devices, including limited-resource devices with low memory footprint. In the ILSVRC classification setup, the final submission used seven trained models and aggressive test-time cropping: 4 image scales, 3 square positions, 6 crops per square including the resized square, and mirrored versions, for 144 crops per image. The single-model, single-crop validation cost is the base; the full seven-model, 144-crop ensemble costs 1008 relative units and reaches 6.67% top-5 error. A single model with one crop is 10.07% top-5 error, and one model with 10 crops is 9.15%, showing that the winning number spent considerable inference-time compute beyond the efficient base network.

The detection evidence reinforces the compute tradeoff. GoogLeNet's detection entry uses an R-CNN-like pipeline with the Inception model as the region classifier, combines Selective Search with MultiBox proposals, halves Selective Search proposals by increasing superpixel size, and adds back 200 MultiBox proposals. The paper says this yields about 60% of the proposals used by R-CNN while increasing coverage from 92% to 93%, improving single-model mAP by 1%. The detection ensemble of six ConvNets improves from about 40% to 43.9% mAP.

Historical Effect

GoogLeNet/Inception made compute-aware CNN modules mainstream: bottleneck projections, multi-scale dense branches, and parameter-count discipline became standard architectural tools. Historically, it sits between brute-force GPU CNN scaling and later efficient architecture families. It showed that a model could become deeper and more accurate without simply multiplying fully connected parameters or using hardware-unfriendly sparse matrices.

It also documents a transition moment in infrastructure. The reported training used CPU DistBelief, but the architecture is written with CPU/GPU dense-library efficiency in mind and explicitly discusses high-end GPU feasibility. The paper therefore captures the period when distributed internal training systems and commodity GPU deep learning overlapped, before single-framework GPU training became the default public recipe.

Limits

Architecture has many manual design knobs.
The paper frames its design principles as requiring further analysis.
The training recipe evolved over time and is hard to reduce to one clean procedure.
Reported training uses CPU DistBelief, not the later standard GPU implementation.
The best classification result uses a costly ensemble and 144-crop test-time procedure, so the headline number is not single-forward-pass inference.
Exact DistBelief cluster size and CPU hardware are not listed in the extracted text.