EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

Metadata

Slug: efficientnet_2019
Year: 2019
Venue: ICML
Authors: Mingxing Tan, Quoc V. Le
Reading status: read complete
Compute regime: Multi-GPU dense training
Primary sources: PDF, extracted text

Compute Setup

The paper explicitly lists the inference device for latency measurement: batch size 1 on a single core of an Intel Xeon CPU E5-2690, averaged over 20 runs. It does not list the training hardware in the extracted text. Under the project rule, the training environment is inferred as 2018-2019 Google accelerator infrastructure for ImageNet-scale CNN/NAS work, but the paper itself optimizes accuracy and FLOPs rather than claiming a specific GPU or TPU training setup.

The compute setup is therefore dual: train large enough models for ImageNet and transfer learning, but judge the resulting family by parameters, FLOPs, and measured CPU latency. EfficientNet-B0 is found by neural architecture search with a 400M-FLOP target. The B1-B7 family is then produced by scaling the same baseline with larger compute budgets.

Bottleneck

The bottleneck is inefficient CNN scaling. The paper argues that conventional scaling usually pushes only one dimension: deeper networks, wider networks, or higher input resolution. Each dimension creates a different device pressure. Depth increases sequential layer count and can make optimization harder. Width increases channel memory and the quadratic cost of convolutional channel products. Resolution increases activation sizes and spatial FLOPs, stressing memory bandwidth and activation storage.

The paper explicitly notes that high-accuracy ConvNets had hit hardware memory limits and that further accuracy gains needed better efficiency. GPipe is the comparison point at the high end: it reaches 84.3% ImageNet top-1 with 557M parameters, but it is so large that it relies on specialized pipeline parallelism. EfficientNet's bottleneck is to reach the same accuracy regime with a model that is smaller, cheaper in FLOPs, and faster on a real CPU.

Method Adaptation

EfficientNet adapts CNN scaling to resource constraints in two steps. First, it searches for a compact baseline, EfficientNet-B0, using a multi-objective target that includes accuracy and FLOPs. The baseline uses mobile-style MBConv blocks and squeeze-and-excitation, giving the scaling procedure a relatively efficient starting point.

Second, it uses compound scaling. A single coefficient phi increases depth, width, and resolution together: depth by alpha^phi, width by beta^phi, and resolution by gamma^phi. This is hardware aware because convolutional FLOPs scale roughly linearly with depth but quadratically with width and resolution. The paper chooses alpha, beta, and gamma by a small grid search at phi=1, then holds them fixed to scale B1 through B7. In effect, the method allocates new FLOPs across sequential depth, channel parallelism, and spatial activation size instead of letting one dimension saturate.

The training recipe is also adjusted for scale. The paper reports RMSProp with momentum, batch normalization settings, weight decay, learning-rate decay, SiLU/Swish activation, AutoAugment, stochastic depth, and dropout increasing from 0.2 for B0 to 0.5 for B7. Those details matter because larger models need more regularization; otherwise the compute allocation would be confounded with overfitting or unstable optimization.

Evidence

The ImageNet table is the main compute evidence. EfficientNet-B0 reaches 77.1% top-1 with 5.3M parameters and 0.39B FLOPs, compared with ResNet-50 at 76.0%, 26M parameters, and 4.1B FLOPs. EfficientNet-B1 reaches 79.1% top-1 with 7.8M parameters and 0.70B FLOPs, while ResNet-152 reaches 77.8% with 60M parameters and 11B FLOPs. EfficientNet-B3 reaches 81.6% with 12M parameters and 1.8B FLOPs; the text highlights that it exceeds ResNeXt-101 while using 18x fewer FLOPs.

At the high end, EfficientNet-B7 reaches 84.3% top-1 and 97.0% top-5 with 66M parameters and 37B FLOPs. GPipe reaches the same 84.3%/97.0% ImageNet accuracy with 557M parameters and a specialized pipeline-parallel training system. The paper summarizes this as EfficientNet-B7 being 8.4x smaller and 6.1x faster than GPipe.

The CPU latency table ties FLOPs to a device. On one Xeon E5-2690 core at batch size 1, EfficientNet-B1 runs at 0.098 seconds with 78.8% accuracy, versus ResNet-152 at 0.554 seconds with 77.8% accuracy, a 5.7x speedup. EfficientNet-B7 is listed at 3.1 seconds and 84.4% accuracy, compared with GPipe at 19.0 seconds and 84.3%, a 6.1x speedup.

Historical Effect

EfficientNet shifted vision scaling from ad hoc "make it bigger" choices to explicit compute allocation across depth, width, and resolution. Historically, it is a bridge between neural architecture search and deployment-aware model scaling: search supplies a compact baseline, and compound scaling supplies a predictable family of models across resource budgets.

Its influence is also methodological. It made parameter count, FLOPs, and real-device latency central reporting columns for high-accuracy CNNs. That helped move ImageNet progress away from accuracy-only comparisons against very large pipeline-parallel models.

Limits

The main limit is that FLOPs are an imperfect proxy for actual accelerator latency. The paper does measure CPU latency for representative models, but the compound search objective optimizes FLOPs rather than a full hardware cost model. Training hardware, chip count, and training wall-clock are not named in the extracted text, so training-scale claims should be limited to model, FLOP, and benchmark evidence.

Some gains also combine multiple factors: the B0 architecture, compound scaling, AutoAugment, stochastic depth, activation choice, and regularization schedule. EfficientNet is therefore not a pure scaling-law ablation. Finally, the local extraction interleaves columns and figures, though the main hardware, latency, FLOP, parameter, and accuracy values are readable.