Sources — Compute-Structure AI History

                                            EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks


                                                                                         Mingxing Tan 1 Quoc V. Le 1


                                                                   Abstract                                                                                                       EfficientNet-B7
                                                                                                                                      84                                 B6                                                  AmoebaNet-C
                                                Convolutional Neural Networks (ConvNets) are                                                                   B5                       AmoebaNet-A
                                                                                                                                                         B4
                                                commonly developed at a fixed resource budget,                                                                                              NASNet-A                        SENet




arXiv:1905.11946v5 [cs.LG] 11 Sep 2020
                                                                                                                                      82




                                                                                                        Imagenet Top-1 Accuracy (%)
                                                and then scaled up for better accuracy if more                                                      B3
                                                resources are available. In this paper, we sys-                                                                                          ResNeXt-101
                                                                                                                                      80                                      Inception-ResNet-v2
                                                tematically study model scaling and identify that
                                                carefully balancing network depth, width, and res-                                                        Xception
                                                olution can lead to better performance. Based                                         78                                        ResNet-152
                                                                                                                                                                                                                         Top1 Acc. #Params
                                                on this observation, we propose a new scaling                                                  B0        DenseNet-201                   ResNet-152 (He et al., 2016)      77.8%       60M
                                                                                                                                                                                        EfficientNet-B1                   79.1%      7.8M
                                                                                                                                                                                        ResNeXt-101 (Xie et al., 2017)    80.9%       84M
                                                method that uniformly scales all dimensions of                                        76                                                EfficientNet-B3                   81.6%       12M
                                                                                                                                                              ResNet-50                 SENet (Hu et al., 2018)           82.7%      146M
                                                depth/width/resolution using a simple yet highly                                                                                        NASNet-A (Zoph et al., 2018)      82.7%       89M
                                                                                                                                                                                        EfficientNet-B4                   82.9%       19M
                                                effective compound coefficient. We demonstrate                                                  Inception-v2
                                                                                                                                                                                        GPipe (Huang et al., 2018) †      84.3%      556M
                                                                                                                                      74
                                                                                                                                               NASNet-A                                 EfficientNet-B7                   84.3%       66M
                                                the effectiveness of this method on scaling up                                                                                           †
                                                                                                                                                                                           Not plotted
                                                                                                                                                         ResNet-34
                                                MobileNets and ResNet.                                                                     0         20             40      60      80    100     120     140                    160         180
                                                                                                                                                                          Number of Parameters (Millions)
                                                To go even further, we use neural architec-
                                                                                                        Figure 1. Model Size vs. ImageNet Accuracy. All numbers are
                                                ture search to design a new baseline network
                                                                                                        for single-crop, single-model. Our EfficientNets significantly out-
                                                and scale it up to obtain a family of models,
                                                                                                        perform other ConvNets. In particular, EfficientNet-B7 achieves
                                                called EfficientNets, which achieve much                new state-of-the-art 84.3% top-1 accuracy but being 8.4x smaller
                                                better accuracy and efficiency than previous            and 6.1x faster than GPipe. EfficientNet-B1 is 7.6x smaller and
                                                ConvNets. In particular, our EfficientNet-B7            5.7x faster than ResNet-152. Details are in Table 2 and 4.
                                                achieves state-of-the-art 84.3% top-1 accuracy
                                                on ImageNet, while being 8.4x smaller and
                                                6.1x faster on inference than the best existing         time larger. However, the process of scaling up ConvNets
                                                ConvNet. Our EfficientNets also transfer well and       has never been well understood and there are currently many
                                                achieve state-of-the-art accuracy on CIFAR-100          ways to do it. The most common way is to scale up Con-
                                                (91.7%), Flowers (98.8%), and 3 other transfer          vNets by their depth (He et al., 2016) or width (Zagoruyko &
                                                learning datasets, with an order of magnitude           Komodakis, 2016). Another less common, but increasingly
                                                fewer parameters. Source code is at https:              popular, method is to scale up models by image resolution
                                                //github.com/tensorflow/tpu/tree/                       (Huang et al., 2018). In previous work, it is common to scale
                                                master/models/official/efficientnet.                    only one of the three dimensions – depth, width, and image
                                                                                                        size. Though it is possible to scale two or three dimensions
                                                                                                        arbitrarily, arbitrary scaling requires tedious manual tuning
                                                                                                        and still often yields sub-optimal accuracy and efficiency.
                                         1. Introduction
                                                                                                        In this paper, we want to study and rethink the process
                                         Scaling up ConvNets is widely used to achieve better accu-     of scaling up ConvNets. In particular, we investigate the
                                         racy. For example, ResNet (He et al., 2016) can be scaled      central question: is there a principled method to scale up
                                         up from ResNet-18 to ResNet-200 by using more layers;          ConvNets that can achieve better accuracy and efficiency?
                                         Recently, GPipe (Huang et al., 2018) achieved 84.3% Ima-       Our empirical study shows that it is critical to balance all
                                         geNet top-1 accuracy by scaling up a baseline model four       dimensions of network width/depth/resolution, and surpris-
                                            1
                                                                                                        ingly such balance can be achieved by simply scaling each
                                             Google Research, Brain Team, Mountain View, CA. Corre-     of them with constant ratio. Based on this observation, we
                                         spondence to: Mingxing Tan <tanmingxing@google.com>.
                                                                                                        propose a simple yet effective compound scaling method.
                                         Proceedings of the 36 th International Conference on Machine   Unlike conventional practice that arbitrary scales these fac-
                                         Learning, Long Beach, California, PMLR 97, 2019.               tors, our method uniformly scales network width, depth,
                              EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks



                                                                                                                    wider



    #channels
                                   wider


                                                                     deeper
                                                                                                           deeper




                   layer_i

                                                                                              higher                         higher
             resolution HxW                                                                   resolution                     resolution

    (a) baseline                (b) width scaling     (c) depth scaling         (d) resolution scaling        (e) compound scaling


Figure 2. Model Scaling. (a) is a baseline network example; (b)-(d) are conventional scaling that only increases one dimension of network
width, depth, or resolution. (e) is our proposed compound scaling method that uniformly scales all three dimensions with a fixed ratio.


and resolution with a set of fixed scaling coefficients. For              of-the-art accuracy on 5 out of 8 widely used datasets, while
example, if we want to use 2N times more computational                    reducing parameters by up to 21x than existing ConvNets.
resources, then we can simply increase the network depth by
αN , width by β N , and image size by γ N , where α, β, γ are             2. Related Work
constant coefficients determined by a small grid search on
the original small model. Figure 2 illustrates the difference             ConvNet Accuracy: Since AlexNet (Krizhevsky et al.,
between our scaling method and conventional methods.                      2012) won the 2012 ImageNet competition, ConvNets have
                                                                          become increasingly more accurate by going bigger: while
Intuitively, the compound scaling method makes sense be-
                                                                          the 2014 ImageNet winner GoogleNet (Szegedy et al., 2015)
cause if the input image is bigger, then the network needs
                                                                          achieves 74.8% top-1 accuracy with about 6.8M parameters,
more layers to increase the receptive field and more channels
                                                                          the 2017 ImageNet winner SENet (Hu et al., 2018) achieves
to capture more fine-grained patterns on the bigger image. In
                                                                          82.7% top-1 accuracy with 145M parameters. Recently,
fact, previous theoretical (Raghu et al., 2017; Lu et al., 2018)
                                                                          GPipe (Huang et al., 2018) further pushes the state-of-the-art
and empirical results (Zagoruyko & Komodakis, 2016) both
                                                                          ImageNet top-1 validation accuracy to 84.3% using 557M
show that there exists certain relationship between network
                                                                          parameters: it is so big that it can only be trained with a
width and depth, but to our best knowledge, we are the
                                                                          specialized pipeline parallelism library by partitioning the
first to empirically quantify the relationship among all three
                                                                          network and spreading each part to a different accelera-
dimensions of network width, depth, and resolution.
                                                                          tor. While these models are mainly designed for ImageNet,
We demonstrate that our scaling method work well on exist-                recent studies have shown better ImageNet models also per-
ing MobileNets (Howard et al., 2017; Sandler et al., 2018)                form better across a variety of transfer learning datasets
and ResNet (He et al., 2016). Notably, the effectiveness of               (Kornblith et al., 2019), and other computer vision tasks
model scaling heavily depends on the baseline network; to                 such as object detection (He et al., 2016; Tan et al., 2019).
go even further, we use neural architecture search (Zoph                  Although higher accuracy is critical for many applications,
& Le, 2017; Tan et al., 2019) to develop a new baseline                   we have already hit the hardware memory limit, and thus
network, and scale it up to obtain a family of models, called             further accuracy gain needs better efficiency.
EfficientNets. Figure 1 summarizes the ImageNet perfor-
mance, where our EfficientNets significantly outperform                   ConvNet Efficiency: Deep ConvNets are often over-
other ConvNets. In particular, our EfficientNet-B7 surpasses              parameterized. Model compression (Han et al., 2016; He
the best existing GPipe accuracy (Huang et al., 2018), but                et al., 2018; Yang et al., 2018) is a common way to re-
using 8.4x fewer parameters and running 6.1x faster on in-                duce model size by trading accuracy for efficiency. As mo-
ference. Compared to the widely used ResNet-50 (He et al.,                bile phones become ubiquitous, it is also common to hand-
2016), our EfficientNet-B4 improves the top-1 accuracy                    craft efficient mobile-size ConvNets, such as SqueezeNets
from 76.3% to 83.0% (+6.7%) with similar FLOPS. Besides                   (Iandola et al., 2016; Gholami et al., 2018), MobileNets
ImageNet, EfficientNets also transfer well and achieve state-             (Howard et al., 2017; Sandler et al., 2018), and ShuffleNets
                              EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

(Zhang et al., 2018; Ma et al., 2018). Recently, neural archi-      i. Figure 2(a) illustrate a representative ConvNet, where
tecture search becomes increasingly popular in designing            the spatial dimension is gradually shrunk but the channel
efficient mobile-size ConvNets (Tan et al., 2019; Cai et al.,       dimension is expanded over layers, for example, from initial
2019), and achieves even better efficiency than hand-crafted        input shape h224, 224, 3i to final output shape h7, 7, 512i.
mobile ConvNets by extensively tuning the network width,
                                                                    Unlike regular ConvNet designs that mostly focus on find-
depth, convolution kernel types and sizes. However, it is
                                                                    ing the best layer architecture Fi , model scaling tries to ex-
unclear how to apply these techniques for larger models that
                                                                    pand the network length (Li ), width (Ci ), and/or resolution
have much larger design space and much more expensive
                                                                    (Hi , Wi ) without changing Fi predefined in the baseline
tuning cost. In this paper, we aim to study model efficiency
                                                                    network. By fixing Fi , model scaling simplifies the design
for super large ConvNets that surpass state-of-the-art accu-
                                                                    problem for new resource constraints, but it still remains
racy. To achieve this goal, we resort to model scaling.
                                                                    a large design space to explore different Li , Ci , Hi , Wi for
Model Scaling: There are many ways to scale a Con-                  each layer. In order to further reduce the design space, we
vNet for different resource constraints: ResNet (He et al.,         restrict that all layers must be scaled uniformly with con-
2016) can be scaled down (e.g., ResNet-18) or up (e.g.,             stant ratio. Our target is to maximize the model accuracy
ResNet-200) by adjusting network depth (#layers), while             for any given resource constraints, which can be formulated
WideResNet (Zagoruyko & Komodakis, 2016) and Mo-                    as an optimization problem:
bileNets (Howard et al., 2017) can be scaled by network                                            
width (#channels). It is also well-recognized that bigger              max     Accuracy N (d, w, r)
                                                                       d,w,r
input image size will help accuracy with the overhead of                                     K
                                                                                                 F̂id·L̂i Xhr·Ĥi ,r·Ŵi ,w·Ĉi i
                                                                                                                                  
more FLOPS. Although prior studies (Raghu et al., 2017;                s.t.    N (d, w, r) =
                                                                                                i=1...s
Lin & Jegelka, 2018; Sharir & Shashua, 2018; Lu et al.,
2018) have shown that network depth and width are both                         Memory(N ) ≤ target memory
important for ConvNets’ expressive power, it still remains                     FLOPS(N ) ≤ target flops
an open question of how to effectively scale a ConvNet to                                                                            (2)
achieve better efficiency and accuracy. Our work systemati-         where w, d, r are coefficients for scaling network width,
cally and empirically studies ConvNet scaling for all three         depth, and resolution; F̂i , L̂i , Ĥi , Ŵi , Ĉi are predefined pa-
dimensions of network width, depth, and resolutions.                rameters in baseline network (see Table 1 as an example).

3. Compound Model Scaling                                           3.2. Scaling Dimensions
In this section, we will formulate the scaling problem, study       The main difficulty of problem 2 is that the optimal d, w, r
different approaches, and propose our new scaling method.           depend on each other and the values change under different
                                                                    resource constraints. Due to this difficulty, conventional
3.1. Problem Formulation                                            methods mostly scale ConvNets in one of these dimensions:

A ConvNet Layer i can be defined as a function: Yi =
                                                                    Depth (d ): Scaling network depth is the most common way
Fi (Xi ), where Fi is the operator, Yi is output tensor, Xi is
                                                                    used by many ConvNets (He et al., 2016; Huang et al., 2017;
input tensor, with tensor shape hHi , Wi , Ci i1 , where Hi and
                                                                    Szegedy et al., 2015; 2016). The intuition is that deeper
Wi are spatial dimension and Ci is the channel dimension.
                                                                    ConvNet can capture richer and more complex features, and
A ConvNet N can be represented by aJ     list of composed lay-
                                                                    generalize well on new tasks. However, deeper networks
ers: N = Fk ... F2 F1 (X1 ) = j=1...k Fj (X1 ). In
                                                                    are also more difficult to train due to the vanishing gradient
practice, ConvNet layers are often partitioned into multiple
                                                                    problem (Zagoruyko & Komodakis, 2016). Although sev-
stages and all layers in each stage share the same architec-
                                                                    eral techniques, such as skip connections (He et al., 2016)
ture: for example, ResNet (He et al., 2016) has five stages,
                                                                    and batch normalization (Ioffe & Szegedy, 2015), alleviate
and all layers in each stage has the same convolutional type
                                                                    the training problem, the accuracy gain of very deep network
except the first layer performs down-sampling. Therefore,
                                                                    diminishes: for example, ResNet-1000 has similar accuracy
we can define a ConvNet as:
                                                                    as ResNet-101 even though it has much more layers. Figure
                                                                    3 (middle) shows our empirical study on scaling a baseline
                      K
                            FiLi XhHi ,Wi ,Ci i
                                                
               N =                                           (1)
                        i=1...s
                                                                    model with different depth coefficient d, further suggesting
                                                                    the diminishing accuracy return for very deep ConvNets.
where FiLi denotes layer Fi is repeated Li times in stage i,
hHi , Wi , Ci i denotes the shape of input tensor X of layer
                                                                    Width (w ): Scaling network width is commonly used for
   1
       For the sake of simplicity, we omit batch dimension.         small size models (Howard et al., 2017; Sandler et al., 2018;
                                                          EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

                               81                                                81                                                                              81




  ImageNet Top-1 Accuracy(%)
                               80                                        w=5.0   80                                                                              80
                                                         w=3.8                                                                              d=6.0    d=8.0
                                                 w=2.6                                                                                                                                                 r=2.2 r=2.5
                               79                                                79                                                                              79                            r=1.9
                                                                                                  d=3.0d=4.0                                                                           r=1.7
                                          w=1.8                                               d=2.0
                               78                                                78                                                                              78                r=1.5
                                                                                                                                                                                 r=1.3
                                         w=1.4
                               77                                                77                                                                              77

                               76       w=1.0                                    76       d=1.0                                                                  76         r=1.0

                               75                                                75                                                                              75
                                    0       2        4       6       8                0      1        2         3                                      4              0          1       2                  3
                                                  FLOPS (Billions)                                FLOPS (Billions)                                                                FLOPS (Billions)

Figure 3. Scaling Up a Baseline Model with Different Network Width (w), Depth (d), and Resolution (r) Coefficients. Bigger
networks with larger width, depth, or resolution tend to achieve higher accuracy, but the accuracy gain quickly saturate after reaching
80%, demonstrating the limitation of single dimension scaling. Baseline network is described in Table 1.


Tan et al., 2019)2 . As discussed in (Zagoruyko & Ko-
                                                                                                                                            82
modakis, 2016), wider networks tend to be able to capture
more fine-grained features and are easier to train. However,                                                                                81




                                                                                                               ImageNet Top1 Accuracy (%)
extremely wide but shallow networks tend to have difficul-
ties in capturing higher level features. Our empirical results                                                                              80
in Figure 3 (left) show that the accuracy quickly saturates
when networks become much wider with larger w.                                                                                              79


Resolution (r ): With higher resolution input images, Con-                                                                                  78
                                                                                                                                                                                                       d=1.0, r=1.0
vNets can potentially capture more fine-grained patterns.
                                                                                                                                                                                                       d=1.0, r=1.3
Starting from 224x224 in early ConvNets, modern Con-                                                                                        77
                                                                                                                                                                                                       d=2.0, r=1.0
vNets tend to use 299x299 (Szegedy et al., 2016) or 331x331                                                                                                                                            d=2.0, r=1.3
                                                                                                                                            76
(Zoph et al., 2018) for better accuracy. Recently, GPipe
                                                                                                                                                 0           5             10          15              20         25
(Huang et al., 2018) achieves state-of-the-art ImageNet ac-                                                                                                               FLOPS (billions)
curacy with 480x480 resolution. Higher resolutions, such as
600x600, are also widely used in object detection ConvNets                                               Figure 4. Scaling Network Width for Different Baseline Net-
(He et al., 2017; Lin et al., 2017). Figure 3 (right) shows the                                          works. Each dot in a line denotes a model with different width
results of scaling network resolutions, where indeed higher                                              coefficient (w). All baseline networks are from Table 1. The first
                                                                                                         baseline network (d=1.0, r=1.0) has 18 convolutional layers with
resolutions improve accuracy, but the accuracy gain dimin-
                                                                                                         resolution 224x224, while the last baseline (d=2.0, r=1.3) has 36
ishes for very high resolutions (r = 1.0 denotes resolution
                                                                                                         layers with resolution 299x299.
224x224 and r = 2.5 denotes resolution 560x560).
The above analyses lead us to the first observation:
                                                                                                         order to capture more fine-grained patterns with more pixels
Observation 1 – Scaling up any dimension of network                                                      in high resolution images. These intuitions suggest that we
width, depth, or resolution improves accuracy, but the accu-                                             need to coordinate and balance different scaling dimensions
racy gain diminishes for bigger models.                                                                  rather than conventional single-dimension scaling.
                                                                                                         To validate our intuitions, we compare width scaling under
3.3. Compound Scaling                                                                                    different network depths and resolutions, as shown in Figure
                                                                                                         4. If we only scale network width w without changing
We empirically observe that different scaling dimensions are
                                                                                                         depth (d=1.0) and resolution (r=1.0), the accuracy saturates
not independent. Intuitively, for higher resolution images,
                                                                                                         quickly. With deeper (d=2.0) and higher resolution (r=2.0),
we should increase network depth, such that the larger re-
                                                                                                         width scaling achieves much better accuracy under the same
ceptive fields can help capture similar features that include
                                                                                                         FLOPS cost. These results lead us to the second observation:
more pixels in bigger images. Correspondingly, we should
also increase network width when resolution is higher, in
                                                                                                         Observation 2 – In order to pursue better accuracy and
              2
    In some literature, scaling number of channels is called “depth                                      efficiency, it is critical to balance all dimensions of network
multiplier”, which means the same as our width coefficient w.                                            width, depth, and resolution during ConvNet scaling.
                             EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

In fact, a few prior work (Zoph et al., 2018; Real et al., 2019)    Table 1. EfficientNet-B0 baseline network – Each row describes
have already tried to arbitrarily balance network width and         a stage i with L̂i layers, with input resolution hĤi , Ŵi i and output
depth, but they all require tedious manual tuning.                  channels Ĉi . Notations are adopted from equation 2.
                                                                      Stage           Operator           Resolution   #Channels   #Layers
In this paper, we propose a new compound scaling method,
                                                                        i               F̂i               Ĥi × Ŵi      Ĉi        L̂i
which use a compound coefficient φ to uniformly scales
                                                                        1            Conv3x3             224 × 224       32          1
network width, depth, and resolution in a principled way:               2         MBConv1, k3x3          112 × 112       16          1
                                                                        3         MBConv6, k3x3          112 × 112       24          2
                     depth: d = αφ                                      4         MBConv6, k5x5           56 × 56        40          2
                                                                        5         MBConv6, k3x3           28 × 28        80          3
                     width: w = β φ                                     6         MBConv6, k5x5           14 × 14        112         3
                                                                        7         MBConv6, k5x5           14 × 14        192         4
                 resolution: r = γ φ                          (3)       8         MBConv6, k3x3            7×7           320         1
                                  2    2                                9     Conv1x1 & Pooling & FC       7×7          1280         1
                         s.t. α · β · γ ≈ 2
                             α ≥ 1, β ≥ 1, γ ≥ 1
                                                                    Net, except our EfficientNet-B0 is slightly bigger due to
where α, β, γ are constants that can be determined by a             the larger FLOPS target (our FLOPS target is 400M). Ta-
small grid search. Intuitively, φ is a user-specified coeffi-       ble 1 shows the architecture of EfficientNet-B0. Its main
cient that controls how many more resources are available           building block is mobile inverted bottleneck MBConv (San-
for model scaling, while α, β, γ specify how to assign these        dler et al., 2018; Tan et al., 2019), to which we also add
extra resources to network width, depth, and resolution re-         squeeze-and-excitation optimization (Hu et al., 2018).
spectively. Notably, the FLOPS of a regular convolution op          Starting from the baseline EfficientNet-B0, we apply our
is proportional to d, w2 , r2 , i.e., doubling network depth        compound scaling method to scale it up with two steps:
will double FLOPS, but doubling network width or resolu-
tion will increase FLOPS by four times. Since convolution              • STEP 1: we first fix φ = 1, assuming twice more re-
ops usually dominate the computation cost in ConvNets,                   sources available, and do a small grid search of α, β, γ
scaling a ConvNet with equation 3 will approximately in-                 based on Equation 2 and 3. In particular, we find
                                        φ
crease total FLOPS by α · β 2 · γ 2 . In this paper, we                  the best values for EfficientNet-B0 are α = 1.2, β =
constraint α · β 2 · γ 2 ≈ 2 such that for any new φ, the total          1.1, γ = 1.15, under constraint of α · β 2 · γ 2 ≈ 2.
FLOPS will approximately3 increase by 2φ .                             • STEP 2: we then fix α, β, γ as constants and scale up
                                                                         baseline network with different φ using Equation 3, to
4. EfficientNet Architecture                                             obtain EfficientNet-B1 to B7 (Details in Table 2).
Since model scaling does not change layer operators F̂i             Notably, it is possible to achieve even better performance by
in baseline network, having a good baseline network is              searching for α, β, γ directly around a large model, but the
also critical. We will evaluate our scaling method using            search cost becomes prohibitively more expensive on larger
existing ConvNets, but in order to better demonstrate the           models. Our method solves this issue by only doing search
effectiveness of our scaling method, we have also developed         once on the small baseline network (step 1), and then use
a new mobile-size baseline, called EfficientNet.                    the same scaling coefficients for all other models (step 2).
Inspired by (Tan et al., 2019), we develop our baseline net-
work by leveraging a multi-objective neural architecture            5. Experiments
search that optimizes both accuracy and FLOPS. Specifi-
cally, we use the same search space as (Tan et al., 2019),          In this section, we will first evaluate our scaling method on
and use ACC(m)×[F LOP S(m)/T ]w as the optimization                 existing ConvNets and the new proposed EfficientNets.
goal, where ACC(m) and F LOP S(m) denote the accu-
racy and FLOPS of model m, T is the target FLOPS and                5.1. Scaling Up MobileNets and ResNets
w=-0.07 is a hyperparameter for controlling the trade-off
                                                                    As a proof of concept, we first apply our scaling method
between accuracy and FLOPS. Unlike (Tan et al., 2019;
                                                                    to the widely-used MobileNets (Howard et al., 2017; San-
Cai et al., 2019), here we optimize FLOPS rather than la-
                                                                    dler et al., 2018) and ResNet (He et al., 2016). Table 3
tency since we are not targeting any specific hardware de-
                                                                    shows the ImageNet results of scaling them in different
vice. Our search produces an efficient network, which we
                                                                    ways. Compared to other single-dimension scaling methods,
name EfficientNet-B0. Since we use the same search space
                                                                    our compound scaling method improves the accuracy on all
as (Tan et al., 2019), the architecture is similar to Mnas-
                                                                    these models, suggesting the effectiveness of our proposed
   3
       FLOPS may differ from theoretical value due to rounding.     scaling method for general existing ConvNets.
                               EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks


Table 2. EfficientNet Performance Results on ImageNet (Russakovsky et al., 2015). All EfficientNet models are scaled from our
baseline EfficientNet-B0 using different compound coefficient φ in Equation 3. ConvNets with similar top-1/top-5 accuracy are grouped
together for efficiency comparison. Our scaled EfficientNet models consistently reduce parameters and FLOPS by an order of magnitude
(up to 8.4x parameter reduction and up to 16x FLOPS reduction) than existing ConvNets.

  Model                                            Top-1 Acc.      Top-5 Acc.   #Params                                      Ratio-to-EfficientNet              #FLOPs             Ratio-to-EfficientNet
  EfficientNet-B0                                    77.1%           93.3%                 5.3M                                           1x                      0.39B                         1x
  ResNet-50 (He et al., 2016)                        76.0%           93.0%                 26M                                           4.9x                      4.1B                        11x
  DenseNet-169 (Huang et al., 2017)                  76.2%           93.2%                 14M                                           2.6x                      3.5B                        8.9x
  EfficientNet-B1                                    79.1%           94.4%                 7.8M                                           1x                      0.70B                         1x
  ResNet-152 (He et al., 2016)                       77.8%           93.8%                 60M                                           7.6x                      11B                         16x
  DenseNet-264 (Huang et al., 2017)                  77.9%           93.9%                 34M                                           4.3x                      6.0B                        8.6x
  Inception-v3 (Szegedy et al., 2016)                78.8%           94.4%                 24M                                           3.0x                      5.7B                        8.1x
  Xception (Chollet, 2017)                           79.0%           94.5%                 23M                                           3.0x                      8.4B                        12x
  EfficientNet-B2                                    80.1%           94.9%                 9.2M                                           1x                       1.0B                         1x
  Inception-v4 (Szegedy et al., 2017)                80.0%           95.0%                 48M                                           5.2x                      13B                         13x
  Inception-resnet-v2 (Szegedy et al., 2017)         80.1%           95.1%                 56M                                           6.1x                      13B                         13x
  EfficientNet-B3                                    81.6%           95.7%                     12M                                        1x                       1.8B                         1x
  ResNeXt-101 (Xie et al., 2017)                     80.9%           95.6%                     84M                                       7.0x                      32B                         18x
  PolyNet (Zhang et al., 2017)                       81.3%           95.8%                     92M                                       7.7x                      35B                         19x
  EfficientNet-B4                                    82.9%           96.4%              19M                                               1x                       4.2B                         1x
  SENet (Hu et al., 2018)                            82.7%           96.2%              146M                                             7.7x                      42B                         10x
  NASNet-A (Zoph et al., 2018)                       82.7%           96.2%               89M                                             4.7x                      24B                         5.7x
  AmoebaNet-A (Real et al., 2019)                    82.8%           96.1%               87M                                             4.6x                      23B                         5.5x
  PNASNet (Liu et al., 2018)                         82.9%           96.2%               86M                                             4.5x                      23B                         6.0x
  EfficientNet-B5                                    83.6%           96.7%              30M                                               1x                       9.9B                         1x
  AmoebaNet-C (Cubuk et al., 2019)                   83.5%           96.5%              155M                                             5.2x                      41B                         4.1x
  EfficientNet-B6                                    84.0%           96.8%                     43M                                           1x                     19B                         1x
  EfficientNet-B7                                    84.3%           97.0%              66M                                               1x                        37B                         1x
  GPipe (Huang et al., 2018)                         84.3%           97.0%              557M                                             8.4x                        -                           -
   We omit ensemble and multi-crop models (Hu et al., 2018), or models pretrained on 3.5B Instagram images (Mahajan et al., 2018).


          Table 3. Scaling Up MobileNets and ResNet.
                                                                                                                                                          EfficientNet-B6
                                                                                                              84                                                                            AmoebaNet-C
  Model                                            FLOPS       Top-1 Acc.                                                               B5
                                                                                                                                                       AmeobaNet-A
  Baseline MobileNetV1 (Howard et al., 2017)         0.6B        70.6%                                                      B4
                                                                                                                                                                  NASNet-A                              SENet
                                                                                                              82




                                                                                Imagenet Top-1 Accuracy (%)
  Scale MobileNetV1 by width (w=2)                   2.2B        74.2%                                                 B3
  Scale MobileNetV1 by resolution (r=2)              2.2B        72.7%                                                                                                             ResNeXt-101
  compound scale (d =1.4, w =1.2, r =1.3)            2.3B        75.6%                                        80                              Inception-ResNet-v2
  Baseline MobileNetV2 (Sandler et al., 2018)        0.3B        72.0%                                                                Xception
  Scale MobileNetV2 by depth (d=4)                   1.2B        76.8%                                        78                             ResNet-152
                                                                                                                                                                                           Top1 Acc. FLOPS
  Scale MobileNetV2 by width (w=2)                   1.1B        76.4%                                                      DenseNet-201
                                                                                                                   B0                                   ResNet-152 (Xie et al., 2017)       77.8%      11B
  Scale MobileNetV2 by resolution (r=2)              1.2B        74.8%                                                                                  EfficientNet-B1                     79.1%      0.7B
                                                                                                              76                                        ResNeXt-101 (Xie et al., 2017)      80.9%      32B
  MobileNetV2 compound scale                         1.3B        77.4%                                                      ResNet-50                   EfficientNet-B3                     81.6%      1.8B
                                                                                                                                                        SENet (Hu et al., 2018)             82.7%      42B
  Baseline ResNet-50 (He et al., 2016)               4.1B        76.0%                                                                                  NASNet-A (Zoph et al., 2018)        80.7%      24B
                                                                                                                       Inception-v2                     EfficientNet-B4                     82.9%      4.2B
  Scale ResNet-50 by depth (d=4)                    16.2B        78.1%                                        74                                        AmeobaNet-C (Cubuk et al., 2019)    83.5%      41B
                                                                                                                   NASNet-A                             EfficientNet-B5                     83.6%      9.9B
  Scale ResNet-50 by width (w=2)                    14.7B        77.7%                                                      ResNet-34
  Scale ResNet-50 by resolution (r=2)               16.4B        77.5%                                             0          5       10          15     20      25           30       35          40         45
  ResNet-50 compound scale                          16.7B        78.8%                                                                                 FLOPS (Billions)

                                                                                Figure 5. FLOPS vs. ImageNet Accuracy – Similar to Figure 1
                                                                                except it compares FLOPS rather than model size.
Table 4. Inference Latency Comparison – Latency is measured
with batch size 1 on a single core of Intel Xeon CPU E5-2690.
                                                                                5.2. ImageNet Results for EfficientNet
                    Acc. @ Latency                          Acc. @ Latency
    ResNet-152      77.8% @ 0.554s            GPipe         84.3% @ 19.0s
                                                                                We train our EfficientNet models on ImageNet using simi-
  EfficientNet-B1   78.8% @ 0.098s       EfficientNet-B7     84.4% @ 3.1s       lar settings as (Tan et al., 2019): RMSProp optimizer with
      Speedup            5.7x                Speedup             6.1x           decay 0.9 and momentum 0.9; batch norm momentum 0.99;
                                                 EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks


Table 5. EfficientNet Performance Results on Transfer Learning Datasets. Our scaled EfficientNet models achieve new state-of-the-
art accuracy for 5 out of 8 datasets, with 9.6x fewer parameters on average.

                                                    Comparison to best public-available results                                                Comparison to best reported results
                                      Model        Acc.  #Param        Our Model         Acc.         #Param(ratio)       Model         Acc.    #Param      Our Model         Acc.           #Param(ratio)
                                                                                                                         †
 CIFAR-10                       NASNet-A          98.0%     85M           EfficientNet-B0    98.1%       4M (21x)         Gpipe     99.0%       556M          EfficientNet-B7      98.9%      64M (8.7x)
 CIFAR-100                      NASNet-A          87.5%     85M           EfficientNet-B0    88.1%       4M (21x)         Gpipe     91.3%       556M          EfficientNet-B7      91.7%      64M (8.7x)
 Birdsnap                      Inception-v4       81.8%     41M           EfficientNet-B5    82.0%      28M (1.5x)        GPipe     83.6%       556M          EfficientNet-B7      84.3%      64M (8.7x)
                                                                                                                          ‡
 Stanford Cars                 Inception-v4       93.4%     41M           EfficientNet-B3    93.6%      10M (4.1x)          DAT     94.8%         -           EfficientNet-B7      94.7%          -
 Flowers                       Inception-v4       98.5%     41M           EfficientNet-B5    98.5%      28M (1.5x)         DAT      97.7%         -           EfficientNet-B7      98.8%          -
 FGVC Aircraft                 Inception-v4       90.9%     41M           EfficientNet-B3    90.7%      10M (4.1x)         DAT      92.9%         -           EfficientNet-B7      92.9%          -
 Oxford-IIIT Pets              ResNet-152         94.5%     58M           EfficientNet-B4    94.8%      17M (5.6x)        GPipe     95.9%       556M          EfficientNet-B6      95.4%      41M (14x)
 Food-101                      Inception-v4       90.8%     41M           EfficientNet-B4    91.5%      17M (2.4x)        GPipe     93.0%       556M          EfficientNet-B7      93.0%      64M (8.7x)
 Geo-Mean                                                                                                 (4.7x)                                                                                (9.6x)
  †
    GPipe (Huang et al., 2018) trains giant models with specialized pipeline parallelism library.
  ‡
    DAT denotes domain adaptive transfer learning (Ngiam et al., 2018). Here we only compare ImageNet-based transfer learning results.
  Transfer accuracy and #params for NASNet (Zoph et al., 2018), Inception-v4 (Szegedy et al., 2017), ResNet-152 (He et al., 2016) are from (Kornblith et al., 2019).


                                                 CIFAR10                               CIFAR100                              Birdsnap                              Stanford Cars
                                 1.0
                                  99                                      92                                  85

                                                                          90                                                                         94



                       Accuracy(%)
                                                                                                              80
                                  98
                                 0.8                                      88                                                                         93

                                                                          86                                  75
                                     97                                                                                                              92
                                                                          84                                                                         91
                                 0.6                                                                          70
                                  96
                                          101         102           103         101         102         103          101         102           103          101           102         103
                                                  Flowers                             FGVC Aircraft                     Oxford-IIIT Pets                             Food-101
                                                                                                              96
                                0.4                                  92.5
                               98.5                                                                                                                  92



                 Accuracy(%)
                                                                     90.0
                                                                                                              94                                     90
                               98.0
                                0.2                                  87.5
                               97.5                                                                                                                  88
                                                                     85.0                                     92
                               97.0                                                                                                                  86
                                0.0                                  82.5
                                   0.0 101            102       0.2103          101            2
                                                                                            100.4      103        101 0.6 102                  103        0.8101          102         1.03
                                                                                                                                                                                      10
                                                                                        Number of Parameters (Millions, log-scale)
                                              DenseNet-201                      ResNet-50                     Inception-v1              ResNet-152                     NASNet-A
                                              GPIPE                             ResNet-101                    Inception-v3              DenseNet-121                   EfficientNet
                                              Inception-ResNet-v2               DenseNet-169                  Inception-v4

Figure 6. Model Parameters vs. Transfer Learning Accuracy – All models are pretrained on ImageNet and finetuned on new datasets.

weight decay 1e-5; initial learning rate 0.256 that decays                                                     being more accurate but 8.4x smaller than the previous
by 0.97 every 2.4 epochs. We also use SiLU (Swish-1) ac-                                                       best GPipe (Huang et al., 2018). These gains come from
tivation (Ramachandran et al., 2018; Elfwing et al., 2018;                                                     both better architectures, better scaling, and better training
Hendrycks & Gimpel, 2016), AutoAugment (Cubuk et al.,                                                          settings that are customized for EfficientNet.
2019), and stochastic depth (Huang et al., 2016) with sur-
                                                                                                              Figure 1 and Figure 5 illustrates the parameters-accuracy
vival probability 0.8. As commonly known that bigger mod-
                                                                                                              and FLOPS-accuracy curve for representative ConvNets,
els need more regularization, we linearly increase dropout
                                                                                                              where our scaled EfficientNet models achieve better accu-
(Srivastava et al., 2014) ratio from 0.2 for EfficientNet-B0 to
                                                                                                              racy with much fewer parameters and FLOPS than other
0.5 for B7. We reserve 25K randomly picked images from
                                                                                                              ConvNets. Notably, our EfficientNet models are not only
the training set as a minival set, and perform early
                                                                                                              small, but also computational cheaper. For example, our
stopping on this minival; we then evaluate the early-
                                                                                                              EfficientNet-B3 achieves higher accuracy than ResNeXt-
stopped checkpoint on the original validation set to
                                                                                                              101 (Xie et al., 2017) using 18x fewer FLOPS.
report the final validation accuracy.
                                                                                                              To validate the latency, we have also measured the inference
Table 2 shows the performance of all EfficientNet models
                                                                                                              latency for a few representative CovNets on a real CPU as
that are scaled from the same baseline EfficientNet-B0. Our
                                                                                                              shown in Table 4, where we report average latency over
EfficientNet models generally use an order of magnitude
                                                                                                              20 runs. Our EfficientNet-B1 runs 5.7x faster than the
fewer parameters and FLOPS than other ConvNets with
                                                                                                              widely used ResNet-152, while EfficientNet-B7 runs about
similar accuracy. In particular, our EfficientNet-B7 achieves
                                                                                                              6.1x faster than GPipe (Huang et al., 2018), suggesting our
84.3% top1 accuracy with 66M parameters and 37B FLOPS,
                                                                                                              EfficientNets are indeed fast on real hardware.
                                    EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

                   original image       baseline model          deeper (d=4)       wider (w=2)                             higher resolution (r=2) compound scaling



        bakeshop




         maze




Figure 7. Class Activation Map (CAM) (Zhou et al., 2016) for Models with different scaling methods- Our compound scaling method
allows the scaled model (last column) to focus on more relevant regions with more object details. Model details are in Table 7.

                                                                                                                  83
                   Table 6. Transfer Learning Datasets.




                                                                                     ImageNet Top-1 Accuracy(%)
                                                                                                                  82
                     Dataset                Train Size   Test Size   #Classes                                     81
 CIFAR-10 (Krizhevsky & Hinton, 2009)         50,000      10,000        10
 CIFAR-100 (Krizhevsky & Hinton, 2009)        50,000      10,000       100                                        80
       Birdsnap (Berg et al., 2014)           47,386       2,443       500                                        79
    Stanford Cars (Krause et al., 2013)        8,144      8,041        196
  Flowers (Nilsback & Zisserman, 2008)         2,040      6,149        102                                        78
    FGVC Aircraft (Maji et al., 2013)          6,667      3,333        100
                                                                                                                  77                                    scale by width
   Oxford-IIIT Pets (Parkhi et al., 2012)      3,680      3,369         37
                                                                                                                                                        scale by depth
     Food-101 (Bossard et al., 2014)          75,750      25,250       101                                                                              scale by resolution
                                                                                                                  76
                                                                                                                                                        compound scaling
                                                                                                                  75
5.3. Transfer Learning Results for EfficientNet                                                                        0     1        2          3            4               5
                                                                                                                                    FLOPS (Billions)
We have also evaluated our EfficientNet on a list of com-                       Figure 8. Scaling Up EfficientNet-B0 with Different Methods.
monly used transfer learning datasets, as shown in Table
6. We borrow the same training settings from (Kornblith
                                                                                                                   Table 7. Scaled Models Used in Figure 7.
et al., 2019) and (Huang et al., 2018), which take ImageNet
pretrained checkpoints and finetune on new datasets.                               Model                                                           FLOPS          Top-1 Acc.
Table 5 shows the transfer learning performance: (1) Com-                          Baseline model (EfficientNet-B0)                                    0.4B          77.3%
pared to public available models, such as NASNet-A (Zoph                           Scale model by depth (d=4)                                          1.8B          79.0%
et al., 2018) and Inception-v4 (Szegedy et al., 2017), our Ef-                     Scale model by width (w=2)                                          1.8B          78.9%
                                                                                   Scale model by resolution (r=2)                                     1.9B          79.1%
ficientNet models achieve better accuracy with 4.7x average                        Compound Scale (d =1.4, w =1.2, r =1.3)                             1.8B          81.1%
(up to 21x) parameter reduction. (2) Compared to state-
of-the-art models, including DAT (Ngiam et al., 2018) that
dynamically synthesizes training data and GPipe (Huang                          ods for the same EfficientNet-B0 baseline network. In gen-
et al., 2018) that is trained with specialized pipeline paral-                  eral, all scaling methods improve accuracy with the cost
lelism, our EfficientNet models still surpass their accuracy                    of more FLOPS, but our compound scaling method can
in 5 out of 8 datasets, but using 9.6x fewer parameters                         further improve accuracy, by up to 2.5%, than other single-
                                                                                dimension scaling methods, suggesting the importance of
Figure 6 compares the accuracy-parameters curve for a va-
                                                                                our proposed compound scaling.
riety of models. In general, our EfficientNets consistently
achieve better accuracy with an order of magnitude fewer pa-                    In order to further understand why our compound scaling
rameters than existing models, including ResNet (He et al.,                     method is better than others, Figure 7 compares the class
2016), DenseNet (Huang et al., 2017), Inception (Szegedy                        activation map (Zhou et al., 2016) for a few representative
et al., 2017), and NASNet (Zoph et al., 2018).                                  models with different scaling methods. All these models are
                                                                                scaled from the same baseline, and their statistics are shown
6. Discussion                                                                   in Table 7. Images are randomly picked from ImageNet
                                                                                validation set. As shown in the figure, the model with com-
To disentangle the contribution of our proposed scaling                         pound scaling tends to focus on more relevant regions with
method from the EfficientNet architecture, Figure 8 com-                        more object details, while other models are either lack of
pares the ImageNet performance of different scaling meth-                       object details or unable to capture all objects in the images.
                               EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

7. Conclusion                                                                Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V., and Le,
                                                                               Q. V. Autoaugment: Learning augmentation policies
In this paper, we systematically study ConvNet scaling and                     from data. CVPR, 2019.
identify that carefully balancing network width, depth, and
resolution is an important but missing piece, preventing us                  Elfwing, S., Uchibe, E., and Doya, K. Sigmoid-weighted
from better accuracy and efficiency. To address this issue,                    linear units for neural network function approximation
we propose a simple and highly effective compound scaling                      in reinforcement learning. Neural Networks, 107:3–11,
method, which enables us to easily scale up a baseline Con-                    2018.
vNet to any target resource constraints in a more principled
way, while maintaining model efficiency. Powered by this                     Gholami, A., Kwon, K., Wu, B., Tai, Z., Yue, X., Jin, P.,
compound scaling method, we demonstrate that a mobile-                         Zhao, S., and Keutzer, K. Squeezenext: Hardware-aware
size EfficientNet model can be scaled up very effectively,                     neural network design. ECV Workshop at CVPR’18,
surpassing state-of-the-art accuracy with an order of magni-                   2018.
tude fewer parameters and FLOPS, on both ImageNet and
five commonly used transfer learning datasets.                               Han, S., Mao, H., and Dally, W. J. Deep compression:
                                                                               Compressing deep neural networks with pruning, trained
                                                                               quantization and huffman coding. ICLR, 2016.
Acknowledgements
                                                                             He, K., Zhang, X., Ren, S., and Sun, J. Deep residual
We thank Ruoming Pang, Vijay Vasudevan, Alok Aggarwal,
                                                                               learning for image recognition. CVPR, pp. 770–778,
Barret Zoph, Hongkun Yu, Jonathon Shlens, Raphael Gon-
                                                                               2016.
tijo Lopes, Yifeng Lu, Daiyi Peng, Xiaodan Song, Samy
Bengio, Jeff Dean, and the Google Brain team for their help.                 He, K., Gkioxari, G., Dollár, P., and Girshick, R. Mask
                                                                               r-cnn. ICCV, pp. 2980–2988, 2017.
Appendix
                                                                             He, Y., Lin, J., Liu, Z., Wang, H., Li, L.-J., and Han, S.
Since 2017, most research papers only report and compare                       Amc: Automl for model compression and acceleration
ImageNet validation accuracy; this paper also follows this                     on mobile devices. ECCV, 2018.
convention for better comparison. In addition, we have
also verified the test accuracy by submitting our predictions                Hendrycks, D. and Gimpel, K. Gaussian error linear units
on the 100k test set images to http://image-net.org;                           (gelus). arXiv preprint arXiv:1606.08415, 2016.
results are in Table 8. As expected, the test accuracy is very
close to the validation accuracy.                                            Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang,
                                                                               W., Weyand, T., Andreetto, M., and Adam, H. Mobilenets:
Table 8. ImageNet Validation vs. Test Top-1/5 Accuracy.                        Efficient convolutional neural networks for mobile vision
              B0      B1      B2      B3      B4      B5      B6      B7       applications. arXiv preprint arXiv:1704.04861, 2017.
  Val top1   77.11   79.13   80.07   81.59   82.89   83.60   83.95   84.26
 Test top1   77.23   79.17   80.16   81.72   82.94   83.69   84.04   84.33   Hu, J., Shen, L., and Sun, G. Squeeze-and-excitation net-
  Val top5   93.35   94.47   94.90   95.67   96.37   96.71   96.76   96.97    works. CVPR, 2018.
 Test top5   93.45   94.43   94.98   95.70   96.27   96.64   96.86   96.94
                                                                             Huang, G., Sun, Y., Liu, Z., Sedra, D., and Weinberger,
References                                                                     K. Q. Deep networks with stochastic depth. ECCV, pp.
                                                                               646–661, 2016.
Berg, T., Liu, J., Woo Lee, S., Alexander, M. L., Jacobs,
  D. W., and Belhumeur, P. N. Birdsnap: Large-scale                          Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger,
  fine-grained visual categorization of birds. CVPR, pp.                       K. Q. Densely connected convolutional networks. CVPR,
  2011–2018, 2014.                                                             2017.
Bossard, L., Guillaumin, M., and Van Gool, L. Food-101–                      Huang, Y., Cheng, Y., Chen, D., Lee, H., Ngiam, J., Le,
  mining discriminative components with random forests.                        Q. V., and Chen, Z. Gpipe: Efficient training of giant
  ECCV, pp. 446–461, 2014.                                                     neural networks using pipeline parallelism. arXiv preprint
Cai, H., Zhu, L., and Han, S. Proxylessnas: Direct neural                      arXiv:1808.07233, 2018.
  architecture search on target task and hardware. ICLR,
                                                                             Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K.,
  2019.
                                                                               Dally, W. J., and Keutzer, K. Squeezenet: Alexnet-level
Chollet, F. Xception: Deep learning with depthwise separa-                     accuracy with 50x fewer parameters and <0.5 mb model
  ble convolutions. CVPR, pp. 1610–02357, 2017.                                size. arXiv preprint arXiv:1602.07360, 2016.
                          EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

Ioffe, S. and Szegedy, C. Batch normalization: Accelerating       Raghu, M., Poole, B., Kleinberg, J., Ganguli, S., and Sohl-
   deep network training by reducing internal covariate shift.      Dickstein, J. On the expressive power of deep neural
   ICML, pp. 448–456, 2015.                                         networks. ICML, 2017.

Kornblith, S., Shlens, J., and Le, Q. V. Do better imagenet       Ramachandran, P., Zoph, B., and Le, Q. V. Searching for
  models transfer better? CVPR, 2019.                               activation functions. arXiv preprint arXiv:1710.05941,
                                                                    2018.
Krause, J., Deng, J., Stark, M., and Fei-Fei, L. Collecting a
  large-scale dataset of fine-grained cars. Second Workshop       Real, E., Aggarwal, A., Huang, Y., and Le, Q. V. Regu-
  on Fine-Grained Visual Categorizatio, 2013.                       larized evolution for image classifier architecture search.
                                                                    AAAI, 2019.
Krizhevsky, A. and Hinton, G. Learning multiple layers of
  features from tiny images. Technical Report, 2009.              Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,
                                                                    Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein,
Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet           M., et al. Imagenet large scale visual recognition chal-
  classification with deep convolutional neural networks.           lenge. International Journal of Computer Vision, 115(3):
  In NIPS, pp. 1097–1105, 2012.                                     211–252, 2015.

Lin, H. and Jegelka, S. Resnet with one-neuron hidden             Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and
  layers is a universal approximator. NeurIPS, pp. 6172–            Chen, L.-C. Mobilenetv2: Inverted residuals and linear
  6181, 2018.                                                       bottlenecks. CVPR, 2018.

Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B.,     Sharir, O. and Shashua, A. On the expressive power of
  and Belongie, S. Feature pyramid networks for object              overlapping architectures of deep learning. ICLR, 2018.
  detection. CVPR, 2017.
                                                                  Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I.,
Liu, C., Zoph, B., Shlens, J., Hua, W., Li, L.-J., Fei-Fei, L.,     and Salakhutdinov, R. Dropout: a simple way to prevent
  Yuille, A., Huang, J., and Murphy, K. Progressive neural          neural networks from overfitting. The Journal of Machine
  architecture search. ECCV, 2018.                                  Learning Research, 15(1):1929–1958, 2014.

Lu, Z., Pu, H., Wang, F., Hu, Z., and Wang, L. The expres-        Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.,
  sive power of neural networks: A view from the width.             Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich,
  NeurIPS, 2018.                                                    A. Going deeper with convolutions. CVPR, pp. 1–9,
                                                                    2015.
Ma, N., Zhang, X., Zheng, H.-T., and Sun, J. Shufflenet v2:
 Practical guidelines for efficient cnn architecture design.      Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna,
 ECCV, 2018.                                                        Z. Rethinking the inception architecture for computer
                                                                    vision. CVPR, pp. 2818–2826, 2016.
Mahajan, D., Girshick, R., Ramanathan, V., He, K., Paluri,
 M., Li, Y., Bharambe, A., and van der Maaten, L. Explor-         Szegedy, C., Ioffe, S., Vanhoucke, V., and Alemi, A. A.
 ing the limits of weakly supervised pretraining. arXiv             Inception-v4, inception-resnet and the impact of residual
 preprint arXiv:1805.00932, 2018.                                   connections on learning. AAAI, 4:12, 2017.

Maji, S., Rahtu, E., Kannala, J., Blaschko, M., and Vedaldi,      Tan, M., Chen, B., Pang, R., Vasudevan, V., Sandler, M.,
 A. Fine-grained visual classification of aircraft. arXiv           Howard, A., and Le, Q. V. MnasNet: Platform-aware
 preprint arXiv:1306.5151, 2013.                                    neural architecture search for mobile. CVPR, 2019.

Ngiam, J., Peng, D., Vasudevan, V., Kornblith, S., Le, Q. V.,     Xie, S., Girshick, R., Dollár, P., Tu, Z., and He, K. Aggre-
  and Pang, R. Domain adaptive transfer learning with spe-          gated residual transformations for deep neural networks.
  cialist models. arXiv preprint arXiv:1811.07056, 2018.            CVPR, pp. 5987–5995, 2017.

Nilsback, M.-E. and Zisserman, A. Automated flower clas-          Yang, T.-J., Howard, A., Chen, B., Zhang, X., Go, A., Sze,
  sification over a large number of classes. ICVGIP, pp.            V., and Adam, H. Netadapt: Platform-aware neural net-
  722–729, 2008.                                                    work adaptation for mobile applications. ECCV, 2018.

Parkhi, O. M., Vedaldi, A., Zisserman, A., and Jawahar, C.        Zagoruyko, S. and Komodakis, N. Wide residual networks.
  Cats and dogs. CVPR, pp. 3498–3505, 2012.                         BMVC, 2016.
                         EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

Zhang, X., Li, Z., Loy, C. C., and Lin, D. Polynet: A pursuit
  of structural diversity in very deep networks. CVPR, pp.
  3900–3908, 2017.
Zhang, X., Zhou, X., Lin, M., and Sun, J. Shufflenet: An ex-
  tremely efficient convolutional neural network for mobile
  devices. CVPR, 2018.

Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., and Torralba,
  A. Learning deep features for discriminative localization.
  CVPR, pp. 2921–2929, 2016.
Zoph, B. and Le, Q. V. Neural architecture search with
  reinforcement learning. ICLR, 2017.

Zoph, B., Vasudevan, V., Shlens, J., and Le, Q. V. Learning
  transferable architectures for scalable image recognition.
  CVPR, 2018.