Source notes and reports

Sources

← Home

                                                    Xception: Deep Learning with Depthwise Separable Convolutions

                                                                                              François Chollet
                                                                                                Google, Inc.
                                                                                           fchollet@google.com




arXiv:1610.02357v3 [cs.CV] 4 Apr 2017
                                                                 Abstract                                as GoogLeNet (Inception V1), later refined as Inception V2
                                                                                                         [7], Inception V3 [21], and most recently Inception-ResNet
                                           We present an interpretation of Inception modules in con-     [19]. Inception itself was inspired by the earlier Network-
                                        volutional neural networks as being an intermediate step         In-Network architecture [11]. Since its first introduction,
                                        in-between regular convolution and the depthwise separable       Inception has been one of the best performing family of
                                        convolution operation (a depthwise convolution followed by       models on the ImageNet dataset [14], as well as internal
                                        a pointwise convolution). In this light, a depthwise separable   datasets in use at Google, in particular JFT [5].
                                        convolution can be understood as an Inception module with           The fundamental building block of Inception-style mod-
                                        a maximally large number of towers. This observation leads       els is the Inception module, of which several different ver-
                                        us to propose a novel deep convolutional neural network          sions exist. In figure 1 we show the canonical form of an
                                        architecture inspired by Inception, where Inception modules      Inception module, as found in the Inception V3 architec-
                                        have been replaced with depthwise separable convolutions.        ture. An Inception model can be understood as a stack of
                                        We show that this architecture, dubbed Xception, slightly        such modules. This is a departure from earlier VGG-style
                                        outperforms Inception V3 on the ImageNet dataset (which          networks which were stacks of simple convolution layers.
                                        Inception V3 was designed for), and significantly outper-           While Inception modules are conceptually similar to con-
                                        forms Inception V3 on a larger image classification dataset      volutions (they are convolutional feature extractors), they
                                        comprising 350 million images and 17,000 classes. Since          empirically appear to be capable of learning richer repre-
                                        the Xception architecture has the same number of param-          sentations with less parameters. How do they work, and
                                        eters as Inception V3, the performance gains are not due         how do they differ from regular convolutions? What design
                                        to increased capacity but rather to a more efficient use of      strategies come after Inception?
                                        model parameters.
                                                                                                         1.1. The Inception hypothesis
                                                                                                            A convolution layer attempts to learn filters in a 3D space,
                                        1. Introduction                                                  with 2 spatial dimensions (width and height) and a chan-
                                                                                                         nel dimension; thus a single convolution kernel is tasked
                                           Convolutional neural networks have emerged as the mas-
                                                                                                         with simultaneously mapping cross-channel correlations and
                                        ter algorithm in computer vision in recent years, and de-
                                                                                                         spatial correlations.
                                        veloping recipes for designing them has been a subject of
                                                                                                            This idea behind the Inception module is to make this
                                        considerable attention. The history of convolutional neural
                                                                                                         process easier and more efficient by explicitly factoring it
                                        network design started with LeNet-style models [10], which
                                                                                                         into a series of operations that would independently look at
                                        were simple stacks of convolutions for feature extraction
                                                                                                         cross-channel correlations and at spatial correlations. More
                                        and max-pooling operations for spatial sub-sampling. In
                                                                                                         precisely, the typical Inception module first looks at cross-
                                        2012, these ideas were refined into the AlexNet architec-
                                                                                                         channel correlations via a set of 1x1 convolutions, mapping
                                        ture [9], where convolution operations were being repeated
                                                                                                         the input data into 3 or 4 separate spaces that are smaller than
                                        multiple times in-between max-pooling operations, allowing
                                                                                                         the original input space, and then maps all correlations in
                                        the network to learn richer features at every spatial scale.
                                                                                                         these smaller 3D spaces, via regular 3x3 or 5x5 convolutions.
                                        What followed was a trend to make this style of network
                                                                                                         This is illustrated in figure 1. In effect, the fundamental hy-
                                        increasingly deeper, mostly driven by the yearly ILSVRC
                                                                                                         pothesis behind Inception is that cross-channel correlations
                                        competition; first with Zeiler and Fergus in 2013 [25] and
                                                                                                         and spatial correlations are sufficiently decoupled that it is
                                        then with the VGG architecture in 2014 [18].
                                                                                                         preferable not to map them jointly 1 .
                                           At this point a new style of network emerged, the Incep-
                                        tion architecture, introduced by Szegedy et al. in 2014 [20]        1 A variant of the process is to independently look at width-wise corre-
   Consider a simplified version of an Inception module that                 Figure 3. A strictly equivalent reformulation of the simplified In-
                                                                             ception module.
only uses one size of convolution (e.g. 3x3) and does not
include an average pooling tower (figure 2). This Incep-
tion module can be reformulated as a large 1x1 convolution
followed by spatial convolutions that would operate on non-
overlapping segments of the output channels (figure 3). This
observation naturally raises the question: what is the ef-
fect of the number of segments in the partition (and their
size)? Would it be reasonable to make a much stronger
hypothesis than the Inception hypothesis, and assume that
cross-channel correlations and spatial correlations can be
mapped completely separately?
     Figure 1. A canonical Inception module (Inception V3).
                                                                             Figure 4. An “extreme” version of our Inception module, with one
                                                                             spatial convolution per output channel of the 1x1 convolution.




              Figure 2. A simplified Inception module.


                                                                             network design as early as 2014 [15] and has become more
                                                                             popular since its inclusion in the TensorFlow framework [1]
                                                                             in 2016.
                                                                                 A depthwise separable convolution, commonly called
                                                                             “separable convolution” in deep learning frameworks such as
                                                                             TensorFlow and Keras, consists in a depthwise convolution,
                                                                             i.e. a spatial convolution performed independently over each
                                                                             channel of an input, followed by a pointwise convolution,
                                                                             i.e. a 1x1 convolution, projecting the channels output by the
                                                                             depthwise convolution onto a new channel space. This is
                                                                             not to be confused with a spatially separable convolution,
                                                                             which is also commonly called “separable convolution” in
1.2. The continuum between convolutions and sep-
                                                                             the image processing community.
      arable convolutions
                                                                                 Two minor differences between and “extreme” version of
    An “extreme” version of an Inception module, based on                    an Inception module and a depthwise separable convolution
this stronger hypothesis, would first use a 1x1 convolution to               would be:
map cross-channel correlations, and would then separately
map the spatial correlations of every output channel. This                     • The order of the operations: depthwise separable con-
is shown in figure 4. We remark that this extreme form of                        volutions as usually implemented (e.g. in TensorFlow)
an Inception module is almost identical to a depthwise sepa-                     perform first channel-wise spatial convolution and then
rable convolution, an operation that has been used in neural                     perform 1x1 convolution, whereas Inception performs
                                                                                 the 1x1 convolution first.
lations and height-wise correlations. This is implemented by some of the
modules found in Inception V3, which alternate 7x1 and 1x7 convolutions.
The use of such spatially separable convolutions has a long history in im-
                                                                               • The presence or absence of a non-linearity after the
age processing and has been used in some convolutional neural network            first operation. In Inception, both operations are fol-
implementations since at least 2012 (possibly earlier).                          lowed by a ReLU non-linearity, however depthwise
     separable convolutions are usually implemented with-             during an internship at Google Brain in 2013, and used
     out non-linearities.                                             them in AlexNet to obtain small gains in accuracy and
                                                                      large gains in convergence speed, as well as a significant
    We argue that the first difference is unimportant, in par-        reduction in model size. An overview of his work was
ticular because these operations are meant to be used in a            first made public in a presentation at ICLR 2014 [23].
stacked setting. The second difference might matter, and we           Detailed experimental results are reported in Sifre’s the-
investigate it in the experimental section (in particular see         sis, section 6.2 [15]. This initial work on depthwise sep-
figure 10).                                                           arable convolutions was inspired by prior research from
    We also note that other intermediate formulations of In-          Sifre and Mallat on transformation-invariant scattering
ception modules that lie in between regular Inception mod-            [16, 15]. Later, a depthwise separable convolution was
ules and depthwise separable convolutions are also possible:          used as the first layer of Inception V1 and Inception
in effect, there is a discrete spectrum between regular convo-        V2 [20, 7]. Within Google, Andrew Howard [6] has
lutions and depthwise separable convolutions, parametrized            introduced efficient mobile models called MobileNets
by the number of independent channel-space segments used              using depthwise separable convolutions. Jin et al. in
for performing spatial convolutions. A regular convolution            2014 [8] and Wang et al. in 2016 [24] also did related
(preceded by a 1x1 convolution), at one extreme of this               work aiming at reducing the size and computational
spectrum, corresponds to the single-segment case; a depth-            cost of convolutional neural networks using separable
wise separable convolution corresponds to the other extreme           convolutions. Additionally, our work is only possible
where there is one segment per channel; Inception modules             due to the inclusion of an efficient implementation of
lie in between, dividing a few hundreds of channels into 3            depthwise separable convolutions in the TensorFlow
or 4 segments. The properties of such intermediate modules            framework [1].
appear not to have been explored yet.
    Having made these observations, we suggest that it may         • Residual connections, introduced by He et al. in [4],
be possible to improve upon the Inception family of archi-           which our proposed architecture uses extensively.
tectures by replacing Inception modules with depthwise sep-
arable convolutions, i.e. by building models that would be       3. The Xception architecture
stacks of depthwise separable convolutions. This is made
practical by the efficient depthwise convolution implementa-         We propose a convolutional neural network architecture
tion available in TensorFlow. In what follows, we present a      based entirely on depthwise separable convolution layers.
convolutional neural network architecture based on this idea,    In effect, we make the following hypothesis: that the map-
with a similar number of parameters as Inception V3, and         ping of cross-channels correlations and spatial correlations
we evaluate its performance against Inception V3 on two          in the feature maps of convolutional neural networks can be
large-scale image classification task.                           entirely decoupled. Because this hypothesis is a stronger ver-
                                                                 sion of the hypothesis underlying the Inception architecture,
2. Prior work                                                    we name our proposed architecture Xception, which stands
                                                                 for “Extreme Inception”.
   The present work relies heavily on prior efforts in the           A complete description of the specifications of the net-
following areas:                                                 work is given in figure 5. The Xception architecture has
                                                                 36 convolutional layers forming the feature extraction base
  • Convolutional neural networks [10, 9, 25], in particular     of the network. In our experimental evaluation we will ex-
    the VGG-16 architecture [18], which is schematically         clusively investigate image classification and therefore our
    similar to our proposed architecture in a few respects.      convolutional base will be followed by a logistic regression
  • The Inception architecture family of convolutional neu-      layer. Optionally one may insert fully-connected layers be-
    ral networks [20, 7, 21, 19], which first demonstrated       fore the logistic regression layer, which is explored in the
    the advantages of factoring convolutions into multiple       experimental evaluation section (in particular, see figures
    branches operating successively on channels and then         7 and 8). The 36 convolutional layers are structured into
    on space.                                                    14 modules, all of which have linear residual connections
                                                                 around them, except for the first and last modules.
  • Depthwise separable convolutions, which our proposed             In short, the Xception architecture is a linear stack of
    architecture is entirely based upon. While the use of spa-   depthwise separable convolution layers with residual con-
    tially separable convolutions in neural networks has a       nections. This makes the architecture very easy to define
    long history, going back to at least 2012 [12] (but likely   and modify; it takes only 30 to 40 lines of code using a high-
    even earlier), the depthwise version is more recent. Lau-    level library such as Keras [2] or TensorFlow-Slim [17], not
    rent Sifre developed depthwise separable convolutions        unlike an architecture such as VGG-16 [18], but rather un-
like architectures such as Inception V2 or V3 which are far                 – Learning rate decay: decay of rate 0.9 every
more complex to define. An open-source implementation of                      3,000,000 samples
Xception using Keras and TensorFlow is provided as part of
the Keras Applications module2 , under the MIT license.
                                                                       For both datasets, the same exact same optimization con-
                                                                    figuration was used for both Xception and Inception V3.
4. Experimental evaluation
                                                                    Note that this configuration was tuned for best performance
    We choose to compare Xception to the Inception V3 ar-           with Inception V3; we did not attempt to tune optimization
chitecture, due to their similarity of scale: Xception and          hyperparameters for Xception. Since the networks have dif-
Inception V3 have nearly the same number of parameters              ferent training profiles (figure 6), this may be suboptimal, es-
(table 3), and thus any performance gap could not be at-            pecially on the ImageNet dataset, on which the optimization
tributed to a difference in network capacity. We conduct            configuration used had been carefully tuned for Inception
our comparison on two image classification tasks: one is            V3.
the well-known 1000-class single-label classification task on          Additionally, all models were evaluated using Polyak
the ImageNet dataset [14], and the other is a 17,000-class          averaging [13] at inference time.
multi-label classification task on the large-scale JFT dataset.
4.1. The JFT dataset                                                4.3. Regularization configuration
   JFT is an internal Google dataset for large-scale image            • Weight decay: The Inception V3 model uses a weight
classification dataset, first introduced by Hinton et al. in [5],       decay (L2 regularization) rate of 4e − 5, which has
which comprises over 350 million high-resolution images                 been carefully tuned for performance on ImageNet. We
annotated with labels from a set of 17,000 classes. To eval-            found this rate to be quite suboptimal for Xception
uate the performance of a model trained on JFT, we use an               and instead settled for 1e − 5. We did not perform
auxiliary dataset, FastEval14k.                                         an extensive search for the optimal weight decay rate.
   FastEval14k is a dataset of 14,000 images with dense                 The same weight decay rates were used both for the
annotations from about 6,000 classes (36.5 labels per im-               ImageNet experiments and the JFT experiments.
age on average). On this dataset we evaluate performance
using Mean Average Precision for top 100 predictions                  • Dropout: For the ImageNet experiments, both models
(MAP@100), and we weight the contribution of each class                 include a dropout layer of rate 0.5 before the logistic
to MAP@100 with a score estimating how common (and                      regression layer. For the JFT experiments, no dropout
therefore important) the class is among social media images.            was included due to the large size of the dataset which
This evaluation procedure is meant to capture performance               made overfitting unlikely in any reasonable amount of
on frequently occurring labels from social media, which is              time.
crucial for production models at Google.
4.2. Optimization configuration                                       • Auxiliary loss tower: The Inception V3 architecture
                                                                        may optionally include an auxiliary tower which back-
  A different optimization configuration was used for Ima-
                                                                        propagates the classification loss earlier in the network,
geNet and JFT:
                                                                        serving as an additional regularization mechanism. For
  • On ImageNet:                                                        simplicity, we choose not to include this auxiliary tower
                                                                        in any of our models.
        – Optimizer: SGD
        – Momentum: 0.9                                             4.4. Training infrastructure
        – Initial learning rate: 0.045
                                                                       All networks were implemented using the TensorFlow
        – Learning rate decay: decay of rate 0.94 every 2           framework [1] and trained on 60 NVIDIA K80 GPUs each.
          epochs                                                    For the ImageNet experiments, we used data parallelism
  • On JFT:                                                         with synchronous gradient descent to achieve the best classi-
                                                                    fication performance, while for JFT we used asynchronous
        – Optimizer: RMSprop [22]                                   gradient descent so as to speed up training. The ImageNet
        – Momentum: 0.9                                             experiments took approximately 3 days each, while the JFT
                                                                    experiments took over one month each. The JFT models
        – Initial learning rate: 0.001
                                                                    were not trained to full convergence, which would have
   2 https://keras.io/applications/#xception                        taken over three month per experiment.
Figure 5. The Xception architecture: the data first goes through the entry flow, then through the middle flow which is repeated eight times,
and finally through the exit flow. Note that all Convolution and SeparableConvolution layers are followed by batch normalization [7] (not
included in the diagram). All SeparableConvolution layers use a depth multiplier of 1 (no depth expansion).




4.5. Comparison with Inception V3                                       152 [4].

4.5.1   Classification performance                                      Table 1. Classification performance comparison on ImageNet (sin-
                                                                        gle crop, single model). VGG-16 and ResNet-152 numbers are
All evaluations were run with a single crop of the inputs               only included as a reminder. The version of Inception V3 being
images and a single model. ImageNet results are reported                benchmarked does not include the auxiliary tower.
on the validation set rather than the test set (i.e. on the                                    Top-1 accuracy        Top-5 accuracy
non-blacklisted images from the validation set of ILSVRC                    VGG-16                 0.715                 0.901
2012). JFT results are reported after 30 million iterations                 ResNet-152             0.770                 0.933
(one month of training) rather than after full convergence.                 Inception V3           0.782                 0.941
Results are provided in table 1 and table 2, as well as figure              Xception               0.790                 0.945
6, figure 7, figure 8. On JFT, we tested both versions of our
networks that did not include any fully-connected layers, and               The Xception architecture shows a much larger perfor-
versions that included two fully-connected layers of 4096                mance improvement on the JFT dataset compared to the
units each before the logistic regression layer.                         ImageNet dataset. We believe this may be due to the fact
    On ImageNet, Xception shows marginally better results                that Inception V3 was developed with a focus on ImageNet
than Inception V3. On JFT, Xception shows a 4.3% rel-                    and may thus be by design over-fit to this specific task. On
ative improvement on the FastEval14k MAP@100 metric.                     the other hand, neither architecture was tuned for JFT. It is
We also note that Xception outperforms ImageNet results                  likely that a search for better hyperparameters for Xception
reported by He et al. for ResNet-50, ResNet-101 and ResNet-              on ImageNet (in particular optimization parameters and reg-
Table 2. Classification performance comparison on JFT (single       Figure 8. Training profile on JFT, with fully-connected layers
crop, single model).
                                      FastEval14k MAP@100
 Inception V3 - no FC layers                   6.36
 Xception - no FC layers                       6.70
 Inception V3 with FC layers                   6.50
 Xception with FC layers                       6.78

             Figure 6. Training profile on ImageNet




                                                                    V3 and Xception. Parameter count is reported on ImageNet
                                                                    (1000 classes, no fully-connected layers) and the number of
                                                                    training steps (gradient updates) per second is reported on
                                                                    ImageNet with 60 K80 GPUs running synchronous gradient
                                                                    descent. Both architectures have approximately the same
                                                                    size (within 3.5%), and Xception is marginally slower. We
Figure 7. Training profile on JFT, without fully-connected layers
                                                                    expect that engineering optimizations at the level of the
                                                                    depthwise convolution operations can make Xception faster
                                                                    than Inception V3 in the near future. The fact that both
                                                                    architectures have almost the same number of parameters
                                                                    indicates that the improvement seen on ImageNet and JFT
                                                                    does not come from added capacity but rather from a more
                                                                    efficient use of the model parameters.
                                                                    4.6. Effect of the residual connections

                                                                    Figure 9. Training profile with and without residual connections.




ularization parameters) would yield significant additional
improvement.

4.5.2   Size and speed

          Table 3. Size and training speed comparison.
                       Parameter count        Steps/second
    Inception V3          23,626,728               31
    Xception              22,855,952               28                   To quantify the benefits of residual connections in the
                                                                    Xception architecture, we benchmarked on ImageNet a mod-
   In table 3 we compare the size and speed of Inception            ified version of Xception that does not include any residual
connections. Results are shown in figure 9. Residual con-            found in Inception modules) the non-linearity is helpful, but
nections are clearly essential in helping with convergence,          for shallow ones (e.g. the 1-channel deep feature spaces
both in terms of speed and final classification performance.         of depthwise separable convolutions) it becomes harmful,
However we will note that benchmarking the non-residual              possibly due to a loss of information.
model with the same optimization configuration as the resid-
ual model may be uncharitable and that better optimization           5. Future directions
configurations might yield more competitive results.
   Additionally, let us note that this result merely shows the           We noted earlier the existence of a discrete spectrum be-
importance of residual connections for this specific architec-       tween regular convolutions and depthwise separable convo-
ture, and that residual connections are in no way required           lutions, parametrized by the number of independent channel-
in order to build models that are stacks of depthwise sepa-          space segments used for performing spatial convolutions. In-
rable convolutions. We also obtained excellent results with          ception modules are one point on this spectrum. We showed
non-residual VGG-style models where all convolution layers           in our empirical evaluation that the extreme formulation of
were replaced with depthwise separable convolutions (with            an Inception module, the depthwise separable convolution,
a depth multiplier of 1), superior to Inception V3 on JFT at         may have advantages over regular a regular Inception mod-
equal parameter count.                                               ule. However, there is no reason to believe that depthwise
                                                                     separable convolutions are optimal. It may be that intermedi-
4.7. Effect of an intermediate activation after point-               ate points on the spectrum, lying between regular Inception
      wise convolutions                                              modules and depthwise separable convolutions, hold further
                                                                     advantages. This question is left for future investigation.
Figure 10. Training profile with different activations between the
depthwise and pointwise operations of the separable convolution      6. Conclusions
layers.
                                                                        We showed how convolutions and depthwise separable
                                                                     convolutions lie at both extremes of a discrete spectrum,
                                                                     with Inception modules being an intermediate point in be-
                                                                     tween. This observation has led to us to propose replacing
                                                                     Inception modules with depthwise separable convolutions in
                                                                     neural computer vision architectures. We presented a novel
                                                                     architecture based on this idea, named Xception, which has
                                                                     a similar parameter count as Inception V3. Compared to
                                                                     Inception V3, Xception shows small gains in classification
                                                                     performance on the ImageNet dataset and large gains on the
                                                                     JFT dataset. We expect depthwise separable convolutions
                                                                     to become a cornerstone of convolutional neural network
                                                                     architecture design in the future, since they offer similar
                                                                     properties as Inception modules, yet are as easy to use as
                                                                     regular convolution layers.

   We mentioned earlier that the analogy between depth-              References
wise separable convolutions and Inception modules suggests
that depthwise separable convolutions should potentially in-          [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen,
clude a non-linearity between the depthwise and pointwise                 C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghe-
operations. In the experiments reported so far, no such non-              mawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia,
                                                                          R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané,
linearity was included. However we also experimentally
                                                                          R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster,
tested the inclusion of either ReLU or ELU [3] as intermedi-
                                                                          J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Van-
ate non-linearity. Results are reported on ImageNet in figure             houcke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden,
10, and show that the absence of any non-linearity leads to               M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. Tensor-
both faster convergence and better final performance.                     Flow: Large-scale machine learning on heterogeneous sys-
   This is a remarkable observation, since Szegedy et al. re-             tems, 2015. Software available from tensorflow.org.
port the opposite result in [21] for Inception modules. It may        [2] F. Chollet. Keras. https://github.com/fchollet/keras, 2015.
be that the depth of the intermediate feature spaces on which         [3] D.-A. Clevert, T. Unterthiner, and S. Hochreiter. Fast and
spatial convolutions are applied is critical to the usefulness            accurate deep network learning by exponential linear units
of the non-linearity: for deep feature spaces (e.g. those                 (elus). arXiv preprint arXiv:1511.07289, 2015.
 [4] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning            Networks for Machine Learning, 4, 2012. Accessed: 2015-
     for image recognition. arXiv preprint arXiv:1512.03385,                11-05.
     2015.                                                             [23] V. Vanhoucke. Learning visual representations at scale. ICLR,
 [5] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge           2014.
     in a neural network, 2015.                                        [24] M. Wang, B. Liu, and H. Foroosh. Factorized convolutional
 [6] A. Howard. Mobilenets: Efficient convolutional neural net-             neural networks. arXiv preprint arXiv:1608.04337, 2016.
     works for mobile vision applications. Forthcoming.                [25] M. D. Zeiler and R. Fergus. Visualizing and understanding
 [7] S. Ioffe and C. Szegedy. Batch normalization: Accelerating             convolutional networks. In Computer Vision–ECCV 2014,
     deep network training by reducing internal covariate shift.            pages 818–833. Springer, 2014.
     In Proceedings of The 32nd International Conference on
     Machine Learning, pages 448–456, 2015.
 [8] J. Jin, A. Dundar, and E. Culurciello. Flattened convolutional
     neural networks for feedforward acceleration. arXiv preprint
     arXiv:1412.5474, 2014.
 [9] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
     classification with deep convolutional neural networks. In
     Advances in neural information processing systems, pages
     1097–1105, 2012.
[10] Y. LeCun, L. Jackel, L. Bottou, C. Cortes, J. S. Denker,
     H. Drucker, I. Guyon, U. Muller, E. Sackinger, P. Simard,
     et al. Learning algorithms for classification: A comparison on
     handwritten digit recognition. Neural networks: the statistical
     mechanics perspective, 261:276, 1995.
[11] M. Lin, Q. Chen, and S. Yan. Network in network. arXiv
     preprint arXiv:1312.4400, 2013.
[12] F. Mamalet and C. Garcia. Simplifying ConvNets for Fast
     Learning. In International Conference on Artificial Neural
     Networks (ICANN 2012), pages 58–65. Springer, 2012.
[13] B. T. Polyak and A. B. Juditsky. Acceleration of stochas-
     tic approximation by averaging. SIAM J. Control Optim.,
     30(4):838–855, July 1992.
[14] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
     Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Ima-
     genet large scale visual recognition challenge. 2014.
[15] L. Sifre. Rigid-motion scattering for image classification,
     2014. Ph.D. thesis.
[16] L. Sifre and S. Mallat. Rotation, scaling and deformation
     invariant scattering for texture discrimination. In 2013 IEEE
     Conference on Computer Vision and Pattern Recognition,
     Portland, OR, USA, June 23-28, 2013, pages 1233–1240,
     2013.
[17] N. Silberman and S. Guadarrama. Tf-slim, 2016.
[18] K. Simonyan and A. Zisserman. Very deep convolutional
     networks for large-scale image recognition. arXiv preprint
     arXiv:1409.1556, 2014.
[19] C. Szegedy, S. Ioffe, and V. Vanhoucke. Inception-v4,
     inception-resnet and the impact of residual connections on
     learning. arXiv preprint arXiv:1602.07261, 2016.
[20] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,
     D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper
     with convolutions. In Proceedings of the IEEE Conference on
     Computer Vision and Pattern Recognition, pages 1–9, 2015.
[21] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna.
     Rethinking the inception architecture for computer vision.
     arXiv preprint arXiv:1512.00567, 2015.
[22] T. Tieleman and G. Hinton. Divide the gradient by a run-
     ning average of its recent magnitude. COURSERA: Neural