Sources — Compute-Structure AI History

Xception: Deep Learning with Depthwise Separable Convolutions

François Chollet
Google, Inc.
fchollet@google.com

arXiv:1610.02357v3 [cs.CV] 4 Apr 2017
Abstract as GoogLeNet (Inception V1), later refined as Inception V2
[7], Inception V3 [21], and most recently Inception-ResNet
We present an interpretation of Inception modules in con- [19]. Inception itself was inspired by the earlier Network-
volutional neural networks as being an intermediate step In-Network architecture [11]. Since its first introduction,
in-between regular convolution and the depthwise separable Inception has been one of the best performing family of
convolution operation (a depthwise convolution followed by models on the ImageNet dataset [14], as well as internal
a pointwise convolution). In this light, a depthwise separable datasets in use at Google, in particular JFT [5].
convolution can be understood as an Inception module with The fundamental building block of Inception-style mod-
a maximally large number of towers. This observation leads els is the Inception module, of which several different ver-
us to propose a novel deep convolutional neural network sions exist. In figure 1 we show the canonical form of an
architecture inspired by Inception, where Inception modules Inception module, as found in the Inception V3 architec-
have been replaced with depthwise separable convolutions. ture. An Inception model can be understood as a stack of
We show that this architecture, dubbed Xception, slightly such modules. This is a departure from earlier VGG-style
outperforms Inception V3 on the ImageNet dataset (which networks which were stacks of simple convolution layers.
Inception V3 was designed for), and significantly outper- While Inception modules are conceptually similar to con-
forms Inception V3 on a larger image classification dataset volutions (they are convolutional feature extractors), they
comprising 350 million images and 17,000 classes. Since empirically appear to be capable of learning richer repre-
the Xception architecture has the same number of param- sentations with less parameters. How do they work, and
eters as Inception V3, the performance gains are not due how do they differ from regular convolutions? What design
to increased capacity but rather to a more efficient use of strategies come after Inception?
model parameters.
1.1. The Inception hypothesis
A convolution layer attempts to learn filters in a 3D space,
1. Introduction with 2 spatial dimensions (width and height) and a chan-
nel dimension; thus a single convolution kernel is tasked
Convolutional neural networks have emerged as the mas-
with simultaneously mapping cross-channel correlations and
ter algorithm in computer vision in recent years, and de-
spatial correlations.
veloping recipes for designing them has been a subject of
This idea behind the Inception module is to make this
considerable attention. The history of convolutional neural
process easier and more efficient by explicitly factoring it
network design started with LeNet-style models [10], which
into a series of operations that would independently look at
were simple stacks of convolutions for feature extraction
cross-channel correlations and at spatial correlations. More
and max-pooling operations for spatial sub-sampling. In
precisely, the typical Inception module first looks at cross-
2012, these ideas were refined into the AlexNet architec-
channel correlations via a set of 1x1 convolutions, mapping
ture [9], where convolution operations were being repeated
the input data into 3 or 4 separate spaces that are smaller than
multiple times in-between max-pooling operations, allowing
the original input space, and then maps all correlations in
the network to learn richer features at every spatial scale.
these smaller 3D spaces, via regular 3x3 or 5x5 convolutions.
What followed was a trend to make this style of network
This is illustrated in figure 1. In effect, the fundamental hy-
increasingly deeper, mostly driven by the yearly ILSVRC
pothesis behind Inception is that cross-channel correlations
competition; first with Zeiler and Fergus in 2013 [25] and
and spatial correlations are sufficiently decoupled that it is
then with the VGG architecture in 2014 [18].
preferable not to map them jointly 1 .
At this point a new style of network emerged, the Incep-
tion architecture, introduced by Szegedy et al. in 2014 [20] 1 A variant of the process is to independently look at width-wise corre-
Consider a simplified version of an Inception module that Figure 3. A strictly equivalent reformulation of the simplified In-
ception module.
only uses one size of convolution (e.g. 3x3) and does not
include an average pooling tower (figure 2). This Incep-
tion module can be reformulated as a large 1x1 convolution
followed by spatial convolutions that would operate on non-
overlapping segments of the output channels (figure 3). This
observation naturally raises the question: what is the ef-
fect of the number of segments in the partition (and their
size)? Would it be reasonable to make a much stronger
hypothesis than the Inception hypothesis, and assume that
cross-channel correlations and spatial correlations can be
mapped completely separately?
Figure 1. A canonical Inception module (Inception V3).
Figure 4. An “extreme” version of our Inception module, with one
spatial convolution per output channel of the 1x1 convolution.

Figure 2. A simplified Inception module.

network design as early as 2014 [15] and has become more
popular since its inclusion in the TensorFlow framework [1]
in 2016.
A depthwise separable convolution, commonly called
“separable convolution” in deep learning frameworks such as
TensorFlow and Keras, consists in a depthwise convolution,
i.e. a spatial convolution performed independently over each
channel of an input, followed by a pointwise convolution,
i.e. a 1x1 convolution, projecting the channels output by the
depthwise convolution onto a new channel space. This is
not to be confused with a spatially separable convolution,
which is also commonly called “separable convolution” in
1.2. The continuum between convolutions and sep-
the image processing community.
arable convolutions
Two minor differences between and “extreme” version of
An “extreme” version of an Inception module, based on an Inception module and a depthwise separable convolution
this stronger hypothesis, would first use a 1x1 convolution to would be:
map cross-channel correlations, and would then separately
map the spatial correlations of every output channel. This • The order of the operations: depthwise separable con-
is shown in figure 4. We remark that this extreme form of volutions as usually implemented (e.g. in TensorFlow)
an Inception module is almost identical to a depthwise sepa- perform first channel-wise spatial convolution and then
rable convolution, an operation that has been used in neural perform 1x1 convolution, whereas Inception performs
the 1x1 convolution first.
lations and height-wise correlations. This is implemented by some of the
modules found in Inception V3, which alternate 7x1 and 1x7 convolutions.
The use of such spatially separable convolutions has a long history in im-
• The presence or absence of a non-linearity after the
age processing and has been used in some convolutional neural network first operation. In Inception, both operations are fol-
implementations since at least 2012 (possibly earlier). lowed by a ReLU non-linearity, however depthwise
separable convolutions are usually implemented with- during an internship at Google Brain in 2013, and used
out non-linearities. them in AlexNet to obtain small gains in accuracy and
large gains in convergence speed, as well as a significant
We argue that the first difference is unimportant, in par- reduction in model size. An overview of his work was
ticular because these operations are meant to be used in a first made public in a presentation at ICLR 2014 [23].
stacked setting. The second difference might matter, and we Detailed experimental results are reported in Sifre’s the-
investigate it in the experimental section (in particular see sis, section 6.2 [15]. This initial work on depthwise sep-
figure 10). arable convolutions was inspired by prior research from
We also note that other intermediate formulations of In- Sifre and Mallat on transformation-invariant scattering
ception modules that lie in between regular Inception mod- [16, 15]. Later, a depthwise separable convolution was
ules and depthwise separable convolutions are also possible: used as the first layer of Inception V1 and Inception
in effect, there is a discrete spectrum between regular convo- V2 [20, 7]. Within Google, Andrew Howard [6] has
lutions and depthwise separable convolutions, parametrized introduced efficient mobile models called MobileNets
by the number of independent channel-space segments used using depthwise separable convolutions. Jin et al. in
for performing spatial convolutions. A regular convolution 2014 [8] and Wang et al. in 2016 [24] also did related
(preceded by a 1x1 convolution), at one extreme of this work aiming at reducing the size and computational
spectrum, corresponds to the single-segment case; a depth- cost of convolutional neural networks using separable
wise separable convolution corresponds to the other extreme convolutions. Additionally, our work is only possible
where there is one segment per channel; Inception modules due to the inclusion of an efficient implementation of
lie in between, dividing a few hundreds of channels into 3 depthwise separable convolutions in the TensorFlow
or 4 segments. The properties of such intermediate modules framework [1].
appear not to have been explored yet.
Having made these observations, we suggest that it may • Residual connections, introduced by He et al. in [4],
be possible to improve upon the Inception family of archi- which our proposed architecture uses extensively.
tectures by replacing Inception modules with depthwise sep-
arable convolutions, i.e. by building models that would be 3. The Xception architecture
stacks of depthwise separable convolutions. This is made
practical by the efficient depthwise convolution implementa- We propose a convolutional neural network architecture
tion available in TensorFlow. In what follows, we present a based entirely on depthwise separable convolution layers.
convolutional neural network architecture based on this idea, In effect, we make the following hypothesis: that the map-
with a similar number of parameters as Inception V3, and ping of cross-channels correlations and spatial correlations
we evaluate its performance against Inception V3 on two in the feature maps of convolutional neural networks can be
large-scale image classification task. entirely decoupled. Because this hypothesis is a stronger ver-
sion of the hypothesis underlying the Inception architecture,
2. Prior work we name our proposed architecture Xception, which stands
for “Extreme Inception”.
The present work relies heavily on prior efforts in the A complete description of the specifications of the net-
following areas: work is given in figure 5. The Xception architecture has
36 convolutional layers forming the feature extraction base
• Convolutional neural networks [10, 9, 25], in particular of the network. In our experimental evaluation we will ex-
the VGG-16 architecture [18], which is schematically clusively investigate image classification and therefore our
similar to our proposed architecture in a few respects. convolutional base will be followed by a logistic regression
• The Inception architecture family of convolutional neu- layer. Optionally one may insert fully-connected layers be-
ral networks [20, 7, 21, 19], which first demonstrated fore the logistic regression layer, which is explored in the
the advantages of factoring convolutions into multiple experimental evaluation section (in particular, see figures
branches operating successively on channels and then 7 and 8). The 36 convolutional layers are structured into
on space. 14 modules, all of which have linear residual connections
around them, except for the first and last modules.
• Depthwise separable convolutions, which our proposed In short, the Xception architecture is a linear stack of
architecture is entirely based upon. While the use of spa- depthwise separable convolution layers with residual con-
tially separable convolutions in neural networks has a nections. This makes the architecture very easy to define
long history, going back to at least 2012 [12] (but likely and modify; it takes only 30 to 40 lines of code using a high-
even earlier), the depthwise version is more recent. Lau- level library such as Keras [2] or TensorFlow-Slim [17], not
rent Sifre developed depthwise separable convolutions unlike an architecture such as VGG-16 [18], but rather un-
like architectures such as Inception V2 or V3 which are far – Learning rate decay: decay of rate 0.9 every
more complex to define. An open-source implementation of 3,000,000 samples
Xception using Keras and TensorFlow is provided as part of
the Keras Applications module2 , under the MIT license.
For both datasets, the same exact same optimization con-
figuration was used for both Xception and Inception V3.
4. Experimental evaluation
Note that this configuration was tuned for best performance
We choose to compare Xception to the Inception V3 ar- with Inception V3; we did not attempt to tune optimization
chitecture, due to their similarity of scale: Xception and hyperparameters for Xception. Since the networks have dif-
Inception V3 have nearly the same number of parameters ferent training profiles (figure 6), this may be suboptimal, es-
(table 3), and thus any performance gap could not be at- pecially on the ImageNet dataset, on which the optimization
tributed to a difference in network capacity. We conduct configuration used had been carefully tuned for Inception
our comparison on two image classification tasks: one is V3.
the well-known 1000-class single-label classification task on Additionally, all models were evaluated using Polyak
the ImageNet dataset [14], and the other is a 17,000-class averaging [13] at inference time.
multi-label classification task on the large-scale JFT dataset.
4.1. The JFT dataset 4.3. Regularization configuration
JFT is an internal Google dataset for large-scale image • Weight decay: The Inception V3 model uses a weight
classification dataset, first introduced by Hinton et al. in [5], decay (L2 regularization) rate of 4e − 5, which has
which comprises over 350 million high-resolution images been carefully tuned for performance on ImageNet. We
annotated with labels from a set of 17,000 classes. To eval- found this rate to be quite suboptimal for Xception
uate the performance of a model trained on JFT, we use an and instead settled for 1e − 5. We did not perform
auxiliary dataset, FastEval14k. an extensive search for the optimal weight decay rate.
FastEval14k is a dataset of 14,000 images with dense The same weight decay rates were used both for the
annotations from about 6,000 classes (36.5 labels per im- ImageNet experiments and the JFT experiments.
age on average). On this dataset we evaluate performance
using Mean Average Precision for top 100 predictions • Dropout: For the ImageNet experiments, both models
(MAP@100), and we weight the contribution of each class include a dropout layer of rate 0.5 before the logistic
to MAP@100 with a score estimating how common (and regression layer. For the JFT experiments, no dropout
therefore important) the class is among social media images. was included due to the large size of the dataset which
This evaluation procedure is meant to capture performance made overfitting unlikely in any reasonable amount of
on frequently occurring labels from social media, which is time.
crucial for production models at Google.
4.2. Optimization configuration • Auxiliary loss tower: The Inception V3 architecture
may optionally include an auxiliary tower which back-
A different optimization configuration was used for Ima-
propagates the classification loss earlier in the network,
geNet and JFT:
serving as an additional regularization mechanism. For
• On ImageNet: simplicity, we choose not to include this auxiliary tower
in any of our models.
– Optimizer: SGD
– Momentum: 0.9 4.4. Training infrastructure
– Initial learning rate: 0.045
All networks were implemented using the TensorFlow
– Learning rate decay: decay of rate 0.94 every 2 framework [1] and trained on 60 NVIDIA K80 GPUs each.
epochs For the ImageNet experiments, we used data parallelism
• On JFT: with synchronous gradient descent to achieve the best classi-
fication performance, while for JFT we used asynchronous
– Optimizer: RMSprop [22] gradient descent so as to speed up training. The ImageNet
– Momentum: 0.9 experiments took approximately 3 days each, while the JFT
experiments took over one month each. The JFT models
– Initial learning rate: 0.001
were not trained to full convergence, which would have
2 https://keras.io/applications/#xception taken over three month per experiment.
Figure 5. The Xception architecture: the data first goes through the entry flow, then through the middle flow which is repeated eight times,
and finally through the exit flow. Note that all Convolution and SeparableConvolution layers are followed by batch normalization [7] (not
included in the diagram). All SeparableConvolution layers use a depth multiplier of 1 (no depth expansion).

4.5. Comparison with Inception V3 152 [4].

4.5.1 Classification performance Table 1. Classification performance comparison on ImageNet (sin-
gle crop, single model). VGG-16 and ResNet-152 numbers are
All evaluations were run with a single crop of the inputs only included as a reminder. The version of Inception V3 being
images and a single model. ImageNet results are reported benchmarked does not include the auxiliary tower.
on the validation set rather than the test set (i.e. on the Top-1 accuracy Top-5 accuracy
non-blacklisted images from the validation set of ILSVRC VGG-16 0.715 0.901
2012). JFT results are reported after 30 million iterations ResNet-152 0.770 0.933
(one month of training) rather than after full convergence. Inception V3 0.782 0.941
Results are provided in table 1 and table 2, as well as figure Xception 0.790 0.945
6, figure 7, figure 8. On JFT, we tested both versions of our
networks that did not include any fully-connected layers, and The Xception architecture shows a much larger perfor-
versions that included two fully-connected layers of 4096 mance improvement on the JFT dataset compared to the
units each before the logistic regression layer. ImageNet dataset. We believe this may be due to the fact
On ImageNet, Xception shows marginally better results that Inception V3 was developed with a focus on ImageNet
than Inception V3. On JFT, Xception shows a 4.3% rel- and may thus be by design over-fit to this specific task. On
ative improvement on the FastEval14k MAP@100 metric. the other hand, neither architecture was tuned for JFT. It is
We also note that Xception outperforms ImageNet results likely that a search for better hyperparameters for Xception
reported by He et al. for ResNet-50, ResNet-101 and ResNet- on ImageNet (in particular optimization parameters and reg-
Table 2. Classification performance comparison on JFT (single Figure 8. Training profile on JFT, with fully-connected layers
crop, single model).
FastEval14k MAP@100
Inception V3 - no FC layers 6.36
Xception - no FC layers 6.70
Inception V3 with FC layers 6.50
Xception with FC layers 6.78

Figure 6. Training profile on ImageNet

V3 and Xception. Parameter count is reported on ImageNet
(1000 classes, no fully-connected layers) and the number of
training steps (gradient updates) per second is reported on
ImageNet with 60 K80 GPUs running synchronous gradient
descent. Both architectures have approximately the same
size (within 3.5%), and Xception is marginally slower. We
Figure 7. Training profile on JFT, without fully-connected layers
expect that engineering optimizations at the level of the
depthwise convolution operations can make Xception faster
than Inception V3 in the near future. The fact that both
architectures have almost the same number of parameters
indicates that the improvement seen on ImageNet and JFT
does not come from added capacity but rather from a more
efficient use of the model parameters.
4.6. Effect of the residual connections

Figure 9. Training profile with and without residual connections.

ularization parameters) would yield significant additional
improvement.

4.5.2 Size and speed

Table 3. Size and training speed comparison.
Parameter count Steps/second
Inception V3 23,626,728 31
Xception 22,855,952 28 To quantify the benefits of residual connections in the
Xception architecture, we benchmarked on ImageNet a mod-
In table 3 we compare the size and speed of Inception ified version of Xception that does not include any residual
connections. Results are shown in figure 9. Residual con- found in Inception modules) the non-linearity is helpful, but
nections are clearly essential in helping with convergence, for shallow ones (e.g. the 1-channel deep feature spaces
both in terms of speed and final classification performance. of depthwise separable convolutions) it becomes harmful,
However we will note that benchmarking the non-residual possibly due to a loss of information.
model with the same optimization configuration as the resid-
ual model may be uncharitable and that better optimization 5. Future directions
configurations might yield more competitive results.
Additionally, let us note that this result merely shows the We noted earlier the existence of a discrete spectrum be-
importance of residual connections for this specific architec- tween regular convolutions and depthwise separable convo-
ture, and that residual connections are in no way required lutions, parametrized by the number of independent channel-
in order to build models that are stacks of depthwise sepa- space segments used for performing spatial convolutions. In-
rable convolutions. We also obtained excellent results with ception modules are one point on this spectrum. We showed
non-residual VGG-style models where all convolution layers in our empirical evaluation that the extreme formulation of
were replaced with depthwise separable convolutions (with an Inception module, the depthwise separable convolution,
a depth multiplier of 1), superior to Inception V3 on JFT at may have advantages over regular a regular Inception mod-
equal parameter count. ule. However, there is no reason to believe that depthwise
separable convolutions are optimal. It may be that intermedi-
4.7. Effect of an intermediate activation after point- ate points on the spectrum, lying between regular Inception
wise convolutions modules and depthwise separable convolutions, hold further
advantages. This question is left for future investigation.
Figure 10. Training profile with different activations between the
depthwise and pointwise operations of the separable convolution 6. Conclusions
layers.
We showed how convolutions and depthwise separable
convolutions lie at both extremes of a discrete spectrum,
with Inception modules being an intermediate point in be-
tween. This observation has led to us to propose replacing
Inception modules with depthwise separable convolutions in
neural computer vision architectures. We presented a novel
architecture based on this idea, named Xception, which has
a similar parameter count as Inception V3. Compared to
Inception V3, Xception shows small gains in classification
performance on the ImageNet dataset and large gains on the
JFT dataset. We expect depthwise separable convolutions
to become a cornerstone of convolutional neural network
architecture design in the future, since they offer similar
properties as Inception modules, yet are as easy to use as
regular convolution layers.

We mentioned earlier that the analogy between depth- References
wise separable convolutions and Inception modules suggests
that depthwise separable convolutions should potentially in- [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen,
clude a non-linearity between the depthwise and pointwise C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghe-
operations. In the experiments reported so far, no such non- mawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia,
R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané,
linearity was included. However we also experimentally
R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster,
tested the inclusion of either ReLU or ELU [3] as intermedi-
J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Van-
ate non-linearity. Results are reported on ImageNet in figure houcke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden,
10, and show that the absence of any non-linearity leads to M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. Tensor-
both faster convergence and better final performance. Flow: Large-scale machine learning on heterogeneous sys-
This is a remarkable observation, since Szegedy et al. re- tems, 2015. Software available from tensorflow.org.
port the opposite result in [21] for Inception modules. It may [2] F. Chollet. Keras. https://github.com/fchollet/keras, 2015.
be that the depth of the intermediate feature spaces on which [3] D.-A. Clevert, T. Unterthiner, and S. Hochreiter. Fast and
spatial convolutions are applied is critical to the usefulness accurate deep network learning by exponential linear units
of the non-linearity: for deep feature spaces (e.g. those (elus). arXiv preprint arXiv:1511.07289, 2015.
[4] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning Networks for Machine Learning, 4, 2012. Accessed: 2015-
for image recognition. arXiv preprint arXiv:1512.03385, 11-05.
2015. [23] V. Vanhoucke. Learning visual representations at scale. ICLR,
[5] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge 2014.
in a neural network, 2015. [24] M. Wang, B. Liu, and H. Foroosh. Factorized convolutional
[6] A. Howard. Mobilenets: Efficient convolutional neural net- neural networks. arXiv preprint arXiv:1608.04337, 2016.
works for mobile vision applications. Forthcoming. [25] M. D. Zeiler and R. Fergus. Visualizing and understanding
[7] S. Ioffe and C. Szegedy. Batch normalization: Accelerating convolutional networks. In Computer Vision–ECCV 2014,
deep network training by reducing internal covariate shift. pages 818–833. Springer, 2014.
In Proceedings of The 32nd International Conference on
Machine Learning, pages 448–456, 2015.
[8] J. Jin, A. Dundar, and E. Culurciello. Flattened convolutional
neural networks for feedforward acceleration. arXiv preprint
arXiv:1412.5474, 2014.
[9] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
classification with deep convolutional neural networks. In
Advances in neural information processing systems, pages
1097–1105, 2012.
[10] Y. LeCun, L. Jackel, L. Bottou, C. Cortes, J. S. Denker,
H. Drucker, I. Guyon, U. Muller, E. Sackinger, P. Simard,
et al. Learning algorithms for classification: A comparison on
handwritten digit recognition. Neural networks: the statistical
mechanics perspective, 261:276, 1995.
[11] M. Lin, Q. Chen, and S. Yan. Network in network. arXiv
preprint arXiv:1312.4400, 2013.
[12] F. Mamalet and C. Garcia. Simplifying ConvNets for Fast
Learning. In International Conference on Artificial Neural
Networks (ICANN 2012), pages 58–65. Springer, 2012.
[13] B. T. Polyak and A. B. Juditsky. Acceleration of stochas-
tic approximation by averaging. SIAM J. Control Optim.,
30(4):838–855, July 1992.
[14] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Ima-
genet large scale visual recognition challenge. 2014.
[15] L. Sifre. Rigid-motion scattering for image classification,
2014. Ph.D. thesis.
[16] L. Sifre and S. Mallat. Rotation, scaling and deformation
invariant scattering for texture discrimination. In 2013 IEEE
Conference on Computer Vision and Pattern Recognition,
Portland, OR, USA, June 23-28, 2013, pages 1233–1240,
2013.
[17] N. Silberman and S. Guadarrama. Tf-slim, 2016.
[18] K. Simonyan and A. Zisserman. Very deep convolutional
networks for large-scale image recognition. arXiv preprint
arXiv:1409.1556, 2014.
[19] C. Szegedy, S. Ioffe, and V. Vanhoucke. Inception-v4,
inception-resnet and the impact of residual connections on
learning. arXiv preprint arXiv:1602.07261, 2016.
[20] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,
D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper
with convolutions. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 1–9, 2015.
[21] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna.
Rethinking the inception architecture for computer vision.
arXiv preprint arXiv:1512.00567, 2015.
[22] T. Tieleman and G. Hinton. Divide the gradient by a run-
ning average of its recent magnitude. COURSERA: Neural