Source notes and reports

Sources

← Home

                                                                              Accurate, Large Minibatch SGD:
                                                                               Training ImageNet in 1 Hour

                                                         Priya Goyal             Piotr Dollár        Ross Girshick                                             Pieter Noordhuis
                                                    Lukasz Wesolowski            Aapo Kyrola         Andrew Tulloch                                           Yangqing Jia Kaiming He

                                                                                                  Facebook




arXiv:1706.02677v2 [cs.CV] 30 Apr 2018
                                                                                                                                               40
                                                                 Abstract




                                                                                                             ImageNet top-1 validation error
                                                                                                                                               35
                                             Deep learning thrives with large neural networks and
                                         large datasets. However, larger networks and larger
                                         datasets result in longer training times that impede re-                                              30

                                         search and development progress. Distributed synchronous
                                         SGD offers a potential solution to this problem by dividing                                           25
                                         SGD minibatches over a pool of parallel workers. Yet to
                                         make this scheme efficient, the per-worker workload must                                              20
                                                                                                                                                 64   128   256   512   1k    2k    4k    8k   16k   32k   64k
                                         be large, which implies nontrivial growth in the SGD mini-
                                                                                                                                                                        mini-batch size
                                         batch size. In this paper, we empirically show that on the
                                                                                                             Figure 1. ImageNet top-1 validation error vs. minibatch size.
                                         ImageNet dataset large minibatches cause optimization dif-
                                                                                                             Error range of plus/minus two standard deviations is shown. We
                                         ficulties, but when these are addressed the trained networks
                                                                                                             present a simple and general technique for scaling distributed syn-
                                         exhibit good generalization. Specifically, we show no loss          chronous SGD to minibatches of up to 8k images while maintain-
                                         of accuracy when training with large minibatch sizes up to          ing the top-1 error of small minibatch training. For all minibatch
                                         8192 images. To achieve this result, we adopt a hyper-              sizes we set the learning rate as a linear function of the minibatch
                                         parameter-free linear scaling rule for adjusting learning           size and apply a simple warmup phase for the first few epochs of
                                         rates as a function of minibatch size and develop a new             training. All other hyper-parameters are kept fixed. Using this
                                         warmup scheme that overcomes optimization challenges                simple approach, accuracy of our models is invariant to minibatch
                                         early in training. With these simple techniques, our Caffe2-        size (up to an 8k minibatch size). Our techniques enable a lin-
                                         based system trains ResNet-50 with a minibatch size of 8192         ear reduction in training time with ∼90% efficiency as we scale
                                         on 256 GPUs in one hour, while matching small minibatch             to large minibatch sizes, allowing us to train an accurate 8k mini-
                                                                                                             batch ResNet-50 model in 1 hour on 256 GPUs.
                                         accuracy. Using commodity hardware, our implementation
                                         achieves ∼90% scaling efficiency when moving from 8 to
                                         256 GPUs. Our findings enable training visual recognition           tation [8, 10, 28]. Moreover, this pattern generalizes: larger
                                         models on internet-scale data with high efficiency.                 datasets and neural network architectures consistently yield
                                                                                                             improved accuracy across all tasks that benefit from pre-
                                                                                                             training [22, 41, 34, 35, 36, 16]. But as model and data
                                         1. Introduction                                                     scale grow, so does training time; discovering the potential
                                                                                                             and limits of large-scale deep learning requires developing
                                            Scale matters. We are in an unprecedented era in AI              novel techniques to keep training time manageable.
                                         research history in which the increasing data and model                 The goal of this report is to demonstrate the feasibility of,
                                         scale is rapidly improving accuracy in computer vision              and to communicate a practical guide to, large-scale train-
                                         [22, 41, 34, 35, 36, 16], speech [17, 40], and natural lan-         ing with distributed synchronous stochastic gradient descent
                                         guage processing [7, 38]. Take the profound impact in com-          (SGD). As an example, we scale ResNet-50 [16] training,
                                         puter vision as an example: visual representations learned          originally performed with a minibatch size of 256 images
                                         by deep convolutional neural networks [23, 22] show excel-          (using 8 Tesla P100 GPUs, training time is 29 hours), to
                                         lent performance on previously challenging tasks like Ima-          larger minibatches (see Figure 1). In particular, we show
                                         geNet classification [33] and can be transferred to difficult       that with a large minibatch size of 8192, we can train
                                         perception problems such as object detection and segmen-            ResNet-50 in 1 hour using 256 GPUs while maintaining


                                                                                                         1
the same level of accuracy as the 256 minibatch baseline.              2. Large Minibatch SGD
While distributed synchronous SGD is now commonplace,
no existing results show that generalization accuracy can be              We start by reviewing the formulation of Stochastic Gra-
maintained with minibatches as large as 8192 or that such              dient Descent (SGD), which will be the foundation of our
high-accuracy models can be trained in such short time.                discussions in the following sections. We consider super-
                                                                       vised learning by minimizing a loss L(w) of the form:
    To tackle this unusually large minibatch size, we employ
a simple and hyper-parameter-free linear scaling rule to ad-                                      1 X
                                                                                        L(w) =            l(x, w).             (1)
just the learning rate. While this guideline is found in ear-                                    |X|
                                                                                                             x∈X
lier work [21, 4], its empirical limits are not well under-
stood and informally we have found that it is not widely               Here w are the weights of a network, X is a labeled training
known to the research community. To successfully apply                 set, and l(x, w) is the loss computed from samples x ∈ X
this rule, we present a new warmup strategy, i.e., a strategy          and their labels y. Typically l is the sum of a classification
of using lower learning rates at the start of training [16], to        loss (e.g., cross-entropy) and a regularization loss on w.
overcome early optimization difficulties. Importantly, not                 Minibatch Stochastic Gradient Descent [32], usually re-
only does our approach match the baseline validation error,            ferred to as simply as SGD in recent literature even though
but also yields training error curves that closely match the           it operates on minibatches, performs the following update:
small minibatch baseline. Details are presented in §2.                                                1X
                                                                                      wt+1 = wt − η         ∇l(x, wt ).           (2)
    Our comprehensive experiments in §5 show that opti-                                               n
                                                                                                              x∈B
mization difficulty is the main issue with large minibatches,
rather than poor generalization (at least on ImageNet), in             Here B is a minibatch sampled from X and n = |B| is the
contrast to some recent studies [20]. Additionally, we show            minibatch size, η is the learning rate, and t is the iteration
that the linear scaling rule and warmup generalize to more             index. Note that in practice we use momentum SGD; we
complex tasks including object detection and instance seg-             return to a discussion of momentum in §3.
mentation [9, 31, 14, 28], which we demonstrate via the                2.1. Learning Rates for Large Minibatches
recently developed Mask R-CNN [14]. We note that a ro-
bust and successful guideline for addressing a wide range                 Our goal is to use large minibatches in place of small
of minibatch sizes has not been presented in previous work.            minibatches while maintaining training and generalization
    While the strategy we deliver is simple, its successful            accuracy. This is of particular interest in distributed learn-
application requires correct implementation with respect to            ing, because it can allow us to scale to multiple workers2 us-
seemingly minor and often not well understood implemen-                ing simple data parallelism without reducing the per-worker
tation details within deep learning libraries. Subtleties in the       workload and without sacrificing model accuracy.
implementation of SGD can lead to incorrect solutions that                As we will show in comprehensive experiments, we
are difficult to discover. To provide more helpful guidance            found that the following learning rate scaling rule is sur-
we describe common pitfalls and the relevant implementa-               prisingly effective for a broad range of minibatch sizes:
tion details that can trigger these traps in §3.
                                                                             Linear Scaling Rule: When the minibatch size is
    Our strategy applies regardless of framework, but
                                                                             multiplied by k, multiply the learning rate by k.
achieving efficient linear scaling requires nontrivial com-
munication algorithms. We use the open-source Caffe21                  All other hyper-parameters (weight decay, etc.) are kept un-
deep learning framework and Big Basin GPU servers [24],                changed. As we will show in §5, the linear scaling rule can
which operates efficiently using standard Ethernet network-            help us to not only match the accuracy between using small
ing (as opposed to specialized network interfaces). We de-             and large minibatches, but equally importantly, to largely
scribe the systems algorithms that enable our approach to              match their training curves, which enables rapid debugging
operate near its full potential in §4.                                 and comparison of experiments prior to convergence.
    The practical advances described in this report are help-
                                                                       Interpretation. We present an informal discussion of the
ful across a range of domains. In an industrial domain,
                                                                       linear scaling rule and why it may be effective. Consider
our system unleashes the potential of training visual models
                                                                       a network at iteration t with weights wt , and a sequence
from internet-scale data, enabling training with billions of
                                                                       of k minibatches Bj for 0 ≤ j < k each of size n. We
images per day. Of equal importance, in a research domain,
                                                                       compare the effect of executing k SGD iterations with small
we have found it to simplify migrating algorithms from a
                                                                       minibatches Bj and learning rate η versus a single iteration
single-GPU to a multi-GPU implementation without requir-
                                                                       with a large minibatch ∪j Bj of size kn and learning rate η̂.
ing hyper-parameter search, e.g. in our experience migrat-
                                                                           2 We use the terms ‘worker’ and ‘GPU’ interchangeably in this work, al-
ing Faster R-CNN [31] and ResNets [16] from 1 to 8 GPUs.
                                                                       though other implementations of a ‘worker’ are possible. ‘Server’ denotes
  1 http://www.caffe2.ai                                               a set of 8 GPUs that does not require communication over a network.


                                                                   2
   According to (2), after k iterations of SGD with learning           2.2. Warmup
rate η and a minibatch size of n we have:
                                                                          As we discussed, for large minibatches (e.g., 8k) the lin-
                          1XX                                          ear scaling rule breaks down when the network is changing
          wt+k = wt − η                ∇l(x, wt+j ).     (3)
                          n                                            rapidly, which commonly occurs in early stages of train-
                             j<k x∈Bj
                                                                       ing. We find that this issue can be alleviated by a properly
On the other hand, taking a single step with the large mini-
                                                                       designed warmup [16], namely, a strategy of using less ag-
batch ∪j Bj of size kn and learning rate η̂ yields:
                                                                       gressive learning rates at the start of training.
                           1 XX
          ŵt+1 = wt − η̂              ∇l(x, wt ).       (4)           Constant warmup. The warmup strategy presented in [16]
                          kn                                           uses a low constant learning rate for the first few epochs of
                               j<k x∈Bj

As expected, the updates differ, and it is unlikely that               training. As we will show in §5, we have found constant
ŵt+1 = wt+k . However, if we could assume ∇l(x, wt ) ≈                warmup particularly helpful for prototyping object detec-
∇l(x, wt+j ) for j < k, then setting η̂ = kη would yield               tion and segmentation methods [9, 31, 26, 14] that fine-tune
ŵt+1 ≈ wt+k , and the updates from small and large mini-              pre-trained layers together with newly initialized layers.
batch SGD would be similar. Although this is a strong as-                  In our ImageNet experiments with a large minibatch of
sumption, we emphasize that if it were true the two updates            size kn, we have tried to train with the low learning rate of
are similar only if we set η̂ = kη.                                    η for the first 5 epochs and then return to the target learn-
    The above interpretation gives intuition for one case              ing rate of η̂ = kη. However, given a large k, we find that
where we may hope the linear scaling rule to apply. In our             this constant warmup is not sufficient to solve the optimiza-
experiments with η̂ = kη (and warmup), small and large                 tion problem, and a transition out of the low learning rate
minibatch SGD not only result in models with the same fi-              warmup phase can cause the training error to spike. This
nal accuracy, but also, the training curves match closely.             leads us to propose the following gradual warmup.
Our empirical results suggest that the above approximation             Gradual warmup. We present an alternative warmup that
might be valid in large-scale, real-world data.                        gradually ramps up the learning rate from a small to a large
    However, there are at least two cases when the condition           value. This ramp avoids a sudden increase of the learning
∇l(x, wt ) ≈ ∇l(x, wt+j ) will clearly not hold. First, in ini-        rate, allowing healthy convergence at the start of training.
tial training when the network is changing rapidly, it does            In practice, with a large minibatch of size kn, we start from
not hold. We address this by using a warmup phase, dis-                a learning rate of η and increment it by a constant amount at
cussed in §2.2. Second, minibatch size cannot be scaled in-            each iteration such that it reaches η̂ = kη after 5 epochs (re-
definitely: while results are stable for a large range of sizes,       sults are robust to the exact duration of warmup). After the
beyond a certain point accuracy degrades rapidly. Interest-            warmup, we go back to the original learning rate schedule.
ingly, this point is as large as ∼8k in ImageNet experiments.
                                                                       2.3. Batch Normalization with Large Minibatches
Discussion. The above linear scaling rule was adopted by
Krizhevsky [21], if not earlier. However, Krizhevsky re-                   Batch Normalization (BN) [19] computes statistics along
ported a 1% increase of error when increasing the minibatch            the minibatch dimension: this breaks the independence of
size from 128 to 1024, whereas we show how to maintain                 each sample’s loss, and changes in minibatch size change
accuracy across a much broader regime of minibatch sizes.              the underlying definition of the loss function being opti-
Chen et al. [5] presented a comparison of numerous dis-                mized. In the following we will show that a commonly used
tributed SGD variants, and although their work also em-                ‘shortcut’, which may appear to be a practical consideration
ployed the linear scaling rule, it did not establish a small           to avoid communication overhead, is actually necessary for
minibatch baseline. Li [25] (§4.6) showed distributed Ima-             preserving the loss function when changing minibatch size.
geNet training with minibatches up to 5120 without a loss                  We note that (1) and (2) assume the per-sample loss
in accuracy after convergence. However, their work did not             l(x, w) is independent of all other samples. This is not the
demonstrate a hyper-parameter search-free rule for adjust-             case when BN is performed and activations are computed
ing the learning rate as a function of minibatch size, which           across samples. We write lB (x, w) to denote that the loss of
is a central contribution of our work.                                 a single sample x depends on the statistics of all samples in
    In recent work, Bottou et al. [4] (§4.2) review theoretical        its minibatch B. We denote theP loss over a single minibatch
tradeoffs of minibatching and show that with the linear scal-          B of size n as L(B, w) = n1 x∈B lB (x, w). With BN, the
ing rule, solvers follow the same training curve as a function         training set can be thought of as containing all distinct sub-
of number of examples seen, and suggest the learning rate              sets of size n drawn from the original training set X, which
should not exceed a maximum rate independent of mini-                  we denote as X n . The training loss L(w) then becomes:
batch size (which justifies warmup). Our work empirically                                        1 X
                                                                                       L(w) =               L(B, w).              (5)
tests these theories with unprecedented minibatch sizes.                                       |X n |     n
                                                                                                      B∈X


                                                                   3
If we view B as a ‘single sample’ in X n , then the loss of           contributed by ε(x, wt ). If there is no weight decay term,
each single sample B is computed independently.                       there are many equivalent ways of scaling the learning rate,
    Note that the minibatch size n over which the BN statis-          including scaling the term ε(x, wt ). However, as can be
tics are computed is a key component of the loss: if the per-         seen from (8), in general this is not the case. We summarize
worker minibatch sample size n is changed, it changes the             these observations in the following remark:
underlying loss function L that is optimized. More specif-                    Remark 1: Scaling the cross-entropy loss is
ically, the mean/variance statistics computed by BN with                       not equivalent to scaling the learning rate.
different n exhibit different levels of random variation.
                                                                      Momentum correction. Momentum SGD is a commonly
    In the case of distributed (and multi-GPU) training, if the
                                                                      adopted modification to the vanilla SGD in (2). A reference
per-worker sample size n is kept fixed and the total mini-
                                                                      implementation of momentum SGD has the following form:
batch size is kn, it can be viewed a minibatch of k samples
with each sample Bj independently selected from X n , so                                              1X
                                                                                    ut+1 = mut +             ∇l(x, wt )
the underlying loss function is unchanged and is still de-                                            n                        (9)
                                                                                                        x∈B
fined in X n . Under this point of view, in the BN setting                          wt+1 = wt − ηut+1 .
after seeing k minibatches Bj , (3) and (4) become:
                               X                                      Here m is the momentum decay factor and u is the update
             wt+k = wt − η         ∇L(Bj , wt+j ),          (6)       tensor. A popular variant absorbs the learning rate η into
                              j<k                                     the update tensor. Substituting vt for ηut in (9) yields:
                              1X                                                                     1X
            ŵt+1 = wt − η̂      ∇L(Bj , wt ).             (7)                     vt+1 = mvt + η           ∇l(x, wt )
                              k                                                                      n                          (10)
                                j<k                                                                    x∈B
Following similar logic as in §2.1, we set η̂ = kη and we                          wt+1 = wt − vt+1 .
keep the per-worker sample size n constant when we change             For a fixed η, the two are equivalent. However, we note that
the number of workers k.                                              while u only depends on the gradients and is independent
    In this work, we use n = 32 which has performed well              of η, v is entangled with η. When η changes, to maintain
for a wide range of datasets and networks [19, 16]. If n is           equivalence with the reference variant in P   (9), the update for
adjusted, it should be viewed as a hyper-parameter of BN,             v should be: vt+1 = m ηt+1   ηt v t + η     1
                                                                                                              t+1 n    ∇l(x, wt ). We
not of distributed training. We also note that the BN statis-                              ηt+1
                                                                      refer to the factor ηt as the momentum correction. We
tics should not be computed across all workers, not only for          found that this is especially important for stabilizing train-
the sake of reducing communication, but also for maintain-            ing when ηt+1  ηt , otherwise the history term vt is too
ing the same underlying loss function being optimized.                small which leads to instability (for ηt+1 < ηt momentum
                                                                      correction is less critical). This leads to our second remark:
3. Subtleties and Pitfalls of Distributed SGD                                    Remark 2: Apply momentum correction
    In practice a distributed implementation has many sub-                      after changing learning rate if using (10).
tleties. Many common implementation errors change the                 Gradient aggregation. For k workers each with a per-
definitions of hyper-parameters, leading to models that train         worker minibatch of size n, following (4), gradient aggre-
but whose error may be higher than expected, and such is-
                                                                                        P P over the entire set of kn examples
                                                                      gation must be performed
                                                                                      1
sues can be difficult to discover. While the remarks below            according to kn      j    x∈Bj l(x, wt ). Loss layers are typi-
are straightforward, they are important to consider explic-           cally implemented to compute an average loss over their lo-
itly to faithfully implement the underlying solver.
                                                                      P input, which amounts to computing a per-worker loss of
                                                                      cal
Weight decay. Weight decay is actually the outcome of the                 l(x, wt )/n. Given this, correct aggregation requires av-
gradient of an L2-regularization term in the loss function.           eraging the k gradients in order to recover the missing 1/k
More formally, the per-sample loss in (1) can be written as           factor. However, standard communication primitives like
l(x, w) = λ2 kwk2 + ε(x, w). Here λ2 kwk2 is the sample-              allreduce [11] perform summing, not averaging. Therefore,
independent L2 regularization on the weights and ε(x, w)              it is more efficient to absorb the 1/k scaling into the loss,
is a sample-dependent term such as the cross-entropy loss.            in which case only the loss’s gradient with respect to its in-
The SGD update in (2) can be written as:                              put needs to be scaled, removing the need to scale the entire
                                                                      gradient vector. We summarize this as follows:
                                 1X
         wt+1 = wt − ηλwt − η          ∇ε(x, wt ).      (8)                    Remark 3: Normalize the per-worker loss by
                                n
                                      x∈B                                    total minibatch size kn, not per-worker size n.
In
P practice, usually only the sample-dependent term                    We also note that it may be incorrect to ‘cancel k’ by setting
   ∇ε(x, wt ) is computed by backprop; the term λwt is                η̂ = η (not kη) and normalizing the loss by 1/n (not 1/kn),
computed separately and added to the aggregated gradients             which can lead to incorrect weight decay (see Remark 1).


                                                                  4
Data shuffling. SGD is typically analyzed as a process that          number of GPU-to-host memory copies and a CPU reduc-
samples data randomly with replacement. In practice, com-            tion otherwise. NCCL uses GPU kernels to accelerate in-
mon SGD implementations apply random shuffling of the                traserver collectives, so this approach dedicates more time
training set during each SGD epoch, which can give better            on the GPU to backprop while using the CPU resources that
results [3, 13]. To provide fair comparisons with baselines          would otherwise have been idle to improve throughput.
that use shuffling (e.g., [16]), we ensure the samples in one            For interserver allreduce, we implemented two of the
epoch done by k workers are from a single consistent ran-            best algorithms for bandwidth-limited scenarios: the re-
dom shuffling of the training set. To achieve this, for each         cursive halving and doubling algorithm [30, 37] and the
epoch we use a random shuffling that is partitioned into k           bucket algorithm (also known as the ring algorithm) [2].
parts, each of which is processed by one of the k workers.           For both, each server sends and receives 2 p−1     p b bytes of
Failing to correctly implement random shuffling in multiple          data, where b is the buffer size in bytes and p is the num-
workers may lead to noticeably different behavior, which             ber of servers. While the halving/doubling algorithm con-
may contaminate results and conclusions. In summary:                 sists of 2 log2 (p) communication steps, the ring algorithm
                                                                     consists of 2(p − 1) steps. This generally makes the halv-
 Remark 4: Use a single random shuffling of the training             ing/doubling algorithm faster in latency-limited scenarios
 data (per epoch) that is divided amongst all k workers.             (i.e., for small buffer sizes and/or large server counts). In
                                                                     practice, we found the halving/doubling algorithm to per-
4. Communication                                                     form much better than the ring algorithm for buffer sizes
   In order to scale beyond the 8 GPUs in a single Big Basin         up to a million elements (and even higher on large server
server [24], gradient aggregation has to span across servers         counts). On 32 servers (256 GPUs), using halving/doubling
on a network. To allow for near perfect linear scaling, the          led to a speedup of 3× over the ring algorithm.
aggregation must be performed in parallel with backprop.                 The halving/doubling algorithm consists of a reduce-
This is possible because there is no data dependency be-             scatter collective followed by an allgather. In the first step
tween gradients across layers. Therefore, as soon as the gra-        of reduce-scatter, servers communicate in pairs (rank 0 with
dient for a layer is computed, it is aggregated across work-         1, 2 with 3, etc.), sending and receiving for different halves
ers, while gradient computation for the next layer continues         of their input buffers. For example, rank 0 sends the second
(as discussed in [5]). We give full details next.                    half of its buffer to 1 and receives the first half of the buffer
                                                                     from 1. A reduction over the received data is performed be-
4.1. Gradient Aggregation                                            fore proceeding to the next step, where the distance to the
                                                                     destination rank is doubled while the data sent and received
    For every gradient, aggregation is done using an allre-          is halved. After the reduce-scatter phase is finished, each
duce operation (similar to the MPI collective operation              server has a portion of the final reduced vector.
MPI Allreduce [11]). Before allreduce starts every GPU has
                                                                         This is followed by the allgather phase, which retraces
its locally computed gradients and after allreduce completes
                                                                     the communication pattern from the reduce-scatter in re-
every GPU has the sum of all k gradients. As the number
                                                                     verse, this time simply concatenating portions of the final
of parameters grows and compute performance of GPUs in-
                                                                     reduced vector. At each server, the portion of the buffer that
creases, it becomes harder to hide the cost of aggregation in
                                                                     was being sent in the reduce-scatter is received in the all-
the backprop phase. Training techniques to overcome these
                                                                     gather, and the portion that was being received is now sent.
effects are beyond the scope of this work (e.g., quantized
                                                                         To support non-power-of-two number of servers, we
gradients [18], Block-Momentum SGD [6]). However, at
                                                                     used the binary blocks algorithm [30]. This is a generalized
the scale of this work, collective communication was not
                                                                     version of the halving/doubling algorithm where servers
a bottleneck, as we were able to achieve near-linear SGD
                                                                     are partitioned into power-of-two blocks and two additional
scaling by using an optimized allreduce implementation.
                                                                     communication steps are used, one immediately after the
    Our implementation of allreduce consists of three phases
                                                                     intrablock reduce-scatter and one before the intrablock all-
for communication within and across servers: (1) buffers
                                                                     gather. Non-power-of-two cases have some degree of load
from the 8 GPUs within a server are summed into a sin-
                                                                     imbalance compared to power-of-two, though in our runs
gle buffer for each server, (2) the results buffers are shared
                                                                     we did not see significant performance degradation.
and summed across all servers, and finally (3) the results
are broadcast onto each GPU. For the local reduction and             4.2. Software
broadcast in phases (1) and (3) we used NVIDIA Collective
Communication Library (NCCL)3 for buffers of size 256                  The allreduce algorithms described are implemented in
KB or more and a simple implementation consisting of a               Gloo4 , a library for collective communication. It supports
  3 https://developer.nvidia.com/nccl                                  4 https://github.com/facebookincubator/gloo




                                                                 5
multiple communication contexts, which means no addi-                 5. Main Results and Analysis
tional synchronization is needed to execute multiple allre-
duce instances in parallel. Local reduction and broadcast                Our main result is that we can train ResNet-50 [16] on
(described as phases (1) and (3)) are pipelined with inter-           ImageNet [33] using 256 workers in one hour, while match-
server allreduce where possible.                                      ing the accuracy of small minibatch training. Applying the
                                                                      linear scaling rule along with a warmup strategy allows us to
   Caffe2 supports multi-threaded execution of the compute            seamlessly scale between small and large minibatches (up
graph that represents a training iteration. Whenever there is         to 8k images) without tuning additional hyper-parameters
no data dependency between subgraphs, multiple threads                or impacting accuracy. In the following subsections we:
can execute those subgraphs in parallel. Applying this to             (1) describe experimental settings, (2) establish the effec-
backprop, local gradients can be computed in sequence,                tiveness of large minibatch training, (3) perform a deeper
without dealing with allreduce or weight updates. This                experimental analysis, (4) show our findings generalize to
means that during backprop, the set of runnable subgraphs             object detection/segmentation, and (5) provide timings.
may grow faster than we can execute them. For subgraphs
that contain an allreduce run, all servers must choose to exe-        5.1. Experimental Settings
cute the same subgraph from the set of runnable subgraphs.                The 1000-way ImageNet classification task [33] serves
Otherwise, we risk distributed deadlock where servers are             as our main experimental benchmark. Models are trained
attempting to execute non-intersecting sets of subgraphs.             on the ∼1.28 million training images and evaluated by top-
With allreduce being a collective operation, servers would            1 error on the 50,000 validation images.
time out waiting. To ensure correct execution we impose a                 We use the ResNet-50 [16] variant from [12], noting that
partial order on these subgraphs. This is implemented using           the stride-2 convolutions are on 3×3 layers instead of on
a cyclical control input, where completion of the n-th allre-         1×1 layers as in [16]. We use Nesterov momentum [29]
duce unblocks execution of the (n + c)-th allreduce, with             with m of 0.9 following [12] but note that standard mo-
c being the maximum number of concurrent allreduce runs.              mentum as was used in [16] is equally effective. We use a
Note that this number should be chosen to be lower than the           weight decay λ of 0.0001 and following [16] we do not ap-
number of threads used to execute the full compute graph.             ply weight decay on the learnable BN coefficients (namely,
                                                                      γ and β in [19]). In order to keep the training objective
                                                                      fixed, which depends on the BN batch size n as described
4.3. Hardware                                                         in §2.3, we use n = 32 throughout, regardless of the overall
                                                                      minibatch size. As in [12], we compute the BN statistics
   We used Facebook’s Big Basin [24] GPU servers for                  using running average (with momentum 0.9).
our experiments. Each server contains 8 NVIDIA Tesla                      All models are trained for 90 epochs regardless of mini-
P100 GPUs that are interconnected with NVIDIA NVLink.                 batch sizes. We apply the linear scaling rule from §2.1 and
                                                                                                         kn
For local storage, each server has 3.2TB of NVMe SSDs.                use a learning rate of η = 0.1 · 256   that is linear in the mini-
For network connectivity, the servers have a Mellanox                 batch size kn. With k = 8 workers (GPUs) and n = 32
ConnectX-4 50Gbit Ethernet network card and are con-                  samples per worker, η = 0.1 as in [16]. We call this num-
                                                                                   kn
nected to Wedge100 [1] Ethernet switches.                             ber (0.1 · 256  ) the reference learning rate, and reduce it by
                                                                      1/10 at the 30-th, 60-th, and 80-th epoch, similar to [16].
    We have found 50Gbit of network bandwidth sufficient
                                                                          We adopt the initialization of [15] for all convolutional
for distributed synchronous SGD for ResNet-50, per the
                                                                      layers. The 1000-way fully-connected layer is initialized by
following analysis. ResNet-50 has approximately 25 mil-
                                                                      drawing weights from a zero-mean Gaussian with standard
lion parameters. This means the total size of parameters is
                                                                      deviation of 0.01. We have found that although SGD with a
25 · 106 · sizeof(float) = 100MB. Backprop for ResNet-50
                                                                      small minibatch is not sensitive to initialization due to BN,
on a single NVIDIA Tesla P100 GPU takes 120 ms. Given
                                                                      this is not the case for a substantially large minibatch. Addi-
that allreduce requires ∼2× bytes on the network compared
                                                                      tionally we require an appropriate warmup strategy to avoid
to the value it operates on, this leads to a peak bandwidth re-
                                                                      optimization difficulties in early training.
quirement of 200MB/0.125s = 1600MB/s, or 12.8 Gbit/s,
                                                                          For BN layers, the learnable scaling coefficient γ is ini-
not taking into account communication overhead. When we
                                                                      tialized to be 1, except for each residual block’s last BN
add a smudge factor for network overhead, we reach a peak
                                                                      where γ is initialized to be 0. Setting γ = 0 in the last BN of
bandwidth requirement for ResNet-50 of ∼15 Gbit/s.
                                                                      each residual block causes the forward/backward signal ini-
   As this peak bandwidth requirement only holds during               tially to propagate through the identity shortcut of ResNets,
backprop, the network is free to be used for different tasks          which we found to ease optimization at the start of training.
that are less latency sensitive then aggregation (e.g. reading        This initialization improves all models but is particularly
data or saving network snapshots) during the forward pass.            helpful for large minibatch training as we will show.


                                                                  6
   We use scale and aspect ratio data augmentation [36] as                                       k     n    kn    η     top-1 error (%)
in [12]. The network input image is a 224×224 pixel ran-            baseline (single server)      8    32   256   0.1    23.60 ±0.12
dom crop from an augmented image or its horizontal flip.            no warmup, Figure 2a         256   32    8k   3.2    24.84 ±0.37
The input image is normalized by the per-color mean and             constant warmup, Figure 2b   256   32    8k   3.2    25.88 ±0.56
standard deviation, as in [12].                                     gradual warmup, Figure 2c    256   32    8k   3.2    23.74 ±0.09

Handling random variation. As models are subject to                Table 1. Validation error on ImageNet using ResNet-50 (mean
random variation in training, we compute a model’s error           and std computed over 5 trials). We compare the small minibatch
rate as the median error of the final 5 epochs. Moreover,          model (kn=256) with large minibatch models (kn=8k) with vari-
                                                                   ous warmup strategies. Observe that the top-1 validation error for
we report the mean and standard deviation (std) of the error
                                                                   small and large minibatch training (with gradual warmup) is quite
from 5 independent runs. This gives us more confidence in          close: 23.60% ±0.12 vs. 23.74% ±0.09, respectively.
our results and also provides a measure of model stability.
   The random variation of ImageNet models has generally
not been reported in previous work (largely due to resource
                                                                   Training error. Training curves are shown in Figure 2.
limitations). We emphasize that ignoring random variation
                                                                   With no warmup (2a), the training curve for large minibatch
may cause unreliable conclusions, especially if results are
                                                                   of kn = 8k is inferior to training with a small minibatch of
from a single trial, or the best of many.
                                                                   kn = 256 across all epochs. A constant warmup strategy
Baseline. Under these settings, we establish a ResNet-50           (2b) actually degrades results: although the small constant
baseline using k = 8 (8 GPUs in one server) and n = 32             learning rate can decrease error during warmup, the error
images per worker (minibatch size of kn = 256), as in [16].        spikes immediately after and training never fully recovers.
Our baseline has a top-1 validation error of 23.60% ±0.12.             Our main result is that with gradual warmup, large mini-
As a reference, ResNet-50 from fb.resnet.torch [12]                batch training error matches the baseline training curve ob-
has 24.01% error, and that of the original ResNet paper [16]       tained with small minibatches, see Figure 2c. Although
has 24.7% under weaker data augmentation.                          the large minibatch curve starts higher due to the low η
                                                                   in the warmup phase, it catches up shortly thereafter. Af-
5.2. Optimization or Generalization Issues?                        ter about 20 epochs, the small and large minibatch training
   We establish our main results on large minibatch train-         curves match closely. The comparison between no warmup
ing by exploring optimization and generalization behaviors.        and gradual warmup suggests that large minibatch sizes are
We will demonstrate that with a proper warmup strategy,            challenged by optimization difficulties in early training and
large minibatch SGD can both match the training curves of          if these difficulties are addressed, the training error and its
small minibatch SGD and also match the validation error.           curve can match a small minibatch baseline closely.
In other words, in our experiments both optimization and
generalization of large minibatch training matches that of
small minibatch training. Moreover, in §5.4 we will show           Validation error. Table 1 shows the validation error for
that these models exhibit good generalization behavior to          the three warmup strategies. The no-warmup variant has
                                                                   ∼1.2% higher validation error than the baseline which is
the object detection/segmentation transfer tasks, matching
the transfer quality of small minibatch models.                    likely caused by the ∼2.1% increase in training error (Fig-
   For the following results, we use k = 256 and n = 32,           ure 2a), rather than overfitting or other causes for poor gen-
which results in a minibatch size kn = 8k (we use ‘1k’             eralization. This argument is further supported by our grad-
to denote 1024). As discussed, our baseline has a mini-            ual warmup experiment. The gradual warmup variant has
batch size of kn = 256 and a reference learning rate of            a validation error within 0.14% of the baseline (noting that
η = 0.1. Applying the linear scaling rule gives η = 3.2            std of these estimates is ∼0.1%). Given that the final train-
as the reference learning rate for our large minibatch runs.       ing errors (Figure 2c) match nicely in this case, it shows that
We test three warmup strategies as discussed in §2.2: no           if the optimization issues are addressed, there is no apparent
warmup, constant warmup with η = 0.1 for 5 epochs,                 generalization degradation observed using large minibatch
and gradual warmup which starts with η = 0.1 and is                training, even if the minibatch size goes from 256 to 8k.
linearly increased to η = 3.2 over 5 epochs. All models                Finally, Figure 4 shows both the training and valida-
are trained from scratch and all other hyper-parameters are        tion curves for the large minibatch training with gradual
kept fixed. We emphasize that while better results for any         warmup. As can be seen, validation error starts to match
particular minibatch size could be obtained by optimizing          the baseline closely after the second learning rate drop; ac-
hyper-parameters for that case; our goal is to match er-           tually, the validation curves can match earlier if BN statis-
rors across minibatch sizes by using a general strategy that       tics are recomputed prior to evaluating the error instead of
avoids hyper-parameter tuning for each minibatch size.             using the running average (see also caption in Figure 4).


                                                               7
                         100
                                        kn=256,     = 0.1, 23.60% 0.12                 kn=256,       = 0.1, 23.60% 0.12                 kn=256,       = 0.1, 23.60% 0.12
                          90
                                        kn= 8k,     = 3.2, 24.84% 0.37                 kn= 8k,       = 3.2, 25.88% 0.56                 kn= 8k,       = 3.2, 23.74% 0.09
                          80




      training error %
                          70

                          60

                          50

                          40

                          30

                          20
                               0   20         40           60     80         0    20            40         60      80         0    20            40         60      80
                                                  epochs                                        epochs                                           epochs
                                         (a) no warmup                                 (b) constant warmup                               (c) gradual warmup

Figure 2. Warmup. Training error curves for minibatch size 8192 using various warmup strategies compared to minibatch size 256.
Validation error (mean±std of 5 runs) is shown in the legend, along with minibatch size kn and reference learning rate η.

                   100
                                        kn=256,    = 0.1, 23.60% 0.12                 kn=256,    = 0.1, 23.60% 0.12                    kn=256,    = 0.1, 23.60% 0.12
                         90                                                           kn=512,    = 0.2, 23.48% 0.09                    kn= 1k,    = 0.4, 23.53% 0.08
                                        kn=128,    = 0.05 23.49% 0.12
                         80




training error %
                         70

                         60

                         50

                         40

                         30

                         20
                              0    20         40           60     80     0       20         40            60      80      0       20         40           60       80
                   100
                                        kn=256,    = 0.1, 23.60% 0.12                 kn=256,    = 0.1, 23.60% 0.12                    kn=256,    = 0.1, 23.60% 0.12
                         90                                                           kn= 4k,    = 1.6, 23.56% 0.12                    kn= 8k,    = 3.2, 23.74% 0.09
                                        kn= 2k,    = 0.8, 23.49% 0.11
                         80




training error %
                         70

                         60

                         50

                         40

                         30

                         20
                              0    20         40           60     80     0       20         40            60      80      0       20         40           60       80
                   100
                                        kn=256,    = 0.1, 23.60% 0.12                 kn=256,    = 0.1, 23.60% 0.12                    kn=256,    = 0.1, 23.60% 0.12
                         90
                                        kn=16k,    = 6.4, 24.79% 0.27                 kn=32k,    =12.8, 27.55% 0.28                    kn=64k,    =25.6, 33.96% 0.80
                         80




training error %
                         70

                         60

                         50

                         40

                         30

                         20
                              0    20         40           60     80     0       20         40            60      80      0       20         40           60       80
                                              epochs                                           epochs                                         epochs
Figure 3. Training error vs. minibatch size. Training error curves for the 256 minibatch baseline and larger minibatches using gradual
warmup and the linear scaling rule. Note how the training curves closely match the baseline (aside from the warmup period) up through 8k
minibatches. Validation error (mean±std of 5 runs) is shown in the legend, along with minibatch size kn and reference learning rate η.




                                                                                           8
              100                                                                                          100
                                   kn=256,        =0.1 [train]                                                            kn=256,    = 0.1, 23.60% 0.12
                                   kn=256,        =0.1 [val]                                                              kn=256,    = 0.2, 23.68% 0.09
               80                                                                                           80




                                                                                        training error %
                                   kn=8k,         =3.2 [train]
                                   kn=8k,         =3.2 [val]

    error %
               60                                                                                           60


               40                                                                                           40


               20                                                                                           20
                    0   20           40            60           80                                               0   20         40         60      80
                                      epochs                                                                                    epochs
Figure 4. Training and validation curves for large minibatch                        Figure 5. Training curves for small minibatches with different
SGD with gradual warmup vs. small minibatch SGD. Both sets                          learning rates η. As expected, changing η results in curves that do
of curves match closely after training for sufficient epochs. We                    not match. This is in contrast to changing batch-size (and linearly
note that the BN statistics (for inference only) are computed us-                   scaling η), which results in curves that do match, e.g. see Figure 3.
ing running average, which is updated less frequently with a large
minibatch and thus is noisier in early training (this explains the
larger variation of the validation error in early epochs).                          η = 0.1 gives best error but slightly smaller or larger η also
                                                                                    work well. When applying the linear scaling rule with a
                                                                                    minibatch of 8k images, the optimum error is also achieved
5.3. Analysis Experiments                                                           with η = 0.1 · 32, showing the successful application of the
Minibatch size vs. error. Figure 1 (page 1) shows top-                              linear scaling rule. However, in this case results are more
1 validation error for models trained with minibatch sizes                          sensitive to changing η. In practice we suggest to use a
ranging from of 64 to 65536 (64k). For all models we used                           minibatch size that is not close to the breaking point.
the linear scaling rule and set the reference learning rate                             Figure 5 shows the training curves of a 256 minibatch
               kn
as η = 0.1 · 256  . For models with kn > 256, we used                               using η = 0.1 or 0.2. It shows that changing the learning
the gradual warmup strategy always starting with η = 0.1                            rate η in general changes the overall shapes of the train-
and increasing linearly to the reference learning rate after                        ing curves, even if the final error is similar. Contrasting
5 epochs. Figure 1 illustrates that validation error remains                        this result with the success of the linear scaling rule (that
stable across a broad range of minibatch sizes, from 64 to                          can match both the final error and the training curves when
8k, after which it begins to increase. Beyond 64k training                          minibatch sizes change) may reveal some underlying invari-
diverges when using the linear learning rate scaling rule.5                         ance maintained between small and large minibatches.

Training curves for various minibatch sizes. Each of the
                                                                                        We also show two √ alternative strategies: keeping η fixed
                                                                                    at 0.1 or using 0.1· 32 according to the square root scaling
nine plots in Figure 3 shows the top-1 training error curve                         rule that was justified theoretically in [21] on grounds that it
for the 256 minibatch baseline (orange) and a second curve                          scales η by the inverse amount of the reduction in the gradi-
corresponding to different size minibatch (blue). Valida-                           ent estimator’s standard deviation.√  For fair comparisons we
tion errors are shown in the plot legends. As minibatch size                        also use gradual warmup for 0.1 · 32. Both policies work
increases, all training curves show some divergence from                            poorly in practice as the results show.
the baseline at the start of training. However, in the cases
where the final validation error closely matches the base-                          Batch Normalization γ initialization. Table 2b controls
line (kn ≤ 8k), the training curves also closely match after                        for the impact of the new BN γ initialization introduced in
the initial epochs. When the validation errors do not match                         §5.1. We show results for minibatch sizes 256 and 8k with
(kn ≥ 16k), there is a noticeable gap in the training curves                        the standard BN initialization (γ = 1 for all BN layers)
for all epochs. This suggests that when comparing a new                             and with our initialization (γ = 0 for the final BN layer
setting, the training curves can be used as a reliable proxy                        of each residual block). The results show improved per-
for success well before training finishes.                                          formance with γ = 0 for both minibatch sizes, and the
                                                                                    improvement is slightly larger for the 8k minibatch size.
Alternative learning rate rules. Table 2a shows results for                         This behavior also suggests that large minibatches are more
multiple learning rates. For small minibatches (kn = 256),                          easily affected by optimization difficulties. We expect that
    5 We note that because of the availability of hardware, we simulated dis-       improved optimization and initialization methods will help
tributed training of very large minibatches (≥12k) on a single server by us-        push the boundary of large minibatch training.
ing multiple gradient accumulation steps between SGD updates. We have
thoroughly verified that gradient accumulation on a single server yields            ResNet-101. Results for ResNet-101 [16] are shown in Ta-
equivalent results relative to distributed training.                                ble 2c. Training ResNet-101 with a batch-size of kn = 8k


                                                                                9
               kn               η                  top-1 error (%)                                                    40




                                                                                    ImageNet top-1 validation error
               256            0.05                  23.92 ±0.10
               256            0.10                  23.60 ±0.12                                                       35
               256            0.20                  23.68 ±0.09
               8k           0.05 · 32               24.27 ±0.08
               8k           0.10 · 32               23.74 ±0.09                                                       30
               8k           0.20 · 32               24.05 ±0.18
               8k             0.10√                 41.67 ±0.10
                                                                                                                      25
               8k          0.10 · 32                26.22 ±0.03
(a) Comparison of learning rate scaling rules. A reference learning rate
of η = 0.1 works best for kn = 256 (23.68% error). The linear scal-                                                   20
                                                                                                                       256      512     1k        2k     4k       8k    16k      32k   64k
ing rule suggests η = 0.1 · 32 when kn = 8k, which again gives best
                                                                                                                                                  mini-batch size
performance (23.74% error). Other ways of scaling η give worse results.
                                                                                    Figure 6. ImageNet-5k top-1 validation error vs. minibatch size
             kn          η          γ-init           top-1 error (%)                with a fixed 90 epoch training schedule. The curve is qualitatively
             256        0.1          1.0              23.84 ±0.18                   similar to results on ImageNet-1k (Figure 1) showing that a 5×
             256        0.1          0.0              23.60 ±0.12                   increase in training data does not lead to a significant change in
              8k        3.2          1.0              24.11 ±0.07                   the maximum effective minibatch size.
              8k        3.2          0.0              23.74 ±0.09
(b) Batch normalization γ initialization. Initializing γ = 0 in the last                                                     ImageNet pre-training                       COCO
BN layer of each residual block improves results for both small and large                                             kn         η     top-1 error (%)        box AP (%)    mask AP (%)
minibatches. This initialization leads to better optimization behavior which                                          256       0.1      23.60 ±0.12           35.9 ±0.1     33.9 ±0.1
has a larger positive impact when training with large minibatches.
                                                                                                                      512       0.2      23.48 ±0.09           35.8 ±0.1     33.8 ±0.2
                                                                                                                      1k        0.4      23.53 ±0.08           35.9 ±0.2     33.9 ±0.2
          model type          kn              η         top-1 error (%)                                               2k        0.8      23.49 ±0.11           35.9 ±0.1     33.9 ±0.1
          ResNet-101          256            0.1         22.08 ±0.06                                                  4k        1.6      23.56 ±0.12           35.8 ±0.1     33.8 ±0.1
          ResNet-101           8k            3.2         22.36 ±0.09                                                  8k        3.2      23.74 ±0.09           35.8 ±0.1     33.9 ±0.2
(c) The linear scaling rule applied to ResNet-101. The difference in error                                            16k       6.4      24.79 ±0.27           35.1 ±0.3     33.2 ±0.3
is about 0.3% between small and large minibatch training.                           (a) Transfer learning of large minibatch pre-training to Mask R-CNN.
                                                                                    Box and mask AP (on COCO minival) are nearly identical for ResNet-
Table 2. ImageNet classification experiments. Unless noted all                      50 models pre-trained with minibatches from 256 to 8k examples. With
experiments use ResNet-50 and are averaged over 5 trials.                           a minibatch pre-training size of 16k both ImageNet validation error and
                                                                                    COCO AP deteriorate. This indicates that as long as ImageNet error is
                                                                                    matched, large minibatches do not degrade transfer learning performance.
and a linearly scaled η = 3.2 results in an error of 22.36%
                                                                                                                      # GPUs      kn   η · 1000    iterations box AP (%)      mask AP (%)
vs. the kn = 256 baseline which achieves 22.08% with
                                                                                                                         1         2      2.5      1,280,000     35.7            33.6
η = 0.1. In other words, ResNet-101 trained with mini-
                                                                                                                         2         4      5.0       640,000      35.7            33.7
batch 8k has a small 0.28% increase in error vs. the baseline.                                                           4         8     10.0       320,000      35.7            33.5
It is likely that the minibatch size of 8k lies on the edge of                                                           8        16     20.0       160,000      35.6            33.6
the useful minibatch training regime for ResNet-101, simi-                          (b) Linear learning rate scaling applied to Mask R-CNN. Using the sin-
larly to ResNet-50 (see Figure 1).                                                  gle ResNet-50 model from [16] (thus no std is reported), we train Mask
    The training time of ResNet-101 is 92.5 minutes in our                          R-CNN using using from 1 to 8 GPUs following the linear learning rate
                                                                                    scaling rule. Box and mask AP are nearly identical across all configurations
implementation using 256 Tesla P100 GPUs and a mini-                                showing the successful generalization of the rule beyond classification.
batch size of 8k. We believe this is a compelling result if
the speed-accuracy tradeoff of ResNet-101 is preferred.                             Table 3. Object detection on COCO with Mask R-CNN [14].


ImageNet-5k. Observing the sharp increase in validation                             is very similar to the ImageNet-1k curve, showing that for
error between minibatch sizes of 8k and 16k on ImageNet-                            practitioners it is unlikely that even a 5× increase in dataset
1k (Figure 1), a natural question is if the position of this                        size will automatically lead to a meaningful increase in use-
‘elbow’ in the error curve is a function of dataset infor-                          able minibatch size. Quantitatively, using an 8k minibatch
mation content. To investigate this question, we adopt                              increases the validation error by 0.26% from 25.83% for a
the ImageNet-5k dataset suggested by Xie et al. [39] that                           256 minibatch to 26.09%. An understanding of the precise
extends ImageNet-1k to 6.8 million images (roughly 5×                               relationship between generalization error, minibatch size,
larger) by adding 4k additional categories from ImageNet-                           and dataset information content is open for future work.
22k [33]. We evaluate the 1k-way classification error on the
original ImageNet-1k validation set as in [39].                                     5.4. Generalization to Detection and Segmentation
   The minibatch size vs. validation error curve for                                  A low error rate on ImageNet is not typically an end
ImageNet-5k is shown in Figure 6. Qualitatively, the curve                          goal. Instead, the utility of ImageNet training lies in learn-


                                                                               10
                             0.3
                                                                          16
                                                                                                                               32k       ideal




time per iteration (secs)                                                       time per epoch (mins)
                            0.28                                                                                                         actual




                                                                                                             images / second
                                                                          8
                                                                                                                               16k
                            0.26                                          4
                                                                                                                                8k
                            0.24                                          2                                                     4k

                            0.22                                          1                                                     2k

                             0.2                                          0.5
                               256   512   1k        2k       4k   8k   11k                                                          8       16   32       64   128     256 352
                                            mini-batch size                                                                                            # GPUs
Figure 7. Distributed synchronous SGD timing. Time per itera-                                                Figure 8. Distributed synchronous SGD throughput. The small
tion (seconds) and time per ImageNet epoch (minutes) for training                                            overhead when moving from a single server with 8 GPUs to multi-
with different minibatch sizes. The baseline (kn = 256) uses 8                                               server distributed training (Figure 7, blue curve) results in linear
GPUs in a single server , while all other training runs distribute                                           throughput scaling that is marginally below ideal scaling (∼90%
training over (kn/256) server. With 352 GPUs (44 servers) our                                                efficiency). Most of the allreduce communication time is hid-
implementation completes one pass over all ∼1.28 million Ima-                                                den by pipelining allreduce operations with gradient computation.
geNet training images in about 30 seconds.                                                                   Moreover, this is achieved with commodity Ethernet hardware.


ing good features that transfer, or generalize well, to re-                                                  tection matches the AP of the small minibatch baseline. We
lated tasks. A question of key importance is if the features                                                 emphasize that we observed no generalization issues when
learned with large minibatches generalize as well as the fea-                                                transferring across datasets (from ImageNet to COCO) and
tures learned with small minibatches?                                                                        across tasks (from classification to detection/segmentation)
    To test this, we adopt the object detection and in-                                                      using models trained with large minibatches.
stance segmentation tasks on COCO [27] as these advanced                                                     Linear scaling rule applied to Mask R-CNN. We also
perception tasks benefit substantially from ImageNet pre-                                                    show evidence of the generality of the linear scaling rule us-
training [10]. We use the recently developed Mask R-CNN                                                      ing Mask R-CNN. In fact, this rule was already used with-
[14] system that is capable of learning to detect and segment                                                out explicit discussion in [16] and was applied effectively
object instances. We follow all of the hyper-parameter set-                                                  as the default Mask R-CNN training scheme when using 8
tings used in [14] and only change the ResNet-50 model                                                       GPUs. Table 3b provides experimental results showing that
used to initialize Mask R-CNN training. We train Mask R-                                                     when training with 1, 2, 4, or 8 GPUs the linear learning rate
CNN on the COCO trainval35k split and report results                                                         rule results in constant box and mask AP. For these experi-
on the 5k image minival split used in [14].                                                                  ments, we initialize Mask R-CNN from the released MSRA
    It is interesting to note that the concept of minibatch                                                  ResNet-50 model, as was done in [14].
size in Mask R-CNN is different from the classification
setting. As an extension of the image-centric Fast/Faster                                                    5.5. Run Time
R-CNN [9, 31], Mask R-CNN exhibits different minibatch                                                           Figure 7 shows two visualizations of the run time char-
sizes for different layers: the network backbone uses two                                                    acteristics of our system. The blue curve is the time per
images (per GPU), but each image contributes 512 Regions-                                                    iteration as minibatch size varies from 256 to 11264 (11k).
of-Interest for computing classification (multinomial cross-                                                 Notably this curve is relatively flat and the time per itera-
entropy), bounding-box regression (smooth-L1/Huber), and                                                     tion increases only 12% while scaling the minibatch size by
pixel-wise mask (28 × 28 binomial cross-entropy) losses.                                                     44×. Visualized another way, the orange curve shows the
This diverse set of minibatch sizes and loss functions pro-                                                  approximately linear decrease in time per epoch from over
vides a good test case to the robustness of our approach.                                                    16 minutes to just 30 seconds. Run time performance can
Transfer learning from large minibatch pre-training.                                                         also be viewed in terms of throughput (images / second), as
To test how large minibatch pre-training effects Mask R-                                                     shown in Figure 8. Relative to a perfectly efficient extrapo-
CNN, we take ResNet-50 models trained on ImageNet-1k                                                         lation of the 8 GPU baseline, our implementation achieves
                                                                                                             ∼90% scaling efficiency.
with 256 to 16k minibatches and use them to initialize Mask
R-CNN training. For each minibatch size we pre-train 5                                                       Acknowledgements. We would like to thank Leon Bottou for
models and then train Mask R-CNN using all 5 models on                                                       helpful discussions on theoretical background, Jerry Pan and
COCO (35 models total). We report the mean box and mask                                                      Christian Puhrsch for discussions on efficient data loading, An-
APs, averaged over the 5 trials, in Table 3a. The results                                                    drew Dye for help with debugging distributed training, and Kevin
show that as long as ImageNet validation error is kept low,                                                  Lee, Brian Dodds, Jia Ning, Koh Yew Thoon, Micah Harris, and
which is true up to 8k batch size, generalization to object de-                                              John Volk for Big Basin and hardware support.


                                                                                                        11
References                                                                 [20] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and
                                                                                P. T. P. Tang. On large-batch training for deep learning: Gen-
 [1] J. Bagga, H. Morsy, and Z. Yao.                      Opening               eralization gap and sharp minima. ICLR, 2017.
     designs for 6-pack and Wedge 100.             https:                  [21] A. Krizhevsky. One weird trick for parallelizing convolu-
     //code.facebook.com/posts/203733993317833/                                 tional neural networks. arXiv:1404.5997, 2014.
     opening-designs-for-6-pack-and-wedge-100, 2016.                       [22] A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet classi-
 [2] M. Barnett, L. Shuler, R. van De Geijn, S. Gupta, D. G.                    fication with deep convolutional neural nets. In NIPS, 2012.
     Payne, and J. Watts. Interprocessor collective communica-             [23] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E.
     tion library (intercom). In Scalable High-Performance Com-                 Howard, W. Hubbard, and L. D. Jackel. Backpropagation
     puting Conference, 1994.                                                   applied to handwritten zip code recognition. Neural compu-
 [3] L. Bottou. Curiously fast convergence of some stochastic                   tation, 1989.
     gradient descent algorithms. Unpublished open problem of-             [24] K. Lee. Introducing Big Basin: Our next-generation
     fered to the attendance of the SLDS 2009 conference, 2009.                 AI hardware.            https://code.facebook.com/posts/
 [4] L. Bottou, F. E. Curtis, and J. Nocedal. Opt. methods for                  1835166200089399/introducing-big-basin, 2017.
     large-scale machine learning. arXiv:1606.04838, 2016.                 [25] M. Li. Scaling Distributed Machine Learning with System
 [5] J. Chen, X. Pan, R. Monga, S. Bengio, and R. Joze-                         and Algorithm Co-design. PhD thesis, Carnegie Mellon Uni-
     fowicz.        Revisiting Distributed Synchronous SGD.                     versity, 2017.
     arXiv:1604.00981, 2016.                                               [26] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and
 [6] K. Chen and Q. Huo. Scalable training of deep learning ma-                 S. Belongie. Feature pyramid networks for object detection.
     chines by incremental block training with intra-block par-                 In CVPR, 2017.
     allel optimization and blockwise model-update filtering. In           [27] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
     ICASSP, 2016.                                                              manan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Com-
 [7] R. Collobert, J. Weston, L. Bottou, M. Karlen,                             mon objects in context. In ECCV. 2014.
     K. Kavukcuoglu, and P. Kuksa. Natural language pro-                   [28] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
     cessing (almost) from scratch. JMLR, 2011.                                 networks for semantic segmentation. In CVPR, 2015.
 [8] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang,                 [29] Y. Nesterov. Introductory lectures on convex optimization: A
     E. Tzeng, and T. Darrell. Decaf: A deep convolutional acti-                basic course. Springer, 2004.
     vation feature for generic visual recognition. In ICML, 2014.         [30] R. Rabenseifner. Optimization of collective reduction oper-
                                                                                ations. In ICCS. Springer, 2004.
 [9] R. Girshick. Fast R-CNN. In ICCV, 2015.
                                                                           [31] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: To-
[10] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea-
                                                                                wards real-time object detection with region proposal net-
     ture hierarchies for accurate object detection and semantic
                                                                                works. In NIPS, 2015.
     segmentation. In CVPR, 2014.
                                                                           [32] H. Robbins and S. Monro. A stochastic approximation
[11] W. Gropp, E. Lusk, and A. Skjellum. Using MPI: Portable                    method. The annals of mathematical statistics, 1951.
     Parallel Programming with the Message-Passing Interface.              [33] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
     MIT Press, Cambridge, MA, 1999.                                            S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,
[12] S. Gross and M. Wilber. Training and investigating Resid-                  A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual
     ual Nets.      https://github.com/facebook/fb.                             Recognition Challenge. IJCV, 2015.
     resnet.torch, 2016.                                                   [34] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus,
[13] M. Gürbüzbalaban, A. Ozdaglar, and P. Parrilo. Why                       and Y. LeCun. Overfeat: Integrated recognition, localization
     random reshuffling beats stochastic gradient descent.                      and detection using convolutional networks. In ICLR, 2014.
     arXiv:1510.08560, 2015.                                               [35] K. Simonyan and A. Zisserman. Very deep convolutional
[14] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask R-                   networks for large-scale image recognition. In ICLR, 2015.
     CNN. arXiv:1703.06870, 2017.                                          [36] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
[15] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into                     D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.
     rectifiers: Surpassing human-level performance on imagenet                 Going deeper with convolutions. In CVPR, 2015.
     classification. In ICCV, 2015.                                        [37] R. Thakur, R. Rabenseifner, and W. Gropp. Optimization of
[16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning                collective comm. operations in MPICH. IJHPCA, 2005.
     for image recognition. In CVPR, 2016.                                 [38] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi,
[17] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed,                      W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey,
     N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath,              et al. Google’s neural machine translation system: Bridg-
     et al. Deep neural networks for acoustic modeling in speech                ing the gap between human and machine translation.
     recognition: The shared views of four research groups. IEEE                arXiv:1609.08144, 2016.
     Signal Processing Magazine, 2012.                                     [39] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated
[18] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and                     residual transformations for deep neural networks. In CVPR,
     Y. Bengio. Quantized neural networks: Training neu-                        2017.
     ral networks with low precision weights and activations.              [40] W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stol-
     arXiv:1510.08560, 2016.                                                    cke, D. Yu, and G. Zweig. The Microsoft 2016 Conversa-
[19] S. Ioffe and C. Szegedy. Batch normalization: Accelerating                 tional Speech Recognition System. arXiv:1609.03528, 2016.
     deep network training by reducing internal covariate shift. In        [41] M. D. Zeiler and R. Fergus. Visualizing and understanding
     ICML, 2015.                                                                convolutional neural networks. In ECCV, 2014.


                                                                      12