Sources — Compute-Structure AI History

Accurate, Large Minibatch SGD:
Training ImageNet in 1 Hour

Priya Goyal Piotr Dollár Ross Girshick Pieter Noordhuis
Lukasz Wesolowski Aapo Kyrola Andrew Tulloch Yangqing Jia Kaiming He

Facebook

arXiv:1706.02677v2 [cs.CV] 30 Apr 2018
40
Abstract

ImageNet top-1 validation error
35
Deep learning thrives with large neural networks and
large datasets. However, larger networks and larger
datasets result in longer training times that impede re- 30

search and development progress. Distributed synchronous
SGD offers a potential solution to this problem by dividing 25
SGD minibatches over a pool of parallel workers. Yet to
make this scheme efficient, the per-worker workload must 20
64 128 256 512 1k 2k 4k 8k 16k 32k 64k
be large, which implies nontrivial growth in the SGD mini-
mini-batch size
batch size. In this paper, we empirically show that on the
Figure 1. ImageNet top-1 validation error vs. minibatch size.
ImageNet dataset large minibatches cause optimization dif-
Error range of plus/minus two standard deviations is shown. We
ficulties, but when these are addressed the trained networks
present a simple and general technique for scaling distributed syn-
exhibit good generalization. Specifically, we show no loss chronous SGD to minibatches of up to 8k images while maintain-
of accuracy when training with large minibatch sizes up to ing the top-1 error of small minibatch training. For all minibatch
8192 images. To achieve this result, we adopt a hyper- sizes we set the learning rate as a linear function of the minibatch
parameter-free linear scaling rule for adjusting learning size and apply a simple warmup phase for the first few epochs of
rates as a function of minibatch size and develop a new training. All other hyper-parameters are kept fixed. Using this
warmup scheme that overcomes optimization challenges simple approach, accuracy of our models is invariant to minibatch
early in training. With these simple techniques, our Caffe2- size (up to an 8k minibatch size). Our techniques enable a lin-
based system trains ResNet-50 with a minibatch size of 8192 ear reduction in training time with ∼90% efficiency as we scale
on 256 GPUs in one hour, while matching small minibatch to large minibatch sizes, allowing us to train an accurate 8k mini-
batch ResNet-50 model in 1 hour on 256 GPUs.
accuracy. Using commodity hardware, our implementation
achieves ∼90% scaling efficiency when moving from 8 to
256 GPUs. Our findings enable training visual recognition tation [8, 10, 28]. Moreover, this pattern generalizes: larger
models on internet-scale data with high efficiency. datasets and neural network architectures consistently yield
improved accuracy across all tasks that benefit from pre-
training [22, 41, 34, 35, 36, 16]. But as model and data
1. Introduction scale grow, so does training time; discovering the potential
and limits of large-scale deep learning requires developing
Scale matters. We are in an unprecedented era in AI novel techniques to keep training time manageable.
research history in which the increasing data and model The goal of this report is to demonstrate the feasibility of,
scale is rapidly improving accuracy in computer vision and to communicate a practical guide to, large-scale train-
[22, 41, 34, 35, 36, 16], speech [17, 40], and natural lan- ing with distributed synchronous stochastic gradient descent
guage processing [7, 38]. Take the profound impact in com- (SGD). As an example, we scale ResNet-50 [16] training,
puter vision as an example: visual representations learned originally performed with a minibatch size of 256 images
by deep convolutional neural networks [23, 22] show excel- (using 8 Tesla P100 GPUs, training time is 29 hours), to
lent performance on previously challenging tasks like Ima- larger minibatches (see Figure 1). In particular, we show
geNet classification [33] and can be transferred to difficult that with a large minibatch size of 8192, we can train
perception problems such as object detection and segmen- ResNet-50 in 1 hour using 256 GPUs while maintaining

1
the same level of accuracy as the 256 minibatch baseline. 2. Large Minibatch SGD
While distributed synchronous SGD is now commonplace,
no existing results show that generalization accuracy can be We start by reviewing the formulation of Stochastic Gra-
maintained with minibatches as large as 8192 or that such dient Descent (SGD), which will be the foundation of our
high-accuracy models can be trained in such short time. discussions in the following sections. We consider super-
vised learning by minimizing a loss L(w) of the form:
To tackle this unusually large minibatch size, we employ
a simple and hyper-parameter-free linear scaling rule to ad- 1 X
L(w) = l(x, w). (1)
just the learning rate. While this guideline is found in ear- |X|
x∈X
lier work [21, 4], its empirical limits are not well under-
stood and informally we have found that it is not widely Here w are the weights of a network, X is a labeled training
known to the research community. To successfully apply set, and l(x, w) is the loss computed from samples x ∈ X
this rule, we present a new warmup strategy, i.e., a strategy and their labels y. Typically l is the sum of a classification
of using lower learning rates at the start of training [16], to loss (e.g., cross-entropy) and a regularization loss on w.
overcome early optimization difficulties. Importantly, not Minibatch Stochastic Gradient Descent [32], usually re-
only does our approach match the baseline validation error, ferred to as simply as SGD in recent literature even though
but also yields training error curves that closely match the it operates on minibatches, performs the following update:
small minibatch baseline. Details are presented in §2. 1X
wt+1 = wt − η ∇l(x, wt ). (2)
Our comprehensive experiments in §5 show that opti- n
x∈B
mization difficulty is the main issue with large minibatches,
rather than poor generalization (at least on ImageNet), in Here B is a minibatch sampled from X and n = |B| is the
contrast to some recent studies [20]. Additionally, we show minibatch size, η is the learning rate, and t is the iteration
that the linear scaling rule and warmup generalize to more index. Note that in practice we use momentum SGD; we
complex tasks including object detection and instance seg- return to a discussion of momentum in §3.
mentation [9, 31, 14, 28], which we demonstrate via the 2.1. Learning Rates for Large Minibatches
recently developed Mask R-CNN [14]. We note that a ro-
bust and successful guideline for addressing a wide range Our goal is to use large minibatches in place of small
of minibatch sizes has not been presented in previous work. minibatches while maintaining training and generalization
While the strategy we deliver is simple, its successful accuracy. This is of particular interest in distributed learn-
application requires correct implementation with respect to ing, because it can allow us to scale to multiple workers2 us-
seemingly minor and often not well understood implemen- ing simple data parallelism without reducing the per-worker
tation details within deep learning libraries. Subtleties in the workload and without sacrificing model accuracy.
implementation of SGD can lead to incorrect solutions that As we will show in comprehensive experiments, we
are difficult to discover. To provide more helpful guidance found that the following learning rate scaling rule is sur-
we describe common pitfalls and the relevant implementa- prisingly effective for a broad range of minibatch sizes:
tion details that can trigger these traps in §3.
Linear Scaling Rule: When the minibatch size is
Our strategy applies regardless of framework, but
multiplied by k, multiply the learning rate by k.
achieving efficient linear scaling requires nontrivial com-
munication algorithms. We use the open-source Caffe21 All other hyper-parameters (weight decay, etc.) are kept un-
deep learning framework and Big Basin GPU servers [24], changed. As we will show in §5, the linear scaling rule can
which operates efficiently using standard Ethernet network- help us to not only match the accuracy between using small
ing (as opposed to specialized network interfaces). We de- and large minibatches, but equally importantly, to largely
scribe the systems algorithms that enable our approach to match their training curves, which enables rapid debugging
operate near its full potential in §4. and comparison of experiments prior to convergence.
The practical advances described in this report are help-
Interpretation. We present an informal discussion of the
ful across a range of domains. In an industrial domain,
linear scaling rule and why it may be effective. Consider
our system unleashes the potential of training visual models
a network at iteration t with weights wt , and a sequence
from internet-scale data, enabling training with billions of
of k minibatches Bj for 0 ≤ j < k each of size n. We
images per day. Of equal importance, in a research domain,
compare the effect of executing k SGD iterations with small
we have found it to simplify migrating algorithms from a
minibatches Bj and learning rate η versus a single iteration
single-GPU to a multi-GPU implementation without requir-
with a large minibatch ∪j Bj of size kn and learning rate η̂.
ing hyper-parameter search, e.g. in our experience migrat-
2 We use the terms ‘worker’ and ‘GPU’ interchangeably in this work, al-
ing Faster R-CNN [31] and ResNets [16] from 1 to 8 GPUs.
though other implementations of a ‘worker’ are possible. ‘Server’ denotes
1 http://www.caffe2.ai a set of 8 GPUs that does not require communication over a network.

2
According to (2), after k iterations of SGD with learning 2.2. Warmup
rate η and a minibatch size of n we have:
As we discussed, for large minibatches (e.g., 8k) the lin-
1XX ear scaling rule breaks down when the network is changing
wt+k = wt − η ∇l(x, wt+j ). (3)
n rapidly, which commonly occurs in early stages of train-
j<k x∈Bj
ing. We find that this issue can be alleviated by a properly
On the other hand, taking a single step with the large mini-
designed warmup [16], namely, a strategy of using less ag-
batch ∪j Bj of size kn and learning rate η̂ yields:
gressive learning rates at the start of training.
1 XX
ŵt+1 = wt − η̂ ∇l(x, wt ). (4) Constant warmup. The warmup strategy presented in [16]
kn uses a low constant learning rate for the first few epochs of
j<k x∈Bj

As expected, the updates differ, and it is unlikely that training. As we will show in §5, we have found constant
ŵt+1 = wt+k . However, if we could assume ∇l(x, wt ) ≈ warmup particularly helpful for prototyping object detec-
∇l(x, wt+j ) for j < k, then setting η̂ = kη would yield tion and segmentation methods [9, 31, 26, 14] that fine-tune
ŵt+1 ≈ wt+k , and the updates from small and large mini- pre-trained layers together with newly initialized layers.
batch SGD would be similar. Although this is a strong as- In our ImageNet experiments with a large minibatch of
sumption, we emphasize that if it were true the two updates size kn, we have tried to train with the low learning rate of
are similar only if we set η̂ = kη. η for the first 5 epochs and then return to the target learn-
The above interpretation gives intuition for one case ing rate of η̂ = kη. However, given a large k, we find that
where we may hope the linear scaling rule to apply. In our this constant warmup is not sufficient to solve the optimiza-
experiments with η̂ = kη (and warmup), small and large tion problem, and a transition out of the low learning rate
minibatch SGD not only result in models with the same fi- warmup phase can cause the training error to spike. This
nal accuracy, but also, the training curves match closely. leads us to propose the following gradual warmup.
Our empirical results suggest that the above approximation Gradual warmup. We present an alternative warmup that
might be valid in large-scale, real-world data. gradually ramps up the learning rate from a small to a large
However, there are at least two cases when the condition value. This ramp avoids a sudden increase of the learning
∇l(x, wt ) ≈ ∇l(x, wt+j ) will clearly not hold. First, in ini- rate, allowing healthy convergence at the start of training.
tial training when the network is changing rapidly, it does In practice, with a large minibatch of size kn, we start from
not hold. We address this by using a warmup phase, dis- a learning rate of η and increment it by a constant amount at
cussed in §2.2. Second, minibatch size cannot be scaled in- each iteration such that it reaches η̂ = kη after 5 epochs (re-
definitely: while results are stable for a large range of sizes, sults are robust to the exact duration of warmup). After the
beyond a certain point accuracy degrades rapidly. Interest- warmup, we go back to the original learning rate schedule.
ingly, this point is as large as ∼8k in ImageNet experiments.
2.3. Batch Normalization with Large Minibatches
Discussion. The above linear scaling rule was adopted by
Krizhevsky [21], if not earlier. However, Krizhevsky re- Batch Normalization (BN) [19] computes statistics along
ported a 1% increase of error when increasing the minibatch the minibatch dimension: this breaks the independence of
size from 128 to 1024, whereas we show how to maintain each sample’s loss, and changes in minibatch size change
accuracy across a much broader regime of minibatch sizes. the underlying definition of the loss function being opti-
Chen et al. [5] presented a comparison of numerous dis- mized. In the following we will show that a commonly used
tributed SGD variants, and although their work also em- ‘shortcut’, which may appear to be a practical consideration
ployed the linear scaling rule, it did not establish a small to avoid communication overhead, is actually necessary for
minibatch baseline. Li [25] (§4.6) showed distributed Ima- preserving the loss function when changing minibatch size.
geNet training with minibatches up to 5120 without a loss We note that (1) and (2) assume the per-sample loss
in accuracy after convergence. However, their work did not l(x, w) is independent of all other samples. This is not the
demonstrate a hyper-parameter search-free rule for adjust- case when BN is performed and activations are computed
ing the learning rate as a function of minibatch size, which across samples. We write lB (x, w) to denote that the loss of
is a central contribution of our work. a single sample x depends on the statistics of all samples in
In recent work, Bottou et al. [4] (§4.2) review theoretical its minibatch B. We denote theP loss over a single minibatch
tradeoffs of minibatching and show that with the linear scal- B of size n as L(B, w) = n1 x∈B lB (x, w). With BN, the
ing rule, solvers follow the same training curve as a function training set can be thought of as containing all distinct sub-
of number of examples seen, and suggest the learning rate sets of size n drawn from the original training set X, which
should not exceed a maximum rate independent of mini- we denote as X n . The training loss L(w) then becomes:
batch size (which justifies warmup). Our work empirically 1 X
L(w) = L(B, w). (5)
tests these theories with unprecedented minibatch sizes. |X n | n
B∈X

3
If we view B as a ‘single sample’ in X n , then the loss of contributed by ε(x, wt ). If there is no weight decay term,
each single sample B is computed independently. there are many equivalent ways of scaling the learning rate,
Note that the minibatch size n over which the BN statis- including scaling the term ε(x, wt ). However, as can be
tics are computed is a key component of the loss: if the per- seen from (8), in general this is not the case. We summarize
worker minibatch sample size n is changed, it changes the these observations in the following remark:
underlying loss function L that is optimized. More specif- Remark 1: Scaling the cross-entropy loss is
ically, the mean/variance statistics computed by BN with not equivalent to scaling the learning rate.
different n exhibit different levels of random variation.
Momentum correction. Momentum SGD is a commonly
In the case of distributed (and multi-GPU) training, if the
adopted modification to the vanilla SGD in (2). A reference
per-worker sample size n is kept fixed and the total mini-
implementation of momentum SGD has the following form:
batch size is kn, it can be viewed a minibatch of k samples
with each sample Bj independently selected from X n , so 1X
ut+1 = mut + ∇l(x, wt )
the underlying loss function is unchanged and is still de- n (9)
x∈B
fined in X n . Under this point of view, in the BN setting wt+1 = wt − ηut+1 .
after seeing k minibatches Bj , (3) and (4) become:
X Here m is the momentum decay factor and u is the update
wt+k = wt − η ∇L(Bj , wt+j ), (6) tensor. A popular variant absorbs the learning rate η into
j<k the update tensor. Substituting vt for ηut in (9) yields:
1X 1X
ŵt+1 = wt − η̂ ∇L(Bj , wt ). (7) vt+1 = mvt + η ∇l(x, wt )
k n (10)
j<k x∈B
Following similar logic as in §2.1, we set η̂ = kη and we wt+1 = wt − vt+1 .
keep the per-worker sample size n constant when we change For a fixed η, the two are equivalent. However, we note that
the number of workers k. while u only depends on the gradients and is independent
In this work, we use n = 32 which has performed well of η, v is entangled with η. When η changes, to maintain
for a wide range of datasets and networks [19, 16]. If n is equivalence with the reference variant in P (9), the update for
adjusted, it should be viewed as a hyper-parameter of BN, v should be: vt+1 = m ηt+1 ηt v t + η 1
t+1 n ∇l(x, wt ). We
not of distributed training. We also note that the BN statis- ηt+1
refer to the factor ηt as the momentum correction. We
tics should not be computed across all workers, not only for found that this is especially important for stabilizing train-
the sake of reducing communication, but also for maintain- ing when ηt+1 ηt , otherwise the history term vt is too
ing the same underlying loss function being optimized. small which leads to instability (for ηt+1 < ηt momentum
correction is less critical). This leads to our second remark:
3. Subtleties and Pitfalls of Distributed SGD Remark 2: Apply momentum correction
In practice a distributed implementation has many sub- after changing learning rate if using (10).
tleties. Many common implementation errors change the Gradient aggregation. For k workers each with a per-
definitions of hyper-parameters, leading to models that train worker minibatch of size n, following (4), gradient aggre-
but whose error may be higher than expected, and such is-
P P over the entire set of kn examples
gation must be performed
1
sues can be difficult to discover. While the remarks below according to kn j x∈Bj l(x, wt ). Loss layers are typi-
are straightforward, they are important to consider explic- cally implemented to compute an average loss over their lo-
itly to faithfully implement the underlying solver.
P input, which amounts to computing a per-worker loss of
cal
Weight decay. Weight decay is actually the outcome of the l(x, wt )/n. Given this, correct aggregation requires av-
gradient of an L2-regularization term in the loss function. eraging the k gradients in order to recover the missing 1/k
More formally, the per-sample loss in (1) can be written as factor. However, standard communication primitives like
l(x, w) = λ2 kwk2 + ε(x, w). Here λ2 kwk2 is the sample- allreduce [11] perform summing, not averaging. Therefore,
independent L2 regularization on the weights and ε(x, w) it is more efficient to absorb the 1/k scaling into the loss,
is a sample-dependent term such as the cross-entropy loss. in which case only the loss’s gradient with respect to its in-
The SGD update in (2) can be written as: put needs to be scaled, removing the need to scale the entire
gradient vector. We summarize this as follows:
1X
wt+1 = wt − ηλwt − η ∇ε(x, wt ). (8) Remark 3: Normalize the per-worker loss by
n
x∈B total minibatch size kn, not per-worker size n.
In
P practice, usually only the sample-dependent term We also note that it may be incorrect to ‘cancel k’ by setting
∇ε(x, wt ) is computed by backprop; the term λwt is η̂ = η (not kη) and normalizing the loss by 1/n (not 1/kn),
computed separately and added to the aggregated gradients which can lead to incorrect weight decay (see Remark 1).

4
Data shuffling. SGD is typically analyzed as a process that number of GPU-to-host memory copies and a CPU reduc-
samples data randomly with replacement. In practice, com- tion otherwise. NCCL uses GPU kernels to accelerate in-
mon SGD implementations apply random shuffling of the traserver collectives, so this approach dedicates more time
training set during each SGD epoch, which can give better on the GPU to backprop while using the CPU resources that
results [3, 13]. To provide fair comparisons with baselines would otherwise have been idle to improve throughput.
that use shuffling (e.g., [16]), we ensure the samples in one For interserver allreduce, we implemented two of the
epoch done by k workers are from a single consistent ran- best algorithms for bandwidth-limited scenarios: the re-
dom shuffling of the training set. To achieve this, for each cursive halving and doubling algorithm [30, 37] and the
epoch we use a random shuffling that is partitioned into k bucket algorithm (also known as the ring algorithm) [2].
parts, each of which is processed by one of the k workers. For both, each server sends and receives 2 p−1 p b bytes of
Failing to correctly implement random shuffling in multiple data, where b is the buffer size in bytes and p is the num-
workers may lead to noticeably different behavior, which ber of servers. While the halving/doubling algorithm con-
may contaminate results and conclusions. In summary: sists of 2 log2 (p) communication steps, the ring algorithm
consists of 2(p − 1) steps. This generally makes the halv-
Remark 4: Use a single random shuffling of the training ing/doubling algorithm faster in latency-limited scenarios
data (per epoch) that is divided amongst all k workers. (i.e., for small buffer sizes and/or large server counts). In
practice, we found the halving/doubling algorithm to per-
4. Communication form much better than the ring algorithm for buffer sizes
In order to scale beyond the 8 GPUs in a single Big Basin up to a million elements (and even higher on large server
server [24], gradient aggregation has to span across servers counts). On 32 servers (256 GPUs), using halving/doubling
on a network. To allow for near perfect linear scaling, the led to a speedup of 3× over the ring algorithm.
aggregation must be performed in parallel with backprop. The halving/doubling algorithm consists of a reduce-
This is possible because there is no data dependency be- scatter collective followed by an allgather. In the first step
tween gradients across layers. Therefore, as soon as the gra- of reduce-scatter, servers communicate in pairs (rank 0 with
dient for a layer is computed, it is aggregated across work- 1, 2 with 3, etc.), sending and receiving for different halves
ers, while gradient computation for the next layer continues of their input buffers. For example, rank 0 sends the second
(as discussed in [5]). We give full details next. half of its buffer to 1 and receives the first half of the buffer
from 1. A reduction over the received data is performed be-
4.1. Gradient Aggregation fore proceeding to the next step, where the distance to the
destination rank is doubled while the data sent and received
For every gradient, aggregation is done using an allre- is halved. After the reduce-scatter phase is finished, each
duce operation (similar to the MPI collective operation server has a portion of the final reduced vector.
MPI Allreduce [11]). Before allreduce starts every GPU has
This is followed by the allgather phase, which retraces
its locally computed gradients and after allreduce completes
the communication pattern from the reduce-scatter in re-
every GPU has the sum of all k gradients. As the number
verse, this time simply concatenating portions of the final
of parameters grows and compute performance of GPUs in-
reduced vector. At each server, the portion of the buffer that
creases, it becomes harder to hide the cost of aggregation in
was being sent in the reduce-scatter is received in the all-
the backprop phase. Training techniques to overcome these
gather, and the portion that was being received is now sent.
effects are beyond the scope of this work (e.g., quantized
To support non-power-of-two number of servers, we
gradients [18], Block-Momentum SGD [6]). However, at
used the binary blocks algorithm [30]. This is a generalized
the scale of this work, collective communication was not
version of the halving/doubling algorithm where servers
a bottleneck, as we were able to achieve near-linear SGD
are partitioned into power-of-two blocks and two additional
scaling by using an optimized allreduce implementation.
communication steps are used, one immediately after the
Our implementation of allreduce consists of three phases
intrablock reduce-scatter and one before the intrablock all-
for communication within and across servers: (1) buffers
gather. Non-power-of-two cases have some degree of load
from the 8 GPUs within a server are summed into a sin-
imbalance compared to power-of-two, though in our runs
gle buffer for each server, (2) the results buffers are shared
we did not see significant performance degradation.
and summed across all servers, and finally (3) the results
are broadcast onto each GPU. For the local reduction and 4.2. Software
broadcast in phases (1) and (3) we used NVIDIA Collective
Communication Library (NCCL)3 for buffers of size 256 The allreduce algorithms described are implemented in
KB or more and a simple implementation consisting of a Gloo4 , a library for collective communication. It supports
3 https://developer.nvidia.com/nccl 4 https://github.com/facebookincubator/gloo

5
multiple communication contexts, which means no addi- 5. Main Results and Analysis
tional synchronization is needed to execute multiple allre-
duce instances in parallel. Local reduction and broadcast Our main result is that we can train ResNet-50 [16] on
(described as phases (1) and (3)) are pipelined with inter- ImageNet [33] using 256 workers in one hour, while match-
server allreduce where possible. ing the accuracy of small minibatch training. Applying the
linear scaling rule along with a warmup strategy allows us to
Caffe2 supports multi-threaded execution of the compute seamlessly scale between small and large minibatches (up
graph that represents a training iteration. Whenever there is to 8k images) without tuning additional hyper-parameters
no data dependency between subgraphs, multiple threads or impacting accuracy. In the following subsections we:
can execute those subgraphs in parallel. Applying this to (1) describe experimental settings, (2) establish the effec-
backprop, local gradients can be computed in sequence, tiveness of large minibatch training, (3) perform a deeper
without dealing with allreduce or weight updates. This experimental analysis, (4) show our findings generalize to
means that during backprop, the set of runnable subgraphs object detection/segmentation, and (5) provide timings.
may grow faster than we can execute them. For subgraphs
that contain an allreduce run, all servers must choose to exe- 5.1. Experimental Settings
cute the same subgraph from the set of runnable subgraphs. The 1000-way ImageNet classification task [33] serves
Otherwise, we risk distributed deadlock where servers are as our main experimental benchmark. Models are trained
attempting to execute non-intersecting sets of subgraphs. on the ∼1.28 million training images and evaluated by top-
With allreduce being a collective operation, servers would 1 error on the 50,000 validation images.
time out waiting. To ensure correct execution we impose a We use the ResNet-50 [16] variant from [12], noting that
partial order on these subgraphs. This is implemented using the stride-2 convolutions are on 3×3 layers instead of on
a cyclical control input, where completion of the n-th allre- 1×1 layers as in [16]. We use Nesterov momentum [29]
duce unblocks execution of the (n + c)-th allreduce, with with m of 0.9 following [12] but note that standard mo-
c being the maximum number of concurrent allreduce runs. mentum as was used in [16] is equally effective. We use a
Note that this number should be chosen to be lower than the weight decay λ of 0.0001 and following [16] we do not ap-
number of threads used to execute the full compute graph. ply weight decay on the learnable BN coefficients (namely,
γ and β in [19]). In order to keep the training objective
fixed, which depends on the BN batch size n as described
4.3. Hardware in §2.3, we use n = 32 throughout, regardless of the overall
minibatch size. As in [12], we compute the BN statistics
We used Facebook’s Big Basin [24] GPU servers for using running average (with momentum 0.9).
our experiments. Each server contains 8 NVIDIA Tesla All models are trained for 90 epochs regardless of mini-
P100 GPUs that are interconnected with NVIDIA NVLink. batch sizes. We apply the linear scaling rule from §2.1 and
kn
For local storage, each server has 3.2TB of NVMe SSDs. use a learning rate of η = 0.1 · 256 that is linear in the mini-
For network connectivity, the servers have a Mellanox batch size kn. With k = 8 workers (GPUs) and n = 32
ConnectX-4 50Gbit Ethernet network card and are con- samples per worker, η = 0.1 as in [16]. We call this num-
kn
nected to Wedge100 [1] Ethernet switches. ber (0.1 · 256 ) the reference learning rate, and reduce it by
1/10 at the 30-th, 60-th, and 80-th epoch, similar to [16].
We have found 50Gbit of network bandwidth sufficient
We adopt the initialization of [15] for all convolutional
for distributed synchronous SGD for ResNet-50, per the
layers. The 1000-way fully-connected layer is initialized by
following analysis. ResNet-50 has approximately 25 mil-
drawing weights from a zero-mean Gaussian with standard
lion parameters. This means the total size of parameters is
deviation of 0.01. We have found that although SGD with a
25 · 106 · sizeof(float) = 100MB. Backprop for ResNet-50
small minibatch is not sensitive to initialization due to BN,
on a single NVIDIA Tesla P100 GPU takes 120 ms. Given
this is not the case for a substantially large minibatch. Addi-
that allreduce requires ∼2× bytes on the network compared
tionally we require an appropriate warmup strategy to avoid
to the value it operates on, this leads to a peak bandwidth re-
optimization difficulties in early training.
quirement of 200MB/0.125s = 1600MB/s, or 12.8 Gbit/s,
For BN layers, the learnable scaling coefficient γ is ini-
not taking into account communication overhead. When we
tialized to be 1, except for each residual block’s last BN
add a smudge factor for network overhead, we reach a peak
where γ is initialized to be 0. Setting γ = 0 in the last BN of
bandwidth requirement for ResNet-50 of ∼15 Gbit/s.
each residual block causes the forward/backward signal ini-
As this peak bandwidth requirement only holds during tially to propagate through the identity shortcut of ResNets,
backprop, the network is free to be used for different tasks which we found to ease optimization at the start of training.
that are less latency sensitive then aggregation (e.g. reading This initialization improves all models but is particularly
data or saving network snapshots) during the forward pass. helpful for large minibatch training as we will show.

6
We use scale and aspect ratio data augmentation [36] as k n kn η top-1 error (%)
in [12]. The network input image is a 224×224 pixel ran- baseline (single server) 8 32 256 0.1 23.60 ±0.12
dom crop from an augmented image or its horizontal flip. no warmup, Figure 2a 256 32 8k 3.2 24.84 ±0.37
The input image is normalized by the per-color mean and constant warmup, Figure 2b 256 32 8k 3.2 25.88 ±0.56
standard deviation, as in [12]. gradual warmup, Figure 2c 256 32 8k 3.2 23.74 ±0.09

Handling random variation. As models are subject to Table 1. Validation error on ImageNet using ResNet-50 (mean
random variation in training, we compute a model’s error and std computed over 5 trials). We compare the small minibatch
rate as the median error of the final 5 epochs. Moreover, model (kn=256) with large minibatch models (kn=8k) with vari-
ous warmup strategies. Observe that the top-1 validation error for
we report the mean and standard deviation (std) of the error
small and large minibatch training (with gradual warmup) is quite
from 5 independent runs. This gives us more confidence in close: 23.60% ±0.12 vs. 23.74% ±0.09, respectively.
our results and also provides a measure of model stability.
The random variation of ImageNet models has generally
not been reported in previous work (largely due to resource
Training error. Training curves are shown in Figure 2.
limitations). We emphasize that ignoring random variation
With no warmup (2a), the training curve for large minibatch
may cause unreliable conclusions, especially if results are
of kn = 8k is inferior to training with a small minibatch of
from a single trial, or the best of many.
kn = 256 across all epochs. A constant warmup strategy
Baseline. Under these settings, we establish a ResNet-50 (2b) actually degrades results: although the small constant
baseline using k = 8 (8 GPUs in one server) and n = 32 learning rate can decrease error during warmup, the error
images per worker (minibatch size of kn = 256), as in [16]. spikes immediately after and training never fully recovers.
Our baseline has a top-1 validation error of 23.60% ±0.12. Our main result is that with gradual warmup, large mini-
As a reference, ResNet-50 from fb.resnet.torch [12] batch training error matches the baseline training curve ob-
has 24.01% error, and that of the original ResNet paper [16] tained with small minibatches, see Figure 2c. Although
has 24.7% under weaker data augmentation. the large minibatch curve starts higher due to the low η
in the warmup phase, it catches up shortly thereafter. Af-
5.2. Optimization or Generalization Issues? ter about 20 epochs, the small and large minibatch training
We establish our main results on large minibatch train- curves match closely. The comparison between no warmup
ing by exploring optimization and generalization behaviors. and gradual warmup suggests that large minibatch sizes are
We will demonstrate that with a proper warmup strategy, challenged by optimization difficulties in early training and
large minibatch SGD can both match the training curves of if these difficulties are addressed, the training error and its
small minibatch SGD and also match the validation error. curve can match a small minibatch baseline closely.
In other words, in our experiments both optimization and
generalization of large minibatch training matches that of
small minibatch training. Moreover, in §5.4 we will show Validation error. Table 1 shows the validation error for
that these models exhibit good generalization behavior to the three warmup strategies. The no-warmup variant has
∼1.2% higher validation error than the baseline which is
the object detection/segmentation transfer tasks, matching
the transfer quality of small minibatch models. likely caused by the ∼2.1% increase in training error (Fig-
For the following results, we use k = 256 and n = 32, ure 2a), rather than overfitting or other causes for poor gen-
which results in a minibatch size kn = 8k (we use ‘1k’ eralization. This argument is further supported by our grad-
to denote 1024). As discussed, our baseline has a mini- ual warmup experiment. The gradual warmup variant has
batch size of kn = 256 and a reference learning rate of a validation error within 0.14% of the baseline (noting that
η = 0.1. Applying the linear scaling rule gives η = 3.2 std of these estimates is ∼0.1%). Given that the final train-
as the reference learning rate for our large minibatch runs. ing errors (Figure 2c) match nicely in this case, it shows that
We test three warmup strategies as discussed in §2.2: no if the optimization issues are addressed, there is no apparent
warmup, constant warmup with η = 0.1 for 5 epochs, generalization degradation observed using large minibatch
and gradual warmup which starts with η = 0.1 and is training, even if the minibatch size goes from 256 to 8k.
linearly increased to η = 3.2 over 5 epochs. All models Finally, Figure 4 shows both the training and valida-
are trained from scratch and all other hyper-parameters are tion curves for the large minibatch training with gradual
kept fixed. We emphasize that while better results for any warmup. As can be seen, validation error starts to match
particular minibatch size could be obtained by optimizing the baseline closely after the second learning rate drop; ac-
hyper-parameters for that case; our goal is to match er- tually, the validation curves can match earlier if BN statis-
rors across minibatch sizes by using a general strategy that tics are recomputed prior to evaluating the error instead of
avoids hyper-parameter tuning for each minibatch size. using the running average (see also caption in Figure 4).

7
100
kn=256, = 0.1, 23.60% 0.12 kn=256, = 0.1, 23.60% 0.12 kn=256, = 0.1, 23.60% 0.12
90
kn= 8k, = 3.2, 24.84% 0.37 kn= 8k, = 3.2, 25.88% 0.56 kn= 8k, = 3.2, 23.74% 0.09
80

training error %
70

20
0 20 40 60 80 0 20 40 60 80 0 20 40 60 80
epochs epochs epochs
(a) no warmup (b) constant warmup (c) gradual warmup

Figure 2. Warmup. Training error curves for minibatch size 8192 using various warmup strategies compared to minibatch size 256.
Validation error (mean±std of 5 runs) is shown in the legend, along with minibatch size kn and reference learning rate η.

100
kn=256, = 0.1, 23.60% 0.12 kn=256, = 0.1, 23.60% 0.12 kn=256, = 0.1, 23.60% 0.12
90 kn=512, = 0.2, 23.48% 0.09 kn= 1k, = 0.4, 23.53% 0.08
kn=128, = 0.05 23.49% 0.12
80

training error %
70

20
0 20 40 60 80 0 20 40 60 80 0 20 40 60 80
100
kn=256, = 0.1, 23.60% 0.12 kn=256, = 0.1, 23.60% 0.12 kn=256, = 0.1, 23.60% 0.12
90 kn= 4k, = 1.6, 23.56% 0.12 kn= 8k, = 3.2, 23.74% 0.09
kn= 2k, = 0.8, 23.49% 0.11
80

training error %
70

20
0 20 40 60 80 0 20 40 60 80 0 20 40 60 80
100
kn=256, = 0.1, 23.60% 0.12 kn=256, = 0.1, 23.60% 0.12 kn=256, = 0.1, 23.60% 0.12
90
kn=16k, = 6.4, 24.79% 0.27 kn=32k, =12.8, 27.55% 0.28 kn=64k, =25.6, 33.96% 0.80
80

training error %
70

20
0 20 40 60 80 0 20 40 60 80 0 20 40 60 80
epochs epochs epochs
Figure 3. Training error vs. minibatch size. Training error curves for the 256 minibatch baseline and larger minibatches using gradual
warmup and the linear scaling rule. Note how the training curves closely match the baseline (aside from the warmup period) up through 8k
minibatches. Validation error (mean±std of 5 runs) is shown in the legend, along with minibatch size kn and reference learning rate η.

8
100 100
kn=256, =0.1 [train] kn=256, = 0.1, 23.60% 0.12
kn=256, =0.1 [val] kn=256, = 0.2, 23.68% 0.09
80 80

training error %
kn=8k, =3.2 [train]
kn=8k, =3.2 [val]

error %
60 60

40 40

20 20
0 20 40 60 80 0 20 40 60 80
epochs epochs
Figure 4. Training and validation curves for large minibatch Figure 5. Training curves for small minibatches with different
SGD with gradual warmup vs. small minibatch SGD. Both sets learning rates η. As expected, changing η results in curves that do
of curves match closely after training for sufficient epochs. We not match. This is in contrast to changing batch-size (and linearly
note that the BN statistics (for inference only) are computed us- scaling η), which results in curves that do match, e.g. see Figure 3.
ing running average, which is updated less frequently with a large
minibatch and thus is noisier in early training (this explains the
larger variation of the validation error in early epochs). η = 0.1 gives best error but slightly smaller or larger η also
work well. When applying the linear scaling rule with a
minibatch of 8k images, the optimum error is also achieved
5.3. Analysis Experiments with η = 0.1 · 32, showing the successful application of the
Minibatch size vs. error. Figure 1 (page 1) shows top- linear scaling rule. However, in this case results are more
1 validation error for models trained with minibatch sizes sensitive to changing η. In practice we suggest to use a
ranging from of 64 to 65536 (64k). For all models we used minibatch size that is not close to the breaking point.
the linear scaling rule and set the reference learning rate Figure 5 shows the training curves of a 256 minibatch
kn
as η = 0.1 · 256 . For models with kn > 256, we used using η = 0.1 or 0.2. It shows that changing the learning
the gradual warmup strategy always starting with η = 0.1 rate η in general changes the overall shapes of the train-
and increasing linearly to the reference learning rate after ing curves, even if the final error is similar. Contrasting
5 epochs. Figure 1 illustrates that validation error remains this result with the success of the linear scaling rule (that
stable across a broad range of minibatch sizes, from 64 to can match both the final error and the training curves when
8k, after which it begins to increase. Beyond 64k training minibatch sizes change) may reveal some underlying invari-
diverges when using the linear learning rate scaling rule.5 ance maintained between small and large minibatches.

Training curves for various minibatch sizes. Each of the
We also show two √ alternative strategies: keeping η fixed
at 0.1 or using 0.1· 32 according to the square root scaling
nine plots in Figure 3 shows the top-1 training error curve rule that was justified theoretically in [21] on grounds that it
for the 256 minibatch baseline (orange) and a second curve scales η by the inverse amount of the reduction in the gradi-
corresponding to different size minibatch (blue). Valida- ent estimator’s standard deviation.√ For fair comparisons we
tion errors are shown in the plot legends. As minibatch size also use gradual warmup for 0.1 · 32. Both policies work
increases, all training curves show some divergence from poorly in practice as the results show.
the baseline at the start of training. However, in the cases
where the final validation error closely matches the base- Batch Normalization γ initialization. Table 2b controls
line (kn ≤ 8k), the training curves also closely match after for the impact of the new BN γ initialization introduced in
the initial epochs. When the validation errors do not match §5.1. We show results for minibatch sizes 256 and 8k with
(kn ≥ 16k), there is a noticeable gap in the training curves the standard BN initialization (γ = 1 for all BN layers)
for all epochs. This suggests that when comparing a new and with our initialization (γ = 0 for the final BN layer
setting, the training curves can be used as a reliable proxy of each residual block). The results show improved per-
for success well before training finishes. formance with γ = 0 for both minibatch sizes, and the
improvement is slightly larger for the 8k minibatch size.
Alternative learning rate rules. Table 2a shows results for This behavior also suggests that large minibatches are more
multiple learning rates. For small minibatches (kn = 256), easily affected by optimization difficulties. We expect that
5 We note that because of the availability of hardware, we simulated dis- improved optimization and initialization methods will help
tributed training of very large minibatches (≥12k) on a single server by us- push the boundary of large minibatch training.
ing multiple gradient accumulation steps between SGD updates. We have
thoroughly verified that gradient accumulation on a single server yields ResNet-101. Results for ResNet-101 [16] are shown in Ta-
equivalent results relative to distributed training. ble 2c. Training ResNet-101 with a batch-size of kn = 8k

9
kn η top-1 error (%) 40

ImageNet top-1 validation error
256 0.05 23.92 ±0.10
256 0.10 23.60 ±0.12 35
256 0.20 23.68 ±0.09
8k 0.05 · 32 24.27 ±0.08
8k 0.10 · 32 23.74 ±0.09 30
8k 0.20 · 32 24.05 ±0.18
8k 0.10√ 41.67 ±0.10
25
8k 0.10 · 32 26.22 ±0.03
(a) Comparison of learning rate scaling rules. A reference learning rate
of η = 0.1 works best for kn = 256 (23.68% error). The linear scal- 20
256 512 1k 2k 4k 8k 16k 32k 64k
ing rule suggests η = 0.1 · 32 when kn = 8k, which again gives best
mini-batch size
performance (23.74% error). Other ways of scaling η give worse results.
Figure 6. ImageNet-5k top-1 validation error vs. minibatch size
kn η γ-init top-1 error (%) with a fixed 90 epoch training schedule. The curve is qualitatively
256 0.1 1.0 23.84 ±0.18 similar to results on ImageNet-1k (Figure 1) showing that a 5×
256 0.1 0.0 23.60 ±0.12 increase in training data does not lead to a significant change in
8k 3.2 1.0 24.11 ±0.07 the maximum effective minibatch size.
8k 3.2 0.0 23.74 ±0.09
(b) Batch normalization γ initialization. Initializing γ = 0 in the last ImageNet pre-training COCO
BN layer of each residual block improves results for both small and large kn η top-1 error (%) box AP (%) mask AP (%)
minibatches. This initialization leads to better optimization behavior which 256 0.1 23.60 ±0.12 35.9 ±0.1 33.9 ±0.1
has a larger positive impact when training with large minibatches.
512 0.2 23.48 ±0.09 35.8 ±0.1 33.8 ±0.2
1k 0.4 23.53 ±0.08 35.9 ±0.2 33.9 ±0.2
model type kn η top-1 error (%) 2k 0.8 23.49 ±0.11 35.9 ±0.1 33.9 ±0.1
ResNet-101 256 0.1 22.08 ±0.06 4k 1.6 23.56 ±0.12 35.8 ±0.1 33.8 ±0.1
ResNet-101 8k 3.2 22.36 ±0.09 8k 3.2 23.74 ±0.09 35.8 ±0.1 33.9 ±0.2
(c) The linear scaling rule applied to ResNet-101. The difference in error 16k 6.4 24.79 ±0.27 35.1 ±0.3 33.2 ±0.3
is about 0.3% between small and large minibatch training. (a) Transfer learning of large minibatch pre-training to Mask R-CNN.
Box and mask AP (on COCO minival) are nearly identical for ResNet-
Table 2. ImageNet classification experiments. Unless noted all 50 models pre-trained with minibatches from 256 to 8k examples. With
experiments use ResNet-50 and are averaged over 5 trials. a minibatch pre-training size of 16k both ImageNet validation error and
COCO AP deteriorate. This indicates that as long as ImageNet error is
matched, large minibatches do not degrade transfer learning performance.
and a linearly scaled η = 3.2 results in an error of 22.36%
# GPUs kn η · 1000 iterations box AP (%) mask AP (%)
vs. the kn = 256 baseline which achieves 22.08% with
1 2 2.5 1,280,000 35.7 33.6
η = 0.1. In other words, ResNet-101 trained with mini-
2 4 5.0 640,000 35.7 33.7
batch 8k has a small 0.28% increase in error vs. the baseline. 4 8 10.0 320,000 35.7 33.5
It is likely that the minibatch size of 8k lies on the edge of 8 16 20.0 160,000 35.6 33.6
the useful minibatch training regime for ResNet-101, simi- (b) Linear learning rate scaling applied to Mask R-CNN. Using the sin-
larly to ResNet-50 (see Figure 1). gle ResNet-50 model from [16] (thus no std is reported), we train Mask
The training time of ResNet-101 is 92.5 minutes in our R-CNN using using from 1 to 8 GPUs following the linear learning rate
scaling rule. Box and mask AP are nearly identical across all configurations
implementation using 256 Tesla P100 GPUs and a mini- showing the successful generalization of the rule beyond classification.
batch size of 8k. We believe this is a compelling result if
the speed-accuracy tradeoff of ResNet-101 is preferred. Table 3. Object detection on COCO with Mask R-CNN [14].

ImageNet-5k. Observing the sharp increase in validation is very similar to the ImageNet-1k curve, showing that for
error between minibatch sizes of 8k and 16k on ImageNet- practitioners it is unlikely that even a 5× increase in dataset
1k (Figure 1), a natural question is if the position of this size will automatically lead to a meaningful increase in use-
‘elbow’ in the error curve is a function of dataset infor- able minibatch size. Quantitatively, using an 8k minibatch
mation content. To investigate this question, we adopt increases the validation error by 0.26% from 25.83% for a
the ImageNet-5k dataset suggested by Xie et al. [39] that 256 minibatch to 26.09%. An understanding of the precise
extends ImageNet-1k to 6.8 million images (roughly 5× relationship between generalization error, minibatch size,
larger) by adding 4k additional categories from ImageNet- and dataset information content is open for future work.
22k [33]. We evaluate the 1k-way classification error on the
original ImageNet-1k validation set as in [39]. 5.4. Generalization to Detection and Segmentation
The minibatch size vs. validation error curve for A low error rate on ImageNet is not typically an end
ImageNet-5k is shown in Figure 6. Qualitatively, the curve goal. Instead, the utility of ImageNet training lies in learn-

10
0.3
16
32k ideal

time per iteration (secs) time per epoch (mins)
0.28 actual

images / second
8
16k
0.26 4
8k
0.24 2 4k

0.22 1 2k

0.2 0.5
256 512 1k 2k 4k 8k 11k 8 16 32 64 128 256 352
mini-batch size # GPUs
Figure 7. Distributed synchronous SGD timing. Time per itera- Figure 8. Distributed synchronous SGD throughput. The small
tion (seconds) and time per ImageNet epoch (minutes) for training overhead when moving from a single server with 8 GPUs to multi-
with different minibatch sizes. The baseline (kn = 256) uses 8 server distributed training (Figure 7, blue curve) results in linear
GPUs in a single server , while all other training runs distribute throughput scaling that is marginally below ideal scaling (∼90%
training over (kn/256) server. With 352 GPUs (44 servers) our efficiency). Most of the allreduce communication time is hid-
implementation completes one pass over all ∼1.28 million Ima- den by pipelining allreduce operations with gradient computation.
geNet training images in about 30 seconds. Moreover, this is achieved with commodity Ethernet hardware.

ing good features that transfer, or generalize well, to re- tection matches the AP of the small minibatch baseline. We
lated tasks. A question of key importance is if the features emphasize that we observed no generalization issues when
learned with large minibatches generalize as well as the fea- transferring across datasets (from ImageNet to COCO) and
tures learned with small minibatches? across tasks (from classification to detection/segmentation)
To test this, we adopt the object detection and in- using models trained with large minibatches.
stance segmentation tasks on COCO [27] as these advanced Linear scaling rule applied to Mask R-CNN. We also
perception tasks benefit substantially from ImageNet pre- show evidence of the generality of the linear scaling rule us-
training [10]. We use the recently developed Mask R-CNN ing Mask R-CNN. In fact, this rule was already used with-
[14] system that is capable of learning to detect and segment out explicit discussion in [16] and was applied effectively
object instances. We follow all of the hyper-parameter set- as the default Mask R-CNN training scheme when using 8
tings used in [14] and only change the ResNet-50 model GPUs. Table 3b provides experimental results showing that
used to initialize Mask R-CNN training. We train Mask R- when training with 1, 2, 4, or 8 GPUs the linear learning rate
CNN on the COCO trainval35k split and report results rule results in constant box and mask AP. For these experi-
on the 5k image minival split used in [14]. ments, we initialize Mask R-CNN from the released MSRA
It is interesting to note that the concept of minibatch ResNet-50 model, as was done in [14].
size in Mask R-CNN is different from the classification
setting. As an extension of the image-centric Fast/Faster 5.5. Run Time
R-CNN [9, 31], Mask R-CNN exhibits different minibatch Figure 7 shows two visualizations of the run time char-
sizes for different layers: the network backbone uses two acteristics of our system. The blue curve is the time per
images (per GPU), but each image contributes 512 Regions- iteration as minibatch size varies from 256 to 11264 (11k).
of-Interest for computing classification (multinomial cross- Notably this curve is relatively flat and the time per itera-
entropy), bounding-box regression (smooth-L1/Huber), and tion increases only 12% while scaling the minibatch size by
pixel-wise mask (28 × 28 binomial cross-entropy) losses. 44×. Visualized another way, the orange curve shows the
This diverse set of minibatch sizes and loss functions pro- approximately linear decrease in time per epoch from over
vides a good test case to the robustness of our approach. 16 minutes to just 30 seconds. Run time performance can
Transfer learning from large minibatch pre-training. also be viewed in terms of throughput (images / second), as
To test how large minibatch pre-training effects Mask R- shown in Figure 8. Relative to a perfectly efficient extrapo-
CNN, we take ResNet-50 models trained on ImageNet-1k lation of the 8 GPU baseline, our implementation achieves
∼90% scaling efficiency.
with 256 to 16k minibatches and use them to initialize Mask
R-CNN training. For each minibatch size we pre-train 5 Acknowledgements. We would like to thank Leon Bottou for
models and then train Mask R-CNN using all 5 models on helpful discussions on theoretical background, Jerry Pan and
COCO (35 models total). We report the mean box and mask Christian Puhrsch for discussions on efficient data loading, An-
APs, averaged over the 5 trials, in Table 3a. The results drew Dye for help with debugging distributed training, and Kevin
show that as long as ImageNet validation error is kept low, Lee, Brian Dodds, Jia Ning, Koh Yew Thoon, Micah Harris, and
which is true up to 8k batch size, generalization to object de- John Volk for Big Basin and hardware support.

11
References [20] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and
P. T. P. Tang. On large-batch training for deep learning: Gen-
[1] J. Bagga, H. Morsy, and Z. Yao. Opening eralization gap and sharp minima. ICLR, 2017.
designs for 6-pack and Wedge 100. https: [21] A. Krizhevsky. One weird trick for parallelizing convolu-
//code.facebook.com/posts/203733993317833/ tional neural networks. arXiv:1404.5997, 2014.
opening-designs-for-6-pack-and-wedge-100, 2016. [22] A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet classi-
[2] M. Barnett, L. Shuler, R. van De Geijn, S. Gupta, D. G. fication with deep convolutional neural nets. In NIPS, 2012.
Payne, and J. Watts. Interprocessor collective communica- [23] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E.
tion library (intercom). In Scalable High-Performance Com- Howard, W. Hubbard, and L. D. Jackel. Backpropagation
puting Conference, 1994. applied to handwritten zip code recognition. Neural compu-
[3] L. Bottou. Curiously fast convergence of some stochastic tation, 1989.
gradient descent algorithms. Unpublished open problem of- [24] K. Lee. Introducing Big Basin: Our next-generation
fered to the attendance of the SLDS 2009 conference, 2009. AI hardware. https://code.facebook.com/posts/
[4] L. Bottou, F. E. Curtis, and J. Nocedal. Opt. methods for 1835166200089399/introducing-big-basin, 2017.
large-scale machine learning. arXiv:1606.04838, 2016. [25] M. Li. Scaling Distributed Machine Learning with System
[5] J. Chen, X. Pan, R. Monga, S. Bengio, and R. Joze- and Algorithm Co-design. PhD thesis, Carnegie Mellon Uni-
fowicz. Revisiting Distributed Synchronous SGD. versity, 2017.
arXiv:1604.00981, 2016. [26] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and
[6] K. Chen and Q. Huo. Scalable training of deep learning ma- S. Belongie. Feature pyramid networks for object detection.
chines by incremental block training with intra-block par- In CVPR, 2017.
allel optimization and blockwise model-update filtering. In [27] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
ICASSP, 2016. manan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Com-
[7] R. Collobert, J. Weston, L. Bottou, M. Karlen, mon objects in context. In ECCV. 2014.
K. Kavukcuoglu, and P. Kuksa. Natural language pro- [28] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
cessing (almost) from scratch. JMLR, 2011. networks for semantic segmentation. In CVPR, 2015.
[8] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, [29] Y. Nesterov. Introductory lectures on convex optimization: A
E. Tzeng, and T. Darrell. Decaf: A deep convolutional acti- basic course. Springer, 2004.
vation feature for generic visual recognition. In ICML, 2014. [30] R. Rabenseifner. Optimization of collective reduction oper-
ations. In ICCS. Springer, 2004.
[9] R. Girshick. Fast R-CNN. In ICCV, 2015.
[31] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: To-
[10] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea-
wards real-time object detection with region proposal net-
ture hierarchies for accurate object detection and semantic
works. In NIPS, 2015.
segmentation. In CVPR, 2014.
[32] H. Robbins and S. Monro. A stochastic approximation
[11] W. Gropp, E. Lusk, and A. Skjellum. Using MPI: Portable method. The annals of mathematical statistics, 1951.
Parallel Programming with the Message-Passing Interface. [33] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
MIT Press, Cambridge, MA, 1999. S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,
[12] S. Gross and M. Wilber. Training and investigating Resid- A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual
ual Nets. https://github.com/facebook/fb. Recognition Challenge. IJCV, 2015.
resnet.torch, 2016. [34] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus,
[13] M. Gürbüzbalaban, A. Ozdaglar, and P. Parrilo. Why and Y. LeCun. Overfeat: Integrated recognition, localization
random reshuffling beats stochastic gradient descent. and detection using convolutional networks. In ICLR, 2014.
arXiv:1510.08560, 2015. [35] K. Simonyan and A. Zisserman. Very deep convolutional
[14] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask R- networks for large-scale image recognition. In ICLR, 2015.
CNN. arXiv:1703.06870, 2017. [36] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
[15] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.
rectifiers: Surpassing human-level performance on imagenet Going deeper with convolutions. In CVPR, 2015.
classification. In ICCV, 2015. [37] R. Thakur, R. Rabenseifner, and W. Gropp. Optimization of
[16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning collective comm. operations in MPICH. IJHPCA, 2005.
for image recognition. In CVPR, 2016. [38] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi,
[17] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey,
N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, et al. Google’s neural machine translation system: Bridg-
et al. Deep neural networks for acoustic modeling in speech ing the gap between human and machine translation.
recognition: The shared views of four research groups. IEEE arXiv:1609.08144, 2016.
Signal Processing Magazine, 2012. [39] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated
[18] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and residual transformations for deep neural networks. In CVPR,
Y. Bengio. Quantized neural networks: Training neu- 2017.
ral networks with low precision weights and activations. [40] W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stol-
arXiv:1510.08560, 2016. cke, D. Yu, and G. Zweig. The Microsoft 2016 Conversa-
[19] S. Ioffe and C. Szegedy. Batch normalization: Accelerating tional Speech Recognition System. arXiv:1609.03528, 2016.
deep network training by reducing internal covariate shift. In [41] M. D. Zeiler and R. Fergus. Visualizing and understanding
ICML, 2015. convolutional neural networks. In ECCV, 2014.