Group Normalization
Yuxin Wu Kaiming He
Facebook AI Research (FAIR)
{yuxinwu,kaiminghe}@fb.com
Abstract 36
arXiv:1803.08494v3 [cs.CV] 11 Jun 2018
Batch Norm
Batch Normalization (BN) is a milestone technique in the 34 Group Norm
development of deep learning, enabling various networks 32
to train. However, normalizing along the batch dimension
30
error (%)
introduces problems — BN’s error increases rapidly when
the batch size becomes smaller, caused by inaccurate batch 28
statistics estimation. This limits BN’s usage for training 26
larger models and transferring features to computer vision
tasks including detection, segmentation, and video, which 24
require small batches constrained by memory consumption. 22
32 16 8 4 2
In this paper, we present Group Normalization (GN) as
batch size (images per worker)
a simple alternative to BN. GN divides the channels into
Figure 1. ImageNet classification error vs. batch sizes. This is
groups and computes within each group the mean and vari-
a ResNet-50 model trained in the ImageNet training set using 8
ance for normalization. GN’s computation is independent
workers (GPUs), evaluated in the validation set.
of batch sizes, and its accuracy is stable in a wide range
of batch sizes. On ResNet-50 trained in ImageNet, GN has Despite its great success, BN exhibits drawbacks that are
10.6% lower error than its BN counterpart when using a also caused by its distinct behavior of normalizing along
batch size of 2; when using typical batch sizes, GN is com- the batch dimension. In particular, it is required for BN
parably good with BN and outperforms other normaliza- to work with a sufficiently large batch size (e.g., 32 per
tion variants. Moreover, GN can be naturally transferred worker2 [26, 59, 20]). A small batch leads to inaccurate
from pre-training to fine-tuning. GN can outperform its BN- estimation of the batch statistics, and reducing BN’s batch
based counterparts for object detection and segmentation in size increases the model error dramatically (Figure 1). As
COCO,1 and for video classification in Kinetics, showing a result, many recent models [59, 20, 57, 24, 63] are trained
that GN can effectively replace the powerful BN in a variety with non-trivial batch sizes that are memory-consuming.
of tasks. GN can be easily implemented by a few lines of The heavy reliance on BN’s effectiveness to train models in
code in modern libraries. turn prohibits people from exploring higher-capacity mod-
els that would be limited by memory.
The restriction on batch sizes is more demanding in com-
1. Introduction puter vision tasks including detection [12, 47, 18], segmen-
Batch Normalization (Batch Norm or BN) [26] has been tation [38, 18], video recognition [60, 6], and other high-
established as a very effective component in deep learning, level systems built on them. For example, the Fast/er and
largely helping push the frontier in computer vision [59, 20] Mask R-CNN frameworks [12, 47, 18] use a batch size of
and beyond [54]. BN normalizes the features by the mean 1 or 2 images because of higher resolution, where BN is
and variance computed within a (mini-)batch. This has been “frozen” by transforming to a linear layer [20]; in video
shown by many practices to ease optimization and enable classification with 3D convolutions [60, 6], the presence of
very deep networks to converge. The stochastic uncertainty spatial-temporal features introduces a trade-off between the
of the batch statistics also acts as a regularizer that can ben- temporal length and batch size. The usage of BN often re-
efit generalization. BN has been a foundation of many state- quires these systems to compromise between the model de-
of-the-art computer vision algorithms. sign and batch sizes.
2 In the context of this paper, we use “batch size” to refer to the number
1 https://github.com/facebookresearch/Detectron/ of samples per worker (e.g., GPU). BN’s statistics are computed for each
blob/master/projects/GN. worker, but not broadcast across workers, as is standard in many libraries.
1
This paper presents Group Normalization (GN) as a sim- inference time, so the mean and variance are pre-computed
ple alternative to BN. We notice that many classical features from the training set [26], often by running average; conse-
like SIFT [39] and HOG [9] are group-wise features and in- quently, there is no normalization performed when testing.
volve group-wise normalization. For example, a HOG vec- The pre-computed statistics may also change when the tar-
tor is the outcome of several spatial cells where each cell is get data distribution changes [45]. These issues lead to in-
represented by a normalized orientation histogram. Analo- consistency at training, transferring, and testing time. In ad-
gously, we propose GN as a layer that divides channels into dition, as aforementioned, reducing the batch size can have
groups and normalizes the features within each group (Fig- dramatic impact on the estimated batch statistics.
ure 2). GN does not exploit the batch dimension, and its Several normalization methods [3, 61, 51, 2, 46] have
computation is independent of batch sizes. been proposed to avoid exploiting the batch dimension.
GN behaves very stably over a wide range of batch sizes Layer Normalization (LN) [3] operates along the chan-
(Figure 1). With a batch size of 2 samples, GN has 10.6% nel dimension, and Instance Normalization (IN) [61] per-
lower error than its BN counterpart for ResNet-50 [20] in forms BN-like computation but only for each sample (Fig-
ImageNet [50]. With a regular batch size, GN is comparably ure 2). Instead of operating on features, Weight Normal-
good as BN (with a gap of ∼0.5%) and outperforms other ization (WN) [51] proposes to normalize the filter weights.
normalization variants [3, 61, 51]. Moreover, although the These methods do not suffer from the issues caused by the
batch size may change, GN can naturally transfer from pre- batch dimension, but they have not been able to approach
training to fine-tuning. GN shows improved results vs. its BN’s accuracy in many visual recognition tasks. We pro-
BN counterpart on Mask R-CNN for COCO object detec- vide comparisons with these methods in context of the re-
tion and segmentation [37], and on 3D convolutional net- maining sections.
works for Kinetics video classification [30]. The effective-
ness of GN in ImageNet, COCO, and Kinetics demonstrates Addressing small batches. Ioffe [25] proposes Batch
that GN is a competitive alternative to BN that has been Renormalization (BR) that alleviates BN’s issue involving
dominant in these tasks. small batches. BR introduces two extra parameters that con-
There have been existing methods, such as Layer Nor- strain the estimated mean and variance of BN within a cer-
malization (LN) [3] and Instance Normalization (IN) [61] tain range, reducing their drift when the batch size is small.
(Figure 2), that also avoid normalizing along the batch di- BR has better accuracy than BN in the small-batch regime.
mension. These methods are effective for training sequen- But BR is also batch-dependent, and when the batch size
tial models (RNN/LSTM [49, 22]) or generative models decreases its accuracy still degrades [25].
(GANs [15, 27]). But as we will show by experiments, both There are also attempts to avoid using small batches.
LN and IN have limited success in visual recognition, for The object detector in [43] performs synchronized BN
which GN presents better results. Conversely, GN could be whose mean and variance are computed across multiple
used in place of LN and IN and thus is applicable for se- GPUs. However, this method does not solve the problem
quential or generative models. This is beyond the focus of of small batches; instead, it migrates the algorithm prob-
this paper, but it is suggestive for future research. lem to engineering and hardware demands, using a number
of GPUs proportional to BN’s requirements. Moreover, the
2. Related Work synchronized BN computation prevents using asynchronous
solvers (ASGD [10]), a practical solution to large-scale
Normalization. It is well-known that normalizing the in- training widely used in industry. These issues can limit the
put data makes training faster [33]. To normalize hidden scope of using synchronized BN.
features, initialization methods [33, 14, 19] have been de- Instead of addressing the batch statistics computation
rived based on strong assumptions of feature distributions, (e.g., [25, 43]), our normalization method inherently avoids
which can become invalid when training evolves. this computation.
Normalization layers in deep networks had been widely
used before the development of BN. Local Response Nor- Group-wise computation. Group convolutions have been
malization (LRN) [40, 28, 32] was a component in AlexNet presented by AlexNet [32] for distributing a model into two
[32] and following models [64, 53, 58]. Unlike recent meth- GPUs. The concept of groups as a dimension for model
ods [26, 3, 61], LRN computes the statistics in a small design has been more widely studied recently. The work
neighborhood for each pixel. of ResNeXt [63] investigates the trade-off between depth,
Batch Normalization [26] performs more global normal- width, and groups, and it suggests that a larger number of
ization along the batch dimension (and as importantly, it groups can improve accuracy under similar computational
suggests to do this for all layers). But the concept of “batch” cost. MobileNet [23] and Xception [7] exploit channel-wise
is not always present, or it may change from time to time. (also called “depth-wise”) convolutions, which are group
For example, batch-wise normalization is not legitimate at convolutions with a group number equal to the channel
2
Batch Norm Layer Norm Instance Norm Group Norm
H, W H, W H, W H, W
C N C N C N C N
Figure 2. Normalization methods. Each subplot shows a feature map tensor, with N as the batch axis, C as the channel axis, and (H, W )
as the spatial axes. The pixels in blue are normalized by the same mean and variance, computed by aggregating the values of these pixels.
number. ShuffleNet [65] proposes a channel shuffle oper- 3.1. Formulation
ation that permutes the axes of grouped features. These
We first describe a general formulation of feature nor-
methods all involve dividing the channel dimension into
malization, and then present GN in this formulation. A fam-
groups. Despite the relation to these methods, GN does not
ily of feature normalization methods, including BN, LN, IN,
require group convolutions. GN is a generic layer, as we
and GN, perform the following computation:
evaluate in standard ResNets [20].
1
x̂i = (xi − µi ). (1)
3. Group Normalization σi
The channels of visual representations are not entirely Here x is the feature computed by a layer, and i is an index.
independent. Classical features of SIFT [39], HOG [9], In the case of 2D images, i = (iN , iC , iH , iW ) is a 4D vec-
and GIST [41] are group-wise representations by design, tor indexing the features in (N, C, H, W ) order, where N is
where each group of channels is constructed by some kind the batch axis, C is the channel axis, and H and W are the
of histogram. These features are often processed by group- spatial height and width axes.
wise normalization over each histogram or each orientation. µ and σ in (1) are the mean and standard deviation (std)
Higher-level features such as VLAD [29] and Fisher Vec- computed by:
tors (FV) [44] are also group-wise features where a group s
1 X 1 X
can be thought of as the sub-vector computed with respect µi = xk , σi = (xk − µi )2 + , (2)
to a cluster. m m
k∈Si k∈Si
Analogously, it is not necessary to think of deep neu-
ral network features as unstructured vectors. For example, with as a small constant. Si is the set of pixels in which
for conv1 (the first convolutional layer) of a network, it is the mean and std are computed, and m is the size of this set.
reasonable to expect a filter and its horizontal flipping to Many types of feature normalization methods mainly differ
exhibit similar distributions of filter responses on natural in how the set Si is defined (Figure 2), discussed as follows.
images. If conv1 happens to approximately learn this pair In Batch Norm [26], the set Si is defined as:
of filters, or if the horizontal flipping (or other transforma- Si = {k | kC = iC }, (3)
tions) is made into the architectures by design [11, 8], then
the corresponding channels of these filters can be normal- where iC (and kC ) denotes the sub-index of i (and k) along
ized together. the C axis. This means that the pixels sharing the same
The higher-level layers are more abstract and their be- channel index are normalized together, i.e., for each chan-
haviors are not as intuitive. However, in addition to orien- nel, BN computes µ and σ along the (N, H, W ) axes. In
tations (SIFT [39], HOG [9], or [11, 8]), there are many Layer Norm [3], the set is:
factors that could lead to grouping, e.g., frequency, shapes,
illumination, textures. Their coefficients can be interde- Si = {k | kN = iN }, (4)
pendent. In fact, a well-accepted computational model
meaning that LN computes µ and σ along the (C, H, W )
in neuroscience is to normalize across the cell responses
axes for each sample. In Instance Norm [61], the set is:
[21, 52, 55, 5], “with various receptive-field centers (cov-
ering the visual field) and with various spatiotemporal fre- Si = {k | kN = iN , kC = iC }. (5)
quency tunings” (p183, [21]); this can happen not only in
the primary visual cortex, but also “throughout the visual meaning that IN computes µ and σ along the (H, W ) axes
system” [5]. Motivated by these works, we propose new for each sample and each channel. The relations among BN,
generic group-wise normalization for deep neural networks. LN, and IN are in Figure 2.
3
As in [26], all methods of BN, LN, and IN learn a per- def GroupNorm(x, gamma, beta, G, eps=1e−5):
channel linear transform to compensate for the possible lost # x: input features with shape [N,C,H,W]
# gamma, beta: scale and offset, with shape [1,C,1,1]
of representational ability: # G: number of groups for GN
yi = γ x̂i + β, (6) N, C, H, W = x.shape
x = tf.reshape(x, [N, G, C // G, H, W])
where γ and β are trainable scale and shift (indexed by iC mean, var = tf.nn.moments(x, [2, 3, 4], keep dims=True)
in all case, which we omit for simplifying notations). x = (x − mean) / tf.sqrt(var + eps)
Group Norm. Formally, a Group Norm layer computes µ x = tf.reshape(x, [N, C, H, W])
and σ in a set Si defined as:
return x ∗ gamma + beta
kC iC
Si = {k | kN = iN , b c=b c}. (7) Figure 3. Python code of Group Norm based on TensorFlow.
C/G C/G
Here G is the number of groups, which is a pre-defined TensorFlow. In fact, we only need to specify how the mean
hyper-parameter (G = 32 by default). C/G is the num- and variance (“moments”) are computed, along the appro-
ber of channels per group. b·c is the floor operation, and priate axes as defined by the normalization method.
kC iC
“b C/G c = b C/G c” means that the indexes i and k are in
the same group of channels, assuming each group of chan- 4. Experiments
nels are stored in a sequential order along the C axis. GN
computes µ and σ along the (H, W ) axes and along a group 4.1. Image Classification in ImageNet
of CG channels. The computation of GN is illustrated in We experiment in the ImageNet classification dataset
Figure 2 (rightmost), which is a simple case of 2 groups [50] with 1000 classes. We train on the ∼1.28M training
(G = 2) each having 3 channels. images and evaluate on the 50,000 validation images, using
Given Si in Eqn.(7), a GN layer is defined by Eqn.(1), the ResNet models [20].
(2), and (6). Specifically, the pixels in the same group are
normalized together by the same µ and σ. GN also learns Implementation details. As standard practice [20, 17], we
the per-channel γ and β. use 8 GPUs to train all models, and the batch mean and
variance of BN are computed within each GPU. We use the
Relation to Prior Work. LN, IN, and GN all perform in- method of [19] to initialize all convolutions for all mod-
dependent computations along the batch axis. The two ex- els. We use 1 to initialize all γ parameters, except for each
treme cases of GN are equivalent to LN and IN (Figure 2). residual block’s last normalization layer where we initial-
Relation to Layer Normalization [3]. GN becomes LN if we ize γ by 0 following [16] (such that the initial state of a
set the group number as G = 1. LN assumes all channels residual block is identity). We use a weight decay of 0.0001
in a layer make “similar contributions” [3]. Unlike the case for all weight layers, including γ and β (following [17] but
of fully-connected layers studied in [3], this assumption can unlike [20, 16]). We train 100 epochs for all models, and
be less valid with the presence of convolutions, as discussed decrease the learning rate by 10× at 30, 60, and 90 epochs.
in [3]. GN is less restricted than LN, because each group of During training, we adopt the data augmentation of [58] as
channels (instead of all of them) are assumed to subject to implemented by [17]. We evaluate the top-1 classification
the shared mean and variance; the model still has flexibil- error on the center crops of 224×224 pixels in the valida-
ity of learning a different distribution for each group. This tion set. To reduce random variations, we report the median
leads to improved representational power of GN over LN, error rate of the final 5 epochs [16]. Other implementation
as shown by the lower training and validation error in ex- details follow [17].
periments (Figure 4). Our baseline is the ResNet trained with BN [20]. To
compare with LN, IN, and GN, we replace BN with the
Relation to Instance Normalization [61]. GN becomes IN
specific variant. We use the same hyper-parameters for all
if we set the group number as G = C (i.e., one channel per
models. We set G = 32 for GN by default.
group). But IN can only rely on the spatial dimension for
computing the mean and variance and it misses the oppor- Comparison of feature normalization methods. We first
tunity of exploiting the channel dependence. experiment with a regular batch size of 32 images (per
GPU) [26, 20]. BN works successfully in this regime, so
3.2. Implementation
this is a strong baseline to compare with. Figure 4 shows
GN can be easily implemented by a few lines of code in the error curves, and Table 1 shows the final results.
PyTorch [42] and TensorFlow [1] where automatic differ- Figure 4 shows that all of these normalization methods
entiation is supported. Figure 3 shows the code based on are able to converge. LN has a small degradation of 1.7%
4
train error val error
60 60
Batch Norm (BN) Batch Norm (BN)
55 Layer Norm (LN) 55 Layer Norm (LN)
Instance Norm (IN) Instance Norm (IN)
50 Group Norm (GN) 50 Group Norm (GN)
45 45
error (%) error (%)
40 IN 40
LN IN
35 35
BN LN
GN GN
30 30
BN
25 25
20 20
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
epochs epochs
Figure 4. Comparison of error curves with a batch size of 32 images/GPU. We show the ImageNet training error (left) and validation
error (right) vs. numbers of training epochs. The model is ResNet-50.
Batch Norm (BN) Group Norm (GN)
60 60
BN, 32 ims/gpu GN, 32 ims/gpu
55 BN, 16 ims/gpu 55 GN, 16 ims/gpu
BN, 8 ims/gpu GN, 8 ims/gpu
50 BN, 4 ims/gpu 50 GN, 4 ims/gpu
BN, 2 ims/gpu GN, 2 ims/gpu
45 45
error (%) error (%)
40 40
35 35
30 30
25 25
20 20
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
epochs epochs
Figure 5. Sensitivity to batch sizes: ResNet-50’s validation error of BN (left) and GN (right) trained with 32, 16, 8, 4, and 2 images/GPU.
BN LN IN GN batch size 32 16 8 4 2
val error 23.6 25.3 28.4 24.1 BN 23.6 23.7 24.8 27.3 34.7
4 (vs. BN) - 1.7 4.8 0.5 GN 24.1 24.2 24.0 24.2 24.1
4 0.5 0.5 -0.8 -3.1 -10.6
Table 1. Comparison of error rates (%) of ResNet-50 in the Ima-
geNet validation set, trained with a batch size of 32 images/GPU. Table 2. Sensitivity to batch sizes. We show ResNet-50’s vali-
The error curves are in Figure 4. dation error (%) in ImageNet. The last row shows the differences
between BN and GN. The error curves are in Figure 5. This table
comparing with BN. This is an encouraging result, as it sug- is visualized in Figure 1.
gests that normalizing along all channels (as done by LN) of
a convolutional network is reasonably good. IN also makes Small batch sizes. Although BN benefits from the stochas-
the model converge, but is 4.8% worse than BN.3 ticity under some situations, its error increases when the
In this regime where BN works well, GN is able to ap- batch size becomes smaller and the uncertainty gets bigger.
proach BN’s accuracy, with a decent degradation of 0.5% in We show this in Figure 1, Figure 5, and Table 2.
the validation set. Actually, Figure 4 (left) shows that GN
We evaluate batch sizes of 32, 16, 8, 4, 2 images per
has lower training error than BN, indicating that GN is ef-
GPU. In all cases, the BN mean and variance are computed
fective for easing optimization. The slightly higher valida-
within each GPU and not synchronized. All models are
tion error of GN implies that GN loses some regularization
trained in 8 GPUs. In this set of experiments, we adopt the
ability of BN. This is understandable, because BN’s mean
linear learning rate scaling rule [31, 4, 16] to adapt to batch
and variance computation introduces uncertainty caused by
size changes — we use a learning rate of 0.1 [20] for the
the stochastic batch sampling, which helps regularization
batch size of 32, and 0.1N/32 for a batch size of N . This
[26]. This uncertainty is missing in GN (and LN/IN). But
linear scaling rule works well for BN if the total batch size
it is possible that GN combined with a suitable regularizer
changes (by changing the number of GPUs) but the per-
will improve results. This can be a future research topic.
GPU batch size does not change [16]. We keep the same
3 For completeness, we have also trained ResNet-50 with WN [51], number of training epochs for all cases (Figure 5, x-axis).
which is filter (instead of feature) normalization. WN’s result is 28.2%. All other hyper-parameters are unchanged.
5
none (w/o norm) Batch Norm Group Norm
3 1st percentile 3 1st percentile
20
20th percentile 20th percentile
2 80th percentile 2 80th percentile
99th percentile 99th percentile
0 1 1
error
0 0
−20 none 29.2
−1 −1
−40
BN 28.0
−2 −2
1st percentile −3 −3
GN 27.6
−60 20th percentile
80th percentile −4 −4
99th percentile
−80 −5 −5
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
epochs epochs epochs
Figure 6. Evolution of feature distributions of conv5 3 ’s output (before normalization and ReLU) from VGG-16, shown as the {1, 20, 80,
99} percentile of responses. The table on the right shows the ImageNet validation error (%). Models are trained with 32 images/GPU.
# groups (G) With a batch size of 4, ResNet-50 trained with BR has an
64 32 16 8 4 2 1 (=LN) error rate of 26.3%. This is better than BN’s 27.3%, but still
24.6 24.1 24.6 24.4 24.6 24.7 25.3 2.1% higher than GN’s 24.2%.
0.5 - 0.5 0.3 0.5 0.6 1.2
Group division. Thus far all presented GN models are
# channels per group trained with a group number of G = 32. Next we eval-
64 32 16 8 4 2 1 (=IN) uate different ways of dividing into groups. With a given
24.4 24.5 24.2 24.3 24.8 25.6 28.4 fixed group number, GN performs reasonably well for all
0.2 0.3 - 0.1 0.6 1.4 4.2 values of G we studied (Table 3, top panel). In the extreme
case of G = 1, GN is equivalent to LN, and its error rate is
Table 3. Group division. We show ResNet-50’s validation error higher than all cases of G > 1 studied.
(%) in ImageNet, trained with 32 images/GPU. (Top): a given We also evaluate fixing the number of channels per group
number of groups. (Bottom): a given number of channels per (Table 3, bottom panel). Note that because the layers can
group. The last rows show the differences with the best number. have different channel numbers, the group number G can
change across layers in this setting. In the extreme case of 1
Figure 5 (left) shows that BN’s error becomes consider-
channel per group, GN is equivalent to IN. Even if using as
ably higher with small batch sizes. GN’s behavior is more
few as 2 channels per group, GN has substantially lower er-
stable and insensitive to the batch size. Actually, Figure 5
ror than IN (25.6% vs. 28.4%). This result shows the effect
(right) shows that GN has very similar curves (subject to
of grouping channels when performing normalization.
random variations) across a wide range of batch sizes from
32 to 2. In the case of a batch size of 2, GN has 10.6% Deeper models. We have also compared GN with BN on
lower error rate than its BN counterpart (24.1% vs. 34.7%). ResNet-101 [20]. With a batch size of 32, our BN base-
These results indicate that the batch mean and variance line of ResNet-101 has 22.0% validation error, and the GN
estimation can be overly stochastic and inaccurate, espe- counterpart has 22.4%, slightly worse by 0.4%. With a
cially when they are computed over 4 or 2 images. How- batch size of 2, GN ResNet-101’s error is 23.0%. This is
ever, this stochasticity disappears if the statistics are com- still a decently stable result considering the very small batch
puted from 1 image, in which case BN becomes similar size, and it is 8.9% better than the BN counterpart’s 31.9%.
to IN at training time. We see that IN has a better result
Results and analysis of VGG models. To study GN/BN
(28.4%) than BN with a batch size of 2 (34.7%).
compared to no normalization, we consider VGG-16 [56]
The robust results of GN in Table 2 demonstrate GN’s
that can be healthily trained without normalization layers.
strength. It allows to remove the batch size constraint im-
We apply BN or GN right after each convolutional layer.
posed by BN, which can give considerably more mem-
Figure 6 shows the evolution of the feature distributions
ory (e.g., 16× or more). This will make it possible to
of conv5 3 (the last convolutional layer). GN and BN be-
train higher-capacity models that would be otherwise bot-
have qualitatively similar, while being substantially differ-
tlenecked by memory limitation. We hope this will create
ent with the variant that uses no normalization; this phe-
new opportunities in architecture design.
nomenon is also observed for all other convolutional layers.
Comparison with Batch Renorm (BR). BR [25] intro- This comparison suggests that performing normalization is
duces two extra parameters (r and d in [25]) that constrain essential for controlling the distribution of features.
the estimated mean and variance of BN. Their values are For VGG-16, GN is better than BN by 0.4% (Figure 6,
controlled by rmax and dmax . To apply BR to ResNet-50, we right). This possibly implies that VGG-16 benefits less
have carefully chosen these hyper-parameters, and found from BN’s regularization effect, and GN (that leads to lower
that rmax = 1.5 and dmax = 0.5 work best for ResNet-50. training error) is superior to BN in this case.
6
4.2. Object Detection and Segmentation in COCO backbone APbbox APbbox
50 APbbox
75 APmask APmask
50 APmask
75
*
BN 37.7 57.9 40.9 32.8 54.3 34.7
Next we evaluate fine-tuning the models for transferring
GN 38.8 59.2 42.2 33.6 55.9 35.4
to object detection and segmentation. These computer vi-
sion tasks in general benefit from higher-resolution input, Table 4. Detection and segmentation ablation results in COCO,
so the batch size tends to be small in common practice (1 or using Mask R-CNN with ResNet-50 C4. BN* means BN is frozen.
2 images/GPU [12, 47, 18, 36]). As a result, BN is turned
into a linear layer y = σγ (x − µ) + β where µ and σ are backbone box head AP
bbox
APbbox
50 APbbox
75 APmask APmask
50 APmask
75
pre-computed from the pre-trained model and frozen [20]. BN *
- 38.6 59.5 41.9 34.2 56.2 36.1
We denote this as BN* , which in fact performs no normal- BN* GN 39.5 60.0 43.2 34.4 56.4 36.3
ization during fine-tuning. We have also tried a variant that GN GN 40.0 61.0 43.3 34.8 57.3 36.3
fine-tunes BN (normalization is performed and not frozen)
Table 5. Detection and segmentation ablation results in COCO,
and found it works poorly (reducing ∼6 AP with a batch
using Mask R-CNN with ResNet-50 FPN and a 4conv1fc bound-
size of 2), so we ignore this variant.
ing box head. BN* means BN is frozen.
We experiment on the Mask R-CNN baselines [18], im-
plemented in the publicly available codebase of Detectron
[13]. We use the end-to-end variant with the same hyper- APbbox APbbox
50 APbbox
75 APmask APmask
50 APmask
75
parameters as in [13]. We replace BN* with GN during fine- R50 BN* 38.6 59.8 42.1 34.5 56.4 36.3
tuning, using the corresponding models pre-trained from R50 GN 40.3 61.0 44.0 35.7 57.9 37.7
ImageNet.4 During fine-tuning, we use a weight decay of 0 R50 GN, long 40.8 61.6 44.4 36.1 58.5 38.2
for the γ and β parameters, which is important for good de- R101 BN* 40.9 61.9 44.8 36.4 58.5 38.7
tection results when γ and β are being tuned. We fine-tune R101 GN 41.8 62.5 45.4 36.8 59.2 39.0
with a batch size of 1 image/GPU and 8 GPUs. R101 GN, long 42.3 62.8 46.2 37.2 59.7 39.5
The models are trained in the COCO train2017 Table 6. Detection and segmentation results in COCO using
set and evaluated in the COCO val2017 set (a.k.a Mask R-CNN and FPN. Here BN* is the default Detectron base-
minival). We report the standard COCO metrics of Av- line [13], and GN is applied to the backbone, box head, and mask
erage Precision (AP), AP50 , and AP75 , for bounding box head. “long” means training with more iterations. Code of these
detection (APbbox ) and instance segmentation (APmask ). results are in https://github.com/facebookresearch/
Detectron/blob/master/projects/GN.
Results of C4 backbone. Table 4 shows the comparison
of GN vs. BN* on Mask R-CNN using a conv4 backbone
layers to construct a pyramid, and appends randomly initial-
(“C4” [18]). This C4 variant uses ResNet’s layers of up to
ized layers as the head. In [35], the box head consists of two
conv4 to extract feature maps, and ResNet’s conv5 layers as
hidden fully-connected layers (2fc). We find that replacing
the Region-of-Interest (RoI) heads for classification and re-
the 2fc box head with 4conv1fc (similar to [48]) can better
gression. As they are inherited from the pre-trained model,
leverage GN. The resulting comparisons are in Table 5.
the backbone and head both involve normalization layers.
As a baseline, BN* has 38.6 box AP using the 4conv1fc
On this baseline, GN improves over BN* by 1.1 box AP
head, on par with its 2fc counterpart using the same pre-
and 0.8 mask AP. We note that the pre-trained GN model is
trained model (38.5 AP). By adding GN to all convolutional
slightly worse than BN in ImageNet (24.1% vs. 23.6%), but
layers of the box head (but still using the BN* backbone),
GN still outperforms BN* for fine-tuning. BN* creates in-
we increase the box AP by 0.9 to 39.5 (2nd row, Table 5).
consistency between pre-training and fine-tuning (frozen),
This ablation shows that a substantial portion of GN’s im-
which may explain the degradation.
provement for detection is from normalization in the head
We have also experimented with the LN variant, and
(which is also done by the C4 variant). On the contrary, ap-
found it is 1.9 box AP worse than GN and 0.8 worse than
plying BN to the box head (that has 512 RoIs per image)
BN* . Although LN is also independent of batch sizes, its
does not provide satisfactory result and is ∼9 AP worse —
representational power is weaker than GN.
in detection, the batch of RoIs are sampled from the same
Results of FPN backbone. Next we compare GN and BN* image and their distribution is not i.i.d., and the non-i.i.d.
on Mask R-CNN using a Feature Pyramid Network (FPN) distribution is also an issue that degrades BN’s batch statis-
backbone [35], the currently state-of-the-art framework in tics estimation [25]. GN does not suffer from this problem.
COCO. Unlike the C4 variant, FPN exploits all pre-trained Next we replace the FPN backbone with the GN-based
4 Detectron [13] uses pre-trained models provided by the authors of
counterpart, i.e., the GN pre-trained model is used during
[20]. For fair comparisons, we instead use the models pre-trained in this
fine-tuning (3rd row, Table 5). Applying GN to the back-
paper. The object detection and segmentation accuracy is statistically sim- bone alone contributes a 0.5 AP gain (from 39.5 to 40.0),
ilar between these pre-trained models. suggesting that GN helps when transferring features.
7
Batch Norm (BN) Group Norm (GN)
60 60
BN, 8clips/gpu GN, 8clips/gpu
BN, 4clips/gpu GN, 4clips/gpu
55 55
error (%) error (%)
50 50
45 45
40 40
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
epochs epochs
Figure 7. Error curves in Kinetics with an input length of 32 frames. We show ResNet-50 I3D’s validation error of BN (left) and GN
(right) using a batch size of 8 and 4 clips/GPU. The monitored validation error is the 1-clip error under the same data augmentation as the
training set, while the final validation accuracy in Table 8 is 10-clip testing without data augmentation.
from scratch APbbox APbbox
50 APbbox
75 APmask APmask
50 APmask
75
clip length 32 32 64
R50 BN [34] 34.5 55.2 37.7 - - - batch size 8 4 4
BN 73.3 / 90.7 72.1 / 90.0 73.3 / 90.8
R50 GN 39.5 59.8 43.6 35.2 56.9 37.6
GN 73.0 / 90.6 72.8 / 90.6 74.5 / 91.7
R101 GN 41.0 61.1 44.9 36.4 58.2 38.7
Table 7. Detection and segmentation results trained from scratch Table 8. Video classification results in Kinetics: ResNet-50 I3D
in COCO using Mask R-CNN and FPN. Here the BN baseline’s top-1 / top-5 accuracy (%).
results are from [34], and BN is synced across GPUs
[43] and is not frozen. Code of these results are in 4.3. Video Classification in Kinetics
https://github.com/facebookresearch/Detectron/
blob/master/projects/GN.
Lastly we evaluate video classification in the Kinetics
dataset [30]. Many video classification models [60, 6] ex-
tend the features to 3D spatial-temporal dimensions. This is
Table 6 shows the full results of GN (applied to the memory-demanding and imposes constraints on the batch
backbone, box head, and mask head), compared with the sizes and model designs.
standard Detectron baseline [13] based on BN* . Using the We experiment with Inflated 3D (I3D) convolutional net-
same hyper-parameters as [13], GN increases over BN* works [6]. We use the ResNet-50 I3D baseline as described
by a healthy margin. Moreover, we found that GN is not in [62]. The models are pre-trained from ImageNet. For
fully trained with the default schedule in [13], so we also both BN and GN, we extend the normalization from over
tried increasing the iterations from 180k to 270k (BN* does (H, W ) to over (T, H, W ), where T is the temporal axis.
not benefit from longer training). Our final ResNet-50 GN We train in the 400-class Kinetics training set and evaluate
model (“long”, Table 6) is 2.2 points box AP and 1.6 points in the validation set. We report the top-1 and top-5 classifi-
mask AP better than its BN* variant. cation accuracy, using standard 10-clip testing that averages
softmax scores from 10 clips regularly sampled.
Training Mask R-CNN from scratch. GN allows us to We study two different temporal lengths: 32-frame and
easily investigate training object detectors from scratch 64-frame input clips. The 32-frame clip is regularly sam-
(without any pre-training). We show the results in Table 7, pled with a frame interval of 2 from the raw video, and the
where the GN models are trained for 270k iterations.5 To 64-frame clip is sampled continuously. The model is fully
our knowledge, our numbers (41.0 box AP and 36.4 mask convolutional in spacetime, so the 64-frame variant con-
AP) are the best from-scratch results in COCO reported to sumes about 2× more memory. We study a batch size of
date; they can even compete with the ImageNet-pretrained 8 or 4 clips/GPU for the 32-frame variant, and 4 clips/GPU
results in Table 6. As a reference, with synchronous BN for the 64-frame variant due to memory limitation.
[43], a concurrent work [34] achieves a from-scratch result
of 34.5 box AP using R50 (Table 7), and 36.3 using a spe- Results of 32-frame inputs. Table 8 (col. 1, 2) shows the
cialized backbone. video classification accuracy in Kinetics using 32-frame
clips. For the batch size of 8, GN is slightly worse than
5 For models trained from scratch, we turn off the default StopGrad in BN by 0.3% top-1 accuracy and 0.1% top-5. This shows
Detectron that freezes the first few layers. that GN is competitive with BN when BN works well. For
8
the smaller batch size of 4, GN’s accuracy is kept simi- [2] D. Arpit, Y. Zhou, B. Kota, and V. Govindaraju. Normal-
lar (72.8 / 90.6 vs. 73.0 / 90.6), but is better than BN’s ization propagation: A parametric technique for removing
72.1 / 90.0. BN’s accuracy is decreased by 1.2% when the internal covariate shift in deep networks. In ICML, 2016.
batch size decreases from 8 to 4. [3] J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization.
Figure 7 shows the error curves. BN’s error curves (left) arXiv:1607.06450, 2016.
have a noticeable gap when the batch size decreases from 8 [4] L. Bottou, F. E. Curtis, and J. Nocedal. Optimization meth-
ods for large-scale machine learning. arXiv:1606.04838,
to 4, while GN’s error curves (right) are very similar.
2016.
Results of 64-frame inputs. Table 8 (col. 3) shows the re- [5] M. Carandini and D. J. Heeger. Normalization as a canonical
sults of using 64-frame clips. In this case, BN has a result neural computation. Nature Reviews Neuroscience, 2012.
of 73.3 / 90.8. These appear to be acceptable numbers (vs. [6] J. Carreira and A. Zisserman. Quo vadis, action recognition?
73.3 / 90.7 of 32-frame, batch size 8), but the trade-off be- a new model and the kinetics dataset. In CVPR, 2017.
tween the temporal length (64 vs. 32) and batch size (4 vs. [7] F. Chollet. Xception: Deep learning with depthwise separa-
8) could have been overlooked. Comparing col. 3 and col. 2 ble convolutions. In CVPR, 2017.
in Table 8, we find that the temporal length actually has pos- [8] T. Cohen and M. Welling. Group equivariant convolutional
itive impact (+1.2%), but it is veiled by BN’s negative effect networks. In ICML, 2016.
of the smaller batch size. [9] N. Dalal and B. Triggs. Histograms of oriented gradients for
GN does not suffer from this trade-off. The 64-frame human detection. In CVPR, 2005.
variant of GN has 74.5 / 91.7 accuracy, showing healthy [10] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao,
A. Senior, P. Tucker, K. Yang, Q. V. Le, et al. Large scale
gains over its BN counterpart and all BN variants. GN helps
distributed deep networks. In NIPS, 2012.
the model benefit from temporal length, and the longer clip
[11] S. Dieleman, J. De Fauw, and K. Kavukcuoglu. Exploiting
boosts the top-1 accuracy by 1.7% (top-5 1.1%) with the cyclic symmetry in convolutional neural networks. In ICML,
same batch size. 2016.
The improvement of GN on detection, segmentation, and [12] R. Girshick. Fast R-CNN. In ICCV, 2015.
video classification demonstrates that GN is a strong alter- [13] R. Girshick, I. Radosavovic, G. Gkioxari, P. Dollár,
native to the powerful and currently dominant BN technique and K. He. Detectron. https://github.com/
in these tasks. facebookresearch/detectron, 2018.
[14] X. Glorot and Y. Bengio. Understanding the difficulty of
5. Discussion and Future Work training deep feedforward neural networks. In International
Conference on Artificial Intelligence and Statistics (AIS-
We have presented GN as an effective normalization TATS), 2010.
layer without exploiting the batch dimension. We have eval- [15] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,
uated GN’s behaviors in a variety of applications. We note, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen-
however, that BN has been so influential that many state-of- erative adversarial nets. In NIPS, 2014.
the-art systems and their hyper-parameters have been de- [16] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis,
signed for it, which may not be optimal for GN-based mod- L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He.
els. It is possible that re-designing the systems or searching Accurate, large minibatch SGD: Training ImageNet in 1
new hyper-parameters for GN will give better results. hour. arXiv:1706.02677, 2017.
In addition, we have shown that GN is related to LN [17] S. Gross and M. Wilber. Training and investigating Resid-
and IN, two normalization methods that are particularly ual Nets. https://github.com/facebook/fb.
successful in training recurrent (RNN/LSTM) or generative resnet.torch, 2016.
(GAN) models. This suggests us to study GN in those areas [18] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask R-
in the future. We will also investigate GN’s performance CNN. In ICCV, 2017.
[19] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into
on learning representations for reinforcement learning (RL)
rectifiers: Surpassing human-level performance on imagenet
tasks, e.g., [54], where BN is playing an important role for
classification. In ICCV, 2015.
training very deep models [20]. [20] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning
for image recognition. In CVPR, 2016.
Acknowledgement. We would like to thank Piotr Dollár
[21] D. J. Heeger. Normalization of cell responses in cat striate
and Ross Girshick for helpful discussions.
cortex. Visual neuroscience, 1992.
[22] S. Hochreiter and J. Schmidhuber. Long short-term memory.
References
Neural computation, 1997.
[1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, [23] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,
M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensor- T. Weyand, M. Andreetto, and H. Adam. MobileNets: Effi-
flow: A system for large-scale machine learning. In Operat- cient convolutional neural networks for mobile vision appli-
ing Systems Design and Implementation (OSDI), 2016. cations. arXiv:1704.04861, 2017.
9
[24] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. [46] M. Ren, R. Liao, R. Urtasun, F. H. Sinz, and R. S. Zemel.
Densely connected convolutional networks. In CVPR, 2017. Normalizing the normalizers: Comparing and extending net-
[25] S. Ioffe. Batch renormalization: Towards reducing minibatch work normalization schemes. In ICLR, 2017.
dependence in batch-normalized models. In NIPS, 2017. [47] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: To-
[26] S. Ioffe and C. Szegedy. Batch normalization: Accelerating wards real-time object detection with region proposal net-
deep network training by reducing internal covariate shift. In works. In NIPS, 2015.
ICML, 2015. [48] S. Ren, K. He, R. Girshick, X. Zhang, and J. Sun. Object
[27] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image detection networks on convolutional feature maps. TPAMI,
translation with conditional adversarial networks. In CVPR, 2017.
2017. [49] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning
representations by back-propagating errors. Nature, 1986.
[28] K. Jarrett, K. Kavukcuoglu, Y. LeCun, et al. What is the best
[50] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
multi-stage architecture for object recognition? In ICCV,
S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,
2009.
A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual
[29] H. Jegou, M. Douze, C. Schmid, and P. Perez. Aggregating
Recognition Challenge. IJCV, 2015.
local descriptors into a compact image representation. In
[51] T. Salimans and D. P. Kingma. Weight normalization: A
CVPR, 2010.
simple reparameterization to accelerate training of deep neu-
[30] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vi- ral networks. In NIPS, 2016.
jayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. [52] O. Schwartz and E. P. Simoncelli. Natural signal statistics
The Kinetics human action video dataset. arXiv:1705.06950, and sensory gain control. Nature neuroscience, 2001.
2017. [53] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus,
[31] A. Krizhevsky. One weird trick for parallelizing convolu- and Y. LeCun. Overfeat: Integrated recognition, localization
tional neural networks. arXiv:1404.5997, 2014. and detection using convolutional networks. In ICLR, 2014.
[32] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet clas- [54] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou,
sification with deep convolutional neural networks. In NIPS, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton,
2012. Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driess-
[33] Y. LeCun, L. Bottou, G. B. Orr, and K.-R. Müller. Efficient che, T. Graepel, and D. Hassabis. Mastering the game of go
backprop. In Neural Networks: Tricks of the Trade. 1998. without human knowledge. Nature, 2017.
[34] Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and [55] E. P. Simoncelli and B. A. Olshausen. Natural image statis-
J. Sun. DetNet: A backbone network for object detection. tics and neural representation. Annual review of neuro-
arXiv:1804.06215, 2018. science, 2001.
[35] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and [56] K. Simonyan and A. Zisserman. Very deep convolutional
S. Belongie. Feature pyramid networks for object detection. networks for large-scale image recognition. In ICLR, 2015.
In CVPR, 2017. [57] C. Szegedy, S. Ioffe, and V. Vanhoucke. Inception-v4,
[36] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. Focal inception-resnet and the impact of residual connections on
loss for dense object detection. In ICCV, 2017. learning. In ICLR Workshop, 2016.
[58] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
[37] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.
manan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Com-
Going deeper with convolutions. In CVPR, 2015.
mon objects in context. In ECCV. 2014.
[59] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna.
[38] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional Rethinking the inception architecture for computer vision. In
networks for semantic segmentation. In CVPR, 2015. CVPR, 2016.
[39] D. G. Lowe. Distinctive image features from scale-invariant [60] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri.
keypoints. IJCV, 2004. Learning spatiotemporal features with 3D convolutional net-
[40] S. Lyu and E. P. Simoncelli. Nonlinear image representation works. In ICCV, 2015.
using divisive normalization. In CVPR, 2008. [61] D. Ulyanov, A. Vedaldi, and V. Lempitsky. Instance nor-
[41] A. Oliva and A. Torralba. Modeling the shape of the scene: malization: The missing ingredient for fast stylization.
A holistic representation of the spatial envelope. IJCV, 2001. arXiv:1607.08022, 2016.
[42] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. De- [62] X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural
Vito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Auto- networks. In CVPR, 2018.
matic differentiation in pytorch. 2017. [63] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated
[43] C. Peng, T. Xiao, Z. Li, Y. Jiang, X. Zhang, K. Jia, G. Yu, residual transformations for deep neural networks. In CVPR,
and J. Sun. MegDet: A large mini-batch object detector. In 2017.
CVPR, 2018. [64] M. D. Zeiler and R. Fergus. Visualizing and understanding
[44] F. Perronnin and C. Dance. Fisher kernels on visual vocabu- convolutional neural networks. In ECCV, 2014.
laries for image categorization. In CVPR, 2007. [65] X. Zhang, X. Zhou, M. Lin, and J. Sun. ShuffleNet: An
[45] S.-A. Rebuffi, H. Bilen, and A. Vedaldi. Learning multiple extremely efficient convolutional neural network for mobile
visual domains with residual adapters. In NIPS, 2017. devices. In CVPR, 2018.
10
Source notes and reports