Group Normalization

Metadata

Slug: group_norm_2018
Year: 2018
Venue: ECCV
Authors: Yuxin Wu, Kaiming He
Reading status: read complete
Compute regime: Multi-GPU dense training
Primary sources: PDF, extracted text

Compute Setup

The paper gives the worker scale but not the exact GPU model. For ImageNet classification, the authors use 8 GPUs for all models and define batch size as images per worker; BN mean and variance are computed inside each GPU and are not synchronized across workers. The ImageNet setup trains ResNet models for 100 epochs on about 1.28M training images and evaluates 224x224 center crops on 50,000 validation images.

For transfer tasks the setup is more explicitly memory-constrained. Mask R-CNN on COCO is fine-tuned with 1 image per GPU and 8 GPUs. The paper also studies video classification on Kinetics with ResNet-50 I3D, where the 64-frame variant uses about 2x more memory than 32-frame input and is therefore evaluated at 4 clips/GPU. Since the extracted text does not name the GPU SKU, the project-era inference is 2018 FAIR multi-GPU NVIDIA training, with the important stated fact being per-GPU batch pressure rather than accelerator type.

Bottleneck

The compute bottleneck is the coupling between normalization quality and per-device batch size. BN works well when each worker has enough samples, for example 32 images/GPU in the ImageNet experiments. Dense vision systems break that assumption: detection and segmentation use high-resolution images and region heads; video uses 3D spacetime activations; larger backbones compete directly with activation memory. Under these conditions, practical batch sizes fall to 1 or 2 images/GPU, so BN statistics become noisy and the training objective changes with the per-worker batch.

The paper also treats synchronized BN as a compute-structure workaround rather than a full solution. Syncing statistics across GPUs moves the small-batch problem into distributed communication and hardware demand, requiring a number of GPUs proportional to BN's statistical needs and limiting asynchronous solvers. GN is motivated as a way to remove that cross-device dependency.

Method Adaptation

Group Normalization makes normalization local to the sample and independent of the batch axis. For an activation tensor with batch, channel, height, and width axes, GN divides channels into groups and computes the mean and variance over spatial positions plus the channels in each group. The default ImageNet setting is 32 groups. The extreme cases connect GN to prior methods: one group is LayerNorm, and one channel per group is InstanceNorm.

This is a device-aware adaptation because it changes what resource controls normalization. BN needs enough same-worker examples; GN needs enough channel and spatial elements inside an example. That fits high-resolution detection and video workloads, where activation tensors are large even when the number of images or clips per GPU is small. It also preserves identical train/test behavior more naturally than frozen BN in fine-tuning: the method does not require running averages from large-batch pretraining or cross-device statistics at fine-tune time.

The cost tradeoff is not mainly fewer FLOPs. GN adds per-sample reductions and normalization, but those reductions are small relative to convolutional backbones. Its compute value is that it lets the model use memory for resolution, temporal length, or backbone capacity instead of spending it on batch cardinality.

Evidence

On ImageNet ResNet-50 with 32 images/GPU, BN is still slightly ahead: 23.6% validation error for BN versus 24.1% for GN. LN is 25.3% and IN is 28.4%, so GN is the closest batch-independent method in the comparison. When the batch is reduced, the result flips sharply. With per-GPU batch sizes 32, 16, 8, 4, and 2, GN stays around 24.0-24.2% error, while BN rises from 23.6 to 34.7%. At 2 images/GPU, GN is 24.1% versus BN's 34.7%, a 10.6 point gap.

The deeper ImageNet experiment shows the same structure. ResNet-101 with 32 images/GPU has 22.0% error for BN and 22.4% for GN; with batch size 2, GN has 23.0% while BN rises to 31.9%.

In COCO Mask R-CNN, the fine-tuning setup is 1 image/GPU on 8 GPUs. For a ResNet-50 C4 backbone, GN improves over frozen BN from 37.7 to 38.8 box AP and from 32.8 to 33.6 mask AP. With FPN, the R50 frozen-BN baseline is 38.6 box AP and 34.5 mask AP; R50 GN long training reaches 40.8 and 36.1. R101 GN long reaches 42.3 and 37.2.

For Kinetics video, 32-frame clips at batch 8 are close: BN 73.3/90.7 top-1/top-5 and GN 73.0/90.6. At batch 4, BN drops to 72.1/90.0 while GN stays 72.8/90.6. With 64-frame clips at batch 4, GN reaches 74.5/91.7, beating BN's 73.3/90.8 and showing that GN lets the model benefit from longer temporal context instead of losing that gain to smaller batch statistics.

Historical Effect

GN made normalization less dependent on data-parallel batch geometry. It is best read as a response to an activation-memory bottleneck in dense vision: once images, masks, regions, and video clips dominate memory, the old ImageNet batch regime no longer fits. GN helped normalize the practice of designing layers around per-device memory and synchronization constraints rather than assuming large, homogeneous minibatches.

Limits

BN remains slightly better in the paper's regular ImageNet large-per-worker-batch setting, probably because stochastic batch statistics add regularization. GN introduces a group-count choice, although the paper shows stable results across several group divisions. It does not reduce the memory of the backbone itself; it removes a batch-size constraint so that memory can be spent elsewhere. The exact GPU model is not reported in the extracted text, so device conclusions should stay at the level of 8-GPU training and per-GPU memory pressure.