GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

Metadata

Slug: gpipe_2018
Year: 2018
Venue: arXiv
Authors: Yanping Huang et al.
Reading status: read complete
Compute regime: TPU/GPU model-parallel training
Primary sources: PDF, extracted text

Compute Setup

The paper explicitly lists accelerator settings. AmoebaNet scaling is tested on Cloud TPU v2 with 8GB memory per accelerator, using 224x224 inputs and minibatch size 128. Transformer scaling is tested on Cloud TPU v3 with 16GB memory per accelerator core, vocabulary size 32K, sequence length 1024, and batch size 32. Communication-overhead experiments run on a single host with multiple NVIDIA P100 GPUs without NVLink, forcing slower PCIe device-host-device transfers.

The memory table states that each model parameter needs 12 bytes because RMSProp is used during training. Without GPipe, a single 8GB accelerator fits only an 82M-parameter AmoebaNet-D. With rematerialization and batch splitting on one accelerator, it fits 318M parameters; with 8 pipeline partitions on 8 accelerators, it fits 1.8B parameters. For Transformers on TPU v3, a naive single accelerator fits 282.2M parameters, while 128 partitions fit an 83.9B-parameter Transformer.

The fit calculation depends on activations, not just parameter bytes. For the 82M AmoebaNet baseline, 12 bytes per parameter is about 0.98 GB, but peak activations add 6.26 GB, making the total roughly 7.24 GB on an 8GB accelerator. With rematerialization, the 318M one-accelerator case uses about 3.82 GB of parameter state and 3.46 GB of peak activation memory, again landing near 7.28 GB. GPipe's reported limits are therefore actual memory-bound fits, not loose parameter-count estimates.

Bottleneck

GPipe targets two coupled bottlenecks: device memory capacity and accelerator utilization. Giant sequential networks cannot fit on one GPU/TPU because weights, optimizer state, and activations exceed memory. Naive layer-wise model parallelism can place layers on different devices, but a sequential network then leaves most devices idle while one layer partition runs. The paper calls this idle time "bubble overhead."

A second bottleneck is communication pattern. Tensor-splitting SPMD approaches scale individual matrix multiplications but require frequent all-reduce-like communication and high-speed interconnects. GPipe instead communicates only at partition boundaries. The P100-without-NVLink experiment tests whether pipeline partitioning still helps when interconnect is weak.

Method Adaptation

GPipe represents a network as a sequence of layers and partitions consecutive layer groups into pipeline "cells." Each cell is placed on a separate accelerator, and communication primitives are inserted at partition boundaries. The user specifies the number of partitions K, the number of microbatches M, and the layer sequence. A cost estimator tries to balance partition compute so the pipeline stages take similar time.

The key scheduling adaptation is batch splitting. A minibatch of size N is split into M equal microbatches that flow through K partitions. Gradients are computed for each microbatch using the same parameters as the forward pass. At the end of the full minibatch, gradients are accumulated and a single synchronous update is applied. This avoids PipeDream-style weight staleness and multiple versioned parameter copies.

The key memory adaptation is rematerialization. During the forward pass, each accelerator stores only partition-boundary activations. During backpropagation it recomputes the partition's internal forward function as needed. This trades extra compute for a much smaller activation footprint, which is exactly the right trade when accelerator memory, not raw FLOPs, blocks model growth.

The pipeline adaptation is parameterized by bubble overhead. The paper derives bubble time amortized over microsteps and reports that overhead is negligible when M >= 4K. That rule turns batch size into a utilization knob: enough microbatches keep multiple devices busy even though the model is sequential.

Evidence

The memory evidence is central. On 8GB accelerators, naive AmoebaNet training fits 82M parameters with 1.05GB parameter memory and 6.26GB peak activation memory. GPipe on one accelerator fits 318M parameters and reduces peak activation memory to 3.46GB. With 8 pipeline partitions, AmoebaNet reaches 1.8B parameters, about 25x larger than the no-GPipe baseline. Scaling is not perfectly linear because AmoebaNet layers are imbalanced.

For Transformer memory scaling on TPU v3, a naive single accelerator fits a 3-layer, 282.2M-parameter Transformer. Pipeline-1 with rematerialization fits 13 layers and 785.8M parameters. Pipeline-128 fits 1663 layers and 83.9B parameters, a 298x increase. Transformer maximum size scales more linearly because layers have similar parameter counts and input sizes.

Throughput evidence supports the microbatch rule. On TPUs, when M is much larger than K, Transformer training scales nearly linearly: at M=32, normalized throughput rises from 1.8 at K=2 to 3.4 at K=4 and 6.3 at K=8. When M=1, there is effectively no pipeline parallelism. On the P100 host without NVLink, increasing partitions from 2 to 8 gives 2.7x speedup for AmoebaNet-D(18,128) and 3.3x for a 24-layer Transformer.

End-task evidence shows the larger models are not just fit-to-memory demos. A 557M-parameter AmoebaNet-B(18,512), trained on ImageNet 2012 with 480x480 inputs and 4 partitions, reaches 84.4% top-1 and 97% top-5 validation accuracy with single-crop. For multilingual NMT, a single 128-layer 6B-parameter Transformer trained across 103 languages outperforms individually trained 350M-parameter bilingual Transformer Big models on 100 language pairs. The paper also reports that increasing effective batch size from 260K to 1M to 4M tokens improves German-English BLEU from 30.92 to 31.86 to 32.71.

Historical Effect

GPipe made pipeline parallelism and rematerialization a practical recipe for models that exceed one accelerator's memory. It reframed model scaling as a placement, scheduling, and activation-memory problem. Later large-model systems reuse the same ideas: split layers across devices, split batches into microbatches, overlap pipeline stages, recompute activations, and control gradient update semantics.

The paper bridges vision-era giant CNNs and Transformer-era scale: it demonstrates a 557M-parameter AmoebaNet and a 6B multilingual Transformer, while the memory table constructs far larger partitioned Transformers. That made pipeline parallelism a general accelerator primitive rather than a one-off architecture trick.

Limits

GPipe assumes the model can be expressed as a sequence of layers and that a single layer fits on one accelerator.
Bubble overhead remains significant when the number of microbatches is too small relative to partitions.
Imbalanced layers reduce scaling, as seen in AmoebaNet versus the more uniform Transformer.
BatchNorm and other batch-dependent layers require special handling because statistics are computed over microbatches.
Rematerialization spends extra compute to save memory, so the tradeoff is best when memory is the limiting resource.