ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

Metadata

Slug: zero_2019
Year: 2019
Venue: arXiv
Authors: Samyam Rajbhandari et al.
Reading status: read complete
Compute regime: Hyperscale dense LLM training
Primary sources: PDF, extracted text

Compute Setup

The paper explicitly lists a 400-GPU V100 system:

400 V100 GPUs across 25 DGX-2 nodes.
800 Gbps internode communication bandwidth.
It analyzes 32GB current-generation GPUs and 1K V100 GPU settings.

The implementation evaluated in the paper is ZeRO-100B, a PyTorch system combining optimizer-state and gradient partitioning with residual-state memory optimizations. For experiments with model parallelism, ZeRO is combined with Megatron-LM. The paper also gives a concrete contrast to Megatron hardware: the most recent Megatron result cited used 512 32GB V100 GPUs across 32 DGX-2 nodes, while ZeRO's main experiments use the available 400-GPU cluster.

Bottleneck

The bottleneck is memory redundancy in data-parallel training. Standard data parallelism replicates optimizer states, gradients, and parameters on every GPU, so it runs out of memory for models above roughly 1.4B parameters on current 32GB GPUs. This is true even though the fp16 weights of a 1.5B-parameter GPT-2-like model are only about 3GB; mixed-precision Adam also keeps fp32 parameters, momentum, variance, gradients, activations, temporary buffers, and fragmented allocations.

The paper gives concrete memory arithmetic. Adam with mixed precision needs fp16 parameters and gradients plus fp32 parameters, momentum, and variance. For a 1.5B-parameter model, the model states alone require at least 24GB, before activations. At sequence length 1K and batch size 32, activations for the same model require about 60GB, or about 8GB after activation checkpointing. For a 100B-parameter GPT-like model, activations can still be around 60GB even with checkpointing. The limitation is therefore not only parameter count but the entire training state.

Model parallelism avoids some memory pressure but can suffer poor communication granularity and low hardware utilization. The paper reports that Megatron-LM on a 40B model across two DGX-2 nodes achieved about 5 TFLOP/s per V100, less than 5% of peak. ZeRO's premise is that data-parallel usability and model-parallel memory efficiency should not be mutually exclusive.

Method Adaptation

ZeRO adapts dense LLM training by partitioning model states across data-parallel ranks:

Stage 1 partitions optimizer states.
Stage 2 partitions gradients.
Stage 3 partitions parameters.
ZeRO-R handles activation memory, temporary buffers, and fragmentation.

The method keeps much of the usability of data parallelism while removing redundant memory. Stage 1, optimizer-state partitioning, gives a 4x model-state memory reduction with the same communication volume as data parallelism. Stage 2 adds gradient partitioning for an 8x reduction, still with the same communication volume. Stage 3 adds parameter partitioning, yielding a memory reduction proportional to the data-parallel degree with a modest communication increase, described in the paper as about 50%.

ZeRO-R attacks residual state memory after model states are partitioned. It partitions activation checkpoints in model-parallel settings, can offload activation checkpoints to CPU when appropriate, sizes temporary buffers to balance memory and throughput, and proactively manages fragmentation so training does not fail despite apparent free memory. The activation partitioning is especially tied to hardware topology: it can reduce activation memory by the model-parallel degree, allowing larger per-GPU batches and better arithmetic intensity, but CPU offload can hurt unless the extra PCIe movement is outweighed by a larger feasible batch.

This is a bandwidth and batching adaptation. ZeRO avoids moving all model states to CPU because the paper treats CPU memory bandwidth and PCIe transfer as severe constraints. Instead, it uses GPU aggregate memory across the data-parallel group. Lower per-GPU memory lets the same GPUs run larger microbatches, which improves throughput and can reduce data-parallel communication per token.

Evidence

The headline evidence is the 400-GPU V100 evaluation. ZeRO-100B runs models up to 170B parameters on 400 GPUs, while the baseline Megatron-style model parallel system cannot scale efficiently beyond 40B in the paper's comparison. For 8B to 100B models, ZeRO-100B sustains 15 PFLOP/s on average, over 30% of peak. The paper attributes the baseline degradation to communication volume and to leaving a DGX-2 node: bandwidth drops from 300GB/sec per link through NVSwitch to 12.5GB/sec per link over InfiniBand EDR.

The model-configuration table makes the batching effect concrete. A 100B ZeRO model on 400 GPUs uses 16-way model parallelism, 125 layers, hidden size 8192, 64 heads, batch size 32, and total batch size 800. The 100B baseline uses 384 GPUs, 128-way model parallelism, 128 heads, batch size 2, and total batch size 6. For 170B, ZeRO runs 400 GPUs with 16-way model parallelism and total batch size 300, while the baseline listed fit uses 256-way model parallelism and total batch size 2.

ZeRO also reports a data-parallel-only usability result: up to 13B parameters on 128 GPUs without model or pipeline parallelism, averaging over 40 TFLOP/s per GPU. Without ZeRO, data parallelism alone tops out at 1.4B parameters and less than 20 TFLOP/s per GPU. Finally, the paper reports Turing-NLG at over 17B parameters, trained end-to-end with ZeRO-100B, sustaining 41.4 TFLOP/s per GPU and reaching WebText-103 perplexity 10.21.

The trillion-parameter analysis is explicitly a memory feasibility argument, not a full training-cost result. The paper says a 1T-parameter model with Adam in 16-bit precision needs about 16TB for optimizer states, gradients, and parameters. Divided over 1024 GPUs, that is 16GB per GPU, within a 32GB V100 memory budget. The paper also warns that training such a model on then-current compute would take over a year and would require an exaflop system for reasonable time.

Historical Effect

ZeRO reframed LLM scaling as a memory-redundancy problem. It made data-parallel-style training viable for models that previously required more difficult manual model parallelism, and it became a foundation for DeepSpeed-style large-model systems. Its historical place complements Megatron-LM: Megatron partitions tensor computation inside layers, while ZeRO partitions the training state that data parallelism normally replicates.

This card belongs in the hyperscale dense LLM regime because it shows that accelerator count alone is not enough. The effective resource is aggregate GPU memory, usable batch size, interconnect bandwidth, and optimizer-state placement. ZeRO made those resources accessible without rewriting every model into a bespoke model-parallel program.

Limits

Fitting trillion-parameter models is not the same as training them cheaply; compute time remains prohibitive.
Some very large configurations require CPU offload.
Communication and batch-size constraints remain.
Stage 3 trillion-scale claims are partly analytical in this paper; the evaluated ZeRO-100B system uses a subset of the full ZeRO design.
CPU offload is useful only in narrow cases because GPU-CPU-GPU transfers can dominate.
The approach reduces redundancy but does not reduce dense Transformer FLOPs per token.