Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Metadata

Slug: megatron_lm_2019
Year: 2019
Venue: arXiv
Authors: Mohammad Shoeybi et al.
Reading status: read complete
Compute regime: Hyperscale dense LLM training
Primary sources: PDF, extracted text

Compute Setup

The paper explicitly lists NVIDIA hardware:

Single-GPU baseline: NVIDIA V100 32GB GPU in a DGX-2H server.
Full runs use up to 32 DGX-2H servers, totaling 512 Tesla V100 SXM3 32GB GPUs.

The infrastructure details matter: each DGX-2H server provides 300 GB/sec bandwidth between GPUs through NVSwitch, and servers are connected with 100 GB/sec interconnect bandwidth using 8 InfiniBand adapters per server. The largest GPT-2-like configuration uses 8-way tensor model parallelism and 64-way data parallelism, for 512 GPUs total. The paper's strong single-GPU baseline is a 1.2B-parameter model on one V100 32GB GPU, sustaining 39 TFLOP/s during training.

Bottleneck

The bottleneck is memory and communication for dense Transformers. Multi-billion-parameter language models exceed the memory of one V100 once parameters, optimizer state, activations, and temporary buffers are included. Data parallelism alone does not reduce model memory, while naively increasing batch size for throughput can damage convergence. Pipeline parallelism can reduce memory pressure but introduces scheduling bubbles and implementation complexity.

The training memory lower bound is much larger than the raw parameter file. Mixed-precision Adam commonly needs roughly 16 bytes per parameter before activations: half-precision parameters and gradients plus FP32 master parameters and Adam moments. On that accounting, the 8.3B GPT-like model implies about 133 GB of model state, more than four 32GB V100s before storing activations or temporary buffers. Even the 1.2B single-GPU baseline consumes about 19 GB of model state, leaving limited space on a 32GB V100 for sequence activations, dropout, checkpointing buffers, and framework overhead.

Megatron-LM focuses on a narrower hardware fact: inside a DGX-2H, V100s are connected with very high bandwidth, so intra-layer tensor parallelism can split large matrix multiplications while keeping GPUs compute-bound. The method avoids pushing fine-grained model-parallel communication across slow links when possible, and combines tensor parallel groups with data parallelism for full-cluster scale.

Method Adaptation

Megatron-LM adapts Transformers to V100 clusters by:

Splitting Transformer MLP matrix multiplications across GPUs.
Splitting attention heads across GPUs.
Combining tensor model parallelism within layers with data parallelism across replicas.
Using activation checkpointing to reduce memory.
Avoiding pipeline bubbles by using intra-layer parallelism.

The MLP split is designed around GEMM and GeLU. The first linear layer is split column-wise so each GPU computes its partition and applies GeLU locally, avoiding a synchronization point before the nonlinearity. The second linear layer is split row-wise and followed by a reduction across GPUs. The paper says this requires one all-reduce in the forward pass and one in the backward pass for the MLP block.

Self-attention is split by heads. The key, query, and value GEMMs are partitioned so the matrix multiply for each attention head is local to one GPU. The output projection is then parallelized to take the parallel attention output directly. For a transformer layer, the paper summarizes the result as two all-reduces in the forward path and two in the backward path. That is the core compute adaptation: add a small number of large, structured collectives to keep tensor cores doing large GEMMs.

The embedding and loss are also communication-aware. Since GPT-2 vocabulary is tens of thousands of tokens, all-gathering full logits would communicate batch x sequence x vocabulary elements. Megatron parallelizes the embedding over vocabulary and fuses the parallel output GEMM with cross-entropy so that communication is over scalar losses instead of full logits. The paper also pads the vocabulary to 51,200 so the per-GPU vocabulary size is a multiple of 128 under up to 8-way model parallelism, which better fits efficient GEMM shapes.

Memory is managed with mixed precision, dynamic loss scaling, Adam, gradient clipping, dropout, and activation checkpointing after every transformer layer. These are not incidental training tricks; they let dense Transformer blocks fit into V100 memory while keeping tensor-core throughput high.

Evidence

The scaling evidence is explicit. The paper trains GPT-2-like models up to 8.3B parameters using 512 GPUs, with the 8.3B run using 72 layers, hidden size 3072, 24 attention heads, and 512 total GPUs. The reported epoch time for the 8.3B model is about 2.10 days. The full application sustains 15.1 PFLOP/s and reaches about 74-76% scaling efficiency relative to the strong single-GPU baseline, depending on the reported summary/figure.

The single-GPU baseline matters because it is not weak: the 1.2B model on a single V100 32GB GPU sustains 39 TFLOP/s, about 30% of theoretical peak. Scaling the 8.3B model with 8-way model parallelism achieves 77% of linear scaling, and with model plus data parallelism on 512 GPUs still achieves 74% of linear scaling. The fixed global batch size for the model+data parallel cases is 512, corresponding to 64-way data parallelism.

The language-model results show that the hardware scaling is not just synthetic throughput. The 8.3B GPT-2-like model reaches WikiText-103 perplexity 10.81 and LAMBADA cloze accuracy 66.51%, improving over the smaller 355M and 2.5B models. The paper also trains BERT-like models up to 3.9B parameters and reports RACE accuracy 90.9% in the summary. The training corpus is 174GB of deduplicated text built from Wikipedia, CC-Stories, RealNews, and OpenWebText, with leakage filtering for evaluation datasets.

Historical Effect

Megatron-LM made dense Transformer scaling a practical tensor-partitioning problem on V100 clusters. It is a key precursor to the hyperscale LLM training stack where tensor parallelism, activation checkpointing, mixed precision, optimizer memory, and interconnect topology all become part of the model recipe.

The historical effect is also methodological: the paper showed that large Transformer layers have enough internal structure to split with only a few collectives, without requiring a new compiler or a full rewrite of the framework. That made tensor parallelism a standard axis that later systems combined with pipeline parallelism, ZeRO-style optimizer partitioning, and data parallelism.

Limits

Future larger models require more memory than available within one DGX-2H box.
The paper points toward hybrid intra-layer, inter-layer, and inter-node parallelism.
It does not solve optimizer-state redundancy by itself; ZeRO addresses that separately.
The method depends on very high intra-node bandwidth; efficiency drops when fine-grained model-parallel communication crosses slower inter-node links.
Communication is minimized but not removed: every model-parallel transformer layer still performs all-reduces.
Dense scaling remains expensive because every parameter participates for every token.