OPT: Open Pre-trained Transformer Language Models

Download PDF

OPT: Open Pre-trained Transformer Language Models

Metadata

Compute Setup

The paper explicitly states that OPT-175B was trained on 992 80GB NVIDIA A100 GPUs with Meta's metaseq codebase. The training stack combines Fully Sharded Data Parallel with Megatron-LM tensor parallelism. The largest model uses 96 layers, 96 attention heads, hidden dimension 12,288, a sequence length of 2,048, and a global batch size of 2M tokens. The final pretraining corpus contains roughly 180B tokens, corresponding to about 800 GB of text, from RoBERTa-style data, selected Pile subsets, and Pushshift.io Reddit.

OPT's hardware report is less topology-heavy than BLOOM's, but it gives the important memory and throughput choices. Adam state is kept in FP32 but sharded across hosts, while model weights remain in FP16. Dynamic loss scaling is used to avoid underflows. The paper reports up to 147 TFLOP/s per GPU. It also emphasizes that metaseq was, at the time, an open-source implementation for a decoder-only Transformer at 175B scale on NVIDIA GPUs without pipeline parallelism.

Bottleneck

The bottleneck is not only raw A100 count. OPT exposes the operational fragility of trying to reproduce GPT-3-class dense training openly. At 175B parameters, weights, activations, gradients, and Adam state exceed the memory of any one GPU; the run therefore depends on sharding and tensor-parallel collectives being reliable for months. The training job also depends on a large cluster staying healthy: the paper reports many host failures and restart paths, so cluster churn becomes a first-order part of the method.

The memory arithmetic explains why this is a sharding problem, not just a throughput problem. A mixed-precision Adam lower bound is roughly 175B x 16 bytes, or 2.8 TB of model state, before activations and temporary buffers. For inference, FP16 weights alone are about 350 GB. With 96 layers, 96 attention heads, head dimension 128, and a 2048-token context, a batch-1 KV cache adds roughly 9-10 GB. Thus even serving at low batch needs many A100s unless quantization reduces the weight footprint; GPTQ later exploits exactly this gap.

Numerical stability is the second bottleneck. OPT used FP16 weights and dynamic loss scaling. Loss divergences correlated with the loss-scaling value crashing to zero and final-layer activation norms spiking. This made checkpoint selection and restart policy part of the training algorithm. The authors reduced learning rate mid-flight, restarted from earlier checkpoints, and adjusted gradient clipping after observing divergence behavior.

Data and batching are also compute constraints. The authors varied batch size mostly for computational efficiency, settled on a 2M-token batch for 175B, and removed some Pile subsets after small-scale experiments showed gradient-norm spikes or instability risk. The system is therefore shaped by memory capacity, global communication, numerical range, and dataset-induced optimizer behavior.

Method Adaptation

OPT adapts the GPT-3 recipe to A100-era open infrastructure by using FSDP to shard model-related state across ranks and Megatron-LM tensor parallelism to split large matrix operations. Keeping Adam state in FP32 preserves optimizer precision, while sharding it across hosts makes the memory footprint tolerable. FP16 weights reduce GPU memory and communication volume, and dynamic loss scaling tries to keep gradients inside representable range.

Several "method" choices are operational responses rather than clean ablations. The paper uses a gradient predivide factor to reduce over/underflow risk when computing gradients across all ranks. During training, the team lowered learning rate after divergences, restarted from checkpoints where the dynamic loss scalar was still healthy, and changed gradient clipping from 1.0 to 0.3 early in the run. They also tried SGD, reset the dynamic loss scalar, and upgraded Megatron; SGD plateaued and was reverted, while the newer Megatron version reduced activation-norm pressure and improved throughput.

The data pipeline is likewise compute-adapted. The corpus is deduplicated with MinHashLSH at Jaccard similarity at least .95, and the Pile subset is curated partly for stability. Some subsets were eliminated because they increased gradient-norm spikes at 1.3B scale or were otherwise unsuitable. This is an important detail: open dense LLM training is not just "more text"; data mixture affects loss spikes and restart risk.

Evidence

The paper reports up to 147 TFLOP/s per A100, which is strong utilization for a 992-GPU dense Transformer job. It also reports at least 35 manual restarts, more than 70 estimated automatic restarts, and more than 100 hosts cycled over two months because of hardware failures. When loss diverged, lowering the learning rate and restarting from earlier checkpoints allowed training to continue. The released logbook is part of the evidence: the paper treats human intervention and infrastructure churn as measurable costs of scale.

Performance evidence is framed against GPT-3. Across 14 zero-shot NLP tasks, average OPT performance follows the GPT-3 trend by parameter count, although task-level results vary. The paper says OPT roughly matched GPT-3 for 10 tasks, underperformed on ARC Challenge and MultiRC, and showed unpredictable behavior on small validation tasks such as CB, BoolQ, and WSC. In dialogue evaluation, unsupervised OPT-175B reports 10.8 perplexity and .185 unigram F1 on ConvAI2, competitive with supervised BlenderBot 1's 10.2 and .183.

Environmental evidence is also tied to hardware. The paper estimates OPT-175B's direct carbon emissions at about 75 tons CO2e, versus published estimates of roughly 500 tons for GPT-3 and 380 tons for Gopher. It also warns that including ablations, baselines, and downtime makes the total cost roughly 2x higher.

Historical Effect

OPT's historical role is transparency around the real mechanics of 175B training. GPT-3 had established the capability frontier, but OPT exposed the engineering burden: sharded optimizer state, tensor-parallel implementation, host churn, dynamic-loss-scale failures, learning-rate interventions, and data mixture decisions. That made the cost of reproducing frontier dense LLMs easier to audit.

It also shifted the open-model conversation from small public checkpoints to a full family of baselines up to GPT-3 scale, plus code and logs. Even with controlled non-commercial access to the largest weights, OPT gave researchers a concrete artifact for studying scale, bias, toxicity, evaluation reproducibility, and training-process risk.

Limits

OPT is not compute-optimal by later data-scaling standards. The paper itself notes Hoffmann et al. (2022): many LLMs may be undertrained as a function of data used, implying that more tokens could improve performance without increasing parameter count. OPT-175B's roughly 180B-token corpus is small relative to later trillion-token recipes.

The release is also controlled and non-commercial, so it is not fully open in the sense of unrestricted deployment. The paper's environmental accounting is useful but incomplete, because embodied hardware emissions, experimentation, downtime, and downstream inference are not standardized. Finally, the model inherits the usual LLM issues: hallucination, repetition, toxicity, bias, English-heavy data, and evaluation sensitivity to prompting and setup.

Links

  • Compute regime: history/compute_regimes/hyperscale_dense_llm_training/README.md
  • Source PDF and extracted text are listed in metadata above.
  • Queue status: read_complete.