Training Compute-Optimal Large Language Models

Metadata

Slug: chinchilla_2022
Year: 2022
Venue: arXiv
Authors: Jordan Hoffmann et al.
Reading status: read complete
Compute regime: Hyperscale dense LLM training
Primary sources: PDF, extracted text

Compute Setup

The paper explicitly states that the models in the analysis were trained on TPU v3 and TPU v4 using JAX and Haiku. It does not give an exact chip count or pod layout for Chinchilla in the extracted text, so the device setup must stop at that disclosed accelerator family. Under the project rule, this is 2021-2022 DeepMind datacenter accelerator training, not commodity GPU or edge inference.

The central setup is a fixed-compute comparison. Chinchilla has 70B parameters and is trained on 1.4T tokens. Gopher has 280B parameters and was trained on about 300B tokens. The paper says both were trained for the same number of FLOPs, with the Gopher budget shown as 5.76e23 FLOPs in the scaling figures. Chinchilla therefore changes the allocation of the same training budget rather than claiming a larger raw budget.

The training hyperparameter table also exposes a compute-structure difference. Gopher's batch size is listed as 3M tokens doubled to 6M, while Chinchilla uses 1.5M tokens doubled to 3M. Chinchilla's forward and backward passes are computed in bfloat16, while a float32 copy of weights is stored in the optimizer state. That is a memory/precision compromise: accelerator-efficient matmul precision for the step, higher precision state for optimization.

Bottleneck

The bottleneck is not just "more compute"; it is allocation under a known accelerator budget. The paper frames the practical question as: if the number of accelerators and training duration are known in advance, what model size and number of training tokens minimize loss? Earlier large language models mostly spent extra compute on parameter count while keeping training tokens around 300B. That produced models that were expensive to serve and, by Chinchilla's fitted frontier, undertrained for their size.

This is a memory and inference bottleneck as well as a training bottleneck. A 280B model carries a much larger parameter memory footprint and higher per-token inference cost than a 70B model. If both consume similar pretraining FLOPs, the smaller compute-optimal model wins twice: it trains on more data and is cheaper to fine-tune and deploy afterward. Chinchilla makes downstream inference cost part of the scaling law argument.

Method Adaptation

Chinchilla adapts the dense Transformer recipe by changing the size/token ratio, not by introducing a new architecture. The authors run over 400 training runs across model sizes from tens of millions to over 16B parameters and token counts from 5B to over 400B, then fit three approaches: fixed model sizes with varying tokens, IsoFLOP profiles, and a parametric loss model. All three approaches predict that, as compute increases, parameters and training tokens should scale in near equal proportions.

The method choice is compute-device aware because the output is a prescription for a fixed cluster budget. Instead of making a 280B-parameter model that stresses memory and inference, the paper selects a 70B model near the predicted optimum for the Gopher compute budget and spends the saved parameter FLOPs on four times more tokens. The Chinchilla section keeps Gopher's broad architecture and training setup, while changing dataset distribution for 1.4T tokens, switching to AdamW, using a slightly modified SentencePiece vocabulary, and retaining higher-precision optimizer state.

This is also a data-pipeline adaptation. Training for 1.4T tokens requires a dataset distribution that can support more passes through available sources without simply repeating the exact 300B-token regime. The paper is careful that the analysis assumes an infinite-data regime for the main scaling results, while acknowledging that multiple-epoch behavior remains unresolved.

Evidence

The headline evidence is the compute-matched model comparison: Chinchilla, at 70B parameters and 1.4T tokens, uniformly and significantly outperforms Gopher, at 280B parameters and about 300B tokens, while using the same number of training FLOPs. On MMLU, Chinchilla reaches 67.6% 5-shot accuracy, 7.6 percentage points above Gopher. On BIG-bench, it averages 65.1% versus 54.4% for Gopher. On LAMBADA, it reaches 77.4% zero-shot accuracy, compared with 74.5% for Gopher and 76.6% for MT-NLG 530B in the table.

Language-modeling metrics support the same story. On Wikitext103, Chinchilla reports perplexity 7.16 compared with Gopher's 7.75. On The Pile subsets, the paper reports Chinchilla improving bits-per-byte over Gopher across all subsets. The scaling tables give the broader compute implication: a 175B-parameter model should, by the paper's estimate, be trained on over 4.2T tokens, and a 280B Gopher-like model would require about 6.8T tokens to sit on the compute-optimal frontier.

Historical Effect

Chinchilla revised the early GPT-3/Kaplan scaling doctrine. In compute-structure terms, it shifted dense LLM scaling from "make the model bigger" toward balanced data/model allocation under a fixed training budget. The important historical point is that it did not need a new accelerator or architecture to change the frontier; it changed the way the accelerator budget was spent.

It also reweighted deployment economics. A compute-optimal 70B model is cheaper to fine-tune and serve than a 280B model trained for the same pretraining FLOPs. That made training-token scale, data quality, and inference memory first-class parts of LLM planning, rather than secondary details after parameter count.

Limits

The paper is candid about the main limit: large-scale validation is expensive, so there are only two comparable frontier runs, Chinchilla and Gopher. The fitted frontier assumes power-law behavior in compute, model size, and token count, and the authors observe some curvature at high compute budgets that may mean even smaller models are optimal than their main extrapolation predicts.

The data assumption is another limit. The main analysis is framed in an infinite-data regime, but real systems face finite high-quality corpora, repeated sources, filtering decisions, and changing data quality. Training for trillions of tokens also makes dataset governance and introspection harder. Finally, the extracted text names TPU v3/v4 and the software stack, but not the Chinchilla chip count, pod topology, wall-clock time, or utilization.