Scaling Language Models: Methods, Analysis and Insights from Training Gopher
Scaling Language Models: Methods, Analysis and Insights from Training Gopher
Metadata
- Reading status: read complete
- Year: 2021
- Compute regime: Hyperscale dense LLM training (
hyperscale_dense_llm_training) - PDF: 2021-gopher_2021.pdf
- Extracted text: 2021-gopher_2021.txt
- PDF URL: https://arxiv.org/pdf/2112.11446.pdf
- OpenAlex:
- Citation count source/date:
- Citation count:
- Reading card created: 2026-06-15
Compute Setup
The paper explicitly uses TPUv3 chips with a JAX/Haiku codebase and JAX pmap for data and model parallelism. Gopher is a 280B-parameter autoregressive Transformer with 80 layers, 128 heads, model dimension 16,384, key/value size 128, and a 2048-token context window. It is trained on 300B tokens from MassiveText, with batch size increasing from 3M to 6M tokens during training. The 7.1B and 280B models use bfloat16 activations and bfloat16 parameters; bfloat16 parameters are updated with stochastic rounding.
The appendix gives the most concrete hardware scale. Table A27 lists the 280B training run on 4096 TPUv3 chips. The compute appendix reports 6.31E+08 train PFLOPs and says Gopher trained for 920 hours in November and December 2020 in Google's Georgia datacenter. The FLOP accounting includes practical costs such as rematerialization, which increases compute by 33%, plus padding and repeated computation; it excludes development and preemption.
Bottleneck
The bottleneck is fitting and moving a dense 280B model across TPUv3 memory and interconnect. The paper states that the half-precision parameters and single-precision Adam state occupy 2.5 TiB, far beyond the 16 GiB available on each TPUv3 core. That makes memory partitioning mandatory before useful training can even begin. Activations must also be controlled, which is why rematerialization appears as both a memory solution and a compute tax.
This is not a parameter-only bound. A generic mixed-precision Adam estimate would put 280B parameters near 4.48 TB of training state at 16 bytes per parameter, while the paper's own accounting reports 2.5 TiB for half-precision parameters plus single-precision Adam state. Either way, the state alone dwarfs one TPUv3 core. For inference, BF16 weights would be about 560 GB before KV cache; at 80 layers, 128 heads, head dimension 128, and 2048-token context, batch-1 KV cache is roughly another 10.7 GB, so evaluation memory is also a sharded serving problem.
At 4096 chips, communication and scheduling become visible in the time breakdown. For the 280B model at the more efficient 6M token batch size, Table A27 attributes 51% of accelerator time to linears, 8% to attention, 3% to optimizer, 7% to model parallelism, 17% to rematerialization, 9% to pipelining, and 5% to other work. The bottleneck is therefore memory capacity, activation recomputation, cross-shard communication, pipeline bubbles, and batch size, not just peak FLOPs.
Method Adaptation
The method stack is a direct response to TPUv3 pod constraints. Optimizer-state partitioning, model parallelism, and rematerialization partition state and reduce activations so the model fits in memory. The authors find data and model parallelism on TPUv3s have only about 10% overhead because of fast cross-chip communication, so pipelining is unnecessary until scale exceeds a 1024-chip pod. For Gopher, they use model and data parallelism within TPU pods and pipelining across pods.
The numerical format is also a device adaptation. bfloat16 reduces memory and increases throughput, while stochastic rounding is used to update bfloat16 parameters. RMSNorm and relative positional encodings simplify the architecture and allow evaluation on longer sequences than training without retraining the context window.
Data is treated as a utilization lever. MassiveText contains 2.35B documents and about 10.5TB of text from web pages, books, news, code, C4, and Wikipedia, but Gopher trains on 300B tokens, about 12.8% of the available tokens. The pipeline filters quality, removes repetition, deduplicates similar documents, and removes significant test overlap. Sampling proportions are tuned, with MassiveWeb at 48%, books at 27%, C4 and news at 10% each, GitHub at 3%, and Wikipedia at 2%.
Evidence
The training evidence is unusually detailed. The main text reports 300B training tokens, 2048 context, Adam, batch growth from 3M to 6M tokens, bfloat16 for the 280B model, 2.5 TiB of parameters plus Adam state, and about 10% data/model parallel overhead on TPUv3. The appendix adds 4096 TPUv3 chips, 920 training hours, 6.31E+08 train PFLOPs, and a time breakdown in which rematerialization alone consumes 17% of accelerator time.
The benchmark evidence shows the payoff from this dense scale. Figure 1 reports that Gopher improves over language-model state of the art on 100 of 124 comparable tasks. On RACE reading comprehension, Gopher reaches 71.6 on RACE-high compared with 46.8 for GPT-3 and 47.9 for Megatron-Turing NLG, and 75.1 on RACE-middle compared with 58.1 for GPT-3. The paper also notes that Gopher improves on most language-modeling tasks involving books and articles, which it links to MassiveText's heavier book sampling.
The compute appendix also flags inference inefficiency. It says repeated processing of common prefixes in evaluations raises inference costs, and removing that repetition would reduce FLOPs by 4-100x depending on the evaluation. Evaluation and prompting at 280B scale are themselves nontrivial compute workloads.
Historical Effect
Gopher is a dense-LLM systems paper as much as a model paper. It documents the transition from "train a large Transformer" to "coordinate thousands of accelerator chips while partitioning optimizer state, recomputing activations, and managing data quality." Its 280B run shows that TPU pod-scale training was practical but already memory- and communication-constrained.
Historically, it also set up the Chinchilla critique: a 280B model trained on 300B tokens produced strong results, but later work argued that this allocation was undertrained relative to compute-optimal scaling. Gopher is a baseline for the pre-Chinchilla dense-scaling regime, where maximum parameter count was prioritized and token count was comparatively low.
Limits
The paper is candid about residual limits. Some reasoning-heavy and mathematical tasks improve less or regress; the source mentions underperformance on Ubuntu IRC and DM Mathematics and points to tokenizer representation for numbers as a possible cause. Safety and bias issues remain, including toxicity and subgroup-bias analyses. The model is expensive to train and expensive to evaluate, and the appendix's repeated-prefix discussion shows that naive evaluation can waste large amounts of inference compute.
The previous short card stated that the exact chip count was not specified, but the local source appendix does specify 4096 TPUv3 chips for the 280B training breakdown. What remains unspecified is the full development compute, preemption/waste overhead, and all serving details.
Links
- Compute regime:
history/compute_regimes/hyperscale_dense_llm_training/README.md - Source PDF and extracted text are listed in metadata above.
- Queue status:
read_complete.