Scaling Laws for Neural Language Models

Metadata

Slug: scaling_laws_2020
Year: 2020
Venue: arXiv
Authors: Jared Kaplan et al.
Reading status: read complete
Compute regime: Hyperscale dense LLM training
Primary sources: PDF, extracted text

Compute Setup

The extracted paper text does not list a specific training device. By the project rule, the setup is inferred from the research-time environment, roughly 2019: multi-accelerator Transformer training on contemporary datacenter GPUs, most plausibly V100-class clusters. This is inferred, not explicit.

The paper frames compute in petaflop-days and non-embedding FLOPs. It defines one PF-day as 8.64 * 10^19 floating point operations and estimates Transformer training compute as C ~= 6NBS, where N is non-embedding parameters, B is batch size in tokens, and S is update steps. Embedding and context terms are excluded for cleaner estimates.

The experimental substrate is decoder-only Transformer language modeling on WebText2 with byte-pair tokenization and mostly 1024-token contexts. Unless otherwise noted, models are trained with Adam for 2.5 * 10^5 steps and batch size 512 sequences of 1024 tokens. The largest models, above 1B parameters, use Adafactor because of memory constraints, which is the paper's clearest direct sign of device memory pressure.

Bottleneck

The bottleneck is compute allocation. Given a fixed training budget, should researchers spend it on larger models, more data, or more optimization steps?

The paper argues that loss follows smooth power laws in model size, dataset size, and compute. This turns model scaling into a resource-allocation problem instead of a collection of isolated architecture guesses.

The important bottleneck is not just total FLOPs but serial training time. Figure 3 argues that, as compute grows, most of the increase should go into model size and batch size, with only a small increase in serial optimization steps. If parallel devices are available, it is better to spend them on a larger model and larger batches than on many more sequential updates for a small model.

The paper also identifies an eventual data bottleneck: compute-efficient training uses surprisingly little data, but the fitted laws imply that overfitting will eventually appear unless dataset size grows.

Method Adaptation

The method adapts Transformer training to large accelerator budgets by measuring scaling regularities:

It estimates loss as a function of parameter count, dataset size, and compute.
It models training compute as approximately proportional to parameters times batch size times training steps.
It uses gradient-noise-scale measurements to estimate efficient batch sizes.
It argues that compute-optimal training should use very large models, relatively modest data, and early stopping.
It notes optimizer and memory pressure: the largest models use Adafactor because of memory constraints.

This is a compute budgeting rule for the dense Transformer regime, not mainly a new architecture.

The architecture choices are intentionally held secondary. The paper reports that performance depends strongly on scale and only weakly, within reasonable limits, on depth, width, attention heads, and feed-forward dimension. That makes N, D, and C the main knobs for planning accelerator runs.

Batching is treated as a parallelism control. The critical batch size gives a boundary where larger batches still buy training speed with little compute-efficiency loss. Near convergence for the largest trainable models, the paper estimates an ideal batch size of roughly 1-2 million tokens, translating gradient noise into a data-parallelism limit.

The compute-optimal recipe is early stopping: training a smaller model to convergence wastes compute relative to training a larger model for fewer steps.

Evidence

The paper reports smooth scaling over several orders of magnitude in model size, data, and compute.
The tested model sizes range from 768 to 1.5B non-embedding parameters; dataset sizes range from 22M to 23B tokens; context length is 1024 for most runs.
WebText2 contains 20.3M documents, 96 GB of text, 1.62 * 10^10 words, and 2.29 * 10^10 tokens, with 6.6 * 10^8 tokens reserved for test.
The paper reports power-law trends spanning more than six orders of magnitude for model size, data size, and compute when the other factors are not bottlenecks.
It finds that every 8x increase in model size needs only roughly 5x more data to avoid the fitted overfitting penalty.
The fitted compute-efficient frontier says optimal model size grows rapidly with compute: about 5x more parameters for each 10x more compute.
The batch-adjusted number of optimization steps grows very slowly, with Smin fit close to Cmin^0.03.
The paper explicitly says the largest models over 1B parameters used Adafactor because of memory constraints.

Historical Effect

Scaling Laws gave a quantitative argument for undertrained large models and helped justify GPT-3-style dense scaling. In the compute-structure spine, it is the planning paper for the hyperscale dense LLM branch: it tells researchers how to spend accelerator budgets once Transformers have become the standard compute-fitting architecture.

The later Chinchilla result revised the data/model allocation, so this card should be read historically: it records the 2020 compute strategy that shaped early LLM scaling, not the final optimal law.

Its historical effect is that it turned "make the model bigger" into a compute allocation formula. GPT-3's decision to train much larger models on comparatively fewer tokens is explicitly tied to this analysis. Even when later work changes the exponents, tokens, parameters, FLOPs, and batch/noise scale remain central to LLM engineering.

Limits

The paper does not name the exact devices in the available text, so the hardware description is inferred.
The compute-optimal prescription was later challenged by data-optimal scaling work.
The experiments are language-model pretraining studies, not a complete account of post-training, retrieval, tool use, or inference-time scaling.
The compute formula ignores attention terms in regimes where they are a smaller fraction of total compute; that is less safe for very long contexts.
The analysis optimizes training loss, not lifecycle cost including serving latency, memory footprint, or deployment volume.