Language Models are Few-Shot Learners

Metadata

Slug: gpt3_2020
Year: 2020
Venue: NeurIPS
Authors: Tom B. Brown et al.
Reading status: read complete
Compute regime: Hyperscale dense LLM training
Primary sources: PDF, extracted text

Compute Setup

The paper lists V100 GPUs on part of a high-bandwidth Microsoft cluster. It does not give GPU count in the extracted text, but it says the largest models required model parallelism within matrix multiplications and across layers to avoid running out of memory.

The 175B-parameter model used 96 layers, width 12,288, 96 attention heads, head dimension 128, and context length 2,048. All models trained for 300B tokens. The 175B final batch size was 3.2M tokens, ramped from 32k tokens over the first 4-12B tokens depending on model size. Appendix D estimates the 175B run at 3.64 * 10^3 PF-days, or 3.14 * 10^23 FLOPs.

Memory and communication structure are central. The model is partitioned across GPUs both in depth and width to reduce transfer between nodes and load-balance the layout. The paper says the precise architecture choices were selected partly for computational efficiency and GPU layout rather than only for modeling aesthetics.

The dataset is also a compute artifact: filtered Common Crawl contributes 410B tokens at 60% of the mix, WebText2 19B at 22%, Books1 12B at 8%, Books2 55B at 8%, and Wikipedia 3B at 3%.

Bottleneck

By 2020, the bottleneck was no longer whether Transformers could train on accelerators. The bottleneck was how to spend datacenter-scale compute on dense language-model pretraining while keeping memory, communication, and data quality under control.

Task-specific fine-tuning also became a workflow bottleneck. GPT-3 shifts part of adaptation from weight updates to inference-time context.

The training bottleneck is dense model scale under V100 memory limits. A 175B parameter model cannot reside on a single GPU, and even tensor operations inside a layer must be split. Partitioning must balance fitting weights and activations against communication in each forward/backward pass.

The inference bottleneck changes form. Few-shot learning uses forward passes over a 2048-token context that can contain an instruction plus roughly 10 to 100 demonstrations. This avoids fine-tuning jobs, but spends extra inference tokens and attention compute for every evaluated example.

Method Adaptation

GPT-3 adapts to the V100 cluster in several ways:

It uses a decoder-only Transformer, preserving the dense matrix-multiply structure introduced by earlier Transformer language models.
Model parallelism splits the model so the 175B parameters and activations fit across GPU memory.
Large token batches are chosen using gradient-noise-scale reasoning, improving utilization at scale.
The context window is fixed at 2,048 tokens, making in-context learning a bounded inference-time compute mechanism.
The dataset is filtered and mixed from Common Crawl, WebText2, Books, and Wikipedia so that high-capacity pretraining has enough token diversity.

The few-shot interface is also compute-structured: adaptation happens by spending extra forward-pass context rather than by running a new fine-tuning job. For multiple-choice tasks, the model scores candidate completions by likelihood. For free-form completion, the paper uses beam search with width 4 and length penalty 0.6.

The training procedure is tuned for long dense runs. GPT-3 uses Adam with beta1 0.9, beta2 0.95, global gradient clipping at 1.0, cosine decay to 10% over 260B tokens, a 375M-token warmup, and weight decay 0.1. Shorter documents are packed into full 2048-token sequences with end-of-text delimiters.

Evidence

The paper reports smooth performance improvement with model size across many tasks.
Few-shot and one-shot performance improve more strongly with scale than zero-shot performance.
The largest model has 175B parameters and was trained on 300B tokens.
The training setup required high-bandwidth multi-GPU partitioning, not a single-machine experiment.
Table 2.1 lists the model ladder from 125M parameters to 175B parameters; the 175B model uses 96 layers, 12,288 width, 96 heads, 128-dimensional heads, 3.2M-token batch size, and learning rate 0.6 * 10^-4.
The paper estimates GPT-3 175B training compute at 3,640 PF-days, compared with tens of PF-days for a 1.5B GPT-2-scale model in its energy discussion.
Appendix D gives the corresponding total FLOPs estimate as 3.14 * 10^23, ignoring attention as typically less than 10% of total compute for the analyzed models.
Filtered Common Crawl is 45TB compressed before filtering and 570GB after filtering.
GPT-3 reaches 64.3% zero-shot, 68.0% one-shot, and 71.2% few-shot accuracy on TriviaQA; it reaches 86.4% few-shot accuracy on LAMBADA, over 18 percentage points above the prior state of the art reported in the paper.
The paper estimates 100 generated pages from the trained 175B model at about 0.4 kWh.

Historical Effect

GPT-3 made dense LLM scaling a mainstream AI branch. It connected three earlier compute trends:

Transformer architectures fit accelerator matrix multiplication.
Scaling-law work justified spending far more compute on larger models.
Datacenter GPU clusters made model-parallel language training feasible.

Historically, this paper also reframed the product interface of language models: prompts became a way to allocate inference compute to new tasks without changing weights.

In the compute spine, GPT-3 is the proof point for hyperscale dense pretraining. It expands the single-machine P100 Transformer pattern into high-bandwidth clustered V100 training with explicit model parallelism, large token batches, and web-scale data filtering. It also shifts some adaptation cost from training-time gradient updates to inference-time context length.

Limits

The paper's scaling still uses a dense decoder-only model; it does not solve the cost problem.
Training data is web-scale and imperfectly filtered.
In-context learning is powerful but unreliable compared with later instruction tuning and RLHF-style post-training.
The extracted text does not give GPU count or interconnect topology, so cluster scale cannot be reconstructed beyond V100s on a high-bandwidth Microsoft cluster.
Few-shot prompting increases per-example inference cost because demonstrations occupy context and attention compute.