Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Download PDF

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Metadata

Compute Setup

The paper is explicit that this is inference-only compute: no finetuning was done. The appendix reports LaMDA-137B inference on TPU v3 in an 8x8 configuration, i.e. 64 chips and 128 cores. PaLM-540B inference used TPU v4 in a 4x4x12 configuration, i.e. 192 chips and 384 cores. GPT-3 experiments used the public API across text-ada-001, text-babbage-001, text-curie-001, and text-davinci-002, which the paper maps to approximate 350M, 1.3B, 6.7B, and 175B model scales. Codex used code-davinci-002 through the OpenAI API. The other evaluated model families include LaMDA at 422M, 2B, 8B, 68B, and 137B; PaLM at 8B, 62B, and 540B; and UL2 at 20B.

The prompt format is the compute setup as much as the hardware. For arithmetic tasks, the authors manually composed eight few-shot exemplars containing input, chain of thought, and final output; AQuA used four training-set exemplars because it is multiple choice. Decoding is greedy. For LaMDA, results are averaged over five random exemplar orders; for the other systems, the paper uses one exemplar order to save compute. This means the experimental resource is not a training cluster but repeated large-model forward passes with longer prompts and longer generated outputs.

Bottleneck

The bottleneck is the structure of inference. Standard few-shot prompting asks a giant frozen model to map from question directly to answer, which gives the model little opportunity to spend sequential computation on intermediate state. For multi-step arithmetic, symbolic manipulation, and commonsense reasoning, that direct-answer format underuses the autoregressive computation already available in the decoder.

Chain-of-thought prompting moves the constraint from training a new model to paying for more and better-organized inference tokens. The paper explicitly frames chains as a way to allocate additional computation to problems requiring more reasoning steps. That extra computation is not free: longer prompts consume more context, longer completions consume more decode time, and decode is sequential because each output token depends on prior tokens. The hardware consequence is that a TPU/API-hosted 100B+ model must be served for more generated tokens per query.

Scale is also a bottleneck. The paper repeatedly finds that chain-of-thought is an emergent ability: small and medium models often generate fluent but illogical traces, and the gains appear reliably only around the largest evaluated models. Thus the method depends on expensive model replicas and inference infrastructure before it becomes useful.

Method Adaptation

The adaptation is deliberately minimal: keep the model fixed and change the shape of the prompt. Each exemplar becomes a triple of input, natural-language intermediate steps, and final output. At test time, the model is expected to continue that pattern by generating a rationale-like sequence before the answer. This turns autoregressive decoding into a scratchpad, using output tokens as a temporary workspace for decomposition, arithmetic substeps, state tracking, and semantic interpretation.

The paper's ablations show that the adaptation is not merely "more tokens." Equation-only prompting helps some easier tasks but not GSM8K, where semantic parsing of the story is hard. A "variable compute only" condition, where the model emits dots to match the length of an equation, performs about like the baseline. Putting reasoning after the answer also performs about like the baseline. The useful compute structure is therefore sequential natural-language reasoning before the answer, not just extra decode length or latent activation of relevant knowledge.

The method also adapts to inference economics by avoiding gradient updates. A few human-written exemplars can be reused across datasets, and one checkpoint can cover arithmetic, commonsense, and symbolic tasks. Compared with rationale-supervised finetuning, the training-data cost is small; compared with standard prompting, the serving bill rises.

Evidence

The headline arithmetic result is PaLM-540B on GSM8K: standard prompting gets 17.9% accuracy and chain-of-thought reaches 56.9%; with an external calculator applied post hoc to generated equations, it reaches 58.6%. GPT-3 175B improves from 15.6% to 46.9% on GSM8K, and Codex improves from 19.7% to 63.1%. On MAWPS, PaLM-540B moves from 79.2% to 93.3%, and Codex moves from 78.7% to 92.6%. The same table shows that UL2-20B and sub-10B LaMDA/GPT/PaLM models receive little benefit or can get worse, supporting the scale-threshold claim.

The gains are not limited to grade-school arithmetic. On commonsense tasks, PaLM-540B improves from 68.6% to 77.8% on StrategyQA, from 49.0% to 65.3% on Date Understanding, from 80.5% to 95.4% on Sports Understanding, and from 80.8% to 91.7% on SayCan. On symbolic length generalization, PaLM-540B rises from 0.2% to 94.8% on out-of-distribution three-name last-letter concatenation and from 0.0% to 63.0% on four-name concatenation. LaMDA-137B shows a similar qualitative pattern, such as 5.8% to 77.5% on two-name last-letter concatenation.

The paper also inspects generated traces. For 50 random GSM8K examples where LaMDA-137B answered correctly, all but two chains were logically and mathematically correct. For 50 wrong answers, 46% of chains were "almost correct" in the authors' categories, often needing a calculator fix, symbol correction, or missing step. Scaling PaLM from 62B to 540B fixed a substantial portion of missing-step and semantic-understanding errors.

Historical Effect

This paper made scratchpad tokens a mainstream compute structure. Before it, reasoning improvements were often framed as training objectives, symbolic modules, verifiers, or specialized datasets. Chain-of-thought showed that an already-trained large model could expose additional ability when the prompt requested intermediate text, making inference-time compute a first-class axis alongside parameter count and pretraining tokens.

Historically, this directly opened the path to self-consistency, verifier-guided reasoning, tool use, program-of-thought prompting, tree-of-thought search, and later agentic workflows. All of those methods elaborate the same basic idea: spend more structured inference compute per problem, then select or use the result.

Limits

The paper is careful that a chain of thought is not proof that the network is faithfully reasoning. Generated paths can be wrong, can reach a correct answer by accident, and can be misleading on tasks where correctness of the trace is hard to verify. The method also makes serving more expensive because it requires large models and longer completions, and the authors explicitly flag real-world serving cost as a limitation.

The few-shot annotation cost is low for prompting but would become expensive for finetuning large rationale datasets. Prompt sensitivity remains nonzero even though the paper reports robustness to several exemplar sources. Finally, because most of the strongest systems are proprietary or API-hosted, reproducibility depends on provided prompts, logged outputs, and the limited hardware details in the appendix rather than full model access.

Links

  • Compute regime: history/compute_regimes/inference_time_compute_post_training/README.md
  • Source PDF and extracted text are listed in metadata above.
  • Queue status: read_complete.
  • Method index: inference_time_reasoning
  • Ledger updates: compute bottlenecks