Self-Consistency Improves Chain of Thought Reasoning in Language Models

Metadata

Reading status: read complete
Year: 2022
Compute regime: Inference-time compute and post-training (inference_time_compute_post_training)
PDF: 2022-self_consistency_2022.pdf
Extracted text: 2022-self_consistency_2022.txt
PDF URL: https://arxiv.org/pdf/2203.11171.pdf
OpenAlex:
Citation count source/date:
Citation count:
Reading card created: 2026-06-15

Compute Setup

The compute in this paper is deliberately inference-only. The authors state that "for all four language models we perform prompting-based inference only"; there is no model training or fine-tuning in the self-consistency experiments. The local text gives specific serving setups: UL2-20B runs on TPU v3 in a 2x2 configuration, meaning 4 chips and 8 cores; LaMDA-137B runs on TPU v3 in an 8x8 configuration, meaning 64 chips and 128 cores; PaLM-540B runs on TPU v4 in a 4x4x12 configuration, meaning 192 chips and 384 cores. The GPT-3/Codex experiments are done through the public API, so the underlying device and serving topology are not reported.

The paper also gives wall-clock scale. Over about 1,000 examples, most jobs take 1 to 4 hours on UL2 and LaMDA-137B and 2 to 12 hours on PaLM-540B; commonsense can run longer but stays under 2 days. The default setting samples 40 outputs independently from the decoder, averaged over 10 runs. UL2 and LaMDA use temperature 0.5 with top-k 40; PaLM uses temperature 0.7 with top-k 40; GPT-3 uses temperature 0.7 without top-k truncation and 128 max tokens.

Bottleneck

The bottleneck is the brittleness of single-path decoding. Chain-of-thought prompting had made large language models expose intermediate reasoning, but greedy decoding still commits the answer to one trajectory through the token distribution. A single arithmetic slip, false premise, or locally likely but globally wrong derivation becomes the final answer. Larger models reduce this failure rate, but size alone leaves avoidable accuracy on the table.

Self-consistency reframes reasoning accuracy as a sampling and aggregation problem. The expensive resource is repeated forward decoding through models ranging from 20B to 540B parameters. That changes the practical bottleneck to batchable inference throughput, output token budget, sample diversity, and answer normalization. On PaLM-540B, 40 chains over about 1,000 examples is already a multi-hour TPU v4 job, so the method is cheap relative to retraining but not cheap in absolute terms.

Method Adaptation

The method adapts chain-of-thought to the accelerator inference regime by using the model as a self-ensemble. It samples diverse reasoning paths, parses the answer from each path, and chooses the most consistent final answer. In engineering terms, this converts one long autoregressive decode into many independent decodes that can be parallelized across accelerator cores, batched across prompts, or traded off against latency.

This fits TPU/GPU serving hardware because each sampled path is independent until aggregation. The final majority vote is CPU-cheap compared with the forward passes. The sampling parameters are simple and architecture-agnostic, so the same recipe runs on API-hosted GPT-3, TPU v3 LaMDA/UL2, and TPU v4 PaLM without retraining. The price is near-linear scaling in sample count and generated tokens.

The method also shifts part of the bottleneck into parsing. Arithmetic tasks can often normalize a numeric final answer, while commonsense tasks require string or option normalization. The gains depend on a task format where many independent chains can be reduced to the same answer space.

Evidence

The main experiments report self-consistency averaged over 10 runs with 40 sampled outputs per question. On PaLM-540B, Table 2 reports GSM8K accuracy rising from 56.5 with greedy chain-of-thought to 74.4 with self-consistency, AQuA from 35.8 to 48.3, and SVAMP from 79.0 to 86.6. On GPT-3 code-davinci-002, GSM8K rises from 60.1 to 78.0, AQuA from 39.8 to 52.0, and SVAMP from 75.8 to 86.8. PaLM-540B also improves ARC-Challenge from 85.2 to 88.7 and StrategyQA from 65.8 to 70.8.

Table 1 compares aggregation rules on PaLM-540B. Direct majority vote reaches 74.4 on GSM8K, 99.3 on MultiArith, 48.3 on AQuA, and 86.6 on SVAMP, close to normalized weighted summation and much better than unnormalized variants. The expensive part is producing enough diverse samples; complex probability weighting is not the main source of gain.

The appendix also compares this inference compute knob against prompt ensembles. On GSM8K, 40 different prompt sets score 58.9 and 40 prompt permutations score 59.6, while self-consistency with 40 paths scores 74.4. The result supports the paper's central claim: multiple sampled reasoning paths from one strong model are more valuable than merely varying the prompt around one greedy decode.

Historical Effect

Self-consistency is an early, clean example of inference-time compute as a method variable for LLM reasoning. It did not require gradient updates, task-specific labels, verifier training, or architecture changes. Instead, it turned decode sample count into a reproducible accuracy knob: a model can be held fixed while inference budget, sampling policy, and aggregation policy vary.

Historically, this card belongs between chain-of-thought prompting and later search/verifier systems. It showed that a large dense model already contains many plausible solution trajectories, and that serving systems can recover reliability by sampling and voting. Once model training is too expensive to repeat, inference fleets become a second compute surface for capability gains.

Limits

The compute cost scales roughly linearly with the number of sampled paths and generated tokens. The reported PaLM-540B runs already take hours per task, and commonsense workloads can approach two days. This makes self-consistency attractive for benchmarks and high-value queries, but less attractive for low-latency serving unless batching and early stopping are engineered carefully.

The method also depends on a clean answer space. It works best when final answers can be parsed, normalized, and compared. Multiple wrong chains can agree, especially on biased prompts or ambiguous questions, and the model probabilities are not calibrated enough to reliably distinguish correct from incorrect chains. GPT-3 hardware is hidden behind the public API, so only the TPU-side compute setups are transparent in the paper.