Let's Verify Step by Step

Metadata

Reading status: read complete
Year: 2023
Compute regime: Inference-time compute and post-training (inference_time_compute_post_training)
PDF: 2023-verify_step_by_step_2023.pdf
Extracted text: 2023-verify_step_by_step_2023.txt
PDF URL: https://arxiv.org/pdf/2305.20050.pdf
OpenAlex:
Citation count source/date:
Citation count:
Reading card created: 2026-06-15

Compute Setup

The paper does not disclose the accelerator type, node count, memory size, or wall-clock time for training or inference. Under the project rule, the compute device is therefore inferred from the research time and institutional context as OpenAI internal accelerator infrastructure in the GPT-4 era, rather than a publicly specified GPU/TPU setup. The paper does give concrete compute proxies. The large-scale models are finetuned from a base GPT-4 model that is next-token pretrained and not RLHF-pretrained, while the small-scale base models are similar in design but pretrained with roughly 200 times less compute.

All models receive a math-focused finetuning stage on MathMix. Appendix A states that MathMix contains roughly 1.5B math-relevant tokens; large-model experiments train on roughly 3B MathMix tokens over two epochs, while small-model experiments use a 1B-token variant and train for six epochs, roughly 6.6B tokens. PRM800K contains 800K training step labels over 75K solutions to 12K problems, after filtering a larger collection of 1,085,590 labels over 101,599 sampled solutions. The headline inference budget is best-of-1860 search on the held-out MATH subset.

Bottleneck

The core bottleneck is reliable selection under high inference-time sampling. MATH problems often have automatically checkable final answers, so outcome supervision can be collected cheaply, but final-answer labels are a low-bandwidth signal. A model can arrive at the right answer through invalid reasoning, or an answer checker can misgrade a solution with correct reasoning but awkward formatting. As the number of sampled solutions grows, this becomes a search problem over many plausible trajectories.

That creates an inference compute tradeoff. Majority voting can exploit many samples, but it collapses each solution to the final answer and ignores whether the reasoning is trustworthy. An outcome reward model can score whole solutions, but because it is trained only on answer correctness, it may learn shortcuts that rank convincing-looking wrong derivations too highly. The bottleneck is therefore bandwidth between human supervision and search: how much information must the labeler provide so that a verifier remains useful when the generator is sampled hundreds or thousands of times?

Method Adaptation

The method adapts the post-training pipeline around that inference bottleneck. The generator is trained to emit newline-delimited step-by-step solutions, not because this is claimed to teach new mathematics, but because it makes each step parseable for both labelers and the reward model. Human labelers mark each step positive, negative, or neutral. To maximize the value of limited labeling time, solutions are sampled from the large-scale generator, and active selection is biased toward convincing wrong-answer solutions, which are the cases most likely to expose verifier failures.

The process reward model (PRM) is cheap relative to repeated generation. It predicts the correctness of each step after the last token in that step, using ordinary language-model log-likelihood targets; the paper says this can be trained in a standard language-model pipeline. At test time, one PRM forward pass over the whole solution yields step probabilities. The PRM score for a candidate is the product of the step correctness probabilities, fitting a generator-then-verifier inference pattern.

The project is also careful not to turn this into a reinforcement-learning result. The generator is fixed for the headline comparison; the experiment isolates verifier supervision and search. Compute is moved into two places: higher-resolution human labels during post-training and larger best-of-N sampling at inference.

Evidence

On the held-out 500-problem MATH test subset, the large-scale PRM solves 78.2% of problems with best-of-1860 search. The strongest outcome-supervised reward model reaches 72.4%, and majority voting reaches 69.6%. The plotted best-of-N curves matter as compute evidence: the PRM is not merely better at one fixed sample count; the performance gap widens as N increases, showing that step supervision is more compatible with spending more inference compute on search.

The ORM baseline is not weak by construction. The paper trains it on 100 uniform samples per problem from the generator, making the ORM dataset an order of magnitude larger than PRM800K and non-overlapping with it. This is a useful compute-device comparison: more coarse labels do not substitute cleanly for fewer higher-bandwidth step labels when the inference loop is searching over many generated trajectories. The PRM also generalizes beyond the MATH subset in a held-out STEM evaluation: on 224 recent STEM questions, the authors report the same qualitative pattern, with PRM best-of-100 outperforming ORM and majority voting.

The MathMix numbers provide the training-side evidence. Rather than using the larger Minerva-style 38.5B-token math corpus, this work uses a filtered 1.5B-token mixture and then spends compute on process labels plus search. The system illustrates a shift from "train a better generator only" toward "train a math-adapted generator, then use a verifier to make inference compute productive."

Historical Effect

Historically, the paper made process supervision a concrete compute structure for reasoning systems. It reframed scaling as a division of labor: pretraining and math finetuning produce candidate reasoning, human step labels train a verifier, and test-time sampling/search turns extra inference compute into higher accuracy. This was especially important because the improvement is largest at high N, exactly where model deployment can choose to spend more latency or accelerator budget for harder questions.

The card therefore sits in the inference-time compute and post-training regime. It does not introduce a new accelerator or architecture; it changes what the accelerator is asked to do. Instead of one decode and one answer, the device runs many decodes and verifier passes, and process labels make those passes more selective.

Limits

The main hardware limit is non-disclosure: the paper does not state GPUs, TPUs, memory, parallelism, training wall time, or serving latency. The 200x-less-than-GPT-4 statement is only a relative pretraining compute marker, not a reproducible device setup. The method also depends on a domain where step labels are meaningful and final answers are often checkable.

The generator is fixed rather than trained end-to-end with RL against the PRM, so the paper does not measure the compute or stability cost of closing that loop. Best-of-1860 is a large inference budget and may be inappropriate for low-latency settings. PRM scoring by a product of step probabilities can penalize long solutions, and step labels are expensive relative to final-answer labels. The result is best read as evidence that process supervision makes search scale better, not as a full accounting of production serving cost.