s1: Simple test-time scaling

Download PDF

s1: Simple test-time scaling

Metadata

Compute Setup

s1 is unusually explicit about research-time hardware. The authors fine-tune Qwen2.5-32B-Instruct on 1,000 reasoning examples for 26 minutes on 16 NVIDIA H100 GPUs using PyTorch FSDP. Appendix D gives the training details: 5 epochs, batch size 16, 315 gradient steps, bfloat16, AdamW, learning rate 1e-5 with 5% warmup and cosine decay, weight decay 1e-4, and loss only on reasoning traces and answers, not on the questions. The selected 32K training sequence length avoids cutting off samples.

The larger compute story is the contrast between small post-training and controllable inference. The full 59K distilled data pool would cost 394 H100 GPU-hours to fine-tune, while s1-32B requires about 7 H100 GPU-hours. The data are generated with Gemini 2.0 Flash Thinking, yielding 59K question, reasoning-trace, and solution triplets with 3.6K average thinking tokens. At serving time, the computational device is the decoder itself: extra "thinking" tokens are the test-time compute budget, and budget forcing changes how many of those tokens are emitted.

Bottleneck

The bottleneck is not training a new frontier model. It is activating latent reasoning behavior in an already pretrained 32B model and then controlling how much inference compute it spends. Prompt-only length control is unreliable because models cannot reliably count tokens. Rejection sampling is wasteful because it repeatedly generates full traces until one happens to meet a length constraint, and in this paper it even shows inverse scaling on AIME24.

s1 also exposes a training-sequence bottleneck. If the fine-tuning sequence length is too short, the answer portion is often cut off, so the model learns long thinking without reliably learning when to stop and answer. Table 8 shows that 4,096-token training cuts off 74% of samples and produces much longer inference traces with worse accuracy than 32,768-token training.

Method Adaptation

The method is a deliberately small adaptation to H100 fine-tuning plus decoder control. The authors curate s1K from a 59K pool using three criteria: quality, difficulty, and diversity. Difficulty is estimated by whether Qwen2.5-7B-Instruct and Qwen2.5-32B-Instruct fail the problem and by reasoning-trace length measured with the Qwen2.5 tokenizer. Diversity is enforced by classifying questions into domains and sampling across 50 domains. The final set includes 1,000 questions; the authors tolerate some incorrect distilled traces because the target is to teach a reasoning process, not only final-answer correctness.

At inference time, budget forcing manipulates the delimiters between thinking and answering. To enforce a maximum, the system appends an end-of-thinking delimiter, optionally followed by "Final Answer:", so the model must answer with its current state. To enforce a minimum, it suppresses the end-of-thinking delimiter and appends "Wait", encouraging continued reflection. This is sequential test-time scaling: later tokens condition on earlier reasoning, unlike majority voting where independent trajectories are aggregated afterward.

The paper defines three metrics around this compute knob: Control measures adherence to the requested thinking-token range, Scaling measures the slope of accuracy versus thinking tokens, and Performance measures best achieved accuracy. The implementation therefore treats inference tokens as a budgeted resource rather than a side effect of generation.

Evidence

The main result is sample efficiency plus budget control. Table 1 reports Qwen2.5-32B-Instruct at 26.7 on AIME 2024, 84.0 on MATH-500, and 49.0 on GPQA Diamond. s1 without budget forcing reaches 50.0, 92.6, and 56.6. s1-32B with budget forcing reaches 56.7, 93.0, and 59.6, exceeding o1-preview on AIME and MATH in this table while using 1K training examples.

Table 3 shows why the chosen inference control matters. Budget forcing has 100% control, scaling 15, and best AIME24 performance 56.7. Token-conditional control has only 40% control and negative scaling, while token-conditional control plus budget forcing restores 100% control but lower performance. Rejection sampling has 100% control but scaling -35. Appendix E shows the mechanism: without intervention, token instructions such as 1,024, 2,048, or 16,384 tokens all produce roughly 7K-8K thinking tokens; forcing the end at the budget makes the token count obey the cap.

The training sequence ablation is a second compute result. With 4,096 training sequence length, AIME24 is 30.0% with 20,721 average thinking tokens; with 32,768, AIME24 rises to 50.0% and thinking drops to 6,984. MATH-500 similarly improves from 90.0%/5,324 tokens to 91.0%/3,268 tokens. Longer training context improves both accuracy and inference efficiency because the answer section remains visible during fine-tuning.

Historical Effect

s1 is historically useful because it separates three ingredients that were often fused in o1-like systems: distilled long traces, small SFT, and explicit test-time compute control. It shows that a 32B open model can reproduce visible test-time scaling behavior without RL, process reward models, tree search, or millions of training samples. That made "reasoning effort" measurable as thinking tokens and made budget adherence a first-class evaluation target.

The card also marks a low-cost path for open reasoning research. The paper contrasts 1K examples and 7 H100 GPU-hours with 59K examples and 394 H100 GPU-hours, and with RL-based contemporaries such as DeepSeek-R1 and Kimi k1.5. It does not replace those methods, but it gives a clean baseline for how far careful data selection and decoding-time control can go.

Limits

The method depends on distilled Gemini traces and Claude-based grading, so it is not a pure self-improvement recipe. The final s1K is only partially correct by the authors' grader: 53.6% for s1K and 63.0% for the follow-up s1K-1.1. Decontamination uses 8-gram filtering against MATH500, GPQA Diamond, and AIME24, which helps but cannot rule out all semantic overlap.

Budget forcing is also a blunt decoder intervention. Appending "Wait" can induce useful self-correction, but further scaling can flatten, hit context limits, or produce repetitive loops. The model still does not reliably follow token or step instructions by itself. GPQA remains far below o1 in Table 1, and AIME24 has only 30 questions, so benchmark variance is high.

Links