来源材料

资料来源

← 首页

                                                                                  s1: Simple test-time scaling


                                        Niklas Muennighoff * 1 3 4 Zitong Yang * 1 Weijia Shi * 2 3 Xiang Lisa Li * 1 Li Fei-Fei 1 Hannaneh Hajishirzi 2 3
                                                      Luke Zettlemoyer 2 Percy Liang 1 Emmanuel Candès 1 Tatsunori Hashimoto 1


                                                                 Abstract                                                    Mathematical     Competition         PhD-Level
                                                                                                                            Problem Solving       Math        Science Questions
                                             Test-time scaling is a promising new approach to                                 (MATH500)         (AIME24)       (GPQA Diamond)
                                                                                                                         95                60                60




arXiv:2501.19393v3 [cs.CL] 1 Mar 2025
                                             language modeling that uses extra test-time com-


                                                                                                              Accuracy (%)
                                             pute to improve performance. Recently, OpenAI’s                             85                40
                                             o1 model showed this capability but did not pub-                                                                50
                                             licly share its methodology, leading to many repli-                         75                20
                                             cation efforts. We seek the simplest approach to                            65                                  40
                                                                                                                                            0
                                             achieve test-time scaling and strong reasoning per-                            512     2048 512 2048 8192 1024 4096
                                             formance. First, we curate a small dataset s1K                                          Average thinking time (tokens)
                                             of 1,000 questions paired with reasoning traces
                                             relying on three criteria we validate through abla-                Figure 1. Test-time scaling with s1-32B. We benchmark s1-32B
                                             tions: difficulty, diversity, and quality. Second, we              on reasoning-intensive tasks and vary test-time compute.
                                             develop budget forcing to control test-time com-
                                             pute by forcefully terminating the model’s think-
                                                                                                                 of this approach is to increase the compute at test time to
                                             ing process or lengthening it by appending “Wait”
                                                                                                                 get better results. There has been much work exploring this
                                             multiple times to the model’s generation when it
                                                                                                                 idea (Snell et al., 2024; Welleck et al., 2024), and the via-
                                             tries to end. This can lead the model to double-
                                                                                                                 bility of this paradigm was recently validated by OpenAI
                                             check its answer, often fixing incorrect reasoning
                                                                                                                 o1 (OpenAI, 2024). o1 has demonstrated strong reasoning
                                             steps. After supervised finetuning the Qwen2.5-
                                                                                                                 performance with consistent gains from scaling test-time
                                             32B-Instruct language model on s1K and equip-
                                                                                                                 compute. OpenAI describes their approach as using large-
                                             ping it with budget forcing, our model s1-32B ex-
                                                                                                                 scale reinforcement learning (RL) implying the use of sizable
                                             ceeds o1-preview on competition math questions
                                                                                                                 amounts of data (OpenAI, 2024). This has led to various
                                             by up to 27% (MATH and AIME24). Further,
                                                                                                                 attempts to replicate their models relying on techniques like
                                             scaling s1-32B with budget forcing allows extrap-
                                                                                                                 Monte Carlo Tree Search (Gao et al., 2024b; Zhang et al.,
                                             olating beyond its performance without test-time
                                                                                                                 2024a), multi-agent approaches (Qin et al., 2024), and oth-
                                             intervention: from 50% to 57% on AIME24. Our
                                                                                                                 ers (Wang et al., 2024a; Huang et al., 2024b; 2025). Among
                                             model, data, and code are open-source at https:
                                                                                                                 these approaches, DeepSeek R1 (DeepSeek-AI et al., 2025)
                                             //github.com/simplescaling/s1.
                                                                                                                 has successfully replicated o1-level performance, also em-
                                                                                                                 ploying reinforcement learning via millions of samples and
                                        1. Introduction                                                          multiple training stages. However, despite the large num-
                                                                                                                 ber of o1 replication attempts, none have openly replicated
                                        Performance improvements of language models (LMs) over                   a clear test-time scaling behavior. Thus, we ask: what is
                                        the past years have largely relied on scaling up train-time              the simplest approach to achieve both test-time scaling and
                                        compute using large-scale self-supervised pretraining (Ka-               strong reasoning performance?
                                        plan et al., 2020; Hoffmann et al., 2022). The creation of
                                                                                                                We show that training on only 1,000 samples with next-token
                                        these powerful models has set the stage for a new scaling
                                                                                                                prediction and controlling thinking duration via a simple
                                        paradigm built on top of them: test-time scaling. The aim
                                                                                                                test-time technique we refer to as budget forcing leads to
                                           *
                                             Equal contribution. ZY and NM started the project. WS, NM          a strong reasoning model that scales in performance with
                                        and ZY collected the prompts, XL, ZY and NM, built the data             more test-time compute. Specifically, we construct s1K,
                                        pipeline, LZ and WS proposed using a 1K subset and ZY and               which consists of 1,000 carefully curated questions paired
                                        NM built budget forcing. 1 Stanford University. 2 University of
                                        Washington, Seattle. 3 Allen Institute for AI. 4 Contextual AI.         with reasoning traces and answers distilled from Gemini
                                                                                                                Thinking Experimental (Google, 2024). We perform super-
                                                                                                                vised fine-tuning (SFT) of an off-the-shelf pretrained model

                                                                                                          1
                                                   s1: Simple test-time scaling

on our small dataset requiring just 26 minutes of training             sity: Datasets should stem from various fields to cover differ-
on 16 H100 GPUs. After training, we control the amount                 ent reasoning tasks. We collect datasets of two categories:
of test-time compute our model spends using budget forc-
ing: (I) If the model generates more thinking tokens than              Curation of existing datasets Our largest source is Nu-
a desired limit, we forcefully end the thinking process by             minaMATH (LI et al., 2024) with 30,660 mathematical
appending an end-of-thinking token delimiter. Ending the               problems from online websites. We also include histori-
thinking this way makes the model transition to generating             cal AIME problems (1983-2021). To enhance diversity,
its answer. (II) If we want the model to spend more test-time          we add OlympicArena (Huang et al., 2024a) with 4,250
compute on a problem, we suppress the generation of the                questions spanning Astronomy, Biology, Chemistry, Com-
end-of-thinking token delimiter and instead append “Wait”              puter Science, Geography, Mathematics, and Physics from
to the model’s current reasoning trace to encourage more               various Olympiads. OmniMath (Gao et al., 2024a) adds
exploration. Equipped with this simple recipe – SFT on                 4,238 competition-level mathematics problems. We also
1,000 samples and test-time budget forcing – our model s1-             include 2,385 problems from AGIEval (Zhong et al., 2023),
32B exhibits test-time scaling (Figure 1). Further, s1-32B is          which features questions from standardized tests like SAT
the most sample-efficient reasoning model and outperforms              and LSAT, covering English, Law, and Logic. We refer to
closed-source models like OpenAI’s o1-preview (Figure 2).              Table 7 in §C for our other sources.
We conduct extensive ablation experiments targeting (a) our
selection of 1,000 (1K) reasoning samples and (b) our test-            New datasets in quantitative reasoning To comple-
time scaling. For (a), we find that jointly incorporating              ment these existing datasets, we create two original
difficulty, diversity, and quality measures into our selec-            datasets. s1-prob consists of 182 questions from the prob-
tion algorithm is important. Random selection, selecting               ability section of Stanford University’s Statistics Depart-
samples with the longest reasoning traces, or only selecting           ment’s PhD Qualifying Exams (https://statistics.
maximally diverse samples all lead to significantly worse              stanford.edu), accompanied by handwritten solutions
performance (around −30% on AIME24 on average). Train-                 that cover difficult proofs. The probability qualifying exam
ing on our full data pool of 59K examples, a superset of               is held yearly and requires professional-level mathemati-
s1K, does not offer substantial gains over our 1K selection.           cal problem-solving. s1-teasers comprises 23 challenging
This highlights the importance of careful data selection and           brain-teasers commonly used in interview questions for
echoes prior findings for instruction tuning (Zhou et al.,             quantitative trading positions. Each sample consists of a
2023). For (b), we define desiderata for test-time scaling             problem and solution taken from PuzzledQuant (https:
methods to compare different approaches. Budget forcing                //www.puzzledquant.com/). We only take examples
leads to the best scaling as it has perfect controllability with       with the highest difficulty level ("Hard").
a clear positive slope leading to strong performance.                  For each question, we generate a reasoning trace and solu-
In summary, our contributions are: We develop simple meth-             tion using the Google Gemini Flash Thinking API (Google,
ods for creating a sample-efficient reasoning dataset (§2)             2024) extracting its reasoning trace and response. This yields
and test-time scaling (§3); Based on these we build s1-32B             59K triplets of a question, generated reasoning trace, and
which is competitive with o1-preview (§4); We ablate sub-              generated solution. Examples from our dataset are in §D.2.
tleties of data (§5.1) and test-time scaling (§5.2). We end            We decontaminate all samples against our evaluation ques-
with a discussion to motivate future work on simple rea-               tions (MATH500, GPQA Diamond, AIME24; §C.5) using
soning (§6). Our code, model, and data are open-source at              8-grams and deduplicate the data.
https://github.com/simplescaling/s1.
                                                                       2.2. Final selection of 1K samples
2. Reasoning data curation to create s1K                               We could directly train on our pool of 59K questions, how-
                                                                       ever, our goal is to find the simplest approach with minimal
In this section, we describe our process for creating a large          resources. Thus, we go through three stages of filtering to
dataset first in §2.1 and then filtering it down to s1K in §2.2.       arrive at a minimal set of 1,000 samples relying on our three
                                                                       guiding data principles: Quality, Difficulty, and Diversity.
2.1. Initial collection of 59K samples
We collect an initial 59,029 questions from 16 sources follow-         Quality We first remove any questions where we ran into
ing three guiding principles. Quality: Datasets should be              any API errors reducing our dataset to 54,116 samples. Next,
high-quality; we always inspect samples and ignore datasets            we filter out low-quality examples by checking if they contain
with, e.g., poor formatting; Difficulty: Datasets should be            any string patterns with formatting issues, such as ASCII
challenging and require significant reasoning effort; Diver-           art diagrams, non-existent image references, or inconsistent
                                                                       question numbering reducing our dataset to 51,581 examples.

                                                                   2
                                                                                                             s1: Simple test-time scaling


                                                                       General
                                     Geo−                Statistics    topology
                        Approxi−    physics
                                                                                    General

                                                                                                                                                100 Most sample-efficient
                         mation            Convex
                         theory
                                          geometry                                  relativity
                                                                                                   Special
                                                                                                  functions
                    Integral    Operations         Linear             Probability




                                                                                                                             MATH500 Accuracy (%)
                  transform      research          algebra              theory
                                                                                    Computer                 Comm−
          Math−                                                                      science                 utative
                                                Group                                                        algebra

                                                                                                                                                                                      r1-distill
         ematical Statistical

                                                                                                                                                    95 s1
                                                theory
          logic   mechanics
                                                           Real          Biology             Numerical Integral
                                                         functions                            analysis equations
                                                                                                                                                                         Bespoke-Stratos
      Differential
                                   Electro−
      geometry Dynamical dynamics
                     systems                                                      Complex                    Measure
                                                                                                                                                                                                   QwQ
                                                                                                                                                    90
                                                                 Number           functions                   theory
           Math−                                 Geometry         theory
          ematics              Difference
         education             equations                                                                          Partial
                                                                                                 Calculus differential
                                                                             Quantum

                                                                                                                                                                                                   o1-preview
                   Differential                                                                                  equations
      Potential                        Calculus of       Combin−              theory
       theory      equations
                                       variations         atorics
           Functional
            analysis
                               Thermo−
                               dynamics
                                                                         Field
                                                                        theory
                                                                                          Algebraic Information
                                                                                          structures theory                                         85
                      Solid
                     mech
                                              Mechanics Algebraic
                                                        systems                   Astronomy
                                                                                                       Harmonic
                                                                                                       analysis
                                                                                                                                                                         Sky-T1
                                                                                                                                                    801000
                     −anics Associative                                 Fluid                      Control




                                                                                                                                                                  17000      800000 N/A
                                   rings                              mechanics                    theory

                                              Economics Algebraic


                                                                                                                                                                 Number of Examples
                                                        geometry




Figure 2. s1K and s1-32B. (left) s1K is a dataset of 1,000 high-quality, diverse, and difficult questions with reasoning traces. (right) s1-32B,
a 32B parameter model finetuned on s1K is on the sample-efficiency frontier. See Table 1 for details on other models.


From this pool, we identify 384 samples for our final 1,000                                                                                         nomics. To select our final examples from the pool of 24,496
samples from datasets that we perceive as high-quality and                                                                                          questions, we first choose one domain uniformly at random.
not in need of further filtering (see §C.4 for details).                                                                                            Then, we sample one problem from this domain according
                                                                                                                                                    to a distribution that favors longer reasoning traces (see §C.4
Difficulty For difficulty, we use two indicators: model per-                                                                                        for details) as motivated in Difficulty. We repeat this process
formance and reasoning trace length. We evaluate two mod-                                                                                           until we have 1,000 total samples spanning 50 domains.
els on each question: Qwen2.5-7B-Instruct and Qwen2.5-                                                                                              In §5.1, we will show that using our three criteria in combi-
32B-Instruct (Qwen et al., 2024), with correctness assessed                                                                                         nation is important, as only relying on quality, diversity, or
by Claude 3.5 Sonnet comparing each attempt against the                                                                                             difficulty in isolation leads to worse datasets. Some distilled
reference solution (see §C.3 for the grading protocol). We                                                                                          generations are incorrect, which we allow in our data as we
measure the token length of each reasoning trace to indicate                                                                                        focus on capturing the reasoning process rather than entirely
problem difficulty using the Qwen2.5 tokenizer. This relies                                                                                         correct solutions. Our grader (§C.3) deems 53.6% correct
on the assumption that more difficult problems require more                                                                                         in s1K and 63.0% in our follow-up s1K-1.1 (see §A).
thinking tokens. Based on the grading, we remove questions
that either Qwen2.5-7B-Instruct or Qwen2.5-32B-Instruct
can solve correctly and thus may be too easy. By using two                                                                                          3. Test-time scaling
models we reduce the likelihood of an easy sample slipping                                                                                          3.1. Method
through our filtering due to a rare mistake on an easy ques-
tion of one of the models. This brings our total samples                                                                                            We classify test-time scaling methods into 1) Sequential,
down to 24,496, setting the stage for the next round of sub-                                                                                        where later computations depend on earlier ones (e.g., a long
sampling based on diversity. While filtering with these two                                                                                         reasoning trace), and 2) Parallel, where computations run in-
models may be optimized for our setup as we will also use                                                                                           dependently (e.g., majority voting) (Snell et al., 2024; Brown
Qwen2.5-32B-Instruct as our model to finetune, the idea of                                                                                          et al., 2024). We focus on sequential scaling as intuitively we
model-based filtering generalizes to other setups.                                                                                                  believe it should scale better, since later computations can
                                                                                                                                                    build on intermediate results, allowing for deeper reasoning
Diversity To quantify diversity, we classify questions into                                                                                         and iterative refinement. We propose new sequential scaling
domains using Claude 3.5 Sonnet based on the Mathematics                                                                                            methods and ways to benchmark them.
Subject Classification (MSC) system (e.g., geometry, com-
binatorics, etc.) from the American Mathematical Society.1                                                                                          Budget forcing We propose a simple decoding-time in-
The taxonomy focuses on topics in mathematics but also                                                                                              tervention by forcing a maximum and/or minimum number
includes other sciences such as biology, physics, and eco-                                                                                          of thinking tokens. Specifically, we enforce a maximum
  1
    https://mathscinet.ams.org/mathscinet/                                                                                                          token count by simply appending the end-of-thinking token
msc/msc2020.html                                                                                                                                    delimiter and optionally “Final Answer:” to early exit

                                                                                                                                          3
                                                                    s1: Simple test-time scaling

  How many r in raspberry?                                         Question          three metrics:
  Let's break down the process of counting the letter 'r' in the                                            1 ∑
                                                                                               Control =          𝕀(𝑎 ≤ 𝑎 ≤ 𝑎max )               (1)
  word "raspberry" ...
                                                                                    || 𝑎∈ min
  * First letter: 'r' - This is an 'r', count = 1.

  * Second letter: 'a' - Not an 'r', count remains 1 ...

                                                                                     where 𝑎min , 𝑎max refer to a pre-specified minimum and max-
  * Sixth letter: 'e' - Not an 'r', count remains 1.

                                                                                     imum amount of test-time compute; in our case thinking
  * Seventh letter: 'r' - This is an 'r', count = 2.

  * Eighth letter: 'y' - Not an 'r', count remains 2 ...

                                                                                     tokens. We usually only constrain 𝑎max . As tokens gener-
  The number of 'r's in "raspberry" is 2.
                                           ated correspond to the amount of test-time compute spent,
  Wait, let's re-read the question carefully. It asks "How many r                    this metric measures the extent to which a method allows
  in raspberry?" ... * r - a - s - p - b - e - r - r - y ... * First 'r' ... *       controllability over the use of that test-time compute. We
  Second 'r' ... * Third 'r' ... Count = 3 ...            Reasoning trace            report it as a percentage with 100% being perfect control.
  My initial answer of 2 was incorrect due to a quick reading of
                                                                                                               1 ∑ 𝑓 (𝑏) − 𝑓 (𝑎)
  the word. Final Answer: The final answer is                      Response
                       Scaling = (||)                               (2)
                                                                                                                   𝑎,𝑏∈
                                                                                                                         𝑏−𝑎
                                                                                                               2
                                                                                                                     𝑏>𝑎
Figure 3. Budget forcing with s1-32B. The model tries to stop
after “...is 2.”, but we suppress the end-of-thinking token delimiter                Scaling is the average slope of the piece-wise linear function.
instead appending “Wait” leading s1-32B to self-correct its answer.                  It must be positive for useful methods and larger is better.

                                                                                                       Performance = max 𝑓 (𝑎)                   (3)
                                                                                                                           𝑎∈
the thinking stage and make the model provide its current
                                                                                     Performance is simply the maximum performance the
best answer. To enforce a minimum, we suppress the gener-
                                                                                     method achieves on the benchmark. A method with mono-
ation of the end-of-thinking token delimiter and optionally
                                                                                     tonically increasing scaling achieves 100% performance on
append the string “Wait” to the model’s current reasoning
                                                                                     any benchmark in the limit. However, the methods we in-
trace to encourage the model to reflect on its current gen-
                                                                                     vestigate eventually flatten out or further scaling fails due to
eration. Figure 3 contains an example of how this simple
                                                                                     control or context window limitations.
approach can lead the model to arrive at a better answer.

                                                                                     4. Results
Baselines We benchmark budget forcing with: (I) Con-
ditional length-control methods, which rely on telling the                           4.1. Setup
model in the prompt how long it should generate for. We
group them by granularity into (a) Token-conditional con-                            Training We perform supervised finetuning on Qwen2.5-
trol: We specify an upper bound of thinking tokens in the                            32B-Instruct using s1K to obtain our model s1-32B using
prompt; (b) Step-conditional control: We specify an upper                            basic hyperparameters outlined in §D. Finetuning took 26
bound of thinking steps, where each step is around 100 to-                           minutes on 16 NVIDIA H100 GPUs with PyTorch FSDP.
kens; (c) Class-conditional control: We write two generic
prompts that tell the model to either think for a short or long                      Evaluation We select three representative reasoning
amount of time (see §E.1 for details). (II) Rejection sam-                           benchmarks widely used in the field: AIME24 (of America,
pling, which samples until a generation fits a predetermined                         2024) has 30 problems that were used in the 2024 American
compute budget. This oracle captures the posterior over                              Invitational Mathematics Examination (AIME) held from
responses conditioned on its length.                                                 January 31 – February 1, 2024. AIME tests mathematical
                                                                                     problem-solving with arithmetic, algebra, counting, geome-
                                                                                     try, number theory, probability, and other secondary school
3.2. Metrics
                                                                                     math topics. High-scoring high school students in the test are
We establish a set of desiderata as evaluation metrics to                            invited to participate in the United States of America Mathe-
measure test-time scaling across methods. Importantly, we                            matics Olympiad (USAMO). All AIME answers are integers
do not only care about the accuracy a method can achieve                             ranging from 000 to 999, inclusive. Some AIME problems
but also its controllability and test-time scaling slope. For                        rely on figures that we provide to our model using the vec-
each method we consider, we run a set of evaluations 𝑎 ∈                             tor graphics language Asymptote as it cannot take image
 varying test-time compute on a fixed benchmark, e.g.                               inputs. MATH500 (Hendrycks et al., 2021) is a benchmark
AIME24. This produces a piece-wise linear function 𝑓                                 of competition math problems of varying difficulty. We
with compute as the x-axis measured in thinking tokens and                           evaluate on the same 500 samples selected by OpenAI in
accuracy as the y-axis (see Figure 1, where the rightmost dot                        prior work (Lightman et al., 2023). GPQA Diamond (Rein
for AIME24 corresponds to 𝑓 (7320) = 57%). We measure                                et al., 2023) consists of 198 PhD-level science questions

                                                                                 4
                                                      s1: Simple test-time scaling




                                                                              Sequential scaling             60% accuracy
          Forcing 2048/4096 max thinking 

          tokens                                                              via Forcing with s1




                 Ignoring end-of-thinking 2x/4x/6x and                                                 Parallel scaling via Majority
                 appending “Wait” thus forcing it to continue                                          voting with base model
                 reasoning when it tries to stop                                                       (Qwen2.5-32B-Instruct)




              (a) Sequential scaling via budget forcing                            (b) Parallel scaling via majority voting

Figure 4. Sequential and parallel test-time scaling. (a): Budget forcing shows clear scaling trends and extrapolates to some extent. For
the three rightmost dots, we prevent the model from stopping its thinking 2/4/6 times, each time appending “Wait” to its current reasoning
trace. (b): For Qwen2.5-32B-Instruct we perform 64 evaluations for each sample with a temperature of 1 and visualize the performance
when majority voting across 2, 4, 8, 16, 32, and 64 of these.


from Biology, Chemistry and Physics. Experts with PhDs in
the corresponding domains only achieved 69.7% on GPQA                   Table 1. s1-32B is a strong open reasoning model. We evaluate
Diamond (OpenAI, 2024). When we write “GPQA” in the                     s1-32B, Qwen, and Gemini (some entries are unknown (N.A.), see
context of evaluation in this work, we always refer to the              §4). Other results are from the respective reports (Qwen et al.,
Diamond subset. We build on the “lm-evaluation-harness”                 2024; Team, 2024; OpenAI, 2024; DeepSeek-AI et al., 2025; Labs,
framework (Gao et al., 2021; Biderman et al., 2024). Unless             2025; Team, 2025). # ex. = number examples used for reasoning
otherwise specified, we evaluate with a temperature of 0                finetuning; BF = budget forcing. See §A for our better s1.1 model.
(greedy) and measure accuracy (equivalent to pass@1).                                                   AIME      MATH         GPQA
                                                                         Model               # ex.
                                                                                                        2024       500        Diamond
Other models We benchmark s1-32B against: OpenAI                                                     API only
o1 series (OpenAI, 2024), closed-source models that pop-
                                                                         o1-preview            N.A.       44.6        85.5          73.3
ularized test-time scaling; DeepSeek r1 series (DeepSeek-
                                                                         o1-mini               N.A.       70.0        90.0          60.0
AI et al., 2025), open-weight reasoning models with up to
                                                                         o1                    N.A.       74.4        94.8          77.3
o1-level performance; Qwen’s QwQ-32B-preview (Team,
                                                                         Gemini 2.0
2024), a 32B open-weight reasoning model without dis-                                          N.A.       60.0       N.A.          N.A.
                                                                         Flash Think.
closed methodology; Sky-T1-32B-Preview (Team, 2025)
and Bespoke-32B (Labs, 2025), open models with open rea-                                        Open Weights
soning data distilled from QwQ-32B-preview and r1; Google                Qwen2.5-
Gemini 2.0 Flash Thinking Experimental (Google, 2024),                                         N.A.       26.7        84.0          49.0
                                                                         32B-Instruct
the API that we distill from. As it has no official evaluation           QwQ-32B              N.A.        50.0        90.6          54.5
scores, we use the Gemini API to benchmark it ourselves.                 r1                 ≫800K         79.8        97.3          71.5
However, the “recitation error” of the Gemini API makes                  r1-distill          800K         72.6        94.3          62.1
evaluation challenging.2 We circumvent this, by manually
inserting all 30 AIME24 questions in its web interface where                           Open Weights and Open Data
the error does not appear. However, we leave out MATH500                 Sky-T1                 17K       43.3        82.4          56.8
(500 questions) and GPQA Diamond (198 questions), thus                   Bespoke-32B            17K       63.3        93.0          58.1
they are N.A. in Table 1. Our model, s1-32B, is fully open
                                                                         s1 w/o BF               1K       50.0        92.6          56.6
including weights, reasoning data, and code.
                                                                         s1-32B                  1K       56.7        93.0          59.6
  2
    https://github.com/google/
generative-ai-docs/issues/257

                                                                    5
                                                   s1: Simple test-time scaling

4.2. Performance
                                                                      Table 2. s1K data ablations. We budget force (BF) a maximum
Test-time scaling Figure 1 shows the performance of s1-                of around 30,000 thinking tokens for all scores in this table. This
32B with budget forcing scales with more test-time compute.            performs slightly better than the scores without BF (Table 1) as
In Figure 4 (left), we expand the plot from Figure 1 (middle)          it allows the model to finish with a best guess when stuck in an
showing that while we can improve AIME24 performance                   infinite loop. We report 95% paired bootstrap confidence intervals
using our budget forcing technique (§3) and more test-time             for differences relative to the s1K model using 10,000 bootstrap
compute it does eventually flatten out at six times. Suppress-         samples. E.g., the interval [-13%, 20%] means that, with 95% con-
ing the end-of-thinking token delimiter too often can lead the         fidence, the true difference between 59K-full and s1K is between
model into repetitive loops instead of continued reasoning.           -13% and +20%. If the entire interval is negative, e.g. [-27%, -3%],
In Figure 4 (right), we show that after training Qwen2.5-             we can confidently say that the performance is worse than s1K.
32B-Instruct on our 1,000 samples to produce s1-32B and                                     AIME                 MATH            GPQA
equipping it with the simple budget forcing technique, it               Model
                                                                                            2024                  500           Diamond
operates in a different scaling paradigm. Scaling test-time
compute on the base model via majority voting cannot catch                                   36.7                90.6               52.0
                                                                        1K-random
                                                                                         [-26.7%, -3.3%]     [-4.8%, 0.0%]     [-12.6%, 2.5%]
up with the performance of s1-32B which validates our
intuition from §3 that sequential scaling is more effective                                  26.7                91.2               54.6
                                                                        1K-diverse
                                                                                        [-40.0%, -10.0%]     [-4.0%, 0.2%]     [-10.1%, 5.1%]
than parallel. We provide example generations of s1-32B in
Figure 5.                                                                                    33.3                90.4               59.6
                                                                        1K-longest
                                                                                         [-36.7%, 0.0%]      [-5.0%, -0.2%]    [-5.1%, 10.1%]
                                                                                             53.3                92.8               58.1
Sample-efficiency In Figure 2 (right) and Table 1 we                    59K-full
                                                                                         [-13.3%, 20.0%]     [-2.6%, 2.2%]      [-6.6%, 8.6%]
compare s1-32B with other models. We find that s1-
32B is the most sample-efficient open data reasoning                    s1K                  50.0                93.0               57.6
model. It performs significantly better than our base model
(Qwen2.5-32B-Instruct) despite just training it on an ad-
ditional 1,000 samples. The concurrently released r1-32B              Finally, we compare with just training on all of our 59K
shows stronger performance than s1-32B while also only                samples, a superset of all the 1K-sample versions. This
using SFT (DeepSeek-AI et al., 2025). However, it is trained          leads to a strong model but uses much more resources. To
on 800 × more reasoning samples. It is an open question               finetune on 59K samples, we use 394 H100 GPU hours
whether one can achieve their performance with just 1,000             while s1-32B only required 7 H100 GPU hours. Moreover,
samples. Finally, our model nearly matches Gemini 2.0                 relying only on s1K is extremely competitive as shown in §2.
Thinking on AIME24. As the data for s1-32B is distilled               Overall, combining all three criteria – Quality, Difficulty,
from Gemini 2.0, this shows our distillation procedure was            Diversity – via our methodology in §2 is key for sample-
likely effective.                                                     efficient reasoning training.

5. Ablations                                                          5.2. Test-time scaling methods

5.1. Data Quantity, Diversity, and Difficulty
                                                                      Table 3. Ablations on methods to scale test-time compute on
In §2 we outlined our three guiding principles in curating            AIME24. || refers to the number of evaluation runs used to
s1K: Quality, Difficulty, and Diversity. Here we test the             estimate the properties; thus a higher value indicates more robust-
importance of combining them and the overall efficacy of our          ness. Bold indicates our chosen method and the best values. BF
selection. Only Quality (1K-random): After obtaining our              = budget forcing, TCC/SCC/CCC = token/step/class-conditional
high-quality reasoning chains from Gemini, we select 1,000            control, RS = rejection sampling.
samples at random; not relying on our difficulty and diversity
                                                                        Method          Control       Scaling        Performance           ||
filtering at all. Table 2 shows this approach performs much
worse than s1K across all benchmarks. Only Diversity                    BF               100%              15                56.7           5
(1K-diverse): For this dataset, we sample uniformly across              TCC               40%              -24               40.0           5
domains to maximize diversity disregarding any notion of                TCC + BF         100%              13                40.0           5
difficulty. This approach also leads to poor performance                SCC               60%               3                36.7           5
similar to 1K-random. Only Difficulty (1K-longest): Here                SCC + BF         100%               6                36.7           5
we rely on one of our difficulty indicators introduced in §2 by         CCC               50%              25                36.7           2
selecting the 1,000 samples with the longest reasoning traces.
This approach significantly boosts GPQA performance but                 RS               100%              -35               40.0           5
overall still falls short of using s1K. Maximize Quantity:

                                                                  6
                                                         s1: Simple test-time scaling

[AIME24]                                                     [MATH500]                                [GPQA]

Alice and Bob play the following game. A stack of            Find the projection of 𝑎 onto 𝑏 =         A quantum mechanical particle of
                                                             ( )
𝑛 tokens lies before them. The players take turns             2                                       mass 𝑚 moves in two dimensions in
with Alice going first. On each turn, the player              6 if 𝑎 ⋅ 𝑏 = 8.                         the following potential, as a function
removes either 1 token or 4 tokens from the stack.            3                                       of the polar coordinates (𝑟, 𝜃):
Whoever removes the last token wins. Find the                The projection of vector 𝐚 onto vector
number of positive integers 𝑛 less than or equal to          𝐛 is given by the formula:                  𝑉 (𝑟, 𝜃) =
                                                                                                                      1 2 3 2
                                                                                                                        𝑘𝑟 + 𝑘𝑟 cos2 (𝜃)
2024 for which there exists a strategy for Bob that                                                                   2     2
guarantees that Bob will win the game regardless                                   𝐚⋅𝐛
of Alice’s play.                                                       proj𝐛 𝐚 =        𝐛             Find the energy spectrum. Hint:
                                                                                   ‖𝐛‖2               Write the potential in Cartesian co-
The problem asks for the number of positive integers                         ( )                      ordinates.
                                                                              2
𝑛 ≤ 2024 such that Bob has a winning strategy. This
                                                             We are given 𝐛 = 6 and 𝐚 ⋅ 𝐛 = 8.        The problem asks for the energy spec-
is a typical impartial game that can be analyzed using                        3
the Sprague-Grundy theorem or by directly finding                                                     trum of a particle in a given 2D po-
the winning and losing positions. Let 𝑊 be the set of                                                 tential. The potential is given in polar
winning positions (number of tokens 𝑛 such that the          First, we need to find the magnitude     coordinates (𝑟, 𝜃):
player whose turn it is can force a win) and 𝐿 be the        squared of 𝐛, which is ‖𝐛‖2 .
set of losing positions. [...]                                                                                        1 2 3 2
                                                                                                         𝑉 (𝑟, 𝜃) =     𝑘𝑟 + 𝑘𝑟 cos2 (𝜃).
                                                             ‖𝐛‖2 = 𝐛⋅𝐛 = 22 +62 +32 = 4+36+9 = 49.                   2     2
Let 𝑃 (𝑛) be the statement "the player whose turn it is
                                                                                                      We can rewrite the potential as:
when there are 𝑛 tokens has a winning strategy. 𝑃 (0)
is false (the player whose turn it is and faces 0 tokens                                                              1 2
has already lost). Consider the small values of 𝑛:           Now we can plug the values into the         𝑉 (𝑟, 𝜃) =     𝑘𝑟 (1 + 3 cos2 (𝜃)).
𝑛 = 1: The player whose turn it is can take 1 token          projection formula:                                      2
and win. So 𝑃 (1) is true. 1 ∈ 𝑊 .                                                                     The Hamiltonian for a particle in 2D
𝑛 = 2: The player whose turn it is can take 1 token,                                   8
                                                                        proj𝐛 𝐚 =        𝐛            is given by:
leaving 1 token for the opponent. Since 1 ∈ 𝑊 , the                                   49
opponent wins. So 𝑃 (2) is false. 2 ∈ 𝐿.                                                                          (              )
                                                             Substitute the vector 𝐛:                      ℏ2          𝜕2   𝜕2
𝑛 = 3: The player whose turn it is can take 1 token,                                                  𝐻 =−                +          + 𝑉 (𝑥, 𝑦).
leaving 2 tokens for the opponent. Since 2 ∈ 𝐿, the                                 ( )                    2𝑚         𝜕𝑥2 𝜕𝑦2
                                                                                      2
player whose turn it is wins. So 𝑃 (3) is true. 3 ∈                              8
                                                                      proj𝐛 𝐚 =       6               In polar coordinates (𝑟, 𝜃), the Lapla-
𝑊 .[...]                                                                        49 3                  cian operator is:
We are looking for the number of positive integers                                                                   (    )
𝑛 ≤ 2024 for which Bob has a winning strategy. This                                                              1 𝜕    𝜕     1 𝜕2
                                                                                                          ∇2 =        𝑟     + 2 2.
means that if the game starts with 𝑛 tokens, and Alice       Multiply the scalar by each compo-                  𝑟 𝜕𝑟 𝜕𝑟     𝑟 𝜕𝜃
goes first, Bob can win. This happens if and only if         nent of the vector:
                                                                                                      So the Hamiltonian in polar coordi-
𝑛 is a losing position (because if 𝑛 ∈ 𝐿, then Alice’s                                                nates is:
first move must lead to a winning position for Bob).                        ⎛ 8 × 2⎞ ⎛ 16 ⎞
[...]                                                                       ⎜ 498   ⎟ ⎜ 4948 ⎟                   [    (    )         ]
                                                                  proj𝐛 𝐚 = ⎜ 49 × 6⎟ = ⎜ 49 ⎟                 ℏ2 1 𝜕    𝜕     1 𝜕2
Let’s try to find a pattern in the losing positions                         ⎜ 8 × 3⎟ ⎜ 24 ⎟           𝐻 =−             𝑟     + 2 2 +𝑉 (𝑟, 𝜃).
                                                                            ⎝ 49    ⎠ ⎝ 49 ⎠                  2𝑚 𝑟 𝜕𝑟 𝜕𝑟       𝑟 𝜕𝜃
𝐿 = {0, 2, 5, 7, 10, 12, 15, 17, 20, 22, ...}. Differences
between consecutive terms: 2, 3, 2, 3, 2, 3, 2, 3, 2, ...
                                                                                                        The time-independent Schrödinger
The pattern of differences is alternating 2, 3. [...]
                                                                                                      equation is 𝐻𝜓 = 𝐸𝜓. We can try
The losing positions are of the form 5𝑚 or 5𝑚 + 2,           Final Answer: The final answer is        to separate variables by assuming a so-
where 𝑚 ≥ 0. We need to find the number of positive
                                                                                                      lution of the form 𝜓(𝑟, 𝜃) = 𝑅(𝑟)Θ(𝜃).
integers 𝑛 ≤ 2024 that are in the set of losing positions
                                                                             ⎛ 16 ⎞                   [...]
𝐿. [...]
                                                                               48 ⎟
                                                                             ⎜ 49
Case 1: 𝑛 = 5𝑚, where 𝑚 ≥ 1. 1 ≤ 5𝑚 ≤ 2024                                   ⎜ 49 ⎟                   This confirms the energy spectrum is:
1∕5 ≤ 𝑚 ≤ 2024∕5 = 404.8 Since 𝑚 is an integer,                              ⎜ 24 ⎟
1 ≤ 𝑚 ≤ 404. The number of values of 𝑚 is 404.                               ⎝ 49 ⎠                              √      (             )
Case 2: 𝑛 = 5𝑚 + 2, where 𝑚 ≥ 0. 5𝑚 + 2 ≤ 2024                                                            𝐸=ℏ
                                                                                                                      𝑘
                                                                                                                         2𝑛𝑥 + 𝑛𝑦 +
                                                                                                                                    3
                                                                                                                                        .
5𝑚 ≤ 2022 𝑚 ≤ 2022∕5 = 404.4 Since 𝑚 is an in-                                                                        𝑚             2
teger, 0 ≤ 𝑚 ≤ 404. The number of values of 𝑚 is
404 − 0 + 1 = 405.                                                                                    Final
                                                                                                       √ ( Answer: )              𝐸            =
                                                                                                      ℏ 𝑚𝑘 2𝑛𝑥 + 𝑛𝑦 + 23 .
The total number of positive integers 𝑛 ≤ 2024 that
are losing positions is 404 + 405 = 809 .


Figure 5. Example model outputs. We pick a question from AIME24 (left), MATH500 (middle), and GPQA (right), where our model
generates the correct answer. The black text is the prompt, the light blue text is the reasoning trace, and the blue text is the answer of
s1-32B. The gray ellipsis [...] indicates that the text was trimmed to fit this page, but the generated text is actually longer.

                                                                       7
                                                           s1: Simple test-time scaling

   Budget forcing In Table 3 we compare the test-time scal-                     spect a question, which was answered correctly by the model
   ing methods we have introduced in §3. Overall, we find that                  when rejection sampling for ≤ 4000, but not for the ≤ 8000
   budget forcing provides perfect control, good scaling, and                   token setting. In the ≤ 4000 setting the model directly jumps
   leads to our best AIME24 score. Thus, this is the method                     to the correct approach, while for the ≤ 8000 setting it back-
   we use for s1-32B in Figure 1 and in §4. In Table 4, we                      tracks a lot. We hypothesize that there is a correlation such
   compare different strings for extrapolating performance. We                  that shorter generations tend to be the ones where the model
   find that “Wait” generally gives the best performance.                       was on the right track from the start, whereas longer ones
                                                                                tend to be ones where the model made mistakes and thus
   Class-conditional control We provide benchmark scores
                                                                                backtracks or questions itself. This leads to longer samples
   for this method in §E.1 and summarize three findings here:
                                                                                often being wrong when rejection sampling and thus the
   (1) Token-conditional control fails without budget forcing, as
                                                                                inverse scaling trend.
   our model cannot reliably count tokens - even when trained
   to do so. (2) Under step-conditional control, the model gen-
   erates a similar total number of tokens when given different                 6. Discussion and related work
   step targets, as the model goes from few steps with many
                                                                                6.1. Sample-efficient reasoning
   tokens per step, to many steps with few tokens in each step.
   Thus, the model learns to hack its way around the com-                       Models There are a number of concurrent efforts to build
   pute constraint making the controllability of this method                    models that replicate the performance of o1 (OpenAI, 2024).
   mediocre. (3) Class-conditional control can work - telling a                 For example, DeepSeek-r1 and k1.5 (DeepSeek-AI et al.,
   model to simply think longer can increase its test-time com-                 2025; Team et al., 2025) are built with reinforcement learn-
   pute and performance, which leads good scaling in Table 3.                   ing methods, while others rely on SFT using tens of thou-
                                                                                sands of distilled examples (Team, 2025; Xu et al., 2025;
                                                                                Labs, 2025). We show that SFT on only 1,000 examples
   Table 4. Budget forcing extrapolation ablations. We compare
                                                                                suffices to build a competitive reasoning model matching
   ignoring the end-of-thinking delimiter twice and appending none
                                                                                o1-preview and produces a model that lies on the pareto
   or various strings.
                                                                                frontier (Figure 2). Further, we introduce budget forcing
                                    AIME    MATH         GPQA                   which combined with our reasoning model leads to the first
               Model
                                    2024     500        Diamond                 reproduction of OpenAI’s test-time scaling curves (OpenAI,
               No extrapolation     50.0     93.0          57.6                 2024). Why does supervised finetuning on just 1,000 sam-
                                                                                ples lead to such performance gains? We hypothesize that
               2x without string    50.0     90.2          55.1
                                                                                the model is already exposed to large amounts of reason-
               2x “Alternatively”   50.0     92.2          59.6
                                                                                ing data during pretraining which spans trillions of tokens.
               2x “Hmm”             50.0     93.0          59.6
                                                                                Thus, the ability to perform reasoning is already present in
               2x “Wait”            53.3     93.0          59.6
                                                                                our model. Our sample-efficient finetuning stage just acti-
                                                                                vates it and we scale it further at test time with budget forcing.
   Rejection sampling Surprisingly, we find that simply sam-                    This is similar to the "Superficial Alignment Hypothesis"
   pling until the generation fits a specific length leads to an                presented in LIMA (Zhou et al., 2023), where the authors
   inverse scaling trend as depicted in Figure 6. In §E.2 we in-                find that 1,000 examples can be sufficient to align a model
                                                                                to adhere to user preferences.

           42
                                                                                Benchmarks and methods To evaluate and push the lim-
           37

Accuracy (%)
                                                                                its of these models, increasingly challenging benchmarks
                                                                                have been introduced, such as Olympiad-level science com-
           32                                                                   petitions (He et al., 2024; Jain et al., 2024; Zhong et al.,
                                                                                2023) and others (Srivastava et al., 2023; Glazer et al., 2024;
           27                                                                   Su et al., 2024; Kim et al., 2024; Phan et al., 2025). To
                                                                                enhance models’ performance on reasoning-related tasks, re-
           22 3072               4096          5120                             searchers have pursued several strategies: Prior works have
                  Average thinking time (tokens)                                explored continuing training language models on special-
                                                                                ized corpora related to mathematics and science (Azerbayev
   Figure 6. Rejection sampling on AIME24 with s1-32B. We sam-                  et al., 2023; Yang et al., 2024), sometimes even synthetically
   ple with a temperature of 1 until all generations have less than (from       generated data (Yu et al., 2024). Others have developed
   left to right) 3500, 4000, 5000, 8000, and 16000 thinking tokens             training methodologies specifically aimed at reasoning per-
   requiring an average of 655, 97, 8, 3, 2, and 1 tries per sample.            formance (Zelikman et al., 2022; 2024; Luo et al., 2025;

                                                                            8
                                                  s1: Simple test-time scaling

Yuan et al., 2025; Wu et al., 2024a). Another significant line                  60
of work focuses on prompting-based methods to elicit and



                                                                     Accuracy (%)
improve reasoning abilities, including methods like Chain-                      50
of-Thought prompting (Wei et al., 2023; Yao et al., 2023a;b;                                                   REBASE
Bi et al., 2023; Fu et al., 2023; Zhang et al., 2024b; Xi-
                                                                                40                             Majority
ang et al., 2025; Hu et al., 2024; Diao et al., 2024). These                                                   Sequential
combined efforts aim to advance the reasoning ability of
                                                                                30                             scaling
language models, enabling them to handle more complex
and abstract tasks effectively.
                                                                                2048      8192       32768       131072
                                                                                       Average thinking time (tokens)
6.2. Test-time scaling                                                  Figure 7. Scaling further with parallel scaling methods. All met-
Methods As we introduce in §3, we differentiate two meth-               rics averaged over the 30 questions in AIME24. Average thinking
ods to scale test-time compute: parallel and sequential. The            tokens for REBASE do not account for the additional compute from
former relies on multiple solution attempts generated in par-           the reward model. For sequential scaling, we prompt the model
allel and selecting the best outcome via specific criteria.             to use up to (from left to right) 32, 64, 256, and 512 steps. For
These criteria include choosing the most frequent response              REBASE and majority voting we generate 16 parallel trajectories
for majority voting or the best response based on an exter-             to aggregate across.
nal reward for Best-of-N (Brown et al., 2024; Irvine et al.,
2023; Levi, 2024). Unlike repeated sampling, previous se-
quential scaling methods let the model generate solution
                                                                       such extrapolation? There may be improvements to budget
attempts sequentially based on previous attempts, allowing
                                                                       forcing such as rotating through different strings, not only
it to refine each attempt based on previous outcomes (Snell
                                                                       “Wait”, or combining it with frequency penalties or higher
et al., 2024; Hou et al., 2025; Lee et al., 2025). Tree-based
                                                                       temperature to avoid repetitive loops. An exciting direc-
search methods (Gandhi et al., 2024; Wu et al., 2024b) offer
                                                                       tion for future work is also researching whether applying
a hybrid approach between sequential and parallel scaling,
                                                                       budget forcing to a reasoning model trained with reinforce-
such as Monte-Carlo Tree Search (MCTS) (Liu et al., 2024;
                                                                       ment learning yields better extrapolation; or if RL allows for
Zhang et al., 2023; Zhou et al., 2024; Choi et al., 2023) and
                                                                       new ways of test-time scaling beyond budget forcing. Our
guided beam search (Xie et al., 2023). REBASE (Wu et al.,
                                                                       work defines the right metrics (§3.2) – Control, Scaling, and
2024b) employs a process reward model to balance exploita-
                                                                       Performance – to enable future research and progress on
tion and pruning during tree search. Empirically, REBASE
                                                                       extrapolating test-time compute.
has been shown to outperform sampling-based methods and
MCTS (Wu et al., 2024b). Reward models (Lightman et al.,
2023; Wang et al., 2024b;c) play a key role in these meth-
ods. They come in two variants: outcome reward models                   Parallel scaling as a solution Parallel scaling offers one
and process reward models. Outcome reward models (Xin                   solution to the limits of sequential scaling, thus we aug-
et al., 2024; Ankner et al., 2024) assign a score to complete           ment our sequentially scaled model with two methods: (I)
solutions and are particularly useful in Best-of-N selection,           Majority voting: After generating 𝑘 solutions, the final
while process reward models (Lightman et al., 2023; Wang                solution is the most frequent one across generations; (II)
et al., 2024b; Wu et al., 2024b) assess individual reasoning            Tree search via REBASE: We use the REBASE process
steps and are effective in guiding tree-based search methods.           reward model, which is initialized from LLaMA-34B and
                                                                        further finetuned on a synthetic process reward modeling
                                                                        dataset (Wu et al., 2024b). We then aggregate the solutions
Limits to further test-time scaling We have shown that
                                                                        generated by REBASE via majority voting. As shown in
budget forcing allows extrapolating test-time compute in
                                                                        Figure 7, augmenting our model with REBASE scales better
§4, e.g., improving AIME24 performance from 50% to 57%.
                                                                        than majority voting, and even sequential scaling in this sce-
However, it has two key limitations when scaling further: it
                                                                        nario. However, REBASE requires an additional forward
eventually flattens out (Figure 4), and the context window
                                                                        pass at each step for the reward model adding some computa-
of the underlying language model constrains it. Despite
                                                                        tion overhead. For sequential scaling, when prompted to use
these constraints, our work shows test-time scaling across a
                                                                        up to 512 steps, for 12 out of the 30 evaluation questions the
wide range of accuracies (Figure 1), partly because scaling
                                                                        model generates a response that exceeds the context window
down test-time compute behaves predictably and does not
                                                                        leading to a large performance drop. Overall, we find that
suffer from these constraints.
                                                                        these parallel scaling methods complement sequential scal-
Continuing test-time scaling will require approaches that               ing thus they offer an avenue for scaling test-time compute
can further extrapolate test-time compute. How can we get               even further; beyond fixed context windows.

                                                                 9
                                                   s1: Simple test-time scaling

Impact Statement                                                         inference compute with repeated sampling, 2024. URL
                                                                         https://arxiv.org/abs/2407.21787.
Language models with strong reasoning capabilities have
the potential to greatly enhance human productivity, from              Cesista, F. L. Multimodal structured generation: Cvpr’s 2nd
assisting in complex decision-making to driving scientific               mmfm challenge technical report, 2024. URL https:
breakthroughs. However, recent advances in reasoning, such               //arxiv.org/abs/2406.11403.
as OpenAI’s o1 and DeepSeek’s r1, lack transparency, limit-
ing broader research progress. Our work aims to push the               Chen, W., Yin, M., Ku, M., Lu, P., Wan, Y., Ma, X.,
frontier of reasoning in a fully open manner, fostering in-              Xu, J., Wang, X., and Xia, T. Theoremqa: A theorem-
novation and collaboration to accelerate advancements that               driven question answering dataset, 2023. URL https:
ultimately benefit society.                                              //arxiv.org/abs/2305.12524.

                                                                       Choi, S., Fang, T., Wang, Z., and Song, Y. Kcts: Knowledge-
Acknowledgements                                                         constrained tree search decoding with token-level halluci-
                                                                         nation detection, 2023. URL https://arxiv.org/
We thank Ryan Marten for generating traces from DeepSeek                 abs/2310.09044.
r1 for s1.1 using Bespoke Curator (Marten et al., 2025).
This work was partly conducted using the Stanford Marlowe              DeepSeek-AI, Guo, D., Yang, D., Zhang, H., Song, J., Zhang,
GPU cluster (Kapfer et al., 2025), made possible by financial            R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., Zhang, X.,
support from Stanford University. We thank Alexander M.                  Yu, X., Wu, Y., Wu, Z. F., Gou, Z., Shao, Z., Li, Z., Gao,
Rush, Andrew Ilyas, Banghua Zhu, Chenglei Si, Chunting                   Z., Liu, A., Xue, B., Wang, B., Wu, B., Feng, B., Lu,
Zhou, John Yang, Ludwig Schmidt, Samy Jelassi, Suhas                     C., Zhao, C., Deng, C., Zhang, C., Ruan, C., Dai, D.,
Kotha, Tengyu Ma, Xuechen Li, Yu Sun, and Yue Zhang for                  Chen, D., Ji, D., Li, E., Lin, F., Dai, F., Luo, F., Hao,
very constructive discussions.                                           G., Chen, G., Li, G., Zhang, H., Bao, H., Xu, H., Wang,
                                                                         H., Ding, H., Xin, H., Gao, H., Qu, H., Li, H., Guo, J.,
References                                                               Li, J., Wang, J., Chen, J., Yuan, J., Qiu, J., Li, J., Cai,
                                                                         J. L., Ni, J., Liang, J., Chen, J., Dong, K., Hu, K., Gao,
Ankner, Z., Paul, M., Cui, B., Chang, J. D., and Am-                     K., Guan, K., Huang, K., Yu, K., Wang, L., Zhang, L.,
  manabrolu, P. Critique-out-loud reward models, 2024.                   Zhao, L., Wang, L., Zhang, L., Xu, L., Xia, L., Zhang,
  URL https://arxiv.org/abs/2408.11791.                                  M., Zhang, M., Tang, M., Li, M., Wang, M., Li, M., Tian,
                                                                         N., Huang, P., Zhang, P., Wang, Q., Chen, Q., Du, Q.,
Arora, D., Singh, H. G., and Mausam. Have llms advanced                  Ge, R., Zhang, R., Pan, R., Wang, R., Chen, R. J., Jin,
  enough? a challenging problem solving benchmark for                    R. L., Chen, R., Lu, S., Zhou, S., Chen, S., Ye, S., Wang,
  large language models, 2023. URL https://arxiv.                        S., Yu, S., Zhou, S., Pan, S., Li, S. S., Zhou, S., Wu,
  org/abs/2305.15074.                                                    S., Ye, S., Yun, T., Pei, T., Sun, T., Wang, T., Zeng, W.,
Azerbayev, Z., Schoelkopf, H., Paster, K., Santos, M. D.,                Zhao, W., Liu, W., Liang, W., Gao, W., Yu, W., Zhang,
  McAleer, S., Jiang, A. Q., Deng, J., Biderman, S., and                 W., Xiao, W. L., An, W., Liu, X., Wang, X., Chen, X.,
 Welleck, S. Llemma: An open language model for math-                    Nie, X., Cheng, X., Liu, X., Xie, X., Liu, X., Yang, X.,
  ematics, 2023.                                                         Li, X., Su, X., Lin, X., Li, X. Q., Jin, X., Shen, X., Chen,
                                                                         X., Sun, X., Wang, X., Song, X., Zhou, X., Wang, X.,
Bi, Z., Zhang, N., Jiang, Y., Deng, S., Zheng, G., and                   Shan, X., Li, Y. K., Wang, Y. Q., Wei, Y. X., Zhang, Y.,
  Chen, H. When do program-of-thoughts work for reason-                  Xu, Y., Li, Y., Zhao, Y., Sun, Y., Wang, Y., Yu, Y., Zhang,
  ing?, 2023. URL https://arxiv.org/abs/2308.                            Y., Shi, Y., Xiong, Y., He, Y., Piao, Y., Wang, Y., Tan,
  15452.                                                                 Y., Ma, Y., Liu, Y., Guo, Y., Ou, Y., Wang, Y., Gong,
                                                                         Y., Zou, Y., He, Y., Xiong, Y., Luo, Y., You, Y., Liu, Y.,
Biderman, S., Schoelkopf, H., Sutawika, L., Gao, L., Tow,                Zhou, Y., Zhu, Y. X., Xu, Y., Huang, Y., Li, Y., Zheng,
  J., Abbasi, B., Aji, A. F., Ammanamanchi, P. S., Black,                Y., Zhu, Y., Ma, Y., Tang, Y., Zha, Y., Yan, Y., Ren, Z. Z.,
  S., Clive, J., DiPofi, A., Etxaniz, J., Fattori, B., Forde,            Ren, Z., Sha, Z., Fu, Z., Xu, Z., Xie, Z., Zhang, Z., Hao,
  J. Z., Foster, C., Hsu, J., Jaiswal, M., Lee, W. Y., Li, H.,           Z., Ma, Z., Yan, Z., Wu, Z., Gu, Z., Zhu, Z., Liu, Z., Li,
  Lovering, C., Muennighoff, N., Pavlick, E., Phang, J.,                 Z., Xie, Z., Song, Z., Pan, Z., Huang, Z., Xu, Z., Zhang,
  Skowron, A., Tan, S., Tang, X., Wang, K. A., Winata,                   Z., and Zhang, Z. Deepseek-r1: Incentivizing reasoning
  G. I., Yvon, F., and Zou, A. Lessons from the trenches on              capability in llms via reinforcement learning, 2025. URL
  reproducible evaluation of language models, 2024.                      https://arxiv.org/abs/2501.12948.

Brown, B., Juravsky, J., Ehrlich, R., Clark, R., Le, Q. V., Ré,        Diao, S., Wang, P., Lin, Y., Pan, R., Liu, X., and Zhang,
  C., and Mirhoseini, A. Large language monkeys: Scaling                 T. Active prompting with chain-of-thought for large lan-

                                                                  10
                                                   s1: Simple test-time scaling

  guage models, 2024. URL https://arxiv.org/                             Wang, Y., Arora, S., Atkinson, D., Authur, R., Chandu,
  abs/2302.12246.                                                        K. R., Cohan, A., Dumas, J., Elazar, Y., Gu, Y., Hessel, J.,
                                                                         Khot, T., Merrill, W., Morrison, J., Muennighoff, N., Naik,
Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A.,             A., Nam, C., Peters, M. E., Pyatkin, V., Ravichander, A.,
  Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A.,               Schwenk, D., Shah, S., Smith, W., Strubell, E., Subramani,
  Goyal, A., Hartshorn, A., Yang, A., Mitra, A., Sravanku-               N., Wortsman, M., Dasigi, P., Lambert, N., Richardson,
  mar, A., Korenev, A., Hinsvark, A., Rao, A., Zhang, A.,                K., Zettlemoyer, L., Dodge, J., Lo, K., Soldaini, L., Smith,
  Rodriguez, A., Gregerson, A., et al. The llama 3 herd                  N. A., and Hajishirzi, H. Olmo: Accelerating the science
  of models, 2024. URL https://arxiv.org/abs/                            of language models, 2024.
  2407.21783.
                                                                       He, C., Luo, R., Bai, Y., Hu, S., Thai, Z. L., Shen, J.,
Fu, Y., Peng, H., Sabharwal, A., Clark, P., and Khot,                    Hu, J., Han, X., Huang, Y., Zhang, Y., Liu, J., Qi, L.,
  T. Complexity-based prompting for multi-step reason-                   Liu, Z., and Sun, M. Olympiadbench: A challeng-
  ing, 2023. URL https://arxiv.org/abs/2210.                             ing benchmark for promoting agi with olympiad-level
  00720.                                                                 bilingual multimodal scientific problems, 2024. URL
                                                                         https://arxiv.org/abs/2402.14008.
Gandhi, K., Lee, D., Grand, G., Liu, M., Cheng, W., Sharma,
  A., and Goodman, N. D. Stream of search (sos): Learning              Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart,
  to search in language, 2024. URL https://arxiv.                        S., Tang, E., Song, D., and Steinhardt, J. Measuring math-
  org/abs/2404.03683.                                                    ematical problem solving with the math dataset, 2021.
                                                                         URL https://arxiv.org/abs/2103.03874.
Gao, B., Song, F., Yang, Z., Cai, Z., Miao, Y., Dong, Q., Li,
  L., Ma, C., Chen, L., Xu, R., Tang, Z., Wang, B., Zan, D.,           Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E.,
  Quan, S., Zhang, G., Sha, L., Zhang, Y., Ren, X., Liu, T.,             Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L. A.,
  and Chang, B. Omni-math: A universal olympiad level                   Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican,
  mathematic benchmark for large language models, 2024a.                 K., van den Driessche, G., Damoc, B., Guy, A., Osindero,
  URL https://arxiv.org/abs/2410.07985.                                  S., Simonyan, K., Elsen, E., Rae, J. W., Vinyals, O., and
                                                                         Sifre, L. Training compute-optimal large language mod-
Gao, L., Tow, J., Biderman, S., Black, S., DiPofi, A., Foster,           els, 2022. URL https://arxiv.org/abs/2203.
  C., Golding, L., Hsu, J., McDonell, K., Muennighoff, N.,              15556.
  Phang, J., Reynolds, L., Tang, E., Thite, A., Wang, B.,
  Wang, K., and Zou, A. A framework for few-shot language              Hou, Z., Lv, X., Lu, R., Zhang, J., Li, Y., Yao, Z., Li, J.,
  model evaluation, September 2021. URL https://                        Tang, J., and Dong, Y. Advancing language model rea-
  doi.org/10.5281/zenodo.5371628.                                        soning through reinforcement learning and inference scal-
                                                                         ing, 2025. URL https://arxiv.org/abs/2501.
Gao, Z., Niu, B., He, X., Xu, H., Liu, H., Liu, A., Hu, X., and         11651.
  Wen, L. Interpretable contrastive monte carlo tree search
  reasoning, 2024b. URL https://arxiv.org/abs/                         Hu, Y., Shi, W., Fu, X., Roth, D., Ostendorf, M., Zettle-
  2410.01707.                                                            moyer, L., Smith, N. A., and Krishna, R. Visual sketch-
                                                                         pad: Sketching as a visual chain of thought for multimodal
Glazer, E., Erdil, E., Besiroglu, T., Chicharro, D., Chen,               language models, 2024. URL https://arxiv.org/
  E., Gunning, A., Olsson, C. F., Denain, J.-S., Ho, A.,                 abs/2406.09403.
  de Oliveira Santos, E., Järviniemi, O., Barnett, M., San-
  dler, R., Vrzala, M., Sevilla, J., Ren, Q., Pratt, E., Levine,       Huang, Z., Wang, Z., Xia, S., Li, X., Zou, H., Xu, R., Fan,
  L., Barkley, G., Stewart, N., Grechuk, B., Grechuk, T.,                R.-Z., Ye, L., Chern, E., Ye, Y., Zhang, Y., Yang, Y., Wu,
  Enugandla, S. V., and Wildon, M. Frontiermath: A bench-               T., Wang, B., Sun, S., Xiao, Y., Li, Y., Zhou, F., Chern, S.,
  mark for evaluating advanced mathematical reasoning in                 Qin, Y., Ma, Y., Su, J., Liu, Y., Zheng, Y., Zhang, S., Lin,
  ai, 2024. URL https://arxiv.org/abs/2411.                              D., Qiao, Y., and Liu, P. Olympicarena: Benchmarking
  04872.                                                                 multi-discipline cognitive reasoning for superintelligent
                                                                         ai, 2024a. URL https://arxiv.org/abs/2406.
Google. Gemini 2.0 flash thinking mode (gemini-2.0-                     12753.
  flash-thinking-exp-1219), December 2024.   URL
  https://cloud.google.com/vertex-ai/                                  Huang, Z., Zou, H., Li, X., Liu, Y., Zheng, Y., Chern, E.,
  generative-ai/docs/thinking-mode.                                     Xia, S., Qin, Y., Yuan, W., and Liu, P. O1 replication
                                                                         journey – part 2: Surpassing o1-preview through simple
Groeneveld, D., Beltagy, I., Walsh, P., Bhagia, A., Kinney,              distillation, big progress or bitter lesson?, 2024b. URL
  R., Tafjord, O., Jha, A. H., Ivison, H., Magnusson, I.,                https://arxiv.org/abs/2411.16489.

                                                                  11
                                                 s1: Simple test-time scaling

Huang, Z., Geng, G., Hua, S., Huang, Z., Zou, H., Zhang,               aimo-progress-prize/blob/main/report/
  S., Liu, P., and Zhang, X. O1 replication journey – part 3:          numina_dataset.pdf.
  Inference-time scaling for medical reasoning, 2025. URL
  https://arxiv.org/abs/2501.06458.                                  Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker,
                                                                       B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and
Irvine, R., Boubert, D., Raina, V., Liusie, A., Zhu, Z.,               Cobbe, K. Let’s verify step by step, 2023. URL https:
   Mudupalli, V., Korshuk, A., Liu, Z., Cremer, F., As-                //arxiv.org/abs/2305.20050.
   sassi, V., Beauchamp, C.-C., Lu, X., Rialan, T., and
   Beauchamp, W. Rewarding chatbots for real-world en-               Ling, W., Yogatama, D., Dyer, C., and Blunsom, P. Program
   gagement with millions of users, 2023. URL https:                   induction by rationale generation : Learning to solve and
  //arxiv.org/abs/2303.06135.                                          explain algebraic word problems, 2017. URL https:
                                                                       //arxiv.org/abs/1705.04146.
Jain, N., Han, K., Gu, A., Li, W.-D., Yan, F., Zhang,
  T., Wang, S., Solar-Lezama, A., Sen, K., and Stoica, I.            Liu, J., Cui, L., Liu, H., Huang, D., Wang, Y., and Zhang,
   Livecodebench: Holistic and contamination free evalu-               Y. Logiqa: A challenge dataset for machine reading com-
   ation of large language models for code, 2024. URL                  prehension with logical reasoning, 2020. URL https:
   https://arxiv.org/abs/2403.07974.                                   //arxiv.org/abs/2007.08124.
Kapfer, C., Stine, K., Narasimhan, B., Mentzel, C., and              Liu, J., Cohen, A., Pasunuru, R., Choi, Y., Hajishirzi,
  Candes, E. Marlowe: Stanford’s gpu-based computational               H., and Celikyilmaz, A. Don’t throw away your value
  instrument, January 2025. URL https://doi.org/                       model! generating more preferable text with value-guided
  10.5281/zenodo.14751899.                                             monte-carlo tree search decoding, 2024. URL https:
                                                                       //arxiv.org/abs/2309.15028.
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B.,
  Chess, B., Child, R., Gray, S., Radford, A., Wu, J.,               Loshchilov, I. and Hutter, F. Decoupled weight decay regu-
  and Amodei, D. Scaling laws for neural language mod-                 larization, 2019.
  els, 2020. URL https://arxiv.org/abs/2001.
  08361.                                                             Luo, H., Sun, Q., Xu, C., Zhao, P., Lou, J., Tao, C., Geng, X.,
                                                                       Lin, Q., Chen, S., Tang, Y., and Zhang, D. Wizardmath:
Kim, E., Suk, J., Kim, S., Muennighoff, N., Kim, D., and Oh,           Empowering mathematical reasoning for large language
  A. Llm-as-an-interviewer: Beyond static testing through              models via reinforced evol-instruct, 2025. URL https:
  dynamic llm evaluation, 2024. URL https://arxiv.                     //arxiv.org/abs/2308.09583.
  org/abs/2412.10424.
                                                                     Marten, R., Vu, T., Ji, C. C.-J., Sharma, K., Pimpalgaonkar,
Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu,
                                                                      S., Dimakis, A., and Sathiamoorthy, M. Curator: A tool
 C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Effi-
                                                                      for synthetic data creation. https://github.com/
 cient memory management for large language model serv-
                                                                      bespokelabsai/curator, January 2025.
 ing with pagedattention, 2023. URL https://arxiv.
 org/abs/2309.06180.                                                 Muennighoff, N., Soldaini, L., Groeneveld, D., Lo, K., Mor-
Labs, B. Bespoke-stratos: The unreasonable effectiveness              rison, J., Min, S., Shi, W., Walsh, P., Tafjord, O., Lambert,
  of reasoning distillation, 2025. URL https://hf.                    N., Gu, Y., Arora, S., Bhagia, A., Schwenk, D., Wadden,
  co/bespokelabs/Bespoke-Stratos-32B. Ac-                             D., Wettig, A., Hui, B., Dettmers, T., Kiela, D., Farhadi,
  cessed: 2025-01-22.                                                 A., Smith, N. A., Koh, P. W., Singh, A., and Hajishirzi, H.
                                                                      Olmoe: Open mixture-of-experts language models, 2024.
Lee, K.-H., Fischer, I., Wu, Y.-H., Marwood, D., Baluja,              URL https://arxiv.org/abs/2409.02060.
  S., Schuurmans, D., and Chen, X. Evolving deeper llm
  thinking, 2025. URL https://arxiv.org/abs/                         of America, M. A. Aime, February 2024. URL https:
  2501.09891.                                                          //artofproblemsolving.com/wiki/index.
                                                                        php/AIME_Problems_and_Solutions/.
Levi, N. A simple model of inference scaling laws, 2024.
  URL https://arxiv.org/abs/2410.16377.                              OpenAI.     Learning to reason with llms, Septem-
                                                                       ber 2024. URL https://openai.com/index/
LI, J., Beeching, E., Tunstall, L., Lipkin, B., Solet-                 learning-to-reason-with-llms/.
  skyi, R., Huang, S. C., Rasul, K., Yu, L., Jiang, A.,
  Shen, Z., Qin, Z., Dong, B., Zhou, L., Fleureau, Y.,               OpenAI. Openai o3-mini, 2025. URL https://openai.
  Lample, G., and Polu, S. Numinamath, 2024. URL                       com/index/openai-o3-mini/. Accessed: 2025-
  https://github.com/project-numina/                                   02-24.

                                                                12
                                                   s1: Simple test-time scaling

Phan, L., Gatti, A., Han, Z., Li, N., Hu, J., Zhang, H., Shi,            H., Yuan, H., Chen, J., Guo, J., Su, J., Wang, J., Zhao, J.,
  S., Choi, M., Agrawal, A., Chopra, A., et al. Humanity’s               Zhang, J., Liu, J., Yan, J., Wu, J., Shi, L., Ye, L., Yu, L.,
  last exam, 2025. URL https://arxiv.org/abs/                            Dong, M., Zhang, N., Ma, N., Pan, Q., Gong, Q., Liu, S.,
  2501.14249.                                                            Ma, S., Wei, S., Cao, S., Huang, S., Jiang, T., Gao, W.,
                                                                         Xiong, W., He, W., Huang, W., Wu, W., He, W., Wei, X.,
Qin, Y., Li, X., Zou, H., Liu, Y., Xia, S., Huang, Z., Ye,               Jia, X., Wu, X., Xu, X., Zu, X., Zhou, X., Pan, X., Charles,
  Y., Yuan, W., Liu, H., Li, Y., and Liu, P. O1 replication              Y., Li, Y., Hu, Y., Liu, Y., Chen, Y., Wang, Y., Liu, Y., Qin,
  journey: A strategic progress report – part 1, 2024. URL               Y., Liu, Y., Yang, Y., Bao, Y., Du, Y., Wu, Y., Wang, Y.,
  https://arxiv.org/abs/2410.18982.                                      Zhou, Z., Wang, Z., Li, Z., Zhu, Z., Zhang, Z., Wang, Z.,
Qwen, :, Yang, A., Yang, B., Zhang, B., Hui, B., Zheng,                  Yang, Z., Huang, Z., Huang, Z., Xu, Z., and Yang, Z. Kimi
 B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H.,               k1.5: Scaling reinforcement learning with llms, 2025.
 Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J.,              URL https://arxiv.org/abs/2501.12599.
 Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li,
 M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li,                Team, N.      Sky-t1: Fully open-source reasoning
 T., Xia, T., Ren, X., Ren, X., Fan, Y., Su, Y., Zhang, Y.,              model with o1-preview performance in $450 budget,
 Wan, Y., Liu, Y., Cui, Z., Zhang, Z., and Qiu, Z. Qwen2.5               2025. URL https://novasky-ai.github.io/
 technical report, 2024. URL https://arxiv.org/                          posts/sky-t1. Accessed: 2025-01-09.
 abs/2412.15115.
                                                                       Team, Q. Qwq: Reflect deeply on the boundaries of the
Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y.,          unknown, November 2024. URL https://qwenlm.
  Dirani, J., Michael, J., and Bowman, S. R. Gpqa: A                     github.io/blog/qwq-32b-preview/.
  graduate-level google-proof q&a benchmark, 2023. URL
  https://arxiv.org/abs/2311.12022.                                    Wang, J., Meng, F., Liang, Y., and Zhou, J. Drt-o1: Op-
                                                                        timized deep reasoning translation via long chain-of-
Shi, Q., Tang, M., Narasimhan, K., and Yao, S. Can lan-                 thought, 2024a. URL https://arxiv.org/abs/
  guage models solve olympiad programming?, 2024. URL                   2412.17498.
  https://arxiv.org/abs/2404.10952.

Snell, C., Lee, J., Xu, K., and Kumar, A. Scaling llm test-            Wang, P., Li, L., Shao, Z., Xu, R. X., Dai, D., Li, Y., Chen,
  time compute optimally can be more effective than scal-               D., Wu, Y., and Sui, Z. Math-shepherd: Verify and rein-
  ing model parameters, 2024. URL https://arxiv.                        force llms step-by-step without human annotations, 2024b.
  org/abs/2408.03314.                                                   URL https://arxiv.org/abs/2312.08935.

Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid,           Wang, S., Liu, Z., Zhong, W., Zhou, M., Wei, Z., Chen, Z.,
  A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A.,                  and Duan, N. From lsat: The progress and challenges
  Garriga-Alonso, A., et al. Beyond the imitation game:                 of complex reasoning, 2021. URL https://arxiv.
  Quantifying and extrapolating the capabilities of language            org/abs/2108.00648.
  models, 2023.
                                                                       Wang, Z., Dong, Y., Delalleau, O., Zeng, J., Shen, G., Egert,
Su, H., Yen, H., Xia, M., Shi, W., Muennighoff, N., yu Wang,            D., Zhang, J. J., Sreedhar, M. N., and Kuchaiev, O. Help-
  H., Liu, H., Shi, Q., Siegel, Z. S., Tang, M., Sun, R., Yoon,         steer2: Open-source dataset for training top-performing
  J., Arik, S. O., Chen, D., and Yu, T. Bright: A realis-               reward models, 2024c. URL https://arxiv.org/
  tic and challenging benchmark for reasoning-intensive                 abs/2406.08673.
  retrieval, 2024. URL https://arxiv.org/abs/
  2407.12883.                                                          Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter,
Sun, L., Han, Y., Zhao, Z., Ma, D., Shen, Z., Chen, B., Chen,           B., Xia, F., Chi, E., Le, Q., and Zhou, D. Chain-of-
  L., and Yu, K. Scieval: A multi-level large language                  thought prompting elicits reasoning in large language
  model evaluation benchmark for scientific research, 2024.             models, 2023. URL https://arxiv.org/abs/
  URL https://arxiv.org/abs/2308.13149.                                 2201.11903.

Team, K., Du, A., Gao, B., Xing, B., Jiang, C., Chen, C., Li,          Welleck, S., Bertsch, A., Finlayson, M., Schoelkopf, H., Xie,
  C., Xiao, C., Du, C., Liao, C., Tang, C., Wang, C., Zhang,            A., Neubig, G., Kulikov, I., and Harchaoui, Z. From de-
  D., Yuan, E., Lu, E., Tang, F., Sung, F., Wei, G., Lai, G.,           coding to meta-generation: Inference-time algorithms for
  Guo, H., Zhu, H., Ding, H., Hu, H., Yang, H., Zhang, H.,              large language models, 2024. URL https://arxiv.
  Yao, H., Zhao, H., Lu, H., Li, H., Yu, H., Gao, H., Zheng,            org/abs/2406.16838.

                                                                  13
                                                  s1: Simple test-time scaling

Wu, T., Lan, J., Yuan, W., Jiao, J., Weston, J., and                  Yu, L., Jiang, W., Shi, H., Yu, J., Liu, Z., Zhang, Y., Kwok,
 Sukhbaatar, S. Thinking llms: General instruction fol-                 J. T., Li, Z., Weller, A., and Liu, W. Metamath: Bootstrap
 lowing with thought generation, 2024a. URL https:                      your own mathematical questions for large language mod-
 //arxiv.org/abs/2410.10630.                                            els, 2024. URL https://arxiv.org/abs/2309.
                                                                        12284.
Wu, Y., Sun, Z., Li, S., Welleck, S., and Yang, Y. Infer-
 ence scaling laws: An empirical analysis of compute-                 Yuan, S., Chen, Z., Xi, Z., Ye, J., Du, Z., and Chen, J.
 optimal inference for problem-solving with language mod-               Agent-r: Training language model agents to reflect via
 els, 2024b. URL https://arxiv.org/abs/2408.                            iterative self-training, 2025. URL https://arxiv.
 00724.                                                                 org/abs/2501.11425.
Xiang, V., Snell, C., Gandhi, K., Albalak, A., Singh, A.,             Zelikman, E., Wu, Y., Mu, J., and Goodman, N. D. Star:
  Blagden, C., Phung, D., Rafailov, R., Lile, N., Mahan,                Bootstrapping reasoning with reasoning, 2022. URL
  D., Castricato, L., Franken, J.-P., Haber, N., and Finn, C.           https://arxiv.org/abs/2203.14465.
  Towards system 2 reasoning in llms: Learning how to
  think with meta chain-of-thought, 2025. URL https:                  Zelikman, E., Harik, G., Shao, Y., Jayasiri, V., Haber, N.,
  //arxiv.org/abs/2501.04682.                                           and Goodman, N. D. Quiet-star: Language models can
                                                                        teach themselves to think before speaking, 2024. URL
Xie, Y., Kawaguchi, K., Zhao, Y., Zhao, X., Kan, M.-Y., He,             https://arxiv.org/abs/2403.09629.
  J., and Xie, Q. Self-evaluation guided beam search for
  reasoning, 2023. URL https://arxiv.org/abs/                         Zhang, H. and Chen, C. Test-time compute scaling laws,
  2305.00633.                                                           2024. URL https://github.com/hughbzhang/
                                                                        o1_inference_scaling_laws.
Xin, H., Guo, D., Shao, Z., Ren, Z., Zhu, Q., Liu, B., Ruan,
  C., Li, W., and Liang, X. Deepseek-prover: Advancing                Zhang, S., Chen, Z., Shen, Y., Ding, M., Tenenbaum, J. B.,
  theorem proving in llms through large-scale synthetic                 and Gan, C. Planning with large language models for
  data, 2024. URL https://arxiv.org/abs/2405.                           code generation, 2023. URL https://arxiv.org/
  14333.                                                                abs/2303.05510.

Xu, H., Wu, X., Wang, W., Li, Z., Zheng, D., Chen, B.,                Zhang, Y., Wu, S., Yang, Y., Shu, J., Xiao, J., Kong, C., and
  Hu, Y., Kang, S., Ji, J., Zhang, Y., Guo, Z., Yang, Y.,               Sang, J. o1-coder: an o1 replication for coding, 2024a.
  Zhang, M., and Zhang, D. Redstar: Does scaling long-cot               URL https://arxiv.org/abs/2412.00154.
  data unlock better slow-reasoning systems?, 2025. URL
                                                                      Zhang, Y., Yang, J., Yuan, Y., and Yao, A. C.-C. Cumula-
  https://arxiv.org/abs/2501.11284.
                                                                        tive reasoning with large language models, 2024b. URL
Yang, Z., Band, N., Li, S., Candès, E., and Hashimoto, T.               https://arxiv.org/abs/2308.04371.
  Synthetic continued pretraining, 2024. URL https:
                                                                      Zhong, H., Xiao, C., Tu, C., Zhang, T., Liu, Z., and Sun, M.
  //arxiv.org/abs/2409.07431.
                                                                        Jec-qa: A legal-domain question answering dataset, 2019.
Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao,          URL https://arxiv.org/abs/1911.12011.
  Y., and Narasimhan, K. Tree of thoughts: Deliberate
                                                                      Zhong, W., Cui, R., Guo, Y., Liang, Y., Lu, S., Wang,
  problem solving with large language models, 2023a. URL
                                                                        Y., Saied, A., Chen, W., and Duan, N. Agieval: A
  https://arxiv.org/abs/2305.10601.
                                                                        human-centric benchmark for evaluating foundation mod-
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan,             els, 2023. URL https://arxiv.org/abs/2304.
  K., and Cao, Y. React: Synergizing reasoning and acting               06364.
  in language models, 2023b. URL https://arxiv.
                                                                      Zhou, A., Yan, K., Shlapentokh-Rothman, M., Wang, H.,
  org/abs/2210.03629.
                                                                        and Wang, Y.-X. Language agent tree search unifies rea-
Ye, Y., Huang, Z., Xiao, Y., Chern, E., Xia, S., and Liu, P.            soning acting and planning in language models, 2024.
  Limo: Less is more for reasoning, 2025a. URL https:                   URL https://arxiv.org/abs/2310.04406.
  //arxiv.org/abs/2502.03387.
                                                                      Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y., Ma, X.,
Ye, Y., Xiao, Y., Mi, T., and Liu, P. Aime-preview: A                   Efrat, A., Yu, P., Yu, L., Zhang, S., Ghosh, G., Lewis, M.,
  rigorous and immediate evaluation framework for ad-                   Zettlemoyer, L., and Levy, O. Lima: Less is more for
  vanced mathematical reasoning. https://github.                        alignment, 2023. URL https://arxiv.org/abs/
  com/GAIR-NLP/AIME-Preview, 2025b. GitHub                              2305.11206.
  repository.

                                                                 14
                                                  s1: Simple test-time scaling

Contents

1   Introduction                                                                                                             1

2   Reasoning data curation to create s1K                                                                                   2
    2.1   Initial collection of 59K samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    2
    2.2   Final selection of 1K samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    2

3 Test-time scaling                                                                                                          3
    3.1   Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     3
    3.2   Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    4

4   Results                                                                                                                  4
    4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      4
    4.2   Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    6

5 Ablations                                                                                                                  6
  5.1 Data Quantity, Diversity, and Difficulty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .         6
    5.2   Test-time scaling methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      6

6   Discussion and related work                                                                                              8
    6.1 Sample-efficient reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       8
    6.2   Test-time scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    9

A s1.1                                                                                                                      16

B Evaluation determinism                                                                                                    16

C s1K details                                                                                                               17
  C.1 s1K summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .         17
    C.2 Dataset composition for full 59K questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      17
    C.3 s1K grading prompt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      19
    C.4 s1K diversity selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     19
    C.5 Decontamination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       19

D Training details                                                                                                          21
    D.1 Training Ablations: Sequence length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       21
    D.2 Training Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      21

E Test-time scaling details                                                                                                 35
    E.1 Sequential scaling ablations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    35
    E.2 Examples for rejection sampling ablation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      37

F Version Control                                                                                                           46




                                                              15
                                                    s1: Simple test-time scaling

A. s1.1
Seven days after our release of s1, we released s1.1. We regenerated traces for our 1,000 samples in s1K using DeepSeek
r1 (DeepSeek-AI et al., 2025) to create s1K-1.1. We use the same training procedure to train our model s1.1. Other updates
since our launch include the release of o3 (OpenAI, 2025), LIMO (Ye et al., 2025a), and AIME 2025. We consider all these
new developments in Table 5. We find that s1.1 performs significantly better than s1. We also tried distilling from Claude
3.7, which led to worse performance than from r1 (not reported).


Table 5. s1-32B is an open and sample-efficient reasoning model. We evaluate s1-32B, Qwen, and Gemini (some entries are unknown
(N.A.), see §4). Other results are from the respective reports (Qwen et al., 2024; Team, 2024; OpenAI, 2024; 2025; DeepSeek-AI et al.,
2025; Labs, 2025; Team, 2025) except for AIME 2025 (Ye et al., 2025b). # ex. = number examples used for reasoning finetuning; BF =
budget forcing.
          Model                                     # Examples          MATH500    GPQA       AIME 2024       AIME 2025
                                                             API only
          o3-mini-low                                       N/A             95.8      70.6            56.3             42.1
          o3-mini-medium                                    N/A             97.3      76.8            75.8             70.4
          o3-mini-high                                      N/A             97.9      79.7            83.8             80.9
                                                          Open Weights
          QwQ-32B                                         N.A.              90.6      54.5            46.7             32.7
          r1                                            ≫800K               07.3      71.5            79.8             70.0
          r1-distill-Llama-70B                           800K               94.5      65.2            57.1             56.3
          r1-distill-Qwen-14B                            800K               93.9      59.1            61.7             48.0
          r1-distill-Qwen-32B                            800K               94.3      62.1            58.3             49.6
                                                 Open Weights and Open Data
          LIMO                                               817            94.8      66.7            56.3             44.6
          s1 w/o BF                                           1K            92.6      56.6            50.0             26.7
          s1 with Budget Forcing “Wait” 1x                    1K            92.8      59.6            53.3             30.0
          s1 with Budget Forcing “Wait” 2x                    1K            93.0      59.6            53.3             33.3
          s1 with Budget Forcing “Wait” 4x                    1K            92.2      58.6            56.7             36.7
          s1.1 w/o BF                                         1K            94.4      60.6            56.7             50.0
          s1.1 with Budget Forcing “Wait” 1x                  1K            95.4      62.6            56.7             50.0
          s1.1 with Budget Forcing “Wait” 2x                  1K            95.4      63.6            56.7             50.0




B. Evaluation determinism
We run our evaluations using vLLM (Kwon et al., 2023) as it is faster than the alternatives we tried. However, we find that
even when using the same random seeds and greedy sampling, evaluation scores can change significantly across runs:

  • Different batch sizes causing different results see https://github.com/vllm-project/vllm/issues/
    5898
  • Continuing generations causing different results see https://github.com/vllm-project/vllm/issues/
    11783
  • Changes in tensor parallelism causing different results

As our model generates long reasoning traces prior to its answer, small numeric changes can snowball into large differences.
We encounter many generations that are exactly the same for thousands of tokens and then suddenly differ in one token
eventually ending up with an entirely different answer. To partly counter this issue we generally run our final evaluations
using full precision unless otherwise indicated.

                                                                   16
                                                  s1: Simple test-time scaling

C. s1K details
C.1. s1K summary


Table 6. Summary of our dataset s1K. Token count measured by the Qwen-2.5 tokenizer. We prompt Claude to produce keywords given
several questions from the domain.
              Domain                       #questions     Total token count      Keywords
              Geometry                     109            560.2K                 Area, Triangle, Distance
              Number theory                98             522.5K                 Sequences, Divisibility
              Combinatorics                75             384.7K                 Permutations, Counting
              Real functions               43             234.8K                 Trigonometry, Calculus
              Biology                      41             120.9K                 Organic reactions
              Complex functions            32             170.2K                 Complex roots
              Quantum theory               32             127.9K                 Particles, Wave functions
              Field theory                 28             150.1K                 Polynomials, Roots
              Calculus of variations       28             155.5K                 Optimization, Control
              Difference equations         24             132.5K                 Recurrence, Recursion
              Electromagnetic theory       23             95.8K                  Optics, Waves, Diffraction
              Group theory                 22             100.0K                 Groups, Automorphisms
              Linear algebra               22             128.3K                 Matrices, Determinants
              Probability theory           20             114.6K                 Random walk, Expectation
              Algebraic systems            19             109.9K                 Functional equations
              Mechanics                    19             103.6K                 Forces, Motion, Energy
              Thermodynamics               19             74.2K                  Heat engines, Entropy
              Differential equations       18             89.6K                  Substitution, Existence
              Computer science             18             34.2K                  Complexity theory, Algorithms
              Numerical analysis           18             76.5K                  Error analysis, Stability
              Calculus                     17             96.3K                  Convergence, Summation
              Algebraic structures         17             90.4K                  Inequalities, Sets
              Astronomy                    16             37.7K                  Stellar populations, Orbits
              Remaining 27 domains         242            982.2K                 Domains with ≤ 16 questions
              All domains (51)             1000           4.7M                   s1K


C.2. Dataset composition for full 59K questions




                                                              17
                                                   s1: Simple test-time scaling




Table 7. Composition of full 59K questions. Thinking and response lengths are measured in tokens using the Qwen2.5-32B-Instruct
tokenizer (Qwen et al., 2024). In addition to excluding our evaluation benchmark, AIME24, we also exclude AIME questions from
2022-2023 as we use these 90 questions during our development stage of s1-32B.
            Source                                 Description                                 #Samples Avg.
                                                                                                        thinking
                                                                                                        length
            NuminaMATH (LI et al., 2024)           Math problems from online websites          30660     4.1K
            MATH (Hendrycks et al., 2021)          Math problems from competitions             11999     2.9K
            OlympicArena (Huang et al.,            Astronomy, Biology, Chemistry, Com-         4250      3.2K
            2024a)                                 puter Science, Geography, Math, and
                                                   Physics olympiad questions
            OmniMath (Gao et al., 2024a)           Math problems from competitions             4238      4.4K
            AGIEval (Zhong et al., 2023;           English, Law, Logic and Math prob-          2385      1.2K
            Ling et al., 2017; Hendrycks et al.,   lems from the SAT, LSAT and other
            2021; Liu et al., 2020; Zhong          exams
            et al., 2019; Wang et al., 2021)
            xword                                  Crossword puzzles                           999       0.7K
            OlympiadBench (He et al., 2024)        Math and Physics olympiad questions         896       3.9K
            AIME (1983-2021)                       American Invitational Mathematics Ex-       890       4.7K
                                                   amination
            TheoremQA (Chen et al., 2023)          Computer Science, Finance, Math, and        747       2.1K
                                                   Physics university-level questions relat-
                                                   ing to theorems
            USACO (Shi et al., 2024)               Code problems from the USA Comput-          519       3.6K
                                                   ing Olympiad
            JEEBench (Arora et al., 2023)          Chemistry, Math, and Physics problems       515       2.9K
                                                   used in the university entrance exami-
                                                   nation of the Indian Institute of Tech-
                                                   nology
            GPQA (Rein et al., 2023)               PhD-Level Science Questions                 348       2.9K
            SciEval (Sun et al., 2024)             Biology, Chemistry, and Physics prob-       227       0.7K
                                                   lems from various sources
            s1-prob                                Stanford statistics qualifying exams        182       4.0K
            LiveCodeBench (Jain et al., 2024)      Code problems from coding websites          151       3.5K
                                                   (LeetCode, AtCoder, and CodeForces)
            s1-teasers                             Math brain-teasers crawled from the In-     23        4.1K
                                                   ternet
            All 59K questions                      Composite of the above datasets with        59029     3.6K
                                                   reasoning traces and solutions




                                                               18
                                                 s1: Simple test-time scaling

C.3. s1K grading prompt
To grade whether an example is correct for our dataset selection in §2, we use the prompt in Figure 8. We grade using Claude
3.5 except for the correctness among the final 1,000 samples, which we graded with Claude 3.7.

You are an AI assistant for grading a science problem. The user will provide you with the question itself, an attempt made by
a student and the correct answer to the problem. Your job is to judge whether the attempt is correct by comparing it with the
correct answer. If the expected solution concludes with a number or choice, there should be no ambiguity. If the expected
solution involves going through the entire reasoning process, you should judge the attempt based on whether the reasoning
process is correct with correct answer if helpful.

The user will provide the attempt and the correct answer in the following format:

# Problem
{problem}

## Attempt
{attempt}

## Correct answer
{solution}

Explain your reasoning, and end your response on a new line with only "Yes" or "No" (without quotes).

                                                 Figure 8. Grading prompt.


C.4. s1K diversity selection
Algorithm 1 provides our algorithm for selecting data in our diversity selection stage. As mentioned in §2, we also include
samples from some specific benchmarks we perceive as high-quality. None of the samples overlap with our final evaluation.

C.5. Decontamination
We filter all samples by checking for an 8-gram overlap between the selected examples and the evaluation benchmarks:
MATH500, GPTQA Diamond, and AIME24. We exclude questions with more than an 8-gram overlap.




                                                             19
                                              s1: Simple test-time scaling




Algorithm 1 Two-stage sampling for s1K
 1: Input:  := Set of 24,496 questions with features
 2: Output:  := Set of 1,000 selected questions
 3:  ← ∅                                                     Initialize the output set (only tracks unique elements)
 4: for 𝑞 ∈  do
 5:   if IsGeminiCorrect(𝑞) and (IsAIME(𝑞) or IsGPQA(𝑞)) then
 6:       ←  ∪ {𝑞}
 7:                                                                         Select all correct AIME/GPQA solutions
 8:   else if IsGeminiCorrect(𝑞) and IsMATH(𝑞) and ThinkingLength(𝑞) > 5600 then
 9:       ←  ∪ {𝑞}
10:                                                            Select correct MATH500 solutions with long chains
11:   end if
12: end for
13:  ← All available domains
14:                                                                                             Initialize domain pool
15: while || < 1000 do
16:   𝑑 ← RandomChoice()
17:                                                                                       Randomly select a domain
18:   𝑄𝑑 ← Questions in domain 𝑑
19:                                                                                  Get questions from this domain
20:   ranks ← RankByThinkingLength(𝑄𝑑 )
21:                                                                                          Rank by thinking length
22:   weights ← 2−ranks
23:                                                                                      Apply power-law weighting
24:   𝑞 ← WeightedSample(𝑄𝑑 , weights)
25:                                                                                   Sample favoring longer chains
26:    ←  ∪ {𝑞}
27:                                                                                            Add selected question
28:   𝑄𝑑 ← 𝑄𝑑 ⧵ {𝑞}
29:   if 𝑄𝑑 = ∅ then
30:       ←  ⧵ {𝑑}
31:                                                                                       Remove exhausted domains
32:   end if
33: end while




                                                          20
                                                                   s1: Simple test-time scaling

D. Training details
We take a model that has already been pretrained and instruction tuned and further finetune it for reasoning. Specifically, we
use Qwen2.5-32B-Instruct (Qwen et al., 2024), which on math tasks generally matches or outperforms the larger Qwen2.5-
72B-Instruct (Qwen et al., 2024) or other open models (Dubey et al., 2024; Groeneveld et al., 2024; Muennighoff et al.,
2024). We use token delimiters to separate the thinking stage from the answering stage. We enclose the thinking stage
with <|im_start|>think and <|im_start|>answer; both preceded and followed by a newline. Samples from our
dataset are in §D.2. We use basic fine-tuning hyperparameters: we train for 5 epochs with a batch size of 16 for a total of 315
gradient steps. We train in bfloat16 precision with a learning rate of 1𝑒 − 5 warmed up linearly for 5% (16 steps) and then
decayed to 0 over the rest of training (299 steps) following a cosine schedule. We use the AdamW optimizer (Loshchilov &
Hutter, 2019) with 𝛽1 = 0.9, 𝛽2 = 0.95 and weight decay of 1𝑒 − 4. We do not compute loss on questions, only on reasoning
traces and solutions. We ensure the sequence length is large enough to avoid cutting off any samples; a setting we ablate in
§D.1. The training takes just 26 minutes on 16 NVIDIA H100 GPUs.

                 Training Loss                                       Learning Rate Schedule                                     Gradient Norm
                                                                   1e 5
                                                             1.0


                                             Learning Rate                                              Gradient Norm
       0.4                                                                                                              3
Loss   0.2
                                                             0.5                                                        2
                                                                                                                        1
       0.0                                                   0.0
             0   100        200       300                           0       100       200         300                       0    100    200      300
                 Training Steps                                            Training Steps                                       Training Steps
                                            Figure 9. Training dynamics of s1-32B on s1K.



D.1. Training Ablations: Sequence length


Table 8. Training sequence length ablation. We report “accuracy / average thinking tokens per sample”; the higher the accuracy and the
fewer the thinking tokens (inference cost) the better.
                                                                   Model A                         Model B
                  Training sequence length                         4096                            32768
                  % training samples cutoff                        74%                             0%
                  AIME24                                           30.0% / 20721                   50.0% / 6984
                  MATH500                                          90.0% / 5324                    91.0% / 3268
                  GPQA                                             52.5% / 6841                    53.0% / 3568

Besides our scaling ablations in §5.2, the main training hyperparameter we ablate is the sequence length used during training.
We find that a shorter training sequence length leads to longer reasoning traces at test time. This is because when training
with a shorter sequence length the answer section of the training sample is more commonly cut off. Inversely, when the
training sequence length is longer, more samples appear in their entirety with the section where the model answers. Thus
the model receives more gradient updates where it learns to generate an answer following its chain. This in turn leads to a
higher log probability of the answer section at any point during the generation and thus shorter reasoning traces at test time.
Performance-wise, we also find that the model trained with a longer sequence length performs better. Thus we opt for the
longest training sequence length as it leads to better performance and makes inference more efficient by leading to shorter
reasoning traces.

D.2. Training Samples
Table 9, Table 10, Table 11 contain training samples from s1K.


                                                                               21
                                                   s1: Simple test-time scaling

Table 9. s1K sample stemming from s1-prob. The black text is the question and the yellow text is the solution. Based solely on the
question we generate the light blue reasoning trace and blue response with Gemini, which we use for training.

You are given integrable random variables 𝑋, 𝑌0 and 𝑍0 on the same probability space (Ω,  , ℙ), and two 𝜎-algebras
 ⊂  ,  ⊂  . For 𝑘 = 1, 2, … ,, let

                             𝑌𝑘 ∶= 𝔼[𝑋|𝜎(, 𝑍0 , … , 𝑍𝑘−1 )],        𝑍𝑘 ∶= 𝔼[𝑋|𝜎(, 𝑌0 , … , 𝑌𝑘−1 )].

(a) Show that there exist integrable random variables 𝑌∞ and 𝑍∞ such that as 𝑛 → ∞ both 𝔼[|𝑌𝑛 − 𝑌∞ |] → 0 and
𝔼[|𝑍𝑛 − 𝑍∞ |] → 0. (b) Prove that almost surely 𝑌∞ = 𝑍∞ .



  1. Introduce the notations, 𝑛 ∶= 𝜎(, 𝑍0 , … , 𝑍𝑛−1 ) and [𝑛 ∶= ]𝜎(, 𝑌0 , … , 𝑌[𝑛−1 ), for
                                                                                             ] all 𝑛 ≥ 1 and 0 ∶= ,
                                                                                                                   { }0 ∶= .
     Clearly, both are filtrations and by definition, 𝑌𝑛 = 𝔼 𝑋|𝑛 and 𝑍𝑛 = 𝔼 𝑋|𝑛 , for all 𝑛 ≥ 0. Thus 𝑌𝑛 𝑛≥0 (or
     { }                                                          { }          { }
       𝑍𝑛 𝑛≥0 ) is a Doob’s Martingale with respect to filtration 𝑛 𝑛≥0 (or 𝑛 𝑛≥0 ) and hence is U.I., 𝑋 being integrable.
     Now apply the fact that a Martingale is U.I. if and only if it converges in 𝐿1 and conclude.
  2. We provide two different proofs.

  1. Observe the following identity which follows from Tower Property.
                                 [     ]      [ [     ] ]       [ [    ] ] [    ]
                               𝔼 𝑌𝑛 |𝑛 = 𝔼 𝔼 𝑋|𝑛 |𝑛 = 𝔼 𝔼 𝑋|𝑛 |𝑛 = 𝔼 𝑍𝑛 |𝑛 .                                             (4)

                    [      ]    [     ]                                                    [        ] 𝐿1 [         ]
     Note that 𝔼|𝔼 𝑌𝑛 |𝑛 − 𝔼 𝑌∞ |𝑛 | ≤ 𝔼|𝑌𝑛 − 𝑌∞ | = 𝑜(1) and by Levy’s Upward Theorem, 𝔼 𝑌∞ |𝑛 ⟶ 𝔼 𝑌∞ |∞ ,
                      (       )
     where ∞ = 𝜎 ∪𝑛≥0 𝑛 = 𝜎(, 𝑌0 , 𝑌1 …). It is evident that 𝑌∞ ∈ 𝑚∞ (since it is almost sure limit of 𝑌𝑛 ∈ 𝑚𝑛 )
                  [         ]                                                                        [     ] 𝐿1
     and hence 𝔼 𝑌∞ |∞ = 𝑌∞ almost surely. Combining the above observations we conclude that 𝔼 𝑌𝑛 |𝑛 ⟶ 𝑌∞ .
                 [        ] 𝐿1
     Similarly, 𝔼 𝑍𝑛 |𝑛 ⟶ 𝑍∞ . Now invoke Equation (4), to conclude that 𝑌∞ = 𝑍∞ almost surely.
                                                                                              [       ]     [     ]
  2. Since 𝑋 is integrable, we have by Levy’s Upward Theorem, 𝑌∞ = lim𝑛→∞ 𝑌𝑛 = lim𝑛→∞ 𝔼 𝑋|𝑛 = 𝔼 𝑋|∞ .
                          [       ]
     Similarly, 𝑍∞ = 𝔼 𝑋|∞ . On the otherhand, in first proof we have observed that 𝑌∞ ∈ 𝑚∞ and hence 𝑌∞ ∈
                        (           )
     𝑚∞ ∩ 𝑚∞ = 𝑚 ∞ ∩ ∞ . This yields the following.
                                    [           ]   [ [    ]        ]   [         ]
                              𝑌∞ = 𝔼 𝑌∞ |∞ ∩ ∞ = 𝔼 𝔼 𝑋|∞ |∞ ∩ ∞ = 𝔼 𝑋|∞ ∩ ∞ .
                       [         ]
      Similarly, 𝑍∞ = 𝔼 𝑋|∞ ∩ ∞ . This proves that 𝑌∞ = 𝑍∞ almost surely.


Let’s analyze the problem. We are given two sequences of random variables 𝑌𝑘 and 𝑍𝑘 defined by conditional expectations.
We need to show that these sequences converge in 𝐿1 and that their limits are equal almost surely.
Part (a): Convergence in 𝐿1 .
The sigma-algebras involved are increasing:
𝑘 = 𝜎(, 𝑍0 , … , 𝑍𝑘−1 )
𝑘 = 𝜎(, 𝑌0 , … , 𝑌𝑘−1 )
For the 𝑌𝑘 sequence:
𝑌1 = 𝔼[𝑋|𝜎()]
𝑌2 = 𝔼[𝑋|𝜎(, 𝑍0 )]
𝑌3 = 𝔼[𝑋|𝜎(, 𝑍0 , 𝑍1 )]
...
The sequence of sigma-algebras 𝜎(, 𝑍0 , … , 𝑍𝑘−1 ) is increasing. By the Martingale Convergence Theorem (for conditional
expectations), since 𝑋 is integrable, the sequence 𝑌𝑘 converges almost surely and in 𝐿1 to 𝔼[𝑋|𝜎(, 𝑍0 , 𝑍1 , …)]. Let
𝑌∞ = 𝔼[𝑋|𝜎(, 𝑍0 , 𝑍1 , …)].



                                                                22
                                              s1: Simple test-time scaling


Similarly, for the 𝑍𝑘 sequence:
𝑍1 = 𝔼[𝑋|𝜎()]
𝑍2 = 𝔼[𝑋|𝜎(, 𝑌0 )]
𝑍3 = 𝔼[𝑋|𝜎(, 𝑌0 , 𝑌1 )]
...
The sequence of sigma-algebras 𝜎(, 𝑌0 , … , 𝑌𝑘−1 ) is increasing. By the Martingale Convergence Theorem, the sequence
𝑍𝑘 converges almost surely and in 𝐿1 to 𝔼[𝑋|𝜎(, 𝑌0 , 𝑌1 , …)]. Let 𝑍∞ = 𝔼[𝑋|𝜎(, 𝑌0 , 𝑌1 , …)].
Part (b): Prove that almost surely 𝑌∞ = 𝑍∞ .
We have 𝑌∞ = 𝔼[𝑋|𝜎(, 𝑍0 , 𝑍1 , …)] and 𝑍∞ = 𝔼[𝑋|𝜎(, 𝑌0 , 𝑌1 , …)].
Let ∞ = 𝜎(, 𝑍0 , 𝑍1 , …) and ∞ = 𝜎(, 𝑌0 , 𝑌1 , …).
Then 𝑌∞ = 𝔼[𝑋|∞ ] and 𝑍∞ = 𝔼[𝑋|∞ ].
Consider 𝑌1 = 𝔼[𝑋|]. 𝑍1 = 𝔼[𝑋|].
𝑌2 = 𝔼[𝑋|𝜎(, 𝑍0 )]. Since 𝑍0 = 𝔼[𝑋|], 𝜎(, 𝑍0 ) = 𝜎(, 𝔼[𝑋|]).
𝑍2 = 𝔼[𝑋|𝜎(, 𝑌0 )]. Since 𝑌0 is not defined, let’s assume 𝑌0 and 𝑍0 are given.
Let’s consider the sigma-algebra generated by all the 𝑌𝑖 and 𝑍𝑖 .
Let  = 𝜎(𝑌0 , 𝑍0 , 𝑌1 , 𝑍1 , …).
Both 𝑌𝑘 and 𝑍𝑘 are measurable with respect to . Thus, 𝑌∞ and 𝑍∞ are measurable with respect to .
Consider 𝔼[𝑌𝑛+1 |, 𝑍0 , … , 𝑍𝑛−1 ] = 𝔼[𝔼[𝑋|𝜎(, 𝑍0 , … , 𝑍𝑛 )]|, 𝑍0 , … , 𝑍𝑛−1 ] = 𝑌𝑛 .
Consider 𝔼[𝑍𝑛+1 |, 𝑌0 , … , 𝑌𝑛−1 ] = 𝔼[𝔼[𝑋|𝜎(, 𝑌0 , … , 𝑌𝑛 )]|, 𝑌0 , … , 𝑌𝑛−1 ] = 𝑍𝑛 .
Let  = 𝜎( ∪  ∪ 𝜎(𝑌0 , 𝑍0 , 𝑌1 , 𝑍1 , …)).
Consider 𝔼[𝑋|, 𝑍0 , … , 𝑍𝑘−1 ] = 𝑌𝑘 .
Consider 𝔼[𝑋|, 𝑌0 , … , 𝑌𝑘−1 ] = 𝑍𝑘 .
Let’s consider the limit sigma-algebras ∞ = 𝜎(, 𝑍0 , 𝑍1 , …) and ∞ = 𝜎(, 𝑌0 , 𝑌1 , …).
We know that 𝑌∞ is ∞ -measurable and 𝑍∞ is ∞ -measurable.
Let  = 𝜎( ∪ ).
𝑌1 = 𝔼[𝑋|]. 𝑍1 = 𝔼[𝑋|].
𝑌2 = 𝔼[𝑋|𝜎(, 𝑍0 )] = 𝔼[𝑋|𝜎(, 𝔼[𝑋|])].
𝑍2 = 𝔼[𝑋|𝜎(, 𝑌0 )] = 𝔼[𝑋|𝜎(, 𝔼[𝑋|])].
Consider the case when  and  are independent. Then 𝑌1 = 𝔼[𝑋|] and 𝑍1 = 𝔼[𝑋|].
𝑌2 = 𝔼[𝑋|𝜎(, 𝔼[𝑋|])].
Let’s consider the sigma-algebra generated by all the information. Let ∞ = 𝜎(, , 𝑌0 , 𝑍0 , 𝑌1 , 𝑍1 , …).
Then both 𝑌∞ and 𝑍∞ are ∞ -measurable.
We have 𝑌∞ = lim𝑛→∞ 𝑌𝑛 and 𝑍∞ = lim𝑛→∞ 𝑍𝑛 a.s. and in 𝐿1 .
Consider 𝔼[𝑋|, 𝑍0 , 𝑍1 , …] = 𝑌∞ and 𝔼[𝑋|, 𝑌0 , 𝑌1 , …] = 𝑍∞ .
Let ∞ = 𝜎( ∪ ).
Consider 𝔼[𝑋|∞ ].
Let’s try to show that both 𝑌∞ and 𝑍∞ are equal to 𝔼[𝑋|𝜎( ∨ )]. This seems too simple.
Let’s consider the definition of conditional expectation. For any set 𝐴 ∈ ∞ , 𝔼[𝑌∞ 1𝐴 ] = 𝔼[𝑋1𝐴 ].
For any set 𝐵 ∈ ∞ , 𝔼[𝑍∞ 1𝐵 ] = 𝔼[𝑋1𝐵 ].
Consider 𝔼[𝑌𝑘+1 |𝑘 ] = 𝑌𝑘 where 𝑘 = 𝜎(, 𝑍0 , … , 𝑍𝑘−1 ).
Consider 𝔼[𝑍𝑘+1 |𝑘 ] = 𝑍𝑘 where 𝑘 = 𝜎(, 𝑌0 , … , 𝑌𝑘−1 ).
Consider the sigma-algebra 𝜎( ∪ ).
Let 𝑊 = 𝔼[𝑋|𝜎( ∪ )].
We have 𝑌∞ = 𝔼[𝑋|𝜎(, 𝑍0 , 𝑍1 , …)].
Since 𝑍𝑘 depends on 𝑌𝑘−1 which depends on 𝑍𝑘−2 etc., it is complex.
Let’s consider the case where 𝑌0 = 𝔼[𝑋|] and 𝑍0 = 𝔼[𝑋|].
Then 𝑌1 = 𝔼[𝑋|] = 𝑌0 .
𝑍1 = 𝔼[𝑋|] = 𝑍0 .
If this were the case, then 𝑌𝑘 = 𝑌0 and 𝑍𝑘 = 𝑍0 , so 𝑌∞ = 𝑌0 and 𝑍∞ = 𝑍0 .
We need 𝑌0 = 𝑍0 a.s., i.e., 𝔼[𝑋|] = 𝔼[𝑋|] a.s. This is not generally true.
Consider the sigma-algebra generated by all 𝑌𝑖 and 𝑍𝑖 , denoted by ∞ = 𝜎(𝑌0 , 𝑍0 , 𝑌1 , 𝑍1 , …).
Then ∞ = 𝜎(, 𝑍0 , 𝑍1 , …) ⊆ 𝜎(, ∞ ).
∞ = 𝜎(, 𝑌0 , 𝑌1 , …) ⊆ 𝜎(, ∞ ).

                                                          23
                                              s1: Simple test-time scaling


Consider 𝔼[𝑋|, 𝑍0 , 𝑍1 , …] = 𝑌∞ . This means for any 𝐴 ∈ 𝜎(, 𝑍0 , 𝑍1 , …), 𝔼[𝑋1𝐴 ] = 𝔼[𝑌∞ 1𝐴 ].
Consider 𝔼[𝑋|, 𝑌0 , 𝑌1 , …] = 𝑍∞ . This means for any 𝐵 ∈ 𝜎(, 𝑌0 , 𝑌1 , …), 𝔼[𝑋1𝐵 ] = 𝔼[𝑍∞ 1𝐵 ].
Let  = 𝜎(, ).
Consider 𝔼[𝑋|].
Consider 𝔼[𝑌∞ |, 𝑌0 , 𝑌1 , …] = 𝔼[𝔼[𝑋|, 𝑍0 , 𝑍1 , …]|, 𝑌0 , 𝑌1 , …].
Let’s use the fact that 𝑌∞ is 𝜎(, 𝑍0 , 𝑍1 , …)-measurable and 𝑍∞ is 𝜎(, 𝑌0 , 𝑌1 , …)-measurable.
Consider the smallest sigma-algebra containing both 𝜎(, 𝑍0 , 𝑍1 , …) and 𝜎(, 𝑌0 , 𝑌1 , …).
This is 𝜎( ∪  ∪ {𝑌𝑖 }𝑖≥0 ∪ {𝑍𝑖 }𝑖≥0 ).
Let’s try to show that 𝔼[(𝑌∞ − 𝑍∞ )2 ] = 0.
𝔼[(𝑌∞ − 𝑍∞ )2 ] = 𝔼[𝑌∞2 ] + 𝔼[𝑍∞   2 ] − 2𝔼[𝑌 𝑍 ].
                                              ∞ ∞
Consider the definition of 𝑌𝑘 and 𝑍𝑘 .
𝔼[𝑌𝑘+1 |, 𝑍0 , … , 𝑍𝑘−1 ] = 𝑌𝑘 .
𝔼[𝑍𝑘+1 |, 𝑌0 , … , 𝑌𝑘−1 ] = 𝑍𝑘 .
Consider 𝔼[𝑌∞ |, 𝑌0 , 𝑌1 , …].
Since 𝑌∞ is measurable with respect to 𝜎(, 𝑍0 , 𝑍1 , …), this does not simplify easily.
Let’s assume 𝑌∞ ≠ 𝑍∞ on a set of positive measure.
Consider the case where  and  are trivial.
Then 𝑌𝑘 = 𝔼[𝑋|𝜎(𝑍0 , … , 𝑍𝑘−1 )] and 𝑍𝑘 = 𝔼[𝑋|𝜎(𝑌0 , … , 𝑌𝑘−1 )].
Let ∞ = 𝜎( ∪  ∪ {𝑌𝑖 , 𝑍𝑖 }𝑖≥0 ).
Consider 𝔼[𝑋|∞ ].
We have 𝑌∞ = 𝔼[𝑋|𝜎(, 𝑍0 , 𝑍1 , …)].
By the properties of conditional expectation, 𝔼[𝑌∞ |] = 𝔼[𝔼[𝑋|, 𝑍0 , 𝑍1 , …]|].
Consider the sigma-algebra  generated by all the random variables involved.
 = 𝜎( ∪  ∪ {𝑌𝑖 }𝑖≥0 ∪ {𝑍𝑖 }𝑖≥0 ).
Let 𝑀∞ = 𝔼[𝑋| ∨ ].
Consider the limits. As 𝑛 → ∞, 𝜎(, 𝑍0 , … , 𝑍𝑛−1 ) ↑ 𝜎(, 𝑍0 , 𝑍1 , …).
As 𝑛 → ∞, 𝜎(, 𝑌0 , … , 𝑌𝑛−1 ) ↑ 𝜎(, 𝑌0 , 𝑌1 , …).
Let ∞ = 𝜎(, 𝑍0 , 𝑍1 , …) and ∞ = 𝜎(, 𝑌0 , 𝑌1 , …).
We have 𝑌∞ = 𝔼[𝑋|∞ ] and 𝑍∞ = 𝔼[𝑋|∞ ].
Consider 𝔼[𝑌∞ − 𝑍∞ ]2 = 𝔼[(𝑌∞ − 𝑋) + (𝑋 − 𝑍∞ )]2
= 𝔼[(𝑌∞ − 𝑋)2 ] + 𝔼[(𝑋 − 𝑍∞ )2 ] + 2𝔼[(𝑌∞ − 𝑋)(𝑋 − 𝑍∞ )]
= 𝔼[(𝑋 − 𝔼[𝑋|∞ ])2 ] + 𝔼[(𝑋 − 𝔼[𝑋|∞ ])2 ] + 2𝔼[𝑌∞ 𝑋 − 𝑌∞ 𝑍∞ − 𝑋 2 + 𝑋𝑍∞ ]
= 𝔼[𝑋 2 ] − 𝔼[𝑌∞2 ] + 𝔼[𝑋 2 ] − 𝔼[𝑍∞ 2 ] + 2𝔼[𝑌 𝑋] − 2𝔼[𝑌 𝑍 ] − 2𝔼[𝑋 2 ] + 2𝔼[𝑋𝑍 ]
                                                ∞          ∞ ∞                         ∞
         2          2          2
= −𝔼[𝑌∞ ] − 𝔼[𝑍∞ ] + 2𝔼[𝑌∞ ] − 2𝔼[𝑌∞ 𝑍∞ ] + 2𝔼[𝑍∞      2]

= 𝔼[𝑌∞2 ] + 𝔼[𝑍∞  2 ] − 2𝔼[𝑌 𝑍 ]
                             ∞ ∞
Consider the sigma-algebra  = 𝜎( ∪ ).
Let 𝑊 = 𝔼[𝑋|].
Consider the limit sigma-algebra generated by all the information.
Let ∞ = 𝜎(, , 𝑌0 , 𝑍0 , 𝑌1 , 𝑍1 , …).
Then ∞ ⊆ ∞ and ∞ ⊆ ∞ .
We have 𝔼[𝑋|∞ ] = 𝑌∞ and 𝔼[𝑋|∞ ] = 𝑍∞ .
Consider 𝔼[𝑌∞ |∞ ] and 𝔼[𝑍∞ |∞ ].
Let  = 𝜎( ∪ ).
Consider 𝔼[𝑋|].
Consider the sequence of pairs (𝑌𝑘 , 𝑍𝑘 ).
If the limits are the same, then for any bounded measurable function 𝑓 , 𝔼[𝑓 (𝑌∞ )] = 𝔼[𝑓 (𝑍∞ )].
Consider the sigma-algebra  = 𝜎( ∪  ∪ 𝜎(𝑌0 , 𝑍0 , …)).
Then both 𝑌∞ and 𝑍∞ are measurable with respect to .
Consider 𝔼[(𝑌𝑛 − 𝑍𝑛 )2 ].
Let  = 𝜎( ∪ ).
Consider 𝑊 = 𝔼[𝑋|].
Then 𝔼[𝑊 |] = 𝔼[𝑋|] and 𝔼[𝑊 |] = 𝔼[𝑋|].


                                                          24
                                                 s1: Simple test-time scaling


Consider the limit sigma-algebra  = 𝜎( ∪  ∪ 𝜎(𝑌0 , 𝑍0 , …)).
We have 𝑌∞ = 𝔼[𝑋|𝜎(, 𝑍0 , 𝑍1 , …)] and 𝑍∞ = 𝔼[𝑋|𝜎(, 𝑌0 , 𝑌1 , …)].
Let  = 𝜎(, 𝑍0 , 𝑍1 , …) ∨ 𝜎(, 𝑌0 , 𝑌1 , …).
Consider 𝔼[𝑋|].
Consider the random variable 𝑋. Then 𝑌∞ = lim𝑛 𝔼[𝑋|𝑛 ] and 𝑍∞ = lim𝑛 𝔼[𝑋|𝑛 ].
Let ∞ = 𝜎( ∪  ∪ 𝜎(𝑌0 , 𝑍0 , …)).
Consider 𝐸 = {𝜔 ∶ 𝑌∞ (𝜔) ≠ 𝑍∞ (𝜔)}. Suppose 𝑃 (𝐸) > 0.
Consider the conditional expectation with respect to the join of the sigma-algebras.
Let ∞ = 𝜎(, , 𝑌0 , 𝑍0 , …).
Consider 𝔼[𝑋|∞ ].
Then 𝔼[𝔼[𝑋|∞ ]|, 𝑍0 , …] = 𝔼[𝑋|, 𝑍0 , …] = 𝑌𝑛+1 .
And 𝔼[𝔼[𝑋|∞ ]|, 𝑌0 , …] = 𝔼[𝑋|, 𝑌0 , …] = 𝑍𝑛+1 .
So 𝑌∞ = 𝔼[𝑋|𝜎(, 𝑍0 , 𝑍1 , …)] and 𝑍∞ = 𝔼[𝑋|𝜎(, 𝑌0 , 𝑌1 , …)].
Let ∞ = 𝜎(, 𝑍0 , 𝑍1 , …) and ∞ = 𝜎(, 𝑌0 , 𝑌1 , …).
Consider 𝔼[𝑋|∞ ∨ ∞ ].
Then 𝔼[𝔼[𝑋|∞ ∨ ∞ ]|∞ ] = 𝔼[𝑋|∞ ] = 𝑌∞ .
And 𝔼[𝔼[𝑋|∞ ∨ ∞ ]|∞ ] = 𝔼[𝑋|∞ ] = 𝑍∞ .
This means 𝑌∞ = 𝔼[𝑋|∞ ∨ ∞ ] a.s. and 𝑍∞ = 𝔼[𝑋|∞ ∨ ∞ ] a.s.
Therefore, 𝑌∞ = 𝑍∞ a.s.
Solution:
(a) For the sequence 𝑌𝑘 , the 𝜎-algebras 𝑘 = 𝜎(, 𝑍0 , … , 𝑍𝑘−1 ) are increasing as 𝑘 increases. Since 𝑋 is integrable, by the
Martingale Convergence Theorem for conditional expectations, the sequence 𝑌𝑘 = 𝔼[𝑋|𝑘 ] converges almost surely and in
𝐿1 to 𝑌∞ = 𝔼[𝑋|𝜎(, 𝑍0 , 𝑍1 , …)]. Similarly, for the sequence 𝑍𝑘 , the 𝜎-algebras 𝑘 = 𝜎(, 𝑌0 , … , 𝑌𝑘−1 ) are increasing.
Since 𝑋 is integrable, by the Martingale Convergence Theorem, the sequence 𝑍𝑘 = 𝔼[𝑋|𝑘 ] converges almost surely and in
𝐿1 to 𝑍∞ = 𝔼[𝑋|𝜎(, 𝑌0 , 𝑌1 , …)].
(b) Let ∞ = 𝜎(, 𝑍0 , 𝑍1 , …) and ∞ = 𝜎(, 𝑌0 , 𝑌1 , …). We have 𝑌∞ = 𝔼[𝑋|∞ ] and 𝑍∞ = 𝔼[𝑋|∞ ]. Consider the
𝜎-algebra  = ∞ ∨ ∞ = 𝜎(, , 𝑌0 , 𝑍0 , 𝑌1 , 𝑍1 , …). Consider 𝑊 = 𝔼[𝑋|].
We have ∞ ⊆ , so 𝔼[𝑊 |∞ ] = 𝔼[𝔼[𝑋|]|∞ ] = 𝔼[𝑋|∞ ] = 𝑌∞ .
Since 𝑊 is -measurable, 𝔼[𝑊 |∞ ] = 𝑌∞ implies that 𝑌∞ is the conditional expectation of 𝑊 given ∞ . But since 𝑌∞ is
already ∞ -measurable, we have 𝔼[𝑊 |∞ ] = 𝑊 if 𝑊 is ∞ -measurable, which is not necessarily true.
However, we know that 𝔼[𝑌∞ |∞ ] = 𝑌∞ and 𝔼[𝑍∞ |∞ ] = 𝑍∞ .
Consider 𝔼[𝑋|∞ ∨ ∞ ].
Then 𝔼[𝔼[𝑋|∞ ∨ ∞ ]|∞ ] = 𝔼[𝑋|∞ ] = 𝑌∞ .
This implies that 𝔼[𝑋|∞ ∨ ∞ ] = 𝑌∞ almost surely.
Similarly, 𝔼[𝔼[𝑋|∞ ∨ ∞ ]|∞ ] = 𝔼[𝑋|∞ ] = 𝑍∞ .
This implies that 𝔼[𝑋|∞ ∨ ∞ ] = 𝑍∞ almost surely.
Therefore, 𝑌∞ = 𝑍∞ almost surely.
Final Answer: The final answer is 𝑌∞ = 𝑍∞




                                                              25
                                                     s1: Simple test-time scaling

Table 10. s1K sample stemming from s1-teasers. The black text is the question and the yellow text is the solution. Based solely on the
question we generate the light blue reasoning trace and blue response with Gemini, which we use for training.

Alice and Bob each have a coin and flip it until they get a heads. If Bob flipped his coin more times than Alice, what is the
expected number of times Alice flipped her coin?

4/3

The problem asks for the expected number of times Alice flipped her coin, given that Bob flipped his coin more times than
Alice.
Let 𝑋 be the number of flips Alice makes until she gets a head.
Let 𝑌 be the number of flips Bob makes until he gets a head.
𝑋 and 𝑌 are independent random variables following a geometric distribution with parameter 𝑝 = 0.5 (probability of getting
a head).
The probability mass function for a geometric distribution is 𝑃 (𝐾 = 𝑘) = (1 − 𝑝)𝑘−1 𝑝, for 𝑘 = 1, 2, … .
In this case, 𝑝 = 0.5, so 𝑃 (𝐾 = 𝑘) = (0.5)𝑘−1 (0.5) = (0.5)𝑘 = 21𝑘 .
We are given the condition that Bob flipped his coin more times than Alice, which means 𝑌 > 𝑋.
We want to find the expected value of 𝑋 given that 𝑌 > 𝑋, i.e., 𝐸[𝑋|𝑌 > 𝑋].
                                                        ∑
The formula for conditional expectation is 𝐸[𝑋|𝐴] = 𝑥 𝑥𝑃 (𝑋 = 𝑥|𝐴).
Also, 𝑃 (𝑋 = 𝑥|𝐴) = 𝑃 (𝑋=𝑥∩𝐴)
                          𝑃 (𝐴)
                                 .
In our case, 𝐴 is the event 𝑌 > 𝑋.
We need to find 𝑃 (𝑌 > 𝑋).
               ∑     ∑∞
𝑃 (𝑌 > 𝑋) = ∞    𝑥=1    𝑦=𝑥+1 𝑃 (𝑋 = 𝑥, 𝑌 = 𝑦)
Since 𝑋 and 𝑌 are independent, 𝑃 (𝑋 = 𝑥, 𝑌 = 𝑦) = 𝑃 (𝑋 = 𝑥)𝑃 (𝑌 = 𝑦) = 21𝑥 21𝑦 = 2𝑥+𝑦
                                                                                   1
                                                                                      .
             ∑∞ ∑∞           1     ∑ ∞   1 ∑   ∞      1
𝑃 (𝑌 > 𝑋) = 𝑥=1 𝑦=𝑥+1 2𝑥+𝑦 = 𝑥=1 2𝑥 𝑦=𝑥+1 2𝑦
                                    ∑
The inner sum is a geometric series: ∞        1 𝑦     1 𝑥+1
                                      𝑦=𝑥+1 ( 2 ) = ( 2 )   + ( 12 )𝑥+2 + …
This is a geometric series with first term 𝑎 = ( 12 )𝑥+1 and common ratio 𝑟 = 12 .
                        ( 12 )𝑥+1       ( 12 )𝑥+1
            𝑎
The sum is 1−𝑟 =                = ( 12 )𝑥 .
                                    =
                1− 21        1
               ∑             2
                                ∑∞ 1 𝑥
So, 𝑃 (𝑌 > 𝑋) = ∞     1 1 𝑥
                 𝑥=1 2𝑥 ( 2 ) =  𝑥=1 ( 4 ) .
This is another geometric series with first term 𝑎 = 14 and common ratio 𝑟 = 41 .
                          1         1
            𝑎
The sum is 1−𝑟 =          4
                                = 43 = 13 .
                        1− 41       4
Now, we need to find 𝑃 (𝑋 = 𝑥 ∩ 𝑌 > 𝑋) = 𝑃 (𝑋 = 𝑥, 𝑌 > 𝑥).
𝑃 (𝑋 = 𝑥, 𝑌 > 𝑥) = 𝑃 (𝑋 = 𝑥)𝑃 (𝑌 > 𝑥), due to independence.
            ∑                   ∑∞
𝑃 (𝑌 > 𝑥) = ∞ 𝑦=𝑥+1 𝑃 (𝑌 = 𝑦) =
                                        1      1 𝑥+1
                                  𝑦=𝑥+1 2𝑦 = ( 2 )   + ( 12 )𝑥+2 + ⋯ = ( 12 )𝑥 .
So, 𝑃 (𝑋 = 𝑥, 𝑌 > 𝑥) = 21𝑥 × ( 12 )𝑥 = ( 14 )𝑥 .
                                                            ( 41 )𝑥
Now, we can find 𝑃 (𝑋 = 𝑥|𝑌 > 𝑋) = 𝑃 (𝑋=𝑥∩𝑌  >𝑋)
                                      𝑃 (𝑌 >𝑋)
                                                 =            1       = 3( 14 )𝑥 .
                                                              3
Finally, we can calculate the conditional expectation:
                 ∑                            ∑                     ∑
𝐸[𝑋|𝑌 > 𝑋] = ∞     𝑥=1 𝑥𝑃  (𝑋 = 𝑥|𝑌 > 𝑋) = ∞        𝑥 ⋅ 3( 14 )𝑥 = 3 ∞       1 𝑥
                                                                      𝑥=1 𝑥( 4 ) .
                         ∑∞                     𝑥=1
Consider the series 𝑆 = 𝑥=1 𝑥𝑟 , where |𝑟| < 1.
                                 𝑥

𝑆 = 𝑟 + 2𝑟2 + 3𝑟3 + 4𝑟4 + …
𝑆
𝑟
  = 1 + 2𝑟 + 3𝑟2 + 4𝑟3 + …
                                                         1
Consider the geometric series 𝐺 = 1 + 𝑟 + 𝑟2 + 𝑟3 + ⋯ = 1−𝑟 .
Differentiating with respect to 𝑟:
𝑑𝐺                                 𝑑    1        1
𝑑𝑟
    = 1 + 2𝑟 + 3𝑟2 + 4𝑟3 + ⋯ = 𝑑𝑟    ( 1−𝑟 ) = (1−𝑟)2.
           1
So, 𝑆𝑟 = (1−𝑟)                      𝑟
              2 , which means 𝑆 = (1−𝑟)2 .

In our case, 𝑟 = 14 .


                                                                      26
                                                              s1: Simple test-time scaling

∑∞       1 𝑥
                    1                1        1

  𝑥=1 𝑥( 4 ) =
                    4
                             =       4
                                           = 49 = 14 × 16 = 49 .
                 (1− 14 )2       ( 43 )2      16
                                                        9

Therefore, 𝐸[𝑋|𝑌 > 𝑋] = 3 × 94 = 12 9
                                      = 43 .
Alternative approach: Symmetry.
Consider the events 𝑌 > 𝑋, 𝑋 > 𝑌 , and 𝑋 = 𝑌 .
𝑃 (𝑌 > 𝑋) = 𝑃 (𝑋 > 𝑌 ) due to symmetry.
              ∑                        ∑∞                          ∑∞ 1 2 ∑∞ 1 𝑘                        1        1
𝑃 (𝑋 = 𝑌 ) = ∞  𝑘=1 𝑃 (𝑋 = 𝑘, 𝑌 = 𝑘) =    𝑘=1 𝑃 (𝑋 = 𝑘)𝑃 (𝑌 = 𝑘) =  𝑘=1 ( 2𝑘 ) = 𝑘=1 ( 4 ) =
                                                                                                        4
                                                                                                              = 43 = 13 .
                                                                                                      1− 14      4
We have 𝑃 (𝑌 > 𝑋) + 𝑃 (𝑋 > 𝑌 ) + 𝑃 (𝑋 = 𝑌 ) = 1.
𝑃 (𝑌 > 𝑋) + 𝑃 (𝑌 > 𝑋) + 31 = 1.
2𝑃 (𝑌 > 𝑋) = 1 − 13 = 32 .
𝑃 (𝑌 > 𝑋) = 31 . This confirms our previous calculation.
Let 𝐸[𝑋|𝑌 > 𝑋] = 𝜇.
We are looking for the expected number of flips Alice made, given Bob flipped more times.
Consider the case where the first flip for both is tails. Then the process restarts, and the condition remains.
Let 𝐴1 be Alice’s first flip, 𝐵1 be Bob’s first flip.
𝑃 (𝐴1 = 𝐻) = 1∕2, 𝑃 (𝐴1 = 𝑇 ) = 1∕2.
𝑃 (𝐵1 = 𝐻) = 1∕2, 𝑃 (𝐵1 = 𝑇 ) = 1∕2.
Case 1: 𝐴1 = 𝐻, 𝐵1 = 𝐻. 𝑋 = 1, 𝑌 = 1, so 𝑌 > 𝑋 is false.
Case 2: 𝐴1 = 𝐻, 𝐵1 = 𝑇 . Alice stops at 1, Bob continues. 𝑌 > 1.
Case 3: 𝐴1 = 𝑇 , 𝐵1 = 𝐻. Bob stops at 1, Alice continues.
Case 4: 𝐴1 = 𝑇 , 𝐵1 = 𝑇 . Both continue.
Consider the conditional probabilities given 𝑌 > 𝑋.
Let 𝐸 = 𝐸[𝑋|𝑌 > 𝑋].
If the first flips are HH, the condition 𝑌 > 𝑋 is not met.
If the first flips are HT, Alice’s number of flips is 1.
If the first flips are TH, Bob’s number of flips is 1, Alice’s is at least 2, so 𝑌 < 𝑋.
If the first flips are TT, the problem restarts, and the expected number of additional flips for Alice is still 𝐸.
Let 𝑁𝐴 be the number of flips Alice makes, 𝑁𝐵 be the number of flips Bob makes.
We are given 𝑁𝐵 > 𝑁𝐴 .
Consider the first flips:
HH (prob 1/4): Condition not met.
HT (prob 1/4): Alice flips 1 time.
TH (prob 1/4): Bob flips 1 time, Alice > 1, so 𝑁𝐴 > 𝑁𝐵 , condition not met.
TT (prob 1/4): Both continue.
Let 𝐸 be the expected number of Alice’s flips given Bob flipped more.
If the first flips are HT, Alice flipped 1 time, and the condition is met.
If the first flips are TT, the situation is the same as the beginning, so the expected number of additional flips for Alice is 𝐸.
Consider the event 𝑌 > 𝑋.
Conditioning on the first flips:
𝐸[𝑋|𝑌 > 𝑋] = 𝐸[𝑋|𝑌 > 𝑋, 𝐻𝐻]𝑃 (𝐻𝐻|𝑌 > 𝑋) + 𝐸[𝑋|𝑌 > 𝑋, 𝐻𝑇 ]𝑃 (𝐻𝑇 |𝑌 > 𝑋) + 𝐸[𝑋|𝑌 > 𝑋, 𝑇 𝐻]𝑃 (𝑇 𝐻|𝑌 >
𝑋) + 𝐸[𝑋|𝑌 > 𝑋, 𝑇 𝑇 ]𝑃 (𝑇 𝑇 |𝑌 > 𝑋)
𝑃 (𝑌 > 𝑋|𝐻𝐻) = 0.
𝑃 (𝑌 > 𝑋|𝐻𝑇 ) = 1.
𝑃 (𝑌 > 𝑋|𝑇 𝐻) = 0.
𝑃 (𝑌 > 𝑋|𝑇 𝑇 ) depends on the future flips.
Let’s use the definition of conditional expectation directly.
                    ∑                               ∑∞ 𝑃 (𝑋=𝑥,𝑌 >𝑥)
𝐸[𝑋|𝑌 > 𝑋] = ∞         𝑥=1 𝑥𝑃 (𝑋 = 𝑥|𝑌 > 𝑋) =        𝑥=1 𝑥 𝑃 (𝑌 >𝑋)
Consider the probabilities of the first flips given 𝑌 > 𝑋.
                                              0⋅1∕4
𝑃 (𝐻𝐻|𝑌 > 𝑋) = 𝑃 (𝑌 >𝑋|𝐻𝐻)𝑃  𝑃 (𝑌 >𝑋)
                                      (𝐻𝐻)
                                           = 1∕3 = 0.
                                                     1⋅1∕4
𝑃 (𝐻𝑇 |𝑌 > 𝑋) = 𝑃 (𝑌 >𝑋|𝐻𝑇  )𝑃 (𝐻𝑇 )
                      𝑃 (𝑌 >𝑋)
                                     = 1∕3 = 34 .



                                                                          27
                                                        s1: Simple test-time scaling

                                            0⋅1∕4
𝑃 (𝑇 𝐻|𝑌 > 𝑋) = 𝑃 (𝑌 >𝑋|𝑇  𝐻)𝑃 (𝑇 𝐻)
                      𝑃 (𝑌 >𝑋)
                                     = 1∕3 = 0.
𝑃 (𝑇 𝑇 |𝑌 > 𝑋) = 𝑃 (𝑌 >𝑋|𝑇       𝑇 )𝑃 (𝑇 𝑇 )
                            𝑃 (𝑌 >𝑋)
If the first flips are TT, the future behavior is the same, and the condition 𝑌 > 𝑋 holds if Bob takes more flips than Alice in
the subsequent trials.
𝑃 (𝑌 > 𝑋|𝑇 𝑇 ) = 𝑃 (Bob needs more flips than Alice starting from scratch) = 𝑃 (𝑌 > 𝑋) = 1∕3.
                       (1∕3)⋅(1∕4)
𝑃 (𝑇 𝑇 |𝑌 > 𝑋) =          1∕3
                                   = 14 .
Sum of conditional probabilities: 0 + 3∕4 + 0 + 1∕4 = 1.
Now consider the conditional expectations of 𝑋:
𝐸[𝑋|𝑌 > 𝑋, 𝐻𝐻]: Not relevant since 𝑃 (𝐻𝐻|𝑌 > 𝑋) = 0.
𝐸[𝑋|𝑌 > 𝑋, 𝐻𝑇 ]: Given HT and 𝑌 > 𝑋, Alice flipped 1 time. 𝐸 = 1.
𝐸[𝑋|𝑌 > 𝑋, 𝑇 𝐻]: Not relevant.
𝐸[𝑋|𝑌 > 𝑋, 𝑇 𝑇 ]: Given TT and 𝑌 > 𝑋, Alice has already flipped once, and the expected number of additional flips is 𝐸.
So, the total number of flips is 1 + 𝐸.
𝐸 = 0 ⋅ 0 + 1 ⋅ (3∕4) + undefined ⋅ 0 + (1 + 𝐸) ⋅ (1∕4)
𝐸 = 0 + 3∕4 + 0 + 1∕4 + 𝐸∕4
𝐸 = 1 + 𝐸∕4
𝐸 − 𝐸∕4 = 1
3𝐸∕4 = 1
𝐸 = 4∕3.
This confirms the previous result.
Let’s think about the intuition. If we have the condition 𝑌 > 𝑋, it means Alice was "lucky" and got a head early, or Bob was
"unlucky".
Consider the random variable 𝑍 = 𝑌 − 𝑋. We are given 𝑍 > 0.
We want 𝐸[𝑋|𝑍 > 0].
Let 𝑝 = 𝑃 (head) = 0.5.
The number of tails before the first head follows a geometric distribution with parameter 𝑝.
𝑋 − 1 ∼ 𝐺𝑒𝑜𝑚(𝑝), 𝑌 − 1 ∼ 𝐺𝑒𝑜𝑚(𝑝).
𝑃 (𝑋 − 1 = 𝑘) = (1 − 𝑝)𝑘 𝑝.
𝑃 (𝑋 = 𝑥) = (1 − 𝑝)𝑥−1 𝑝.
Consider the joint distribution of 𝑋 and 𝑌 .
𝑃 (𝑋 = 𝑥, 𝑌 = 𝑦) = 𝑝2 (1 − 𝑝)𝑥−1 (1 − 𝑝)𝑦−1
Consider the event 𝑌 = 𝑋 + 𝑘, where 𝑘 ≥ 1.
                      ∑                                 ∑∞ 2
𝑃 (𝑌 = 𝑋 + 𝑘) = ∞        𝑥=1 𝑃 (𝑋 = 𝑥, 𝑌 = 𝑥 + 𝑘) =        𝑥=1 𝑝 (1 − 𝑝)
                                                                        𝑥−1 (1 − 𝑝)𝑥+𝑘−1
                   ∑
= 𝑝2 (1 − 𝑝)𝑘−1 ∞     𝑥=1 ((1 − 𝑝) )
                                    2 𝑥−1 = 𝑝2 (1 − 𝑝)𝑘−1    1
                                                          1−(1−𝑝)2
                     1                   𝑘−1 1                       1
= 𝑝2 (1 − 𝑝)𝑘−1 1−(1−2𝑝+𝑝        2
                          2 ) = 𝑝 (1 − 𝑝)   2𝑝−𝑝2
                                                  = 𝑝2 (1 − 𝑝)𝑘−1 𝑝(2−𝑝)
         𝑘−1
= 𝑝(1−𝑝)
      2−𝑝
             .
If 𝑝 = 0.5, then 1 − 𝑝 = 0.5, 2 − 𝑝 = 1.5.
                             𝑘−1         𝑘     (1∕2)𝑘
𝑃 (𝑌 = 𝑋 + 𝑘) = 0.5(0.5)         = (0.5)   = 3∕2 = 21𝑘 ⋅ 23 = 3⋅21𝑘−1 .
           ∑            1.5          1.5
Check if ∞    𝑘=1 𝑃 (𝑌 = 𝑋 + 𝑘) = 𝑃 (𝑌 > 𝑋) = 1∕3.
∑∞        1       1 ∑∞ 1 𝑘−1
   𝑘=1 3⋅2𝑘−1 = 3     𝑘=1 ( 2 )   = 31 (1 + 1∕2 + 1∕4 + … ) = 13 1−1∕2  1
                                                                          = 13 ⋅ 2 = 32 .
Something is wrong here.
Let’s recalculate 𝑃 (𝑌 > 𝑋).
                ∑      ∑∞
𝑃 (𝑌 > 𝑋) = ∞      𝑥=1   𝑦=𝑥+1 (1 − 𝑝)
                                          𝑥−1 𝑝(1 − 𝑝)𝑦−1 𝑝
      ∑ ∞               ∑   ∞
= 𝑝2 𝑥=1 (1 − 𝑝)𝑥−1 𝑦=𝑥+1 (1 − 𝑝)𝑦−1
                                             (1−𝑝)  𝑥           𝑥
Inner sum: (1 − 𝑝)𝑥 + (1 − 𝑝)𝑥+1 + ⋯ = 1−(1−𝑝)       = (1−𝑝)
                                                          𝑝
                                                             .
              2 ∑∞                 (1−𝑝)𝑥             ∑∞
𝑃 (𝑌 > 𝑋) = 𝑝      𝑥=1 (1 − 𝑝)
                               𝑥−1
                                      𝑝
                                          = 𝑝(1 − 𝑝) 𝑥=1 (1 − 𝑝)2(𝑥−1)
          ∑
= 𝑝(1 − 𝑝) ∞              2 𝑥−1 = 𝑝(1 − 𝑝)
             𝑥=1 ((1 − 𝑝) )
                                                1
                                            1−(1−𝑝)2
                1                     1                 1     1−𝑝
= 𝑝(1 − 𝑝) 1−(1−2𝑝+𝑝 2 ) = 𝑝(1 − 𝑝) 2𝑝−𝑝2 = 𝑝(1 − 𝑝) 𝑝(2−𝑝) = 2−𝑝 .

If 𝑝 = 0.5, 𝑃 (𝑌 > 𝑋) = 0.5
                        1.5
                            = 31 . Correct.

                                                                    28
                                                    s1: Simple test-time scaling

Now, let’s find 𝐸[𝑋|𝑌 > 𝑋].
                 ∑                             ∑∞ 𝑃 (𝑋=𝑥,𝑌 >𝑥)
𝐸[𝑋|𝑌 > 𝑋] = ∞     𝑥=1 𝑥𝑃 (𝑋 = 𝑥|𝑌 > 𝑋) =        𝑥=1 𝑥 𝑃 (𝑌 >𝑋) .
𝑃 (𝑋 = 𝑥, 𝑌 > 𝑥) = 𝑃 (𝑋 = 𝑥)𝑃 (𝑌 > 𝑥) (independence)
𝑃 (𝑋 = 𝑥) = (1 − 𝑝)𝑥−1 𝑝.
              ∑                                                             𝑥
                             𝑦−1 𝑝 = 𝑝[(1 − 𝑝)𝑥 + (1 − 𝑝)𝑥+1 + … ] = 𝑝 (1−𝑝) = (1 − 𝑝)𝑥 .
𝑃 (𝑌 > 𝑥) = ∞   𝑦=𝑥+1 (1 − 𝑝)                                         1−(1−𝑝)
𝑃 (𝑋 = 𝑥, 𝑌 > 𝑥) = (1 − 𝑝)𝑥−1 𝑝(1 − 𝑝)𝑥 = 𝑝(1 − 𝑝)2𝑥−1 .
                        ∑
𝐸[𝑋|𝑌 > 𝑋] = 𝑃 (𝑌1>𝑋) ∞    𝑥=1 𝑥𝑝(1 − 𝑝)
                                         2𝑥−1

   2−𝑝           ∑
= 1−𝑝  𝑝(1 − 𝑝)−1 ∞𝑥=1 𝑥((1 − 𝑝) ) (error in power)
                                2 𝑥
                   𝑥
               (1−𝑝)
𝑃 (𝑌 > 𝑥) = 1−(1−𝑝)   𝑝 = (1 − 𝑝)𝑥 .
                          ∑
𝐸[𝑋|𝑌 > 𝑋] = 𝑃 (𝑌1>𝑋) ∞      𝑥=1 𝑥𝑃 (𝑋 = 𝑥)𝑃 (𝑌 > 𝑥)
      1     ∑∞
= 𝑃 (𝑌 >𝑋) 𝑥=1 𝑥(1 − 𝑝) 𝑝(1 − 𝑝)𝑥
                          𝑥−1
            ∑
= 𝑃 (𝑌𝑝>𝑋) ∞  𝑥=1 𝑥(1 − 𝑝)
                          2𝑥−1
                        ∑
        𝑝
= (1−𝑝)∕(2−𝑝) (1 − 𝑝)−1 ∞               2 𝑥
                           𝑥=1 𝑥((1 − 𝑝) )
     2−𝑝 ∑∞
             𝑥=1 𝑥((1 − 𝑝) ) .
= 𝑝 (1−𝑝)                  2 𝑥
          2
                   ∑ ∞             𝑟
Let 𝑟 = (1 − 𝑝)2 . 𝑥=1 𝑥𝑟𝑥 = (1−𝑟)   2.

1 − 𝑟 = 1 − (1 − 𝑝)2 = 1 − (1 − 2𝑝 + 𝑝2 ) = 2𝑝 − 𝑝2 = 𝑝(2 − 𝑝).
(1 − 𝑟)2 = 𝑝2 (2 − 𝑝)2 .
∑∞                       (1−𝑝)2
  𝑥=1 𝑥((1 − 𝑝) ) = 𝑝2 (2−𝑝)2 .
                2 𝑥

                2−𝑝 (1−𝑝)      2     1
𝐸[𝑋|𝑌 > 𝑋] = 𝑝 (1−𝑝)2 𝑝2 (2−𝑝)2 = 𝑝(2−𝑝) .
                            1        1     1
If 𝑝 = 0.5, 𝐸[𝑋|𝑌 > 𝑋] = 0.5×1.5 = 0.75 = 3∕4 = 4∕3.

Solution:
Let 𝑋 be the number of flips Alice makes until she gets a head, and 𝑌 be the number of flips Bob makes until he gets a
head. 𝑋 and 𝑌 are independent geometric random variables with parameter 𝑝 = 0.5. The probability mass function is
𝑃 (𝐾 = 𝑘) = (1 − 𝑝)𝑘−1 𝑝.
We are interested in the expected number of times Alice flipped her coin, given that Bob flipped his coin more times than
Alice, i.e., 𝐸[𝑋|𝑌 > 𝑋].
                                                                ∑
Using the definition of conditional expectation, 𝐸[𝑋|𝑌 > 𝑋] = ∞   𝑥=1 𝑥𝑃 (𝑋 = 𝑥|𝑌 > 𝑋).
                      𝑃 (𝑋=𝑥,𝑌 >𝑥)
𝑃 (𝑋 = 𝑥|𝑌 > 𝑋) = 𝑃 (𝑌 >𝑋) .
First, calculate 𝑃 (𝑌 > 𝑋).
               ∑     ∑∞                          ∑∞
𝑃 (𝑌 > 𝑋) = ∞    𝑥=1   𝑦=𝑥+1 𝑃 (𝑋 = 𝑥, 𝑌 = 𝑦) =    𝑥=1 𝑃 (𝑋 = 𝑥)𝑃 (𝑌 > 𝑥).
𝑃 (𝑋 = 𝑥) = (0.5)𝑥−1 (0.5) = (0.5)𝑥 = 21𝑥 .
            ∑
𝑃 (𝑌 > 𝑥) = ∞ 𝑦=𝑥+1 (0.5)
                          𝑦−1 (0.5) = (0.5)𝑥 .
             ∑∞ 1 1 𝑥 ∑∞ 1 𝑥                   1∕4 1∕4
𝑃 (𝑌 > 𝑋) = 𝑥=1 2𝑥 ( 2 ) = 𝑥=1 ( 4 ) = 1−1∕4 = 3∕4 = 31 .
Next, calculate 𝑃 (𝑋 = 𝑥, 𝑌 > 𝑥) = 𝑃 (𝑋 = 𝑥)𝑃 (𝑌 > 𝑥) = 21𝑥 ( 21 )𝑥 = ( 14 )𝑥 .
                                                  (1∕4)𝑥
Now, find 𝑃 (𝑋 = 𝑥|𝑌 > 𝑋) = 𝑃 (𝑋=𝑥,𝑌      >𝑥)
                                   𝑃 (𝑌 >𝑋)
                                               = 1∕3 = 3( 14 )𝑥 .
Finally, calculate the conditional expectation:
                  ∑                             ∑                     ∑
𝐸[𝑋|𝑌 > 𝑋] = ∞           𝑥𝑃 (𝑋 = 𝑥|𝑌 > 𝑋) = ∞         𝑥 ⋅ 3( 14 )𝑥 = 3 ∞       1 𝑥
                                                                        𝑥=1 𝑥( 4 ) .
                    ∑∞
                     𝑥=1
                                    𝑟
                                                  𝑥=1
Using the formula 𝑥=1 𝑥𝑟 = (1−𝑟)2 with 𝑟 = 1∕4:
                             𝑥
∑∞       1 𝑥       1∕4       1∕4      1∕4    1   16    4
  𝑥=1 𝑥( 4 ) = (1−1∕4)2 = (3∕4)2 = 9∕16 = 4 × 9 = 9 .
𝐸[𝑋|𝑌 > 𝑋] = 3 × 49 = 12
                      9
                         = 34 .
Final Answer: The final answer is 4∕3




                                                                29
                                                          s1: Simple test-time scaling

Table 11. s1K sample stemming from the astronomy subset of OlympicArena. The black text is the question and the blue text is the
solution. Based solely on the question we generate the light blue reasoning trace and blue response with Gemini, which we use for training.

An often-repeated fun fact is that humans produce more power per unit volume than stars. If the sun were the same size, but
it produced the same amount of power per unit volume as a human, what would its surface temperature be? Assume the
"average human" produces 100 watts of power and has a volume of 66400 cubic centimeters.
A: 3500 K
B: 10000 K
C: 25000 K
D: 40000 K
E: 50000 K

Using the numbers from the problem, the average human produces
                                                              100
                                                  𝑢=                   = 1506 W∕m3
                                                          66400 × 10−6
The volume of the sun is 43 𝜋𝑅3⊙ , so its new power output would be 𝑃 = 43 𝜋𝑅3⊙ 𝑢. To find the equilibrium temperature, we use

                                                          𝑃 = 𝜎𝐴𝑇 4 = 4𝜋𝜎𝑅2⊙ 𝑇 4
Solving for 𝑇 , we get
                                        √          √(          )
                                                   √
                                         4   𝑅⊙ 𝑢 √√
                                                   4
                                                     6.96 × 108 (1506)
                                  𝑇 =            =      (           ) = 49823 ≈ 50000 K.
                                              3𝜎      3 5.67 × 10−8


The problem asks for the surface temperature of the sun if it had the same size but produced power per unit volume as a
human.
First, let’s find the power density of a human.
Power produced by a human, 𝑃ℎ𝑢𝑚𝑎𝑛 = 100 W
Volume of a human, 𝑉ℎ𝑢𝑚𝑎𝑛 = 66400 cm3
Convert the volume to cubic meters:
1 m = 100 cm
1 m3 = (100 cm)3 = 106 cm3
𝑉ℎ𝑢𝑚𝑎𝑛 = 66400 cm3 = 66400 × 10−6 m3 = 0.0664 m3
                                            𝑃       100 W
Power density of a human, 𝜌𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛 = 𝑉ℎ𝑢𝑚𝑎𝑛 = 0.0664 m3
                                                  ℎ𝑢𝑚𝑎𝑛
𝜌𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛 = 1506.024 W∕m3
Now, consider the sun with the same size as it currently is, but with a power density equal to that of a human.
We need the radius of the sun to calculate its volume. The average radius of the sun is 𝑅𝑠𝑢𝑛 = 6.95 × 108 m.
Volume of the sun, 𝑉𝑠𝑢𝑛 = 43 𝜋𝑅3𝑠𝑢𝑛 = 43 𝜋(6.95 × 108 m)3
𝑉𝑠𝑢𝑛 = 43 𝜋(337.159375 × 1024 ) m3
𝑉𝑠𝑢𝑛 = 1410.24 × 1024 × 𝜋 m3
𝑉𝑠𝑢𝑛 = 4.4307 × 1027 m3
If the sun had the same power density as a human, the total power produced by this modified sun would be:
𝑃𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛 = 𝜌𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛 × 𝑉𝑠𝑢𝑛
𝑃𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛 = 1506.024 W∕m3 × 4.4307 × 1027 m3
𝑃𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛 = 6.6726 × 1030 W




                                                                      30
                                                  s1: Simple test-time scaling

The surface temperature of a star is related to its luminosity (total power output) through the Stefan-Boltzmann law:
𝐿 = 4𝜋𝑅2 𝜎𝑇 4
where 𝐿 is the luminosity (total power output), 𝑅 is the radius of the star, 𝜎 is the Stefan-Boltzmann constant, and 𝑇 is the
surface temperature.
In this case, the luminosity of the modified sun is 𝑃𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛 . The radius of the modified sun is the same as the actual sun,
𝑅𝑠𝑢𝑛 .
The Stefan-Boltzmann constant, 𝜎 = 5.67 × 10−8 Wm−2 K −4 .
𝑃𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛 = 4𝜋𝑅2𝑠𝑢𝑛 𝜎𝑇 4
6.6726 × 1030 W = 4𝜋(6.95 × 108 m)2 × 5.67 × 10−8 Wm−2 K −4 × 𝑇 4
Calculate the term 4𝜋𝑅2𝑠𝑢𝑛 𝜎:
𝑅2𝑠𝑢𝑛 = (6.95 × 108 )2 = 48.3025 × 1016 m2
4𝜋𝑅2𝑠𝑢𝑛 = 4𝜋 × 48.3025 × 1016 = 607.08 × 1016 m2
4𝜋𝑅2𝑠𝑢𝑛 𝜎 = 607.08 × 1016 m2 × 5.67 × 10−8 Wm−2 K −4
4𝜋𝑅2𝑠𝑢𝑛 𝜎 = 3441.93 × 108 WK −4 = 3.44193 × 1011 WK −4
Now, substitute this back into the equation:
6.6726 × 1030 = 3.44193 × 1011 × 𝑇 4
Solve for 𝑇 4 : 30
        6.6726×10
𝑇 4 = 3.44193×10  11 = 1.9386 × 10
                                   19

Solve for 𝑇 :
𝑇 = (1.9386 × 1019 )1∕4 = (193.86 × 1017 )1∕4
𝑇 = (1.9386)1∕4 × (1019 )1∕4 = (1.9386)0.25 × 104.75
Calculate (1.9386)0.25 :
Let 𝑦 = 1.93860.25 , then ln 𝑦 = 0.25 ln(1.9386)
ln(1.9386) ≈ 0.6619
ln 𝑦 = 0.25 × 0.6619 = 0.165475
𝑦 = 𝑒0.165475 ≈ 1.1799
So, 𝑇 = 1.1799 × 104.75 = 1.1799 × 104 × 100.75
100.75 = 103∕4 = (103 )1∕4 = (1000)1∕4
We know 54 = 625 and 64 = 1296, so (1000) √
                                               1∕4 is between 5 and 6.

Alternatively, 100.75 = 100.5 × 100.25 = 10 × 100.25 ≈ 3.16 × 1.778 = 5.62
𝑇 = 1.1799 × 104 × 5.62 = 6.63 × 104 K = 66300 K
Let’s double-check the calculations.
Power density of human: 𝜌𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛 = 1506.02 W∕m3
Volume of sun: 𝑉𝑠𝑢𝑛 = 4.43 × 1027 m3
Power of modified sun: 𝑃𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛 = 1506.02 × 4.43 × 1027 = 6671.67 × 1027 = 6.67 × 1030 W
Stefan-Boltzmann law: 𝑃 = 4𝜋𝑅2 𝜎𝑇 4
          𝑃
𝑇 4 = 4𝜋𝑅   2𝜎
4𝜋𝑅2𝑠𝑢𝑛 𝜎 = 3.44 × 1011
             30
𝑇 4 = 6.67×10
      3.44×1011
                = 1.939 × 1019
𝑇 = (1.939 × 1019 )1∕4 = (193.9 × 1017 )1∕4
𝑇 = (193.9)0.25 × (1017 )0.25 = (193.9)0.25 × 104.25
Calculate (193.9)0.25 :
Let 𝑦 = 193.90.25 , then ln 𝑦 = 0.25 ln(193.9)
ln(193.9) ≈ 5.267
ln 𝑦 = 0.25 × 5.267 = 1.31675
𝑦 = 𝑒1.31675 ≈ 3.73
𝑇 = 3.73 × 104.25 = 3.73 × 104 × 100.25 = 37300 × 1.778 = 66329 K
The result is around 66000 K. This is not among the options. Let’s review the assumptions and calculations.
Assumption: The sun radiates as a black body, which is reasonable for estimating surface temperature.




                                                               31
                                                  s1: Simple test-time scaling

Consider potential errors in calculations.
Let’s try a simpler approach by setting up ratios.
The power per unit volume of the modified sun is equal to that of a human.
𝑃𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛   𝑃
    𝑉
               = 𝑉ℎ𝑢𝑚𝑎𝑛
      𝑠𝑢𝑛           ℎ𝑢𝑚𝑎𝑛
We also know 𝑃𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛 = 4𝜋𝑅2𝑠𝑢𝑛 𝜎𝑇𝑚𝑜𝑑𝑖𝑓
                                        4
                                              𝑖𝑒𝑑_𝑠𝑢𝑛
                                                      .
          4
4𝜋𝑅2𝑠𝑢𝑛 𝜎𝑇𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛     𝑃
       4                  = 𝑉ℎ𝑢𝑚𝑎𝑛
       3
         𝜋𝑅3𝑠𝑢𝑛              ℎ𝑢𝑚𝑎𝑛
   4
3𝜎𝑇𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛
     𝑅𝑠𝑢𝑛
                   = 𝜌𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛
  4                  𝜌           𝑅
𝑇𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛
               = 𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛
                       3𝜎
                             𝑠𝑢𝑛

Substitute the values:
𝜌𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛 = 1506.02 W∕m3
𝑅𝑠𝑢𝑛 = 6.95 × 108 m
𝜎 = 5.67 × 10−8 Wm−2 K −4 8
  4
𝑇𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛
               = 1506.02×6.95×10
                   3×5.67×10−8
                                     8       12
 4
𝑇𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛
               = 10466.839×10
                  17.01×10−8
                              = 1.0466×10
                                1.701×10−7
                                           = 0.6153 × 1019 = 6.153 × 1018
𝑇𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛 = (6.153 × 1018 )1∕4 = (6153 × 1015 )1∕4
𝑇𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛 = (6153)0.25 × (1015 )0.25 = (6153)0.25 × 103.75
Calculate (6153)0.25 :
Let 𝑦 = 61530.25 , then ln 𝑦 = 0.25 ln(6153)
ln(6153) ≈ 8.724
ln 𝑦 = 0.25 × 8.724 = 2.181
𝑦 = 𝑒2.181 ≈ 8.85
𝑇𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛 = 8.85 × 103.75 = 8.85 × 103 × 100.75 = 8850 × 5.62 = 49737 K
This is close to option E: 50000 K.
Let’s re-check the units and calculations carefully.
                             100 W
Power density of human: 0.0664    m3
                                     = 1506.02 W∕m3
                                         𝜌    𝑅
The formula derived was 𝑇 4 = 𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛
                                     3𝜎
                                          𝑠𝑢𝑛
                                              .
Units check:
[𝜌𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛 ] = W∕m3
[𝑅𝑠𝑢𝑛 ] = m
[𝜎] = Wm−2 K −4
(W∕m3 )×m       W∕m2             2 4

 Wm−2 K −4
           = Wm−2 K −4 = mW2 × mWK = K 4
The units match for 𝑇 4 .
Numerical calculation check:
Numerator: 1506.02 × 6.95 × 108 = 10466.839 × 108 = 1.0466 × 1012
Denominator: 312× 5.67 × 10−8 = 17.01 × 10−8
Ratio: 1.0466×10
        17.01×10−8
                   = 0.061528 × 1020 = 6.1528 × 1018
𝑇 4 = 6.1528 × 1018
𝑇 = (6.1528 × 1018 )1∕4 = (6152.8 × 1015 )1∕4
𝑇 = (6152.8)0.25 × 103.75
Let’s use logarithms for precision.
log10 (6152.8) ≈ 3.789
0.25 × 3.789 = 0.94725
100.94725 ≈ 8.856
𝑇 = 8.856 × 103.75 = 8.856 × 103 × 100.75 = 8856 × 5.623 = 49806 K
This is very close to 50000 K.
Possible sources of minor discrepancies: rounding errors, precision of constants.
Let’s verify the formula derivation.
Power per unit volume = Total  Power
                            Volume


                                                              32
                                                 s1: Simple test-time scaling

𝜌𝑝𝑜𝑤𝑒𝑟 = 𝑉𝐿
𝐿 = 4𝜋𝑅2 𝜎𝑇 4
𝑉 = 43 𝜋𝑅3
             2     4          4
𝜌𝑝𝑜𝑤𝑒𝑟 = 4𝜋𝑅
           4
             𝜎𝑇
              3
                = 3𝜎𝑇
                   𝑅
           3
             𝜋𝑅
If the modified sun has the power density of a human:
                     4
                  3𝜎𝑇𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛
𝜌𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛 =         𝑅𝑠𝑢𝑛
  4                𝜌              𝑅
𝑇𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛
               = 𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛
                       3𝜎
                              𝑠𝑢𝑛

The formula derivation seems correct.
Let’s use approximate values to check the order of magnitude.
𝜌𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛 ≈ 103 W∕m3
𝑅𝑠𝑢𝑛 ≈ 7 × 108 m
𝜎 ≈ 6 × 310−8 Wm   −2 K −4
  4    10 ×7×108    7×1011               19 = 3.8 × 1018
𝑇 ≈ 3×6×10−8 = 18×10      −8 = 0.38 × 10
𝑇 ≈ (3.8 × 1018 )1∕4 = (3800 × 1015 )1∕4
𝑇 ≈ (3800)0.25 × 103.75
(3800)0.25 : 74 = 2401, 84 = 4096. So, between 7 and 8.
≈ 7.8
𝑇 ≈ 7.8 × 103.75 = 7800 × 5.6 = 43680 K
The order of magnitude seems consistent with the detailed calculation.
The problem statement is straightforward and doesn’t seem to have any hidden complexities. The assumptions made are
standard for such problems.
Consider the precision of the given values.
Power of human: 100 W (exact)
Volume of human: 66400 cm3 (exact)
The result obtained is very close to option E. Given the nature of multiple-choice questions, the closest option is likely the
intended answer.
If there were any misinterpretations of the question, it would likely be in the setup. However, the phrase "If the sun were the
same size, but it produced the same amount of power per unit volume as a human" seems clear.
The surface temperature is calculated using the Stefan-Boltzmann law, assuming black-body radiation.
Double-check the provided options:
A: 3500 K
B: 10000 K
C: 25000 K
D: 40000 K
E: 50000 K
Our calculated value is approximately 49800 K, which is closest to 50000 K.
If we used more precise values for the constants, the result might be slightly different, but likely not enough to shift to
another option.
The fun fact mentioned in the question implies that the power density of a human is higher than that of the sun. Let’s verify
this.
Power output of the sun: 3.846 × 1026 W
Volume of the sun: 4.43 × 1027 m26    3
                             3.846×10
Power density of the sun: 4.43×1027 = 0.868 W∕m3
Power density of human: 1506 W∕m3
Indeed, the power density of a human is significantly higher than that of the sun.
Final check of the calculation steps and logic. The approach of equating the power density of the modified sun to that of a
human and then using the Stefan-Boltzmann law seems correct.




                                                              33
                                                s1: Simple test-time scaling

Solution:
The power produced by an average human is 𝑃ℎ𝑢𝑚𝑎𝑛 = 100 W, and the volume of an average human is 𝑉ℎ𝑢𝑚𝑎𝑛 = 66400 cm3 =
0.0664 m3 .
                                                 𝑃        100 W
The power density of a human is 𝜌𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛 = 𝑉ℎ𝑢𝑚𝑎𝑛 = 0.0664 m3
                                                                  ≈ 1506.02 W∕m3 .
                                                  ℎ𝑢𝑚𝑎𝑛
If the sun were the same size but produced the same amount of power per unit volume as a human, its power density would
be 𝜌𝑝𝑜𝑤𝑒𝑟,𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛 = 𝜌𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛 .
The radius of the sun is 𝑅𝑠𝑢𝑛 = 6.95 × 108 m.
The volume of the sun is 𝑉𝑠𝑢𝑛 = 34 𝜋𝑅3𝑠𝑢𝑛 .
                                                                           𝑃𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛
The power density of the modified sun is given by 𝜌𝑝𝑜𝑤𝑒𝑟,𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛 =       𝑉𝑠𝑢𝑛
                                                                                            .
We also know that 𝑃𝑚𝑜𝑑𝑖𝑓 𝑖𝑒𝑑_𝑠𝑢𝑛 = 4𝜋𝑅2𝑠𝑢𝑛 𝜎𝑇 4 , where 𝑇 is the surface temperature of the modified sun.
Equating the power density:
                  4𝜋𝑅2𝑠𝑢𝑛 𝜎𝑇 4           4
𝜌𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛 =        4          = 3𝜎𝑇
                      3
                        𝜋𝑅3𝑠𝑢𝑛     𝑅𝑠𝑢𝑛

Solving for 𝑇 4 :
      𝜌           𝑅𝑠𝑢𝑛
𝑇 4 = 𝑝𝑜𝑤𝑒𝑟,ℎ𝑢𝑚𝑎𝑛
            3𝜎
Substitute the values:
      1506.02 W∕m3 ×6.95×108 m
𝑇 4 = 3×5.67×10−8 Wm−2 K −4
                 12
𝑇 4 = 1.0466×10
       1.701×10−7
                  = 6.1528 × 1018 K 4
Solving for 𝑇 :
𝑇 = (6.1528 × 1018 )1∕4 = (6152.8)0.25 × 1018∕4 = (6152.8)0.25 × 104.5
Using logarithms:
log10 (6152.8) ≈ 3.789
0.25 × 3.789 = 0.94725
(6152.8)0.25 = 100.94725 ≈ 8.856                     √
𝑇 = 8.856 × 104.5 = 8.856 × 104 × 100.5 = 88560 × 10 ≈ 88560 × 3.16 = 280000 K
Let’s redo the exponent calculation: 18∕4 = 4.5.
𝑇 = 8.856 × 104.5 = 8.856 × 104 × 100.5 - error in previous step.
𝑇 = (6.1528 × 1018 )0.25 = 49806 K
The surface temperature would be approximately 49806 K, which is closest to 50000 K.
Final Answer: The final answer is 𝐸




                                                            34
                                                   s1: Simple test-time scaling

E. Test-time scaling details
E.1. Sequential scaling ablations

<|im_start|>user                                              <|im_start|>user
What is the answer to Life, the Universe and Everything?      What is the answer to Life, the Universe and Everything?
Think for up to 2048 tokens.                                  Think for up to 64 steps.
<|im_start|>assistant                                         <|im_start|>assistant
<|im_start|>think                                             <|im_start|>64 steps left
Let me break down this question into the three parts it       Let me break down this question into the three parts it is asking for:
is asking for: 1) Life 2) Universe 3) Everything              1) Life 2) Universe 3) Everything
Let me start with life...                                     <|im_start|>63 steps left
                                                              Let me start with life...
<|im_start|>answer                                            <|im_start|>answer
The answer is...                                              The answer is...

Figure 10. Token and step instruction data formats for controlling test-time compute. We only train our model on the reasoning trace
and the answer.


     Table 12. Scaling thinking time via tokens-conditional control. All metrics are averaged over the 30 questions in AIME24.
                       Tokens instructed (→)             1024        2048     4096       8192       16384
                                                  No intervention at test-time
                       Thinking tokens                   7939        7158     8263       7108       7500
                       Answer tokens                     689         669      659        722        724
                       AIME24                            26.7        30.0     33.3       33.3       40.0
                                    Forcing end of thinking when token budget is reached
                       Thinking tokens                   1024        2048     4031       5664       6330
                       Answer tokens                     15          15       142        722        691
                       AIME24                            3.3         30.0     33.3       33.3       40.0


Token-conditional control One general approach is to simply tell a model in the prompt precisely how many tokens it
should generate. Ideally, the model can keep track of its token count and adjust its generation to finish within the desired
limits. We experiment with this approach by training a model with token instructions using the format in Figure 10 (left).
We bucket the lengths of the reasoning traces from our 1,000 training examples into powers of two (rounded upwards) and
add a corresponding instruction to the user prompt. For example, if the instruction says “Think for up to 2048 tokens”,
then the reasoning trace has anywhere between 1024 and 2048 tokens. In Table 12, we show that after training the model
hardly follows the token instruction. It does sometimes generate more tokens when given a higher limit but often overshoots
the limit. This may not be unique to our model as prior work suggests that OpenAI o1-mini can also not follow token
instructions (Zhang & Chen, 2024). To prevent exceeding the limit, we test budget forcing the thinking to end once the
limit is reached. This leads to perfect control (Table 12 (lower)). With budget forcing, the scaling trend is also clearer as the
model can no longer overshoot the limit when given a small thinking budget. This leads to better test-time scaling values for
Token Prompting + budget forcing in Table 3. To compute Control reported in Table 3 for token-conditional control variants
we divide the number of times the thinking tokens in Table 12 are less than the upper limit by the total evaluations (2/5 for
without intervention; 5/5 for with intervention).

Step-conditional control Token instructions fail as current models cannot count tokens. To accommodate this lack of
capability, we experiment with making the counting more coarse-grained. We partition the reasoning traces into steps and
ask the model to think for a specific number of steps rather than tokens. We split our reasoning traces on double newlines into
steps, which we find act as intuitive separators based on manual inspection of samples. We bucket our training samples into
powers of 2 depending on their number of steps and add a corresponding step instruction following the format in Figure 10
(right). This format is based on early experiments, where we found the model to be more likely to adhere to the step limit

                                                                35
                                                   s1: Simple test-time scaling




Table 13. Scaling thinking time via step-conditional control. All metrics are averaged over the 30 samples in AIME24. Token counts
ignore the thinking and step delimiters.
                       Steps instructed (→)              16          32       64         128        256
                                                  No intervention at test-time
                       Steps used                        123         90       80         82         136
                       Tokens per step                   60          70       69         66         56
                       Thinking tokens                   7252        6277     5396       5552       7551
                       Answer tokens                     665         653      735        777        754
                       AIME24                            33.3        23.3     33.3       36.7       33.3
                                       Forcing end of thinking when 0 steps are reached
                       Steps used                        16          32       59         78         136
                       Tokens per step                   96          94       80         70         56
                       Thinking tokens                   1517        2963     4636       5409       7551
                       Answer tokens                     1111        788      799        794        754
                       AIME24                            23.3        23.3     33.3       36.7       33.3




Table 14. Scaling thinking time via class-conditional control. We report “accuracy / average thinking tokens per sample”; the higher
the accuracy and the fewer the thinking tokens (inference cost) the better.
               Prompt appended to the question after two newlines              AIME24        MATH500 GPQA
               Answer after a short amount of thinking. Do not spend           30.0% /       90.4% /       56.6% /
               excessive time double-checking your work.                       8033          2537          4177

               Answer after a long amount of thinking. If you feel like you    36.7% /       91.4% /       51.0% /
               are finished early, spend the extra time trying to double-      9651          3875          4827
               check your work until you are absolutely sure that you have
               the correct answer.
               Without generic prompt appending                                50.0% /       93.0% /       57.6% /
                                                                               6109          3298          3510




                                                                36
                                                  s1: Simple test-time scaling

when counting down (“3 steps left...2 steps left”) rather than counting up (“Step2...Step3...”). This is likely because if
counting down, the final step is always 1, which will act as a strong prior to the model to finish its generation. If counting up,
the final step before the answer varies, thus if the model does not remember the original step instruction, it may fail to stop.
We conclude the following from our results in Table 13: (1) The model still struggles to adhere to the step limit. The model
sometimes simply continues counting into negative steps, e.g. “-1 steps left”. To solve this issue, we automatically stop the
thinking process once 0 steps are reached and then force the model to transition to answering mode by appending the answer
token delimiter (§3). This leads to perfect step adherence (lower half of Table 13), yet problems remain. (2) The model
compensates for fewer steps by making each step longer. For example, when forced to use up to 16 steps vs 256 steps, the
model generates an average of 96 tokens per step vs 56. Despite this issue, more steps still clearly correlate with more total
thinking tokens in Table 13 and better performance leading to a positive slope (3) Step instructions are more costly than
other methods. The step delimiters require around 6 tokens each which for e.g. 64 steps adds up to a total of around 380
tokens. When ignoring the step delimiters in token counts as in Table 13, the model still requires 7551 thinking tokens on
average to achieve only 33.3% on AIME24. To compute Control reported in Table 3 for step-conditional control variants,
we first decide that 100 tokens are an upper limit per step and then multiply this number by the steps instructed to arrive
at a proxy total token limit, e.g. 1600 for 16 steps instructed. We then check whether the thinking tokens in Table 13 fit
within the respective limit for each evaluation run (3/5 for without intervention; 5/5 for with intervention). For the model in
Figure 7, we use a model with step-conditional control trained on an earlier version of our data and using an earlier version
of our evaluation codebase.

Class-conditional control OpenAI exposes test-time compute control to users via a “reasoning_effort” API parameter
with three possible settings: low, medium, and high.3 The OpenAI documentation also states that “Reducing reasoning
effort can result in faster responses and fewer tokens used on reasoning in a response." suggesting that they are unable to
control test-time compute with guarantees. Thus, maybe OpenAI simply adjusts the prompt or system instruction depending
on the reasoning effort desired. In Table 14, we show that separate prompts for short and long thinking allow us to control
thinking time to some extent: Prompting the model to think for longer leads to longer thinking. However, it does not reliably
improve performance and control is not precise. The current adherence to control may suffice when we only have three
classes, but it might not scale to finer-grained classes. To compute Control reported in Table 3 for this method, we assume
that prompting the model to think for a short time in Table 14 should produce fewer tokens than the default for AIME24,
while the long prompt should produce more. As 8033 > 6109 and 9651 > 6109, one out of two follows our expected control
thus Control is 50%.

E.2. Examples for rejection sampling ablation




  3
    https://github.com/openai/openai-python/blob/44d6210f101abedeb2dd68507fcffcb329df70ea/
src/openai/types/chat/completion_create_params.py#L172

                                                               37
                                                      s1: Simple test-time scaling

Table 15. Answers generated by an earlier version of s1-32B when rejection sampling on AIME24. The black text is the question and
the yellow text is the solution. Based solely on the question we generate the light blue reasoning trace and blue response with s1-32B
using a temperature of 1. We generate until the reasoning trace has less than 4000 tokens for the first and less than 8000 tokens for the
second example. It required on average 199 (1) tries per example until thinking was ≤ 4000 (≤ 8000 tokens).
                                  √
Let 𝑂(0, 0), 𝐴( 12 , 0), and 𝐵(0, 23 ) be points in the coordinate plane. Let  be the family of segments 𝑃 𝑄 of unit length
lying in the first quadrant with 𝑃 on the 𝑥-axis and 𝑄 on the 𝑦-axis. There is a unique point 𝐶 on 𝐴𝐵, distinct from 𝐴 and 𝐵,
that does not belong to any segment from  other than 𝐴𝐵. Then 𝑂𝐶 2 = 𝑞𝑝 , where 𝑝 and 𝑞 are relatively prime positive
integers. Find 𝑝 + 𝑞.

23

[Correct generation with 3981 thinking tokens and 1055 answer tokens]
The problem asks for a point 𝐶 on the segment 𝐴𝐵 that is not contained in any other segment from the family  , except for
𝐴𝐵 itself.
The points 𝑂, 𝐴, 𝐵 form a triangle in the first quadrant.
𝑂 = (0, 0)
𝐴 = ( 12 , 0)
         √
𝐵 = (0, 23 )
                                                  √
The segment 𝐴𝐵 connects 𝐴( 21 , 0) and 𝐵(0, 23 ).
The equation√ of the line containing 𝐴𝐵 can be found using the two points:
             3
               −0      √
Slope 𝑚 = 2 1 = − 3
               0− 2
Using the point-slope form 𝑦 − 𝑦1 = 𝑚(𝑥 − 𝑥1 ) with point 𝐴( 12 , 0):
          √
𝑦 − 0 = − 3(𝑥 − 21 )
      √       √
𝑦 = − 3𝑥 + 23
                                                                            √
The segment 𝐴𝐵 lies in the first quadrant and connects ( 12 , 0) to (0, 23 ).
The family  consists of segments 𝑃 𝑄 of unit length, where 𝑃 is on the 𝑥-axis and 𝑄 is on the 𝑦-axis. Let 𝑃 = (𝑎, 0) and
𝑄 = (0, 𝑏).           √                       √
The length of 𝑃 𝑄 is (𝑎 − 0)2 + (0 − 𝑏)2 = 𝑎2 + 𝑏2 = 1.
So 𝑎2 + 𝑏2 = 1.
Since 𝑃 is on the positive 𝑥-axis and 𝑄 is on the positive 𝑦-axis, we have 𝑎 > 0 and 𝑏 > 0.
The segment 𝑃 𝑄 connects (𝑎, 0) and (0, 𝑏). The equation of the line containing 𝑃 𝑄 is 𝑥𝑎 + 𝑦𝑏 = 1.
The region covered by the family  is the area bounded by the lines containing these segments. The endpoints (𝑎, 0) and
(0, 𝑏) vary such that 𝑎2 + 𝑏2 = 1.
Consider the parametrization 𝑎 = cos 𝜃, 𝑏 = sin 𝜃, where 0 < 𝜃 < 𝜋2 .
The segment 𝑃 𝑄 connects (cos 𝜃, 0) and (0, sin 𝜃). The equation of the line is cos𝑥 𝜃 + sin𝑦 𝜃 = 1.
We are looking for a unique point 𝐶 on 𝐴𝐵, distinct from 𝐴 and 𝐵, that does not belong to any segment from  other than
𝐴𝐵.
Let 𝐶 be a point on 𝐴𝐵.√We can parametrize
                                       √
                                            the points on 𝐴𝐵 as (1 − 𝑡)𝐴 + 𝑡𝐵 for 0 ≤ 𝑡 ≤ 1.
𝐶 = (1 − 𝑡)( 12 , 0) + 𝑡(0, 23 ) = ( 12 − 21 𝑡, 23 𝑡).
Since 𝐶 is distinct from 𝐴 and 𝐵, we have 0 < 𝑡 < 1.
Suppose the point 𝐶 lies on some segment 𝑃 𝑄 ∈  , where 𝑃 = (cos 𝜃, 0) and 𝑄 = (0, sin 𝜃).
The point 𝐶 satisfies the equation cos𝑥 𝜃 + sin𝑦 𝜃 = 1.




                                                                   38
                                                     s1: Simple test-time scaling

Plugging√in the coordinates of 𝐶:
1 1         3
 − 𝑡
2 2
                𝑡
cos 𝜃
      + sin2 𝜃 = 1
           √
  1−𝑡         3𝑡
2 cos 𝜃
        + 2 sin 𝜃
                  =1
Multiply√by 2:
 1−𝑡
      + sin3𝑡𝜃 = 2
cos 𝜃            √
(1 − 𝑡) sin 𝜃 + √3𝑡 cos 𝜃 = 2 sin 𝜃 cos 𝜃
sin 𝜃 − 𝑡 sin 𝜃 + 3𝑡 cos 𝜃 = sin(2𝜃)∕2
We are looking for a point 𝐶 on 𝐴𝐵 such that the only segment from  containing 𝐶 is 𝐴𝐵 itself.
The segment 𝐴𝐵 is part of the family  . √
For 𝐴𝐵, 𝑃 = 𝐴 = ( 12 , 0) and 𝑄 = 𝐵 = (0, 23 ).
                  √
                     3
Here 𝑎 = 21 , 𝑏 =   2√
                       .
𝑎2 + 𝑏2 = ( 2 )2 + ( 23 )2 = 41 + 43 = 1.
            1
                                              √
The line segment 𝐴𝐵 connects ( 21 , 0) and (0, 23 ).
                                                                                 √         √
The equation of the line is 1∕2𝑥
                                 + √ 𝑦 = 1, which simplifies to 2𝑥 + √2 𝑦 = 1, or 3𝑥 + 𝑦 = 23 .
                                      3∕2 √                            3
                                    √       3
This matches the equation 𝑦 = − 3𝑥 + 2 .
Consider the envelope of the family of lines cos𝑥 𝜃 + sin𝑦 𝜃 = 1.
Differentiating with respect to 𝜃:
− 𝑥cossin2 𝜃𝜃 + 𝑦 cos2 𝜃 = 0
             sin 𝜃
𝑦 cos 𝜃
    2   = 𝑥cossin2 𝜃𝜃
 sin 𝜃
𝑦 cos3 𝜃 = 𝑥 sin3 𝜃
𝑦
𝑥
   = tan3 𝜃
tan 𝜃 = ( 𝑥𝑦 )1∕3
From the line equation: 𝑦 sin 𝜃 = sin 𝜃 cos 𝜃 − 𝑥 sin 𝜃
𝑦 = cos 𝜃 − 𝑥
Substitute cos 𝜃 = √ 1 2 = √ 1 2∕3
                           1+tan 𝜃   1+(𝑦∕𝑥)
              𝑥     1∕3
cos 𝜃 = (𝑥2∕3 +𝑦2∕3 )1∕2
                                     1∕3              1∕3
sin 𝜃 = tan 𝜃 cos 𝜃 = ( 𝑥𝑦 )1∕3 (𝑥2∕3 +𝑦
                                      𝑥                   𝑦
                                        2∕3 )1∕2 = (𝑥2∕3 +𝑦2∕3 )1∕2
Substitute back into the line equation:
  𝑥
cos 𝜃
      + sin𝑦 𝜃 = 1
𝑥(𝑥2∕3 + 𝑦2∕3 )1∕2 ∕𝑥1∕3 + 𝑦(𝑥2∕3 + 𝑦2∕3 )1∕2 ∕𝑦1∕3 = 1
𝑥2∕3 (𝑥2∕3 + 𝑦2∕3 )1∕2 + 𝑦2∕3 (𝑥2∕3 + 𝑦2∕3 )1∕2 = 1
(𝑥2∕3 + 𝑦2∕3 )3∕2 = 1
𝑥2∕3 + 𝑦2∕3 = 1
The envelope of the family  is the astroid 𝑥2∕3 + 𝑦2∕3 = 1.
The segments from  are tangent to this astroid.
The point 𝐶 on 𝐴𝐵 that we are looking        √
                                                 for is the point where 𝐴𝐵 is tangent to the astroid.
                                 √             3
The equation of 𝐴𝐵 is 𝑦 = − 3𝑥 + 2 .
Substitute into the√astroid equation:
            √
𝑥2∕3 + (− 3𝑥 + 23 )2∕3 = 1
Let 𝑢 = 𝑥1∕3 and 𝑣 = 𝑦1∕3 . The astroid is 𝑢2 + 𝑣2 = 1.
The point 𝐶 lies on both the line 𝐴𝐵 and the astroid.
Differentiating the astroid equation implicitly:
2 −1∕3
3
  𝑥     + 23 𝑦−1∕3 𝑦′ = 0
         1∕3
𝑦′ = − 𝑥𝑦1∕3 = −( 𝑥𝑦 )1∕3

                                                                   39
                                                 s1: Simple test-time scaling
                                                              √
The slope of the tangent to the astroid at a point (𝑥, 𝑦) is − 3 𝑥𝑦 .
                               √
The slope of the line 𝐴𝐵 is − √  3.
                                         √
                                3 𝑦
So, at the point of tangency, − 𝑥 = − 3.
√       √
 3 𝑦
   𝑥
     =     3
𝑦
      √         √
   = (   3) 3 =3 3
𝑥                                           √
                                    √                                 √
The point 𝐶 lies on the line 𝑦 = − 3𝑥 + 23 and satisfies 𝑥𝑦 = 3 3.
          √
So 𝑦 = 3 3𝑥.
Substitute into the line
                     √
                         equation:
  √          √        3
3 3𝑥 = − 3𝑥 + 2
3𝑥 = −𝑥 + 12
4𝑥 = 12
𝑥 = 18
          √           √
Then 𝑦 = 3 3( 18 ) = 3 8 3 .
                        √
The point 𝐶 is ( 18 , 3 8 3 ).
We need to check if this point lies on√the segment 𝐴𝐵.
The point 𝐶 is of the form ( 21 − 21 𝑡, 23 𝑡).
1
2
  − 12 𝑡 = 18 ⟹ 4 − 4𝑡 = 1       ⟹ 3 = 4𝑡 ⟹ 𝑡 = 34 .
√        √          √
  3        3 3     3 3
 2
    𝑡 =   2
             ⋅ 4
                 =  8
                       .
So 𝐶 lies on 𝐴𝐵.
We need to make sure 0 < 𝑡 < 1, which is true for 𝑡 = 34 .
Now, we need to check if this point 𝐶 does not belong to any other segment from  .
The point 𝐶 lies on the astroid. The segment 𝐴𝐵 is tangent to the astroid at 𝐶.
Any other segment from  that contains 𝐶 would have 𝐶 as an interior point, not a tangent point.
Consider the parametrization of the family  .
A point (𝑥, 𝑦) belongs to some segment in  iff there exists 𝜃 such that cos𝑥 𝜃 + sin𝑦 𝜃 = 1.
                            √
For the point 𝐶 = ( 18 , 3 8 3 ), we have:
          √
 1∕8    3 3∕8
cos 𝜃
      +   sin 𝜃
                 =1
          √
  1
      + 3sin 𝜃3 = 8
cos 𝜃      √
sin 𝜃 + 3√3 cos 𝜃 = 8 sin 𝜃 cos 𝜃
sin 𝜃 + 3 3 cos 𝜃 = 4 sin(2𝜃)
We know that 𝐶 lies on 𝐴𝐵, which corresponds to some specific segment in  .
The problem statement assumes the existence and uniqueness of such a point 𝐶.
The unique point 𝐶 on 𝐴𝐵, distinct from 𝐴 and 𝐵, that does not belong to any segment from  other than 𝐴𝐵.
This means that if 𝐶 lies on another segment 𝑃 ′ 𝑄′ ∈  , then 𝑃 ′ 𝑄′ = 𝐴𝐵.
The point 𝐶 is the point of tangency of 𝐴𝐵 with the astroid.
Now we need to √  find 𝑂𝐶 2 .
         1 2     3 3 2     1
    2
𝑂𝐶 = ( 8 ) + ( 8 ) = 64       + 27
                                64
                                     28
                                   = 64 = 167
                                              .
Here 𝑝 = 7 and 𝑞 = 16.
𝑝 and 𝑞 are relatively prime.
𝑝 + 𝑞 = 7 + 16 = 23.
Final check of the reasoning:
The family  consists of unit-length segments with endpoints on the axes.
The envelope of this family is the astroid 𝑥2∕3 + 𝑦2∕3 = 1.
The segments from  are tangent to the astroid.
The segment 𝐴𝐵 is also a member of the family  .
We are looking for a point 𝐶 on 𝐴𝐵 that lies on the astroid. 40
This point 𝐶 is the point of tangency of 𝐴𝐵 with the astroid.
                                                   s1: Simple test-time scaling

If a point on 𝐴𝐵 lies on the astroid, it belongs to the tangent segment at that point. Since 𝐴𝐵 is the tangent segment, this
point does not belong to any other segment from  .
Consider the case when the segment 𝐴𝐵 intersects the astroid at two points.
If the segment 𝐴𝐵 is not tangent to the astroid, then there are two points of intersection.
Let the points of intersection be 𝐶1 and 𝐶2 .
If 𝐶1 lies on 𝐴𝐵, then the tangent to the astroid at 𝐶1 is the segment 𝐴𝐵.
If 𝐶2 lies on 𝐴𝐵, then√the tangent to the astroid at 𝐶2 is the segment 𝐴𝐵.
The slope of 𝐴𝐵 is − 3.                                  √
We found the point on the astroid with tangent slope − 3.
The problem statement guarantees the uniqueness of such a point 𝐶.
The point 𝐶 must be interior to 𝐴𝐵, so 0 < 𝑡 < 1.
The parameter 𝑡 = 34 is in this range.
The point 𝐶 is distinct from 𝐴 and 𝐵.
If 𝐶 = 𝐴, then 𝑡 = 0, 𝐶 = ( 12 , 0).
                                    √
If 𝐶 = 𝐵, then 𝑡 = 1, 𝐶 = (0, 23 ).
The point of tangency 𝐶 is the unique point on 𝐴𝐵 (excluding endpoints) that lies on the astroid.
Solution:
Let the family of segments  be defined by segments 𝑃 𝑄 of unit length with 𝑃 on the positive 𝑥-axis and 𝑄 on the positive
𝑦-axis. Let 𝑃 = (𝑎, 0) and 𝑄 = (0, 𝑏). The length condition gives 𝑎2 + 𝑏2 = 1. The segment 𝑃 𝑄 lies on the line 𝑥𝑎 + 𝑦𝑏 = 1.
Using the parametrization 𝑎 = cos 𝜃, 𝑏 = sin 𝜃 with 0 < 𝜃 < 𝜋2 , the equation of the line is cos𝑥 𝜃 + sin𝑦 𝜃 = 1.
The envelope of this family of lines is the astroid 𝑥2∕3 + 𝑦2∕3 = 1.
The segments from  are tangent to this astroid.
                                              √                                                                √
                                                                                                     √
The segment 𝐴𝐵 connects 𝐴( 12 , 0) and 𝐵(0, 23 ). The equation of the line containing 𝐴𝐵 is 𝑦 = − 3𝑥 + 23 . This segment
is part of the family  .
We are looking for a unique point 𝐶 on 𝐴𝐵, distinct from 𝐴 and 𝐵, that does not belong to any segment from  other than
𝐴𝐵. This means 𝐶 is the point where 𝐴𝐵 is tangent to the√      astroid.
                                                                                          √
The slope of the tangent to the astroid at a point (𝑥, 𝑦) is − 3 𝑥𝑦 . The slope of 𝐴𝐵 is − 3.
                             √       √                  √
Setting the slopes equal: − 3 𝑥𝑦 = − 3 ⟹ 𝑥𝑦 = 3 3.
                                   √        √                         √
The point 𝐶 lies on the line 𝑦 = − 3𝑥 + 23 and satisfies 𝑦 = 3 3𝑥.
                √         √      √            √        √
Substituting, 3 3𝑥 = − 3𝑥 + 23 ⟹ 4 3𝑥 = 23 ⟹ 𝑥 = 18 .
             √          √
Then 𝑦 = 3 3( 18 ) = 3 8 3 .
                        √
The point 𝐶 is ( 18 , 3 8 3 ).
                                                                          √
We need to check if 𝐶 lies on 𝐴𝐵. The points on 𝐴𝐵 are ( 12 − 12 𝑡, 23 𝑡) for 0 ≤ 𝑡 ≤ 1.
For 𝑥 = 18 , 12 − 12 𝑡 = 81 ⟹ 𝑡 = 43 .
            √     √       √
For 𝑦 = 3 8 3 ,   2
                   3
                     𝑡 = 3 3
                          8
                                 ⟹ 𝑡 = 34 .
Since 0 < 43 < 1, 𝐶 is an interior point of 𝐴𝐵.
The point 𝐶 is the unique point on 𝐴𝐵 (excluding endpoints) that lies on the astroid. The tangent to the astroid at 𝐶 is the
segment 𝐴𝐵. Thus, 𝐶 does not belong√to any other segment from  .
Finally, we calculate 𝑂𝐶 2 = ( 81 )2 + ( 3 8 3 )2 = 64
                                                     1
                                                       + 27
                                                         64
                                                            = 28
                                                              64
                                                                    7
                                                                 = 16 .
We have 𝑝 = 7 and 𝑞 = 16. These are relatively prime.
𝑝 + 𝑞 = 7 + 16 = 23.
Final Answer: The final answer is 23




                                                                41
                                                         s1: Simple test-time scaling

[Incorrect generation with 5923 thinking tokens and 754 answer tokens]
The problem asks for a point 𝐶 on the segment 𝐴𝐵 such that 𝐶 is not on any other segment from the family  , except 𝐴𝐵
itself. The family  consists of segments 𝑃 𝑄 of unit length, with 𝑃 on the 𝑥-axis and 𝑄 on the 𝑦-axis in the first quadrant.
Let 𝑃 = (𝑥, 0) and 𝑄 = (0, 𝑦), where 𝑥 > 0 and 𝑦 > 0. The length of the segment 𝑃 𝑄 is 1, so the distance between 𝑃 and 𝑄
is
√1.
           2         2
√(𝑥 − 0) + (0 − 𝑦) = 1
    2
   𝑥 +𝑦 =12
𝑥2 + 𝑦2 = 1
The segment 𝑃 𝑄 lies in the first quadrant, so 𝑥 > 0 and 𝑦 > 0.
The equation 𝑥2 + 𝑦2 = 1 represents the quarter circle in the first quadrant, with endpoints at (1, 0) and (0, 1).
The points on the segment 𝑃 𝑄 can be parameterized
                                              √
                                                      as (1 − 𝑡)𝑥 + 𝑡(0), (1 − 𝑡)0 + 𝑡𝑦) = (𝑥(1 − 𝑡), 𝑦𝑡), where 0 ≤ 𝑡 ≤ 1.
The segment 𝐴𝐵 connects 𝐴( 21 , 0) and 𝐵(0, 23 ).
The equation√ of the line
                      √
                          containing 𝐴𝐵 can be found using the two points:
             3
               −0
                       3     √
Slope 𝑚 = 2 1 = 21 = − 3.
             0− 2      −2
Using point-slope form with point 𝐴( 21 , 0):
         √
𝑦 − 0 = − 3(𝑥 − 21 )
      √       √
𝑦 = − 3𝑥 + 23
The segment 𝐴𝐵 lies in the first quadrant, between 𝑥 = 0 and 𝑥 = 12 .
Alternatively, we can parameterize the points on 𝐴𝐵.
Let 𝐶 be a point on 𝐴𝐵. Then
                         √
                              𝐶 = (1 − 𝜆)𝐴
                                         √
                                             + 𝜆𝐵 for some 0 ≤ 𝜆 ≤ 1.
𝐶 = (1 − 𝜆)( 21 , 0) + 𝜆(0, 23 ) = ( 12 − 12 𝜆, 23 𝜆).
We are looking for a unique point 𝐶 on 𝐴𝐵, distinct from 𝐴 and 𝐵, that does not belong to any segment from  other than
𝐴𝐵.
Consider a segment from  , with endpoints 𝑃 = (𝑥, 0) and 𝑄 = (0, 𝑦), where 𝑥2 + 𝑦2 = 1.
A point on this segment is of the form (𝑥(1 − 𝑡), 𝑦𝑡), where 0 ≤ 𝑡 ≤ 1.
We want to find a point
                   √
                        𝐶 on 𝐴𝐵, 𝐶 ≠ 𝐴, 𝐵, such that if 𝐶 lies on a segment 𝑃 𝑄 ∈  , then 𝑃 𝑄 = 𝐴𝐵.
Let 𝐶 = ( 12 − 12 𝜆, 23 𝜆), where 0 < 𝜆 < 1.
If 𝐶 lies on a segment 𝑃 𝑄 from  , then there exist 𝑥, 𝑦 > 0 with 𝑥2 + 𝑦2 = 1, and 𝑡 ∈ [0, 1] such that:
1
2
   − 12 𝜆 = 𝑥(1 − 𝑡)
√
 3
2
   𝜆 = 𝑦𝑡
                                     1
                                         −1𝜆
From the first equation, 1 − 𝑡 = 2 𝑥2 = 1−𝜆
                                         2𝑥
                                            .
                                    √
                                        3       √
                                            𝜆
From the second equation, 𝑡 = 2𝑦 = 2𝑦3𝜆 .
Since 𝑡 ∈ [0, 1], we have 1 − 𝑡 ∈ [0, 1].
Also, 𝑥 > 0, so 1−𝜆
                  2𝑥
                     > 0, which implies 1 − 𝜆 > 0, or 𝜆 < 1. This is consistent with the condition for point on 𝐴𝐵.
                            √
                              3𝜆
We have 1 − 𝑡 = 1−𝜆
                 2𝑥
                    and 𝑡 =  2𝑦
                                 .
                                   √
Adding these equations: 1 = 2𝑥 + 2𝑦3𝜆
                            1−𝜆
                 √
2𝑥𝑦 = (1 − 𝜆)𝑦 +       3𝜆𝑥




                                                                     42
                                                          s1: Simple test-time scaling

We know 𝑥2 + 𝑦2 = 1.
If the point 𝐶 belongs to the segment√𝐴𝐵 itself, then the endpoints of the segment are 𝐴 and 𝐵.
So, the segment is from ( 12 , 0) to (0, 23 ).
                   √
                     3
Here, 𝑥 = 12 and 𝑦 =2
                       .
                             √
Check if 𝑥 + 𝑦 = 1: ( 2 ) + ( 23 )2 = 41 + 34 = 1. So 𝐴𝐵 ∈  .
          2   2       1 2
                                                                                                                   √
If 𝐶 belongs to another segment 𝑃 𝑄 from  , with endpoints 𝑃 = (𝑥, 0) and 𝑄 = (0, 𝑦), where (𝑥, 𝑦) ≠ ( 12 , 23 ).
Consider the geometric interpretation. The family  consists of segments of unit length with endpoints on the axes. The
midpoints of these segments lie on the circle (𝑥 − 21 )2 + (𝑦 − 12 )2 = 12 .
Let the point 𝐶 correspond to some 𝜆.
If 𝐶 lies on 𝑃 𝑄, then 𝐶 = (𝑥(1
                           √
                                − 𝑡), 𝑦𝑡).
                                3
So, 21 − 12 𝜆 = 𝑥(1 − 𝑡) and   2
                                  𝜆 = 𝑦𝑡.
                                                                                                                                √
                                                                                                                                 3
We need to find a 𝜆 ∈ (0, 1) such that the only solution to these equations, with 𝑥2 + 𝑦2 = 1, 𝑥, 𝑦 > 0, is 𝑥 = 21 and 𝑦 =      2
                                                                                                                                   .
                                                 √
                                                   3𝜆
From the equations, 1 − 𝑡 = 1−𝜆
                             2𝑥
                                and 𝑡 =           2𝑦
                                                      .
Substitute these into 𝑥2 + 𝑦2 = 1.
Try a different approach. Consider the tangents to the family of segments.
The segments are formed by points (𝑥, 0) and (0, 𝑦) with 𝑥2 + 𝑦2 = 1.
The equation of the line containing the segment 𝑃 𝑄 is 𝑥𝑥 + 𝑦𝑦 = 1, or 𝑥𝑥 + 𝑦𝑦 = 1.
The equation of the line is 𝑥𝑥 + 𝑦𝑦 = 1.
Consider the equation of the segment in parametric form: (𝑥(1 − 𝑡), 𝑦𝑡).
If a point 𝐶 lies on this segment, then 𝐶 = (𝑥𝑐 , 𝑦𝑐 ), where 𝑥𝑐 = 𝑥(1 − 𝑡) and 𝑦𝑐 = 𝑦𝑡.
     𝑥               𝑦
So, 𝑥𝑐 = 1 − 𝑡 and 𝑦𝑐 = 𝑡.
𝑥𝑐   𝑦
𝑥
   + 𝑦𝑐 = 1 − 𝑡 + 𝑡 = 1.
The segments from  are parts of the lines 𝑥𝑥 + 𝑦𝑦 = 1, where 𝑥20 + 𝑦20 = 1, 𝑥0 , 𝑦0 > 0.
                                   √                 0      0

Consider the point 𝐶 = ( 12 − 12 𝜆, 23 𝜆).
If 𝐶 lies on
          √
             a segment from  with endpoints (𝑥, 0) and (0, 𝑦), then:
1 1         3
 − 𝜆
2 2        2
              𝜆
  𝑥
       +          =1
     √ 𝑦
1−𝜆
 2𝑥
    + 2𝑦3𝜆 = 1
                                                                                                                               √
We need to find a unique 𝜆 ∈ (0, 1) such that the only solution (𝑥, 𝑦) to this equation, with 𝑥2 + 𝑦2 = 1, 𝑥, 𝑦 > 0, is ( 12 , 23 ).
                             √
                              3
Substitute 𝑥 = 12 and 𝑦 =    2
                                into the equation:
          √
1−𝜆
        + √3𝜆 = 1
2( 21 )     3
         2( )
           2
1−𝜆+𝜆=1
1=1
This shows that any point on 𝐴𝐵 lies on the segment
                                               √
                                                    𝐴𝐵 itself.
                                       √        3
Consider the line containing 𝐴𝐵: 𝑦 = − 3𝑥 + 2 .
If a point 𝐶 on 𝐴𝐵 also lies on another segment from  with endpoints (𝑥, 0) and (0, 𝑦), then:
𝑥𝐶     𝑦
 𝑥
    + 𝑦𝐶 = 1.
Let the point 𝐶 divide 𝐴𝐵 in the ratio 𝜇 ∶ (1√− 𝜇), where 0 <√ 𝜇 < 1.
𝐶 = (1 − 𝜇)𝐴 + 𝜇𝐵 = (1 − 𝜇)( 12 , 0) + 𝜇(0, 23 ) = ( 12 − 12 𝜇, 23 𝜇).
If 𝐶 lies on
          √
             a segment with endpoints (𝑥, 0) and (0, 𝑦), then:
1 1         3
 − 𝜇
2 2        2
              𝜇
  𝑥
       +   𝑦
                  = 1.
We are looking for a point 𝐶 such that the line passing through 𝐶 with the form 𝑥𝑥 + 𝑦𝑦 = 1, where 𝑥20 + 𝑦20 = 1, is unique.
                                                                                         0   0



                                                                      43
                                                       s1: Simple test-time scaling

Consider the tangents to the unit circle centered at the origin. The lines of the form 𝑥𝑥 + 𝑦𝑦 = 1 are tangents to the circle
                                                                                             0       0
𝑥2 + 𝑦2 = 𝑟2 .
Consider the dual problem. The family of segments corresponds to points on the quarter circle 𝑥2 + 𝑦2 = 1 in the first
quadrant.
The lines containing the segments are 𝑥𝑥 + 𝑦𝑦 = 1.
                                        0    0
The envelope of these lines is the curve whose tangents are these lines.
The equation 𝑥𝑥 + 𝑦𝑦 = 1, with the constraint 𝑥20 + 𝑦20 = 1.
                  0     0
Let 𝑥0 = cos 𝜃, 𝑦0 = sin 𝜃, where 0 < 𝜃 < 𝜋2 .
The equations of the lines are 𝑥 cos 𝜃 + 𝑦 sin 𝜃 = 1.
The envelope of the family of lines 𝑥 cos 𝜃 + 𝑦 sin 𝜃 = 1 is the circle 𝑥2 + 𝑦2 = 1.
The segments from  are parts of the tangents to the unit circle in the first quadrant.
The point 𝐶 on 𝐴𝐵 that does not belong to any other segment from  means that the tangent to the unit circle passing
through 𝐶 is unique.              √
Consider the point 𝐶 = ( 12 − 12 𝜆, 23 𝜆).
The tangents to the unit circle passing through 𝐶 have the form 𝑥𝑥𝐶 + 𝑦𝑦𝐶 = 1.
If the tangent √
               is unique, then 𝐶 lies on the unit circle.
( 12 − 12 𝜆)2 + ( 23 𝜆)2 = 1
( 41 − 12 𝜆 + 14 𝜆2 ) + 34 𝜆2 = 1
1
4
  − 12 𝜆 + 𝜆2 = 1
4𝜆2 − 2𝜆 + 1 = 4
4𝜆2 − 2𝜆 − 3 = 0
Solve for 𝜆:√                   √         √        √       √
     −(−2)± (−2)2 −4(4)(−3)   2± 4+48   2± 52   2±2 13   1± 13
𝜆=            2(4)
                            =    8
                                      =   8
                                              =    8
                                                       =   4
                                                               .
We
√   need 0  < 𝜆  < 1.
  13 ≈ 3.6.
1+3.6
  4
      = 4.6
          4
             > 1.
1−3.6
  4
      = −2.6
         4
             < 0.
Something is wrong. The point 𝐶 is inside the unit circle.
Let’s rephrase the condition. There is
                                    √
                                       a unique point 𝐶 on 𝐴𝐵 such that if 𝐶 lies on a segment 𝑃 𝑄 ∈  , then 𝑃 𝑄 = 𝐴𝐵.
Let the point 𝐶 on 𝐴𝐵 be ( 12 − 12 𝜆, 23 𝜆).
Consider a segment from  with endpoints (𝑥, 0) and (0, 𝑦), 𝑥2 + 𝑦2 = 1.
The parametric form of the segment is (𝑥(1 − 𝑡), 𝑦𝑡).
If 𝐶 is on this segment, then for some 𝑡 ∈ [0, 1]:
1
2
   − 12 𝜆 = 𝑥(1 − 𝑡)
√
 3
2
   𝜆 = 𝑦𝑡
If the segment is unique, then
                            √
                               the values of 𝑥 and 𝑦 are unique.                              √
                                     3𝜆            2 + 𝑦2 = 1, 𝑥, 𝑦 > 0, should be 𝑥 = 1 , 𝑦 = 3 .
The only solution to 1−𝜆
                      2𝑥
                         +          2𝑦
                                        = 1, with 𝑥                                    2       2
Consider the intersection of the line containing 𝐴𝐵 with the boundary of the region formed by the segments in  .
The boundaries are the 𝑥-axis, 𝑦-axis, and the quarter circle 𝑥2 + 𝑦2 = 1.
Consider the dual problem again. The family of lines 𝑥𝑥 + 𝑦𝑦 = 1, where 𝑥20 + 𝑦20 = 1, 𝑥0 , 𝑦0 > 0.
                                                              0     0
The envelope of these lines is the circle 𝑥2 + 𝑦2 = 1.
A point belongs to a unique segment from  if it lies on the circle 𝑥2 + 𝑦2 = 1.
The point 𝐶 on 𝐴𝐵 satisfies 𝐶 = 𝑡𝐴 + (1 − 𝑡)𝐵.                       √
                                                                             3
Let’s consider the case where the segment is 𝐴𝐵. Here 𝑥 = 12 , 𝑦 =          2
                                                                               .
                                               √
The point 𝐶 can be represented as ( 21 (1 − 𝑠), 23 𝑠) for 0 ≤ 𝑠 ≤ 1.
We are looking for a point 𝐶 on 𝐴𝐵 that does not lie on any other segment from  .


                                                                   44
                                                      s1: Simple test-time scaling

Consider the geometry. The family  forms a region bounded by the axes and the quarter circle.
The segments are chords of circles centered on the line 𝑦 = 𝑥, tangent to the axes.
Let the endpoints of a segment be (cos 𝜃, 0) and (0, sin 𝜃).
A point on the segment is√(cos 𝜃(1 − 𝑡), sin 𝜃𝑡).
The point 𝐶 = ( 21 − 12 𝜆, 23 𝜆).
                                                           √
                                        1 1                  3
                                         − 𝜆                   𝜆
If 𝐶 lies on such a segment, then cos 𝜃 = 1 − 𝑡 and sin 𝜃 = 𝑡.
                                        2 2                 2
           √
 1−𝜆          3𝜆
2 cos 𝜃
        + 2 sin 𝜃
                  = 1.
                                                                                     √
We need a unique 𝜆 ∈ (0, 1) such that the only solution (cos 𝜃, sin 𝜃) is ( 21 , 23 ), which corresponds to 𝜃 = 𝜋3 .
                                                      √
                                 1
Consider the function 𝑓 (𝜃) = 2 cos 𝜃
                                           𝜆
                                      − 2 cos 𝜃
                                                + 2 sin3𝜆𝜃 .
We need 𝑓 (𝜃) = 1.     √
If 𝜆 = 12 , then 𝐶 = ( 14 , 43 ).
           √
1−1∕2         3(1∕2)
2 cos 𝜃
        +            =1
            √sin 𝜃
            2
   1           3
        +
4 cos 𝜃 √4 sin 𝜃
                   =1
  1
       + sin3𝜃 = 4
cos 𝜃    √
sin 𝜃 + √3 cos 𝜃 = 4 sin 𝜃 cos 𝜃
sin 𝜃 + 3 cos 𝜃 = 2 sin(2𝜃)
                       √
Divide by 2: 12 sin 𝜃 + 23 cos 𝜃 = sin(2𝜃)
sin(𝜃 + 𝜋3 ) = sin(2𝜃).
𝜃 + 𝜋3 = 2𝜃 + 2𝜋𝑘, or 𝜃 + 𝜋3 = 𝜋 − 2𝜃 + 2𝜋𝑘.
𝜃 = 𝜋3 − 2𝜋𝑘, so 𝜃 = 𝜋3 .
3𝜃 = 𝜋 − 𝜋3 + 2𝜋𝑘 = 2𝜋
                     3
                       + 2𝜋𝑘.
𝜃 = 2𝜋
     9
       + 2𝜋𝑘
          3
             .
If 𝑘 = 0, 𝜃 = 2𝜋
               9
                 .
If 𝑘 = 1, 𝜃 = 2𝜋
               9
                 + 2𝜋
                    3
                      = 2𝜋+6𝜋
                          9
                              = 8𝜋
                                 9
                                   .
                               √
If 𝜆 = 12 , the point is ( 41 , 43 ).
This point lies on segments corresponding to 𝜃 = 𝜋3 and 𝜃 = 2𝜋9
                                                                .
The unique point 𝐶 means that for other values of 𝜆, there is more than one solution for (𝑥, 𝑦).
Consider the tangent to the unit circle passing through 𝐶.
If the tangent is unique,
                      √
                          𝐶 is on the unit circle.
This leads to 𝜆 = 1± 4 13 , neither of which is in (0, 1).
Consider the geometric interpretation of the family  . The segments connect points on the axes such that the distance is 1.
The locus of the centers of these segments is the circle (𝑥 − 21 )2 + (𝑦 − 12 )2 = 12 , in the first quadrant.
If a point 𝐶 lies on a segment from  , then 𝐶 is on a line segment connecting (𝑥, 0) and (0, 𝑦) with 𝑥2 + 𝑦2 = 1.
The line containing the segment is 𝑥𝑥 + 𝑦𝑦 = 1.
                                       0    0
If 𝐶 belongs to a unique segment, √
                                     then there is a unique line of this form passing through 𝐶.
The point 𝐶 on 𝐴𝐵 is ( 21 − 12 𝜆, 23 𝜆).
                                                                                     √
                                                                       1 1            3
                                                                         − 𝜆       𝜆
We need a unique (𝑥0 , 𝑦0 ) with 𝑥20 + 𝑦20 = 1, 𝑥0 , 𝑦0 > 0, such that 2 𝑥 2 + 𝑦2 = 1.
                                                                          0      0
Let the point 𝐶 be such that the line passing through 𝐶 with the form 𝑥𝑥 + 𝑦𝑦 = 1 is unique, subject to 𝑥20 + 𝑦20 = 1.
                                                                             0   0
This line is tangent to the unit circle.
The point 𝐶 lies
               √
                  on the intersection of the line containing 𝐴𝐵 and the unit circle.
( 21 − 12 𝜆)2 + ( 23 𝜆)2 = 1.
We found no solution for 𝜆 ∈ (0, 1).

                                                                   45
                                                 s1: Simple test-time scaling

The unique point 𝐶 means that if any other segment passes through 𝐶, it must be 𝐴𝐵.
Consider the slope of the line containing the segment connecting (𝑥, 0) and (0, 𝑦). Slope is − 𝑥𝑦 .
The equation is 𝑦 = − 𝑥𝑦 𝑥 + 𝑦.
                                        √
If 𝐶 lies on the segment, then 1−𝜆
                                2𝑥
                                   + 2𝑦3𝜆 = 1.
Consider the geometric mean.
Let the point 𝐶 have coordinates (𝑎, 𝑏). The segment connecting (𝑥, 0) and (0, 𝑦) passes through (𝑎, 𝑏).
𝑎
𝑥
   + 𝑦𝑏 = 1.
                           √       √
If 𝐶 is on 𝐴𝐵, then 𝑏 = − 3𝑎 + 23 .
                   √
𝑎 = 12 − 21 𝜆, 𝑏 = 23 𝜆.
The point 𝐶 is tangent to the envelope of the family of segments.

Solution:
Let the family of segments be  . The segments have endpoints 𝑃 = (𝑥, 0) and 𝑄 = (0, 𝑦) with 𝑥2 + 𝑦2 = 1 and 𝑥, 𝑦 > 0.
The points on the segment 𝑃 𝑄 satisfy 𝑥𝑥 + 𝑦𝑦 = 1. The segments lie on the lines 𝑥𝑥 + 𝑦𝑦 = 1, where 𝑥20 + 𝑦20 = 1, 𝑥0 , 𝑦0 > 0.
                                                   √                               0    0
                                                                                                         √        √
The segment 𝐴𝐵 has endpoints 𝐴( 2 , 0) and 𝐵(0, 2 ). The equation of the line containing 𝐴𝐵 is 𝑦 = − 3𝑥 + 23 .
                                   1                3
                                                           √
A point 𝐶 on 𝐴𝐵 can be parameterized as 𝐶 = ( 12 − 12 𝜆, 23 𝜆), where 0 ≤ 𝜆 ≤ 1.
                                                                                      √
                                                                           1 1          3
                                                                            − 𝜆           𝜆
If 𝐶 belongs to a segment from  with endpoints (𝑥, 0) and (0, 𝑦), then    2 2
                                                                            𝑥
                                                                                  +    2
                                                                                       𝑦
                                                                                              = 1.
We are looking for a unique point 𝐶 on 𝐴𝐵, distinct from 𝐴 and 𝐵, that does not belong to any√segment from  other than
𝐴𝐵. This means that the only solution (𝑥, 𝑦) to the equation with 𝑥2 + 𝑦2 = 1, 𝑥, 𝑦 > 0, is ( 12 , 23 ).
The segments from  are parts of the tangents to the unit circle. The point 𝐶 must be tangent to the envelope of the family
of segments. The envelope is the unit circle itself.
The unique point
             √
                 𝐶 on 𝐴𝐵, distinct from 𝐴 and 𝐵, that lies on the unit circle. We need to solve:
( 12 − 12 𝜆)2 + ( 23 𝜆)2 = 1
                               √
𝜆 = 21 . The point 𝐶 = ( 14 , 43 ).
                  √
𝑂𝐶 2 = ( 41 )2 + ( 43 )2 = 16
                            1   3
                              + 16    4
                                   = 16 = 14 .
𝑂𝐶 2 = 41 = 𝑞𝑝 , so 𝑝 = 1, 𝑞 = 4.
𝑝 and 𝑞 are relatively prime.
𝑝 + 𝑞 = 1 + 4 = 5.
Final Answer: The final answer is 5



F. Version Control
V2 → V3 (2025-03):

  • Added §A

  • Added number of correct samples in §2

V1 → V2 (2025-02):

  • Added citations and other small writing changes




                                                               46