Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2

Metadata

Reading status: read complete
Year: 2025
Compute regime: Search, simulation, and science compute (search_simulation_science_compute)
PDF: 2025-alphageometry2_2025.pdf
Extracted text: 2025-alphageometry2_2025.txt
PDF URL: https://arxiv.org/pdf/2502.03544.pdf
OpenAlex:
Citation count source/date: Frontier provisional 2026-06-15
Citation count:
Reading card created: 2026-06-15

Compute Setup

The paper gives a partly explicit Google DeepMind setup. AlphaGeometry2 keeps the neuro-symbolic division from AlphaGeometry: a language model proposes auxiliary constructions, and the DDAR symbolic engine tries to complete a formal geometry proof. The language model is a sparse mixture-of-experts Transformer built on the Gemini training pipeline. The paper states that AG2 models are trained with the largest batch size allowed by TPUv4 hardware, with the exact number of TPUs depending on model size. It also says proof search serves multiple model replicas on TPUv4, while DDAR workers run asynchronously and are shared across problems.

The research device is a TPUv4-backed generator plus a Python/C++ symbolic search stack. The training set is about 300 million synthetic theorems, and the model sizes shown include 51M, 176M, and 3.3B parameters. Inference uses top-k sampling with temperature 1.0 and k = 32, multiple language models, and multiple search-tree configurations. The symbolic engine core is moved into C++ via pybind11 because every LM-proposed construction is immediately tested by deduction.

Bottleneck

The bottleneck is the coupling between proof-search branching and symbolic verification throughput. Geometry proofs often require auxiliary points, but the space of possible constructions is huge. A greedy model plus one deduction run solves very little: on problems requiring auxiliary constructions, greedy decoding with no tree search solves only 2 of 26, while top-k sampling without a search tree solves 9 of 26. The expensive unit is a candidate construction followed by a DDAR run, repeated across a branching search.

AG1 also had representational and systems bottlenecks. Its language covered only 66% of IMO geometry problems from 2000 to 2024, missing locus-style motion, linear equations over angles/ratios/distances, non-constructive statements, and "find the angle/ratio" computations. Its DDAR implementation was too slow for broad search: faster symbolic deduction directly buys more construction attempts within a fixed time budget.

Method Adaptation

AG2 adapts the method to the TPUv4 plus symbolic-engine setup in four places. First, it expands the formal language so more problems can enter the pipeline. New predicates cover angle and ratio computation, linear equations, locus cases, and diagram/topology checks. Coverage rises from 66% to 88% of 2000-2024 IMO geometry problems.

Second, DDAR is re-engineered for throughput. The paper identifies searches for similar triangles and angle-ratio facts as major costs, reduces rule application overhead, hard-codes essential rule searches, improves double-point handling, and implements Gaussian elimination in C++. The new C++ library is reported as more than 300x faster than DDAR1; on 25 unsolved-by-DDAR1 benchmark problems, DDAR1 averages about 1179.57 seconds while DDAR2 is much faster.

Third, the LM is trained as a construction generator rather than a standalone prover. Perplexity is tracked on synthetic and IMO-derived proof sets, but the downstream metric is solve rate after DDAR verifies the proposed auxiliary points. The inference prompt is also changed: AG2 feeds the LM an "analysis string" summarizing facts DDAR can derive from the premises, facts derivable when assuming the goal, and facts numerically true in the diagram. That makes model tokens carry symbolic-search state.

Fourth, AG2 replaces a single beam search with Shared Knowledge Ensemble of Search Trees. SKEST's technical value is maintaining several construction priors at once rather than betting on one beam. It runs classic, multi-auxiliary, operator, deep narrow, and shallow wide trees in parallel. Nodes are one auxiliary-construction attempt plus one DDAR run; useful facts move across trees, and asynchronous workers recycle deduction capacity when problems finish.

Evidence

The central result is IMO-AG-50, built from 45 IMO geometry problems from 2000 to 2024 translated into 50 AlphaGeometry problems. AG2 solves 42 of 50, or 84%, compared with 27 solved by AG1 and 40.9 for the paper's average gold medalist reference. On the older IMO-AG-30 subset, AG2 full setting solves all 30; AG2 with the AG1 single-tree setup solves 28; AG1 solves 25. DDAR alone improves too, from 14 to 16 on IMO-AG-50.

The ablations make the compute structure visible. One language model plus classical tree search reaches 27 solved problems after only 250 training steps at batch size 256, about 200 million tokens. For a single search tree, the best reported setting is beam size 128, depth 4, and 32 samples; increasing beam or sample count further does not keep helping. The full multi-tree SKEST setting lifts the result from 38 solved with AG2 in an AG1-style setup to 42 solved.

The paper also reports that AG2 was part of the system reaching silver-medal standard at IMO 2024. Its solution to IMO 2024 P4 was obtained within 30 seconds during IMO 2024 and received full marks from an IMO evaluator.

Historical Effect

AG2 is historically important because it makes neuro-symbolic theorem proving a modular method: the LM invents constructions, DDAR verifies, and SKEST coordinates search. The TPUv4-served Gemini-family model spends samples on creative auxiliary constructions; the C++ symbolic engine spends deterministic compute verifying consequences. The result is a compute structure closer to AlphaGo-style search plus learned priors than to pure chain-of-thought prompting.

It also marks a shift in math AI evaluation. General frontier "thinking" models in the paper's table score 0 on the formalized geometry benchmark, while the specialized neuro-symbolic stack exceeds the average gold-medalist threshold. The lesson is not simply that bigger LMs solve geometry. It is that formal language coverage, symbolic-engine latency, sample diversity, and cross-tree knowledge sharing determine how effectively accelerator inference can be converted into proofs.

Limits

The strongest implementation is not fully reproducible. The paper says DDAR2 examples will be shared, but Gemini training code and the full multi-model tree search are not provided because they rely on internal Google infrastructure and many parallel workers. Hardware is explicit at the TPUv4 level, but exact TPU counts, wall-clock training time, and total inference cost are absent.

AG2 still leaves gaps in the formal language and proof engine. The remaining unformalizable IMO problems involve inequalities and variable numbers of points; the attempted-but-unsolved cases include techniques such as inversion, projective geometry, and radical axis that are not implemented in DDAR. The authors argue these might be solved by longer proofs and more auxiliary constructions, but that would require more inference time and deeper search.