Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling - Summary

Summary (Overview)

Simple Unified Recipe: Introduces a compact, domain-unified pipeline (SU-01) to convert a broadly capable post-trained reasoning backbone (P1-30B-A3B) into a rigorous olympiad-level solver, supporting a "specializable-generalist" view.
Three-Stage Training: The recipe consists of: 1) SFT with a reverse-perplexity curriculum to instill proof-search and self-checking behaviors, 2) Two-stage RL (coarse RL with verifiable rewards, then refined RL with proof-level rewards and experience replay) to scale these behaviors, and 3) Test-Time Scaling (TTS) via a self-verification-and-refinement loop for inference.
Gold-Medal Performance: The resulting model, SU-01, achieves gold-medal-level performance on mathematical (IMO 2025, USAMO 2026) and physical (IPhO 2024/2025) olympiads, matching the highest reported human total on USAMO 2026 (35 points).
Strong Generalization: Demonstrates strong transfer of scientific reasoning to untrained domains (Chemistry, Biology) beyond the main math/physics training signals, as shown on FrontierScience benchmarks.
Long-Context Reasoning: Can sustain coherent reasoning trajectories exceeding 100K tokens during inference, effectively utilizing the TTS budget for proof repair.

Introduction and Theoretical Foundation

Recent systems like AlphaGeometry and Gemini Deep Think have reached olympiad-level performance, often combining neural guidance with symbolic search or extensive inference-time compute. This paper explores whether a compact reasoning backbone can be pushed to similar performance with a simple, domain-unified recipe that applies the same pipeline across mathematical and scientific problems.

The core design follows a specializable-generalist perspective: rather than building a narrow olympiad solver, the method specializes a broadly capable post-trained model (P1-30B-A3B) toward expert-level proof reasoning while preserving transfer across scientific domains. The pipeline is modular: SFT reshapes behavior, RL scales capability, and TTS allocates additional inference compute to the hardest proof-search problems.

Methodology

The pipeline has four core stages, visualized in Figure 2 of the paper.

1. Supervised Fine-Tuning (SFT) with Reverse-Perplexity Curriculum

Goal: Reshape the model's reasoning behavior toward explicit, disciplined, proof-oriented long-form reasoning.
Data: Curated from math (AoPS, olympiad materials), STEM (NaturalReasoning), code, and instruction-following sources. Includes direct solutions and self-verification/refinement trajectories. Final mixture: 338K trajectories (<8K tokens).
Key Technique: Reverse-Perplexity Curriculum. Data is sorted by descending perplexity under the initial policy $\pi_0$ : $\text{PPL}(x_i, y_i) = \exp\left(-\frac{1}{T_i} \sum_{t=1}^{T_i} \log \pi_0(y_{i,t} | x_i, y_{i,<t})\right)$ Training starts with high-PPL (unfamiliar) examples to drive behavioral adaptation before consolidating on familiar ones, preventing capability degradation.

2. Coarse Reinforcement Learning (RL)

Goal: Scale the SFT reasoning pattern into stronger answer-seeking behavior using reliable, verifiable rewards.
Formulation: Uses Group Sequence Policy Optimization (GSPO) with verifiable rewards (RLVR). For a prompt $q$ , a group of $K$ candidate solutions $G_q = \{o_i\}_{i=1}^K$ is sampled. The reward $r(q, o)$ is 1 if the final answer is correct (verified via layered checks: text matching → Math-Verify → generative model), else 0.
Objective: The policy is updated to maximize the clipped sequence-level surrogate objective: $J_{\text{GSPO}}(\theta) = \mathbb{E}_{q, \{o_i\}} \left[ \frac{1}{K} \sum_{i=1}^{K} \min\left( s_i(\theta) \hat{A}_i, \text{clip}(s_i(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_i \right) \right]$ where $s_i(\theta)$ is the length-normalized sequence-level importance ratio and $\hat{A}_i = r(q, o_i) - \mu_{G_q}$ is the within-proup advantage.

3. Refined Reinforcement Learning

Goal: Shift optimization from answer correctness to proof quality, encouraging rigor and self-refinement.
Components:
- Generative Proof Reward: Uses DeepSeekMath-V2 as a reward model $r_{\text{proof}}(q, o) \in \{0, 1\}$ to score the complete reasoning path.
- Self-Refinement: Failed responses (average group reward $< \tau_{\text{ref}}=0.5$ ) are converted into refinement prompts and mixed into training (ratio $\eta_{\text{ref}}=0.2$ ).
- Experience Replay: Stores rare successful trajectories from hard problems ( $0 < n^+(q) < 2$ ) in a buffer. Replays the lowest-entropy trajectory at a controlled ratio ( $\rho=0.25$ ) to preserve valuable signals.
Combined Objective: $J_{\text{refined}}(\theta) = (1-\rho) \mathbb{E}_{B_{\text{fresh}}}[J_{\text{GSPO}}(q, G_q; \theta, \pi_{\theta_{\text{old}}})] + \rho \mathbb{E}_{B_{\text{exp}}}[J_{\text{GSPO}}(q^*, \{o^*\} \cup G_{q^*}; \theta, \pi_{\theta_{\text{src}}})]$

4. Test-Time Scaling (TTS)

Goal: Allocate additional inference compute to the hardest problems via iterative self-verification and refinement.
Procedure: An iterative solve–verify–refine loop. The model produces an initial solution, self-verifies it (generating a bug report), and refines it based on the critique. The loop repeats until the solution passes verification consecutively or the budget is exhausted.

Empirical Validation / Results

Performance is evaluated across three benchmark families: answer-verifiable tasks, proof-oriented tasks, and official olympiad competitions.

1. Verifiable Problems

SU-01 achieves competitive performance, nearly matching the strongest similar-size baseline (Qwen3.6-35B-A3B).

Model	AnswerBench	AMO-Bench	AIME 25/26	FrontierScience-Olympiad	Avg.
P1-30B-A3B	69.3%	41.3%	90.4% / 89.6%	54.5%	69.0%
Qwen3.6-35B-A3B	78.0%	58.8%	92.5% / 92.9%	65.0%	77.4%
SU-01	77.5%	59.8%	94.6% / 93.3%	61.5%	77.3%

2. Non-verifiable / Proof-Oriented Problems

SU-01 shows substantial gains, especially with TTS, approaching larger frontier models.

Model	IMO-ProofBench (Basic/Advanced/Overall)	FrontierScience-Research (Overall)
Gemini 3.1 Pro Thinking	95.2% / 50.0% / 72.6%	13.3%
GPT-5.5-High	96.7% / 64.8% / 80.7%	36.7%
Qwen3.6-35B-A3B	39.1% / 7.1% / 23.1%	5.0%
SU-01 (direct)	77.1% / 38.1% / 57.6%	11.7%
SU-01 (w/ TTS)	91.0% / 49.5% / 70.2%	-

3. Olympiad Competition Problems

SU-01 exceeds gold-medal lines on physics olympiads and, with TTS, achieves gold medals on mathematical olympiads.

Table: Performance on Physics Olympiad Problems (IPhO)

Model	IPhO 2024	IPhO 2025
Gold Line	20.8	19.7
SU-01 (direct)	23.5	20.3
SU-01 (w/ TTS)	25.3	21.7

Table: Performance on Mathematical Olympiad Problems

Model	IMO 2025 Total	USAMO 2026 Total
Medal Lines (G/S/B)	35/28/19	25/18/11
SU-01 (direct)	21 (Bronze)	15 (Bronze)
SU-01 (w/ TTS)	35 (Gold)	35 (Gold)

The USAMO 2026 score of 35 matches the highest reported human total among 340 competitors.
Case Studies: The model solutions (provided in Appendix H) show strengths in formal reformulation (e.g., using complex numbers for geometry) but limitations on problems requiring subtle structural preservation (failing on IMO P6 and USAMO P2).

4. Progressive Reasoning Analysis

Figure 4 shows the contribution of each pipeline stage:

SFT lowers AnswerBench score but drastically improves ProofBench scores, confirming its role in behavior shaping toward proof rigor.
Coarse RL recovers and improves direct solving ability (AnswerBench) while further boosting proof performance.
Refined RL provides the largest gains on the hardest ProofBench.Advanced problems, highlighting the importance of proof-level rewards and experience replay.

5. Inference Scaling Characterization

TTS traces on USAMO 2026 (Figure 5) show a clear allocation of compute:

Initial Solution: Median length 106K tokens (broad proof search).
Refinement: Median length 83K tokens (substantial proof repair).
Verification: Median length 28.7K tokens (auditing). This demonstrates the model's ability to sustain complex, conditioned reasoning beyond 100K tokens.

Theoretical and Practical Implications

Specializable-Generalist Models: Demonstrates that a compact, broadly capable backbone can be driven toward expert-level proof reasoning while retaining meaningful scientific transfer, challenging the need for overly narrow specialization.
Efficient Olympiad Training: Provides a simple and unified recipe that achieves top-level performance with substantially lower training cost compared to broader, multi-stage post-training pipelines used by similar-size models (e.g., Nemotron-Cascade-2).
Importance of Proof-Level Feedback: Highlights that optimizing for final-answer correctness (RLVR) is insufficient for olympiad reasoning; proof-level generative rewards and self-refinement are critical for achieving rigor.
Test-Time Scaling as a Capability Amplifier: Shows that trained self-verification and refinement behaviors can productively utilize additional inference compute, a model- agnostic strategy for boosting performance on the hardest problems.

Conclusion

The paper presents a simple and unified recipe for eliciting gold-medal-level olympiad reasoning from a compact 30B-A3B model. The recipe decomposes rigorous reasoning improvement into:

Behavior shaping via reverse-perplexity SFT.
Scalable feedback via coarse and refined RL.
Proof-level specialization via generative rewards, self-refinement, and experience replay.
Inference-time repair via test-time scaling.

The resulting model, SU-01, validates a specializable-generalist view: it reaches expert-level proof reasoning on mathematical and physical olympiads while showing strong generalization to other scientific domains. The work suggests that with the right training and inference recipe, compact reasoning models can achieve remarkable reasoning capabilities.