Summary (Overview)

  • Three atomic proof capabilities: M3 trains proof generation (via verifier-guided RL), proof verification (via aligned error finding), and critique-conditioned proof repair (via rejection-sampling fine‑tune). These are merged into a single released model.
  • Defense‑in‑depth verifier: A four‑layer generative verifier (bad‑case filtering, solution normalization, multi‑judge parallel scoring, pessimistic min aggregation) designed to minimize the false‑positive rate during long‑horizon RL, directly addressing reward‑hacking patterns observed in an earlier M2 cycle.
  • MaxProof framework: A model‑agnostic population‑level test‑time scaling framework that searches over a population of candidate proofs, applies dual PATCH/REWRITE refinement, and selects a final proof through a pairwise tournament. It converts best@K capability into stable pass@1.
  • State‑of‑the‑art results: With MaxProof, the M3 model achieves 35/42 on IMO 2025 and 36/42 on USAMO 2026, exceeding the human gold‑medal threshold on both contests. Standalone benchmark scores reach 67.40 on IMOProof Bench and 81.56 on IMOAnswer Bench.
  • Reward‑hacking case study: The M2 cycle documented four canonical hacking patterns (length bias, format hacking, semantic shortcut, judge‑specific preference) and directly shaped the conservative verifier design that enabled stable RL training.

Introduction and Theoretical Foundation

Mathematical proof is a stress‑test for reliable reasoning in language models because it demands a tightly coupled chain of constraints with very low tolerance for hand‑waving. The paper identifies three atomic capabilities required for competition‑level proof:

  1. Proof generation – producing a candidate proof that is sometimes close to correct (best@K). Long‑horizon RL with a reward from a generative verifier is the natural route, but this introduces challenges of reward noise, false positives, and reward hacking.
  2. Proof verification – reliably pointing to errors in a proof and explaining why. This capability (error finding rather than score prediction) underwrites self‑checking, error correction, and population‑level search.
  3. Proof refinement – fixing a flawed proof given a critique. Structurally different from one‑shot generation: the model must preserve correct parts and patch targeted defects.

The paper builds on prior work: AlphaGeometry, AlphaProof, Gemini Deep Thinking, DeepSeek‑Math‑V2, and earlier M2 series models. The key theoretical basis is that for proof tasks there is no executable ground truth—the reward must come from a generative verifier, which introduces a much harder class of issues than outcome‑based RL on code or unit‑test tasks.


Methodology

Proof Expert: Long‑Horizon RL under a Defense‑in‑Depth Verifier

The Proof Expert is trained using a variant of GRPO adapted to the M‑series CISPO objective (clipped importance‑sampling policy gradient). For each problem, a group of (G) candidate proofs is sampled, each scored by the verifier. The group‑normalized advantage is:

[ A_i = \frac{R_i - \mu_R}{\sigma_R + \epsilon}, \quad \mu_R = \frac{1}{G}\sum_{j=1}^G R_j, \quad \sigma_R = \sqrt{\frac{1}{G}\sum_{j=1}^G (R_j - \mu_R)^2} \tag{1} ]

The policy ratio and clipped ratio are:

[ \rho_{i,t}(\theta) = \frac{\pi_\theta(y_{i,t} \mid p, y_{i,<t})}{\pi_{\theta_{\text{old}}}(y_{i,t} \mid p, y_{i,<t})}, \quad \bar{\rho}{i,t}(\theta) = \operatorname{clip}\bigl(\rho{i,t}(\theta), 1-\epsilon_{\text{low}}, 1+\epsilon_{\text{high}}\bigr) \tag{2} ]

The CISPO policy objective is:

[ \mathcal{J}{\text{CISPO}}(\theta) = \mathbb{E}!\left[ \frac{1}{\sum{i=1}^G T_i} \sum_{i=1}^G \sum_{t=1}^{T_i} \operatorname{sg}\bigl(\bar{\rho}{i,t}(\theta)\bigr) , A_i \log \pi\theta(y_{i,t} \mid p, y_{i,<t}) \right] \tag{3} ]

A std‑threshold filter suppresses updates when the group reward variance is too small:

[ \mathcal{J}{\text{Proof}}(\theta) = \mathbb{E}!\left[ \mathbf{1}{\sigma_R > \tau{\text{std}}} , \mathcal{J}{\text{CISPO}}(\theta; p, y{1:G}) \right] \tag{5} ]

Defense‑in‑Depth Verifier – four layers:

  1. Bad‑case filtering – removes empty proofs, unclosed thinking blocks, boilerplate loops, length‑budget violations.
  2. Solution normalization – strips fixed openings, step headers, verification sections.
  3. Multi‑judge scoring – three parallel judges (two rubric‑based, one no‑rubric) for calibration and direct error finding.
  4. Pessimistic aggregation – final reward is the minimum judge score, favoring false negatives over false positives.

Verifier Expert: Aligned Error Finding

Formulated as a joint error‑finding and classification task (not score regression). The Verifier Expert outputs:

  • <assessment> – step‑by‑step analysis
  • <errors> – concrete error descriptions
  • <verdict> – one of no_errors, minor_gaps, has_errors, fundamentally_wrong

Training data is harvested from the Proof Expert’s RL run. The reward function is:

[ R = 0.7 \cdot R_{\text{error}} + 0.3 \cdot R_{\text{verdict}} \tag{6} ]

where (R_{\text{error}}) is a semantic‑alignment score and (R_{\text{verdict}}) is an order‑aware distance against the four‑class verdict.

Fixer Expert: Proof Repair by Rejection‑Sampling Fine‑Tune

Input: (problem, flawed_proof, verification_analysis). The Fixer Expert is fine‑tuned on corrections that pass the same pessimistic‑min verifier (strict no_errors verdict). Training data is a free byproduct of the Proof Expert’s RL run.

MaxProof: Population‑Level Test‑Time Scaling

MaxProof assumes four interfaces: generator, verifier, refiner, ranker – all served by the merged M3 model under different prompts in this work. The algorithm:

  1. Initialize archive of (N) candidate proofs.
  2. Score each with (K_{\text{verify}}) verifier samples, retain pessimistic‑min score and critique.
  3. For (R) rounds:
    • Select (M) diverse parents (lexical‑distance filter) among candidates not already at maximum fitness.
    • For each parent, generate two offspring: PATCH (exploitation, address concrete errors) and REWRITE (exploration, try a different proof path).
    • Score offspring and add to archive.
    • Early stop if ≥2 candidates reach maximum fitness (redundancy against verifier noise).
  4. Final selection: pairwise tournament over top‑(K) archive candidates, decided by (K_{\text{ranker}}) ranker votes per match.

Key design choices: conservative fitness (minimum verifier score), diverse parent selection (lexical‑distance), PATCH/REWRITE refinement, population‑level early stop, and pairwise tournament final selection.


Empirical Validation / Results

Standalone Benchmark Results

BenchmarkOpus 4.7GPT‑5.5Gemini 3.1 ProM3
IMOProof Bench65.8590.8575.7167.40
IMOAnswer Bench79.9090.6090.0081.56

Table 1: Standalone scores on mathematical proof‑style benchmarks. MaxProof not used.

Contest Results with MaxProof

SystemIMO 2025USAMO 2026
M3 (one‑shot)2726
M3 + MaxProof3536
Gain from MaxProof+8+10

Table 2: M3 with and without MaxProof on IMO 2025 and USAMO 2026 (0–7 per problem, 42 total). Both scores exceed the human gold‑medal threshold.

Per‑Problem Analysis

ProblemIMO 2025USAMO 2026
SelfOracle
P177
P277
P377
P477
P577
P600

Table 3: Per‑problem scores. Self = system self‑pick; Oracle = highest in final population. Gap is selection loss.

Key findings:

  • Selection loss concentrated on one problem (USAMO 2026 P2): oracle 6/7 but tournament selected a 2/7 candidate.
  • 9 of 12 problems reached oracle 7/7 by round 4; three problems (IMO P6, USAMO P2, USAMO P3) never reached 7/7 within 10 rounds.
  • Human expert review confirmed all self‑picked 7/7 solutions are complete and correct, validating the low false‑positive design.
  • M3’s solutions tend to be conservative (exhaustive case analysis) and depend on multi‑round search, even for routine problems.

Theoretical and Practical Implications

  • RL with generative verifiers: The paper demonstrates that reliable long‑horizon RL for mathematical proof requires a verifier engineered to minimize false positives rather than maximize static accuracy. The four‑layer defense‑in‑depth pipeline and the std‑threshold filter are concrete contributions to the design of reward signals for domains without executable oracles.
  • Test‑time scaling: MaxProof shows that converting best@K into pass@1 via population‑level search, guided refinement, and tournament selection can yield substantial gains (8–10 points on IMO/USAMO) without changing the base model architecture. This suggests that system‑level design can meaningfully close the gap to larger frontier models.
  • Capability decomposition: Separating proof generation, verification, and repair into three aligned experts, then merging them, provides a blueprint for building general‑purpose models that can self‑improve on hard reasoning tasks.
  • Limitations: The framework still suffers from selection loss (one case of 4‑point gap) and cannot overcome missing core ideas (e.g., IMO 2025 P6). Verifier miscalibration at the edge of correctness remains an open challenge.

Conclusion

The paper presents M3 and MaxProof as a record of sustained pursuit toward competition‑level mathematical proof in language models. M3 trains three atomic capabilities (proof generation, verification, repair) using a defense‑in‑depth generative verifier, and MaxProof scales them at test time through population‑level search and tournament selection. The released model reaches 35/42 on IMO 2025 and 36/42 on USAMO 2026, exceeding the human gold‑medal threshold. Key lessons include the centrality of a conservative, low‑false‑positive verifier, the detectability of reward‑hacking patterns through multi‑signal monitoring, and the effectiveness of turning single‑shot generation into an evolving search process. The authors acknowledge that strongest closed‑source systems still remain ahead, and future work should address selection loss, verifier calibration, and closing the gap in conciseness and essential reasoning.

Related papers