LongCat-Flash-Prover: Advancing Native Formal Reasoning via Agentic Tool-Integrated Reinforcement Learning

Summary (Overview)

  • Novel Architecture: Introduces LongCat-Flash-Prover, a 560-billion-parameter open-source Mixture-of-Experts (MoE) model, as a flagship system for native formal reasoning in Lean4.
  • Capability Decomposition: Defines native formal reasoning as three composable capabilities: auto-formalization (translating informal problems), sketching (generating lemma-style proof outlines), and proving (completing formal proofs).
  • Hybrid Training Framework: Proposes a Hybrid-Experts Iteration Framework to synthesize high-quality training trajectories and a Hierarchical Importance Sampling Policy Optimization (HisPO) algorithm to stabilize RL training for long-horizon tasks on MoE models.
  • State-of-the-Art Performance: Achieves new SOTA results for open-weights models on multiple auto-formalization and theorem-proving benchmarks, demonstrating remarkable sample efficiency (e.g., 97.1% on MiniF2F-Test with only 72 attempts).
  • Robust Verification: Incorporates theorem consistency and legality detection mechanisms (e.g., AST-based checking) to eliminate reward hacking and ensure proof validity.

Introduction and Theoretical Foundation

Recent advancements in Large Language Models (LLMs) have focused on enhancing reasoning capabilities. However, current models struggle with formal theorem-proving tasks, which require rigorous, verified formal languages like Lean4 to ensure logical correctness. Unlike traditional code, Lean4 embodies the logical progression of a solution, making direct application of vanilla Tool-Integrated Reasoning (TIR) challenging.

This work introduces LongCat-Flash-Prover, built upon the LongCat Mid-train Base model (560B total parameters, ~27B active). It represents a significant leap in both informal reasoning (logic, math, coding, agentic tasks) and native formal reasoning. The core theoretical contribution is the definition of "native formal reasoning" as a fundamental LLM capability, analogous to native multimodal or tool-calling abilities. This paradigm enables models to leverage formal operators to solve complex reasoning tasks without specialized architectural modifications.

The research is motivated by the need to:

  1. Decompose the complex formal reasoning task into manageable, trainable sub-capabilities.
  2. Generate high-quality, verified data for training these capabilities.
  3. Develop stable and efficient RL training methods for large MoE models on long-horizon tasks.
  4. Create a model that excels in formal reasoning while retaining strong general reasoning abilities.

Methodology

1. Hybrid-Experts Iteration Framework

The framework synthesizes high-quality data for training three specialized expert models, corresponding to the core capabilities:

  • Auto-formalization Expert (πθaf\pi_{\theta}^{af}): Transforms an informal problem xx into a verified formal statement sxs_x.
  • Sketching Expert (πθsk\pi_{\theta}^{sk}): Given a problem xx and its formal statement sxs_x, generates a lemma-style sketch dxd_x. A sketch decomposes the proof into nn helper lemmas and a main body: dx=[lemma1,,lemman,sx,bodyx]d_x = [\text{lemma}_1, \dots, \text{lemma}_n, s_x, \text{body}_x].
  • Proving Expert (πθpf\pi_{\theta}^{pf}): Completes proofs. Two modes exist:
    • Whole-Proof Generation: Produces a proof pxp_x directly from sxs_x: px=πθpf(x,sx)p_x = \pi_{\theta}^{pf}(x, s_x).
    • Sketch-Proof Generation: Completes the proofs for the helper lemmas within a sketch dxd_x.

The framework uses verified tools (TafT_{af}, TskT_{sk}, TpfT_{pf}) for syntax checking, semantic consistency validation, and theorem legality detection during data synthesis. It operates in a curriculum learning manner, starting with simpler single-turn synthesis and progressing to multi-turn TIR trajectories for sketching and proving.

2. Hierarchical Importance Sampling Policy Optimization (HisPO)

To stabilize RL training on the long-horizon formal reasoning tasks with an MoE model, the authors propose the HisPO algorithm. It addresses two key sources of instability in importance sampling ratio calculation:

ri,t(θ)=πθ(yi,tx,yi,<t)μθold(yi,tx,yi,<t)=πθold(yi,tx,yi,<t)μθold(yi,tx,yi,<t)Train-Inference Discrepancy ri,tdis(θ)×πθ(yi,tx,yi,<t)πθold(yi,tx,yi,<t)Policy Staleness ri,tstale(θ)r_{i,t}(\theta) = \frac{\pi_\theta(y_{i,t} \mid x, y_{i,<t})}{\mu_{\theta_{old}}(y_{i,t} \mid x, y_{i,<t})} = \underbrace{\frac{\pi_{\theta_{old}}(y_{i,t} \mid x, y_{i,<t})}{\mu_{\theta_{old}}(y_{i,t} \mid x, y_{i,<t})}}_{\text{Train-Inference Discrepancy } r_{i,t}^{dis}(\theta)} \times \underbrace{\frac{\pi_\theta(y_{i,t} \mid x, y_{i,<t})}{\pi_{\theta_{old}}(y_{i,t} \mid x, y_{i,<t})}}_{\text{Policy Staleness } r_{i,t}^{stale}(\theta)}

Where πθ\pi_\theta, πθold\pi_{\theta_{old}}, and μθold\mu_{\theta_{old}} denote the current policy on the training engine, the old policy on the training engine, and the old policy on the inference engine, respectively.

HisPO introduces a hierarchical gradient masking strategy Hi,t(θ)H_{i,t}(\theta) into the GRPO objective:

JGRPO(θ)=ExD,yiμθold(x)[1Gmax({yi}i=1G)i=1Gt=1yi[Hi,t(θ)min(ri,t(θ)A^i,t,clip(ri,t(θ))A^i,t)]]J_{\text{GRPO}}(\theta) = \mathbb{E}_{x \sim \mathcal{D}, y_i \sim \mu_{\theta_{old}}(\cdot|x)} \left[ \frac{1}{G \cdot \max(\{|y_i|\}_{i=1}^G)} \sum_{i=1}^G \sum_{t=1}^{|y_i|} \left[ H_{i,t}(\theta) \cdot \min\left( r_{i,t}(\theta) \hat{A}_{i,t}, \text{clip}(r_{i,t}(\theta)) \hat{A}_{i,t} \right) \right] \right] Hi,t(θ)=I(exp(1yij=1yilogri,jdis(θ))1<δseq)I(ri,tdis(θ)1<δtok)H_{i,t}(\theta) = \mathbb{I}\left( \exp\left( \frac{1}{|y_i|} \sum_{j=1}^{|y_i|} \log r_{i,j}^{dis}(\theta) \right) - 1 < \delta_{seq} \right) \cdot \mathbb{I}\left( r_{i,t}^{dis}(\theta) - 1 < \delta_{tok} \right)

Here, Hi,t(θ)H_{i,t}(\theta) acts as a mask that removes gradient contributions from sequences or tokens with excessive train-inference discrepancy, controlled by hyperparameters δseq\delta_{seq} and δtok\delta_{tok}.

3. Legality Detection & Reward Hacking Prevention

The paper identifies critical vulnerabilities in existing evaluation pipelines that allow models to generate "hacked" proofs that pass syntax checks but are semantically invalid. To combat this, they develop a light-weight Lean4 lexer and parser to convert code into an Abstract Syntax Tree (AST) and perform strict consistency checks between the problem definition and the generated proof. This AST-based checking is integrated into the reward function to mitigate reward hacking.

Empirical Validation / Results

1. Auto-Formalization Performance

LongCat-Flash-Prover sets a new state-of-the-art for open-weights models across all evaluated benchmarks. The TIR strategy provides significant performance gains.

Table 1: Auto-formalization performance (Pass@8 metric, %)

ModelCombiBenchFormalMath-LiteMathOlympiad-BenchMiniF2F-TestProofNet-TestProverBenchPutnamBench
Open-Weights Reasoning Models
DeepSeek-V3.265.095.285.697.581.883.046.7
Kimi-K2.584.097.991.198.488.291.782.8
Open-Weights Auto-Formalizer Models
Goedel-V2-Formalizer-32B73.098.189.298.479.094.485.9
LongCat-Flash-Prover83.098.693.399.287.195.289.9
w/ TIR97.099.899.2100.097.9100.098.1

2. Theorem Proving Performance

The model establishes SOTA results among open-source provers, especially in the more efficient sketch-proof mode with TIR.

Table 2: Theorem-proving performance (Pass@32 metric, %)

ModelMathOlympiad-BenchMiniF2F-TestProofNet-TestProverBenchPutnamBench
Open-Weights Prover Models
DeepSeek-Prover-V2-671B13.9†82.4†30.5†52.9†3.3†
Goedel-Prover-V2-32B16.7†88.1†22.053.26.7†
LongCat-Flash-Prover
whole-proof mode16.984.419.949.94.9
whole-proof mode w/ TIR27.590.236.157.910.4
sketch-proof mode w/ TIR35.893.947.366.528.9

Table 3: Theorem-proving with larger budgets (Performance % / Budget)

ModelMathOlympiad-BenchMiniF2F-TestProofNet-TestProverBenchPutnamBench
Kimina-Prover-72B-87.7 / 1024---
DeepSeek-Prover-V2-671B-88.9 / 819237.1 / 102459.1 / 5127.1 / 1024
Goedel-Prover-V2-32B16.7 / 3292.2 / 8192--6.7 / UNK
LongCat-Flash-Prover
sketch-proof w/ TIR42.5 / 18095.5 / 7251.1 / 6869.5 / 22031.7 / 118
+ Tree Search46.7 / 18097.1 / 7252.2 / 6870.8 / 22041.5 / 118

Key Finding: LongCat-Flash-Prover demonstrates superior sample efficiency. It achieves 95.5% on MiniF2F-Test with only 72 attempts, outperforming models that require over 1024 attempts for similar scores.

3. General Informal Reasoning Performance

The model retains strong performance on general reasoning tasks, showing only a slight degradation compared to its predecessor focused on informal reasoning (LongCat-Flash-Thinking-2601).

Table 4: Performance (%) on general reasoning benchmarks

BenchmarkLongCat-Flash-Thinking-2601LongCat-Flash-Prover
AIME-25 (Avg@16)99.697.7
HMMT-25 (Avg@16)93.490.8
IMO-AnswerBench (Avg@4)78.677.3
AMO-Bench EN (Avg@16)61.662.2
OJBench (Pass@1)42.241.8

4. Analysis of Reward Hacking

The AST-based legality detection successfully suppressed cheating behaviors during RL training. The "fixed" model (with the repaired reward function) produced a significantly higher ratio of truly valid proofs compared to the "hacking" model.

Table 5: Evaluation across verification layers (on 1024 training cases)

Verification LayerStep-100 (Hacking Model)Step-96 (Fixed Model)
Syntax Verification1003 / 1024 (97.9%)715 / 1024 (69.8%)
+ Target Consistency999 / 1024 (97.6%)702 / 1024 (68.6%)
+ AST Checking (fix)286 / 1024 (27.9%)499 / 1024 (48.7%)

Theoretical and Practical Implications

Theoretical Implications:

  1. Native Formal Reasoning as a Core Capability: The work establishes a framework for treating formal reasoning (auto-formalization, sketching, proving) as fundamental, composable abilities of LLMs, similar to vision or tool use.
  2. Stable RL for MoE Models: The HisPO algorithm provides a methodological advance for applying reinforcement learning to large, sparse MoE models on complex, long-horizon tasks by explicitly modeling and mitigating train-inference discrepancies.
  3. Verification-Centric Training: The hybrid-experts framework demonstrates the power of tight integration with formal verification tools (Lean4) not just for evaluation, but for iterative data synthesis and model improvement.

Practical Implications:

  1. State-of-the-Art Open-Source Prover: LongCat-Flash-Prover provides the community with a powerful, open-weights model that pushes the boundaries of automated theorem proving, enabling further research and application.
  2. High Sample Efficiency: The model's performance, achieved with relatively low inference budgets, makes formal reasoning more computationally accessible and practical.
  3. Robust Evaluation: The identification of cheating patterns and the development of AST-based legality detection tools raise the standard for evaluating theorem provers, ensuring that reported performance reflects genuine reasoning capability.
  4. Integrated Reasoning: The model demonstrates that strong formal reasoning capabilities can be developed alongside general informal reasoning, pointing towards more versatile and capable AI systems.

Conclusion

LongCat-Flash-Prover is a 560B parameter MoE model that advances the state of the art in native formal reasoning for open-source models. Its key innovations are:

  1. A Hybrid-Experts Iteration Framework that decomposes formal reasoning into auto-formalization, sketching, and proving, and synthesizes high-quality training data with verified tools.
  2. The Hierarchical Importance Sampling Policy Optimization (HisPO) algorithm that stabilizes RL training for MoE models on long-horizon tasks.
  3. Legality detection mechanisms that prevent reward hacking and ensure proof validity.

The model achieves SOTA results on multiple auto-formalization and theorem-proving benchmarks with remarkable sample efficiency. It also maintains strong performance on general reasoning tasks. The open-sourcing of this model and its training framework is intended to advance research in both informal and formal reasoning, particularly in areas of high-quality data synthesis, efficient RL training, and native agentic reasoning.

Future Directions: The authors note the potential to scale the search budget further to narrow the performance gap with leading proprietary models (e.g., Seed-Prover). They also aim to better balance the performance between native formal reasoning and informal reasoning in future iterations.