Summary of "Scaling Research-Level Mathematics via Agents"

Summary (Overview)

Introduces RESEARCHMATH-14K: A novel, large-scale dataset of 14,056 research-level mathematical problems curated from academic literature (arXiv, zbMATH, workshops) via a multi-agent pipeline, making it the largest public collection of its kind.
Highlights a Concerning Trend: Analysis of reasoning traces from eight open-weight models reveals that newer model generations (e.g., DeepSeek V4-Pro, Kimi K2.6) produce 5.6× more reference-like mentions and 5.0× more fake references per trace when tackling research-level problems, indicating a regression in factuality despite increased citation behavior.
Demonstrates Utility of Imperfect Data: Fine-tuning Qwen3 models (4B to 30B parameters) on a filtered subset of incorrect reasoning traces (RESEARCHMATH-REASONING-FILTERED) yields an average improvement of +9.2 percentage points over base models, proving that wrong-but-reasonable attempts can provide valuable supervision even without correct solutions.
Provides Comprehensive Behavioral Analysis: The study employs rule-based counters and agent-judges to quantify failure modes (non-attempts, fabricated references, lack of lemma decomposition) across models and benchmarks, isolating issues specific to research-level reasoning.
Releases Open Resources: The dataset and 220K reasoning trajectories are released under the MIT license to support future work on advancing mathematical reasoning in AI.

Introduction and Theoretical Foundation

The paper addresses a critical gap in AI for mathematics: the lack of large-scale, research-level training data. While frontier proprietary models show advanced mathematical capabilities, the open-source landscape is limited to contest-style or undergraduate-level problems. Research-level problems—characterized by genuine uncertainty, need for decomposition into lemmas, and exploration of novel approaches—are scarce in public datasets, often kept as gated evaluation benchmarks.

The core insight is that the mathematical literature itself is a vast repository of unsolved problems, conjectures, and open questions. The bottleneck is the labor-intensive process of extracting these problems from their context (papers, workshop notes, surveys) and rewriting them into self-contained, standalone questions. The paper's theoretical foundation is that leveraging LLM-powered agents can automate this curation at scale, enabling the creation of a large corpus suitable for training and analyzing model behavior on frontier mathematics.

Methodology

Dataset Curation Pipeline (RESEARCHMATH-14K): A two-stage agentic pipeline processes source documents (1,233 total from arXiv, open-problem web pages, and curated lists):

Extractor Agent: Driven by Codex with GPT-5.5. For each source, it:
- Confirms the document contains open problems.
- Reads the paper end-to-end.
- Extracts verbatim quotes of candidate problems.
- Performs a first-level rewrite, pulling in necessary definitions and statements from across the document.
Refiner Agent: Driven by Claude Code with Opus 4.7. It:
- Re-reads the source to inline all implicit definitions and hypotheses, creating a fully self-contained problem statement.
- Searches citing literature to determine the problem's status (open, partially solved, solved, unknown).
- Produces a final JSON record with statement, status, domain metadata, source, and solution fields.

Filtering: Problems are embedded using Qwen3-Embedding-8B. Pairs with a cosine similarity > 0.9 (on either original or rewritten statement) are considered duplicates; one is removed, prioritizing arXiv sources. This yields the final 14,056-problem set.

Generating Reasoning Traces (RESEARCHMATH-REASONING): Two teacher models (GPT-OSS-120B and Qwen3-30B-A3B) generate reasoning trajectories for the 14K problems, resulting in approximately 220K responses (16 per prompt). The goal is not to produce correct solutions but to capture attempts.

Behavioral and Factuality Analysis:

Models: Eight open-weight models, grouped into four older→newer pairs: DeepSeek R1→V4-Pro, Kimi K2→K2.6, Qwen3 30B→Qwen3.5 35B, Qwen3 235B→Qwen3.5 397B.
Benchmarks: Five benchmarks to isolate properties:
- Research-Level & AI-Refined: RESEARCHMATH-14K
- Research-Level & Human-Authored: SOOHAK (graduate+), Leipzig Tier-4
- Easier Difficulty: HLE-Verified (math subset), AIME (2024-2026, olympiad level)
Metrics:
- Rule-Based Counters: Three curated phrase lists matched against lowercased traces:
  - cite: Citation-like nouns (e.g., "paper").
  - abandon: Abandonment phrases (e.g., "cannot solve").
  - assume: Unjustified claims (e.g., "known result").
  - Reports row-hit-rate: fraction of traces with at least one match.
- Agent-Judge:
  - Lemma Decomposition: GPT-5.5 judges if the solver breaks the problem into provable subgoals (within first 30% of trace).
  - Reference Verification: Two-stage pipeline: (1) GPT-5.4-nano extracts reference-like spans, (2) a search-enabled Codex agent verifies each span against web search, labeling it as genuine or fake.

Training Setup:

Filtering: RESEARCHMATH-REASONING is filtered using the Agent-Judge reference verification. Traces containing any fake reference are removed, creating RESEARCHMATH-REASONING-FILTERED (5,000 traces).
Baseline: 5,000 randomly sampled traces from DASD-Thinking (an olympiad-level dataset).
Fine-tuning: Qwen3-4B, 8B, and 30B-A3B base models are fine-tuned with LoRA on each training set.
Evaluation: Models are evaluated on AIME (n=90), HLE (n=315), and SOOHAK (n=501) using math-verify for scoring.

Empirical Validation / Results

Dataset Characteristics (RESEARCHMATH-14K):

Size: 14,056 problems after deduplication (from an initial 20,835).
Domains: Broad coverage across 11 level-one groups. The largest areas are Analysis/PDEs/Dynamics, Mathematical Physics, Discrete Mathematics/Combinatorics, and Geometry/Topology (63.82% combined).
Status: Majority are open (59.14%), followed by unknown (17.71%), partially solved (14.82%), and solved (8.33%).
Difficulty: Elo rating comparisons against datasets like AceMath, AIME, HLE-Verified, and NuminaMath show RESEARCHMATH-14K is rated ~400 Elo points higher on axes of Knowledge, Novelty, and Procedural difficulty.

Behavioral Analysis Findings:

Finding: Across model families, the newer generation cites more often but also produces more fake references.

Citation Increase: Row-hit-rate for the cite counter increased by 30-80 percentage points on research-level benchmarks (RESEARCHMATH-14K, Leipzig, SOOHAK) for newer models. The effect diminished on easier benchmarks (HLE, AIME).
Fabrication Epidemic: Agent-Judge analysis of 720 RESEARCHMATH-14K traces found:
- 87.4% (629) contained at least one reference-like mention.
- 54.0% (389) contained at least one fake reference.
- In total, 17.6% of 19,864 extracted mentions were judged fake.
Per-Trace Growth: Newer models produced dramatically more mentions and fakes:
- DeepSeek R1→V4-Pro: 4.9→57.8 mentions (0.5→11.6 fakes)
- Kimi K2→K2.6: 1.9→60.0 mentions (0.1→8.3 fakes)
- Qwen3 30B→Qwen3.5 35B: 6.5→36.7 mentions (1.4→7.7 fakes)
Aggregate: Newer models produce 5.6× more mentions and 5.0× more fakes per trace.

Finding: Models parrot the style of research mathematics without engaging its underlying reasoning.

Surface Imitation: The assume counter matched 94.0% of traces, indicating heavy reliance on compressed, unjustified claims. The abandon counter matched only 17.4%.
Lack of Decomposition: The lemma-decomposition Agent-Judge found the behavior to be almost absent. Only 11/720 RESEARCHMATH-14K traces were marked positive (see Table 2).

Training Results:

Finding: Open-problem trajectories teach models more about research-level reasoning than olympiad data does, even without ever solving the problem.

Table: Fine-tuning results overview (scores in %).

Model Size	Training Data	AIME	HLE	SOOHAK	Avg. Gain vs Base
4B	Base	24.4	7.0	24.6	-
4B	DASD-Thinking	11.9	5.6	22.2	-6.1
4B	RESEARCHMATH-FILTERED	13.3	7.9	25.0	+0.7
8B	Base	(Not shown in detail)	...	...	-
8B	DASD-Thinking	(Not shown in detail)	...	...	(Inferior)
8B	RESEARCHMATH-FILTERED	(Not shown in detail)	...	...	+9.2 avg
30B-A3B	Base	(Not shown in detail)	...	...	-
30B-A3B	DASD-Thinking	(Not shown in detail)	...	...	(Mixed)
30B-A3B	RESEARCHMATH-FILTERED	(Not shown in detail)	...	...	+9.2 avg

Overall Improvement: Fine-tuning on RESEARCHMATH-REASONING-FILTERED improved over base models in all 9 (model × benchmark) cells, with a mean gain of +9.2 percentage points.
Superior to Olympiad Data: It outperformed training on DASD-Thinking in 8/9 cells. The clearest gains were on research-level evaluations (HLE & SOOHAK), where it averaged +2.6 points above DASD.
Exception: On the easiest benchmark (AIME), the 30B model trained on DASD performed better (+11.1 points), suggesting domain-specific benefits for olympiad-style problems.

Theoretical and Practical Implications

Theoretical Implications:

Scaling Research-Level Data is Feasible: The paper demonstrates that agentic pipelines can effectively mine and refine the vast, untapped resource of open problems in the academic literature, providing a scalable alternative to expensive expert annotation.
The "Citation-Fabrication" Trade-off: The analysis reveals a potentially alarming side-effect of training models with agentic/retrieval-augmented pipelines. Models may learn the form of authoritative, citation-backed reasoning without the corresponding fact-checking capability, leading to increased hallucination when tools are unavailable.
Value of Process Over Product: The success of fine-tuning on filtered, incorrect attempts challenges the assumption that only verified-correct solutions are useful for training. The process of a knowledgeable attempt—exploring reductions, testing examples, introducing relevant objects—contains transferable structural knowledge beneficial for reasoning.

Practical Implications:

Resource for the Community: The release of RESEARCHMATH-14K and RESEARCHMATH-REASONING provides a much-needed large-scale benchmark and training resource for advancing mathematical reasoning in open models.
Training Strategy: For organizations aiming to improve model performance on hard, open-ended problems, curating and filtering "wrong-but-reasonable" attempts may be a cost-effective training strategy compared to requiring perfect solutions.
Evaluation and Monitoring: The behavioral metrics (rule-based counters, agent-judges) offer tools for developers to diagnose failure modes in their models' reasoning, particularly concerning factuality and decomposition skills.

Conclusion

This work makes three primary contributions: (1) the creation and release of RESEARCHMATH-14K, the largest public dataset of research-level mathematical problems; (2) a detailed analysis revealing that newer LLM generations exhibit a troubling increase in citation fabrication when tackling such problems, likely a byproduct of agentic training; and (3) an empirical demonstration that filtered, incorrect reasoning traces can serve as effective supervisory signal, improving model performance by an average of 9.2 points.

The findings suggest a path forward where scaling research-level mathematical reasoning does not strictly depend on scaling expert-written solutions. Future work should explore this "wrong-but-reasonable" supervision signal at larger scales, investigate methods to curb citation hallucination, and further clarify the contexts where verified-correct reasoning remains indispensable for reliable proof generation. The released datasets aim to catalyze progress in this direction.