Summary (Overview)

SOOHAK Benchmark: Introduces a large-scale, contamination-resistant benchmark for evaluating research-level mathematical reasoning in LLMs, consisting of 439 newly authored problems (340 Challenge, 99 Refusal) and a companion 702-problem SOOHAK-Mini set.
Key Findings: Frontier LLMs (Gemini-3-Pro, GPT-5, Claude-Opus-4.5) achieve low accuracy on the Challenge subset (30.4%, 26.4%, 10.4% Avg@3), leaving substantial headroom. Open-weight models perform significantly worse (<15%). The Refusal subset, probing the ability to recognize ill-posed problems, reveals a critical weakness, with no model exceeding 50% accuracy.
Human Baseline: A human evaluation on a subset shows aggregated team coverage of 50.6%, with contest-trained undergraduates outperforming PhD-level researchers, highlighting a task-format mismatch. Gemini-3-Pro (60.8%) is the only model to exceed combined human coverage.
Scaling Patterns: Performance on the Challenge subset scales roughly linearly with both training and test-time compute. Performance on the Refusal subset does not show clear scaling, indicating it is a distinct and poorly optimized capability.
Data Collection: A rigorous, multi-stage pipeline involving 105 contributors (including faculty, PhD students, and IMO medalists) ensures originality and quality, with problems gated against increasingly capable models to target specific difficulty levels.

Introduction and Theoretical Foundation

Following the achievement of gold-medal performance on the International Mathematical Olympiad (IMO) by frontier LLMs, the community seeks a more challenging target for measuring LLM reasoning. While olympiad-style problems test step-by-step reasoning, research-level problems use such reasoning to advance the frontier of mathematical knowledge itself, offering a compelling next benchmark.

However, existing research-level math benchmarks are scarce (e.g., Riemann Bench and FrontierMath-Tier 4 contain only 25 and 50 problems) due to the difficulty of sourcing original, high-quality problems. Common dataset construction methods—scraping from publicly available competitions—lead to training-data overlap and rapid benchmark saturation. Alternative methods, like human-authoring fresh problems, are often confined to a single area or kept small to remain tractable, limiting breadth and comparability. Some benchmarks withhold problems behind access controls to prevent leakage, but this sacrifices transparency and reproducibility.

These issues compound when benchmarks must guide high-stakes model development, where integrity, breadth, and accountability are paramount. The paper introduces SOOHAK to address these gaps: a large-scale, contamination-resistant benchmark newly authored from scratch by expert mathematicians, designed to reliably evaluate next-generation frontier models.

Methodology

Data Collection Pipeline

A rigorous five-stage pipeline (Figure 1) was employed to collect and filter problems:

Submission & Consent: Contributors submit original problems under an agreement affirming no AI use and granting copyright.
Automated LLM-based Checks: LLMs perform difficulty analysis and similarity checks.
Manual Review & Investigation: Two human reviewers audit model-generated solutions against contributor-written references, requesting clarifications and banning contributors who submit AI-generated questions.
Contributor Opt-in: Contributors can review feedback and opt-in for final submission.
Final Submission Pool: Verified dataset.

Contributor Details & Incentives

105 contributors participated, including 48% faculty, 23% graduate students/postdocs, 25% undergraduates, and 5% undisclosed.
Contributors were recruited via direct outreach to mathematics departments.
Compensation options: monetary payment (total pool: USD 260,000, paid per accepted question, capped at USD 20,000 per contributor) or authorship on the dataset paper.
Problems must be written in English or Korean, typeset in text-only LaTeX, accompanied by a complete solution and explicit final answer.

Dataset Splits and Gates

Problems were routed through three model-gated collection gates to target specific difficulty levels:

Gate 1: Requires failure of small open models (e.g., Qwen3-7B, OpenThinker3-7B).
Gate 2: Requires failure of mid-size open models (e.g., gpt-oss-20B, Qwen3-32B). Problems passing Gates 1 & 2 are merged into SOOHAK-Mini.
Gate 3: Requires failure of all large open models in the panel (e.g., gpt-oss-120B, Qwen3-235B, DeepSeek-R1). Problems passing Gate 3 contribute to SOOHAK Challenge. Submission for this gate was limited to selected experts (faculty, postdocs, PhD students, IMO medalists) and supplemented with bulk-purchased problems from ScienceBench.

Refusal Subset Construction

SOOHAK Refusal contains 99 items sourced from submissions rejected during quality control because they were ill-posed (containing contradictions, missing assumptions, or having no unique answer). A model is marked correct only if it diagnoses the flaw instead of producing a confident numeric answer.

Dataset Composition and Annotation

SOOHAK Challenge: 340 problems, graduate-level and research-adjacent.
SOOHAK Refusal: 99 problems.
SOOHAK-Mini: 702 problems, spanning high-school olympiad through early graduate material.
Each question is annotated with:
1. Contributor-provided keywords (e.g., for Challenge: "automorphism", "abelian variety", "Fano variety").
2. An LLM-assigned Mathematics Subject Classification (MSC) subject area.

Table 1: Distribution of mathematical subject areas grouped by macro-category.

Macro-category	Subject areas (count)	Total #
Algebra & Discrete	Number theory (269), Combinatorics (131), Algebraic geometry (76), Group theory (67), Field theory (54), Linear algebra (33), Rings and algebras (24), Category theory (11), Nonassociative algebra (9), Commutative algebra (6)	680
Analysis	Real analysis (115), Series and summability (36), Functional equations (17), Harmonic analysis (11), Measure theory (9), Complex analysis (9), Partial differential equations (8), Potential theory (3), Ordinary differential equations (3), Global analysis (3), Special functions (2), Functional analysis (1), Operator theory (1), Calculus of variations (1)	233
Geometry & Topology	Geometry (95), Convex and discrete geometry (30), Algebraic topology (25), Manifolds (14), Differential geometry (6), General topology (5)	175
Probability & Statistics	Probability theory (24), Statistics (1)	25
Applied / CS / OR	Numerical analysis (9), Information and communication theory (7), Game theory and economics (5), Computer science (4), Operations research (2)	27
Logic	Mathematical logic (1)	1

Evaluation Setup

Models Evaluated: Eleven models spanning closed and open-weight systems:
- Closed: Gemini-3-Pro, Gemini-3-Flash, GPT-5, GPT-5-Mini, Claude-Opus-4.5, Claude-Sonnet-4.5, Grok-4.1-Fast.
- Open-weight: Qwen3-235B-A22B-thinking-2507, GPT-OSS-120B, Kimi-2.5, GLM-5.
Metrics: For each model-question pair, three independent responses are sampled.
- Avg@3: Average accuracy across the three samples.
- Pass@3: Proportion of questions where at least one of the three samples is correct. Let $c_{i,j} \in \{0, 1\}$ indicate correctness for question $i$ and sample $j \in \{1,2,3\}$ , with $N$ total questions:
$\text{avg@3} = \frac{1}{N} \sum_{i=1}^{N} \left( \frac{1}{3} \sum_{j=1}^{3} c_{i,j} \right), \quad \text{pass@3} = \frac{1}{N} \sum_{i=1}^{N} \mathbb{I}\left[\max_{j} c_{i,j} = 1\right].$
Answer Judging: GPT-5-Mini is used as an LLM judge to compare the parsed model answer to the gold answer via mathematical equivalence, outputting a binary correctness label.

Human Baseline Setup

Participants: 25 participants across five teams with varying mathematical backgrounds (Table 3).
Evaluation Set: 79 prompts sampled across benchmark splits (49 Calibration, 30 Challenge).
Conditions: 4.5-hour time budget. Participants could use any non-AI tools (programming environments, computer algebra systems, internet search) but were prohibited from using LLMs.
Scoring: Pure outcome-based; a prompt is correct only if the final answer is correct.

Table 3: Human participant team profiles.

Team Name	Size	Key Credentials
A	CS Major (IMO exp.)	5
B	Math Major (IMO exp.)	5
C	Math Major (IMO Gold)	5
D	Math Major	5
E	Math Researchers	5

Empirical Validation / Results

Overall Model Performance

Table 2: Avg@3 and Pass@3 (%) for SOOHAK-Mini and SOOHAK.

Model	SOOHAK-Mini (n=702)	SOOHAK Challenge (n=340)	SOOHAK Refusal (n=99)
	Avg@3	Pass@3	Avg@3
Closed frontier
Gemini-3-Pro	71.70	80.63	30.39
Gemini-3-Flash	61.40	68.80	15.69
GPT-5	72.22	81.48	26.37
GPT-5-Mini	67.14	78.49	18.82
Claude-Opus-4.5	51.38	61.40	10.39
Claude-Sonnet-4.5	40.88	51.57	5.69
Grok-4.1-Fast	70.66	77.92	18.43
Open-weight (largest in each family)
Qwen3-235B-A22B-thinking-2507	56.22	67.66	8.04
GPT-OSS-120B	61.02	75.21	11.27
Kimi-2.5	66.07	74.93	13.87
GLM-5	63.11	71.08	9.61

Challenge Subset: Frontier models show substantial headroom. Gemini-3-Pro leads with 30.39% Avg@3. Open-weight models trail significantly, with Kimi-2.5 being the best at 13.87%. 124 Challenge items were unsolved by any evaluated model.
Refusal Subset: No model exceeds 50% Avg@3. GLM-5 leads at 49.49%, exceeding all closed models. The Qwen3 family performs worst.
SOOHAK-Mini: GPT-5 leads at 72.22% Avg@3. Open-weight models remain competitive (Kimi-2.5: 66.07%).

Scaling Analysis

Challenge scales roughly linearly with both training compute (within the Qwen3 family) and test-time compute (extended context/token budget).
Refusal does not show the same scaling patterns, indicating that refusal/hallucination behavior is governed by different factors and is not directly optimized by current training.

Human Baseline Results

Aggregate Human Coverage: The combined coverage of all five human teams on the 79-problem evaluation set is 50.6%.
Model vs. Human: Gemini-3-Pro (60.8%) is the only model that exceeds combined human coverage.
Team Performance Ordering: Contest-trained undergraduates outperform PhD-level researchers.
- Best single team: Math Major with IMO experience (38.0%).
- Math Researchers (24.1%) scored lower than top undergraduate teams, indicating a task-format mismatch likely due to the time-pressured, contest-style format favoring short-path solutions and breadth over deep specialization.

Theoretical and Practical Implications

A New Benchmark for Frontier Evaluation: SOOHAK provides a large-scale, contamination-resistant benchmark that remains challenging for frontier LLMs, addressing the scarcity of research-level math benchmarks. Its low saturation (best model: 30.4% on Challenge) offers meaningful headroom for tracking future progress.
Identifying a Critical Weakness: The Refusal subset exposes a fundamental limitation in current LLMs: the inability to reliably recognize and refuse ill-posed problems. This capability is intrinsic to research mathematics (pausing rather than producing unjustified answers) and represents a new optimization target not directly addressed by current models.
Gap Between Closed and Open-weight Models: The significant performance gap on the Challenge subset between closed frontier models and leading open-weight models suggests that open-weight systems transfer less reliably to unpublished, research-adjacent mathematics. This aligns with observations from attempts to apply LLMs to unresolved mathematical problems.
Insights into Scaling and Capabilities: The linear scaling of Challenge performance with compute contrasts with the non-scaling of Refusal performance. This indicates that research-level problem-solving and refusal are distinct capabilities, with the latter not being a simple byproduct of increased model scale or reasoning budget.
Human Performance as a Reference: The human baseline provides an interpretable reference point, revealing that the benchmark primarily rewards contest-style reasoning under time pressure rather than deep research expertise. This highlights the importance of evaluation format and suggests that future benchmarks may need formats better aligned with research practice.

Conclusion

SOOHAK introduces a comprehensive benchmark for evaluating graduate-level and research-adjacent mathematical reasoning in LLMs. Its Challenge subset (340 problems) demonstrates substantial headroom for frontier models, while its Refusal subset (99 problems) identifies a critical failure mode—the inability to recognize ill-posed problems—as a new optimization target. The companion SOOHAK-Mini (702 problems) supports broader tracking across difficulty levels.

The benchmark was constructed via a rigorous, expert-driven pipeline to ensure originality and prevent contamination. Evaluation results show:

Frontier closed models achieve low accuracy on Challenge, leaving room for improvement.
Open-weight models lag significantly on Challenge.
No model excels at Refusal.
Human performance, while aggregated, is surpassed by the top model, but reveals a format mismatch favoring contest-trained solvers.

Future Directions: The authors note limitations in the collection process (time constraints, noisy difficulty labels) and recommend improvements for future benchmark builders: early review infrastructure with explicit rubrics, incentive schemes that reward more than raw difficulty, globally scoped recruitment for broader subfield coverage, and evaluation formats beyond unique-integer answers.

The full dataset will be open-sourced in late 2026, with model evaluations available upon request in the interim, providing a valuable tool for guiding the development of next-generation LLMs in advanced mathematical reasoning.