Visual Summary | NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?

Summary (Overview)

NatureBench is a cross-discipline benchmark of 90 tasks distilled from Nature-family publications (2022–2025), designed to evaluate whether AI coding agents can move beyond reproduction toward discovery on real scientific problems, using the published SOTA as the scoring anchor.
The benchmark is built via NatureGym, an automated pipeline that constructs standardized, containerized per-task environments from source papers, addressing the environment-fragmentation problem that has limited prior agent-on-research benchmarks.
Under a strict web-search-disabled protocol, the strongest agent (Claude Opus 4.7) surpasses SOTA on only 17.8% of tasks ( $g > 0.1$ ) and matches it on 47.8%.
Success is driven primarily by methodological translation (45.5% of validated successes) — converting scientific tasks into familiar supervised-prediction problems — rather than genuine scientific invention.
Failures are dominated by wrong method choice (45.1%) and insufficient compute budget (24.4%), not by task misunderstanding.

Introduction and Theoretical Foundation

Background and Motivation

AI coding agents are rapidly moving toward autonomous scientific research — from reproducing published implementations to conducting end-to-end research workflows. However, existing benchmarks for evaluating agent capabilities on scientific research have several limitations:

Paper-based benchmarks (PaperBench, CORE-Bench, ReplicationBench) measure whether an agent can re-implement a published method, but stop short of asking whether an agent can discover a competitive method on its own.
Engineering-optimization benchmarks (MLE-bench, PostTrainBench) target Kaggle competitions or post-training tasks, which do not require the domain reasoning, specialized tooling, or cross-discipline knowledge that characterize natural-science research, and suffer from environment fragmentation.

Theoretical Basis

The paper defines a Discovery-oriented evaluation protocol: rather than reproducing a known method, agents must independently solve the same scientific problem, using the source paper's reported SOTA as the scoring anchor to match or surpass.

The authors argue that existing AI-for-Science systems (AlphaFold, GNoME, etc.) share a structural limitation: humans specify the research programme, curate the data, and fix the success criterion, while AI acts as a more capable instrument inside that programme. NatureBench tests the missing horizontal capability: whether contemporary coding agents can solve tasks across six scientific domains using published SOTA as a unified scoring anchor.

Methodology

NatureGym Pipeline

NatureGym converts a published Nature-family paper into a containerized task package through three review-gated stages:

Stage 1: Paper Filtering

Three-level cascade filter examining: task extractability, evaluation automatability, and data completeness
Adversarial review to catch false positives

Stage 2: Dataset Acquisition & Verification

Download data and determine the algorithm boundary — keep inputs to algorithm $A$ , drop $A$ 's outputs
Verify decomposability (whether $D_{\text{dev}}$ separates from $D_{\text{eval}}$ ) and instance validity

Stage 3: Task Package Construction

Imposes an information firewall that removes the source method from each package
Builds containerized environments with:
- Agent-visible: problem/ (README, data description, input data)
- Hidden from agent: evaluation/ (evaluator, ground truth)
- Infrastructure: Dockerfile, metadata

The pipeline refines a per-paper record $T = (A, D, M, S, B)$ representing algorithm, dataset, metric, SOTA score, and optional baseline.

SOTA-Normalized Relative Gap

To compare agents across tasks with heterogeneous metrics, each task is scored by a single normalized quantity:

g_i = \text{dir}_i \cdot \frac{m_i - m^{\text{sota}}_i}{|m^{\text{sota}}_i|}

where $m_i$ is the agent's primary metric value, $m^{\text{sota}}_i$ is the paper-reported SOTA, and $\text{dir}_i \in \{+1, -1\}$ encodes metric direction. $g_i \geq 0$ means the agent matches or surpasses the published result. The task-level score averages $g_i$ across instances, with $g^{\text{fail}}_i = -1.0$ for no valid submission.

Evaluation Protocol

Agent operates in isolated Docker container with 4-hour wall-clock budget
Web search disabled
Iterative submission via three endpoints: /evaluate, /best_score, /time_remaining
Post-hoc validity judge (Claude Sonnet 4.6) screens for shortcut behaviors
3 tasks CPU-only, 70 tasks on RTX 3090/4090, 17 compute-intensive tasks on A800

Benchmark Composition

The final 90 tasks span 6 scientific domains, sourced from 6 Nature-family journals (primarily Nature Machine Intelligence, Nature Methods, and Nature Computational Science). Tasks cover 8 ML task types (prediction/regression 29, classification 19, clustering 14, generation 9, and tail tasks), with 81 distinct primary metrics across 333 evaluation instances.

Empirical Validation / Results

Main Results

Table 4: Main results on NatureBench — sorted by overall Surpass-SOTA ( $g > 0.1$ ) and Match-SOTA ( $g \geq 0$ ), as percentages of tasks.

Model	All S↑	All M↑	Protein S↑ M↑	Cellular S↑ M↑	Physical S↑ M↑	Molec. S↑ M↑	Relat. S↑ M↑	Biomed. S↑ M↑
Claude Opus 4.7	17.8	47.8	12.5 56.2	22.6 54.8	30.8 46.2	18.2 45.5	0.0 60.0	7.1 21.4
Gemini 3.5 Flash	15.6	37.8	6.2 43.8	25.8 51.6	30.8 30.8	0.0 18.2	0.0 60.0	7.1 14.3
GPT-5.5	14.4	44.4	6.2 50.0	25.8 54.8	23.1 38.5	0.0 18.2	0.0 60.0	7.1 35.7
Claude Opus 4.6	12.2	36.7	12.5 31.2	19.4 41.9	23.1 30.8	0.0 36.4	0.0 60.0	0.0 28.6
Qwen 3.7 Max	10.0	28.9	12.5 37.5	16.1 35.5	15.4 23.1	0.0 18.2	0.0 40.0	0.0 14.3
Kimi K2.6	8.9	30.0	12.5 37.5	12.9 29.0	15.4 15.4	0.0 27.3	0.0 60.0	0.0 28.6
GPT-5.4	8.9	27.8	6.2 37.5	12.9 29.0	23.1 30.8	0.0 18.2	0.0 60.0	0.0 7.1
GLM-5.1	7.8	28.9	6.2 25.0	12.9 35.5	7.7 23.1	0.0 18.2	0.0 60.0	7.1 21.4
DeepSeek-V4-Pro	4.4	26.7	6.2 37.5	9.7 32.3	0.0 15.4	0.0 18.2	0.0 60.0	0.0 7.1
MiniMax-M2.7	1.1	13.3	0.0 18.8	3.2 16.1	0.0 7.7	0.0 0.0	0.0 20.0	0.0 14.3

Score Distribution

Table 5: Gap summary and submission rates

Model	$\tilde{g}_{\text{all}}$	$\bar{g}_{\text{all}}$	$\tilde{g}_{\text{valid}}$	$\bar{g}_{\text{valid}}$	CR%	SR%
Claude Opus 4.7	−0.007	−4.54	−0.007	−4.54	100.0	100.0
Gemini 3.5 Flash	−0.083	−5.71	−0.041	−5.98	94.4	98.9
GPT-5.5	−0.055	−2.81	+0.001	−3.14	84.4	98.9
Claude Opus 4.6	−0.061	−2.02	−0.061	−2.02	100.0	100.0
Qwen 3.7 Max	−0.121	−2.94	−0.105	−3.03	95.6	98.9
Kimi K2.6	−0.142	−10.11	−0.087	−10.88	92.2	94.4
GPT-5.4	−0.123	−3.72	−0.113	−3.88	94.4	100.0
GLM-5.1	−0.150	−8.44	−0.131	−8.98	93.3	93.3
DeepSeek-V4-Pro	−0.242	−8.57	−0.239	−8.66	98.9	98.9
MiniMax-M2.7	−0.401	−11.76	−0.347	−12.53	93.3	98.9

Solution Mechanisms (from 900 runs × 10 agents)

Success modes (among Match-SOTA runs):

Supervised proxy prediction: 45.5%
Search/tuning: 17.6%
Engineering pipeline: 11.0%
Pretraining/scaling: 8.6%
Domain-reasoned alternatives: 8.3%
Method-aligned solutions: 9.0%

Failure modes (among below-SOTA/invalid runs):

Method-layer failures: 61.1% (wrong method choice 45.1%)
Execution-layer: 28.7% (insufficient budget/time 24.4%)
Strategy: 7.0%
Understanding: 3.1%

Domain Performance

A stable difficulty gradient emerges, shared across agents ( $\rho \geq 0.71$ ):

Easier tier: Relational Reasoning (60.0% Match-SOTA), Protein Biology (37.5%), Cellular Omics (35.5%)
Harder tier: Physical Modeling (26.9%), Molecular Design (18.2%), Biomedical Modeling (17.9%)

Cross-discipline tasks (15 of 90) show wider gaps: median $\tilde{g}_{\text{all}}$ drops from −0.13 to −0.21.

Theoretical and Practical Implications

Theoretical Implications

Current coding agents excel at methodological translation, not scientific invention. The dominant success pathway is converting scientific tasks into familiar supervised-prediction problems, suggesting agents lack deep scientific reasoning capabilities.
Method selection and implementation depth are the primary bottlenecks, not code generation or task understanding. This suggests that future improvements in agent architecture should focus on method selection and resource management rather than language model capabilities alone.
Cross-discipline integration remains a distinct challenge, as evidenced by the wider performance gap on interdisciplinary tasks. This aligns with the observation that contemporary scientists face increasingly restrictive information cocoons.

Practical Implications

NatureGym provides a reusable pipeline for converting published papers into reproducible, containerized benchmark tasks, addressing the long-standing environment-fragmentation problem in AI-for-Science evaluation.
The benchmark establishes a credible Discovery-oriented evaluation protocol that separates genuine algorithmic progress from engineering optimization and shortcut-taking, providing a more rigorous standard for evaluating scientific coding agents.
The finding that agents succeed primarily through "supervised proxy prediction" suggests that current agent capabilities are best suited for tasks that can be reformulated as standard ML problems, and poorly suited for tasks requiring novel scientific insight.

Conclusion

NatureBench demonstrates that current frontier coding agents remain far from matching published SOTA on genuine scientific problems from Nature-family papers, with the strongest agent (Claude Opus 4.7) surpassing SOTA on only 17.8% of tasks. The dominant success pathway is methodological translation — converting scientific tasks into familiar supervised-prediction problems — rather than genuine scientific invention. Failures are dominated by wrong method choice and insufficient compute budget, not by task misunderstanding.

The authors release:

NatureBench: 90 Nature-sourced tasks across six scientific domains
NatureGym: The automated pipeline for constructing task packages from papers
A public leaderboard with maintainer-side reproduction

Future directions include turning the same benchmark substrate into training data for future scientific-discovery agents, enabling AI systems to learn directly from the task packages how to discover methods that advance the state of the art across disciplines.