# NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?

> The strongest AI coding agent surpasses published SOTA on only 17.8% of 90 Nature-sourced benchmark tasks, succeeding via translation not invention.

- **Source:** [arXiv](https://arxiv.org/abs/2606.24530)
- **Published:** 2026-06-25
- **Permalink:** https://picx.dev/p/fuFjtc
- **Whiteboard:** https://picx.dev/p/fuFjtc/image

## Summary

## Summary (Overview)

- **NatureBench is a cross-discipline benchmark of 90 tasks distilled from Nature-family publications** (2022–2025), designed to evaluate whether AI coding agents can move beyond reproduction toward *discovery* on real scientific problems, using the published SOTA as the scoring anchor.
- **The benchmark is built via NatureGym**, an automated pipeline that constructs standardized, containerized per-task environments from source papers, addressing the environment-fragmentation problem that has limited prior agent-on-research benchmarks.
- **Under a strict web-search-disabled protocol, the strongest agent (Claude Opus 4.7) surpasses SOTA on only 17.8% of tasks** ($g > 0.1$) and matches it on 47.8%.
- **Success is driven primarily by methodological translation** (45.5% of validated successes) — converting scientific tasks into familiar supervised-prediction problems — rather than genuine scientific invention.
- **Failures are dominated by wrong method choice (45.1%) and insufficient compute budget (24.4%)**, not by task misunderstanding.

## Introduction and Theoretical Foundation

### Background and Motivation

AI coding agents are rapidly moving toward autonomous scientific research — from reproducing published implementations to conducting end-to-end research workflows. However, existing benchmarks for evaluating agent capabilities on scientific research have several limitations:

- **Paper-based benchmarks** (PaperBench, CORE-Bench, ReplicationBench) measure whether an agent can *re-implement* a published method, but stop short of asking whether an agent can *discover* a competitive method on its own.
- **Engineering-optimization benchmarks** (MLE-bench, PostTrainBench) target Kaggle competitions or post-training tasks, which do not require the domain reasoning, specialized tooling, or cross-discipline knowledge that characterize natural-science research, and suffer from environment fragmentation.

### Theoretical Basis

The paper defines a **Discovery-oriented evaluation** protocol: rather than reproducing a known method, agents must independently solve the same scientific problem, using the source paper's reported SOTA as the scoring anchor to match or surpass.

The authors argue that existing AI-for-Science systems (AlphaFold, GNoME, etc.) share a structural limitation: humans specify the research programme, curate the data, and fix the success criterion, while AI acts as a more capable instrument inside that programme. NatureBench tests the **missing horizontal capability**: whether contemporary coding agents can solve tasks across six scientific domains using published SOTA as a unified scoring anchor.

## Methodology

### NatureGym Pipeline

NatureGym converts a published Nature-family paper into a containerized task package through three review-gated stages:

**Stage 1: Paper Filtering**
- Three-level cascade filter examining: task extractability, evaluation automatability, and data completeness
- Adversarial review to catch false positives

**Stage 2: Dataset Acquisition & Verification**
- Download data and determine the **algorithm boundary** — keep inputs to algorithm $A$, drop $A$'s outputs
- Verify decomposability (whether $D_{\text{dev}}$ separates from $D_{\text{eval}}$) and instance validity

**Stage 3: Task Package Construction**
- Imposes an **information firewall** that removes the source method from each package
- Builds containerized environments with:
  - Agent-visible: `problem/` (README, data description, input data)
  - Hidden from agent: `evaluation/` (evaluator, ground truth)
  - Infrastructure: Dockerfile, metadata

The pipeline refines a per-paper record $T = (A, D, M, S, B)$ representing algorithm, dataset, metric, SOTA score, and optional baseline.

### SOTA-Normalized Relative Gap

To compare agents across tasks with heterogeneous metrics, each task is scored by a single normalized quantity:

$$g_i = \text{dir}_i \cdot \frac{m_i - m^{\text{sota}}_i}{|m^{\text{sota}}_i|}$$

where $m_i$ is the agent's primary metric value, $m^{\text{sota}}_i$ is the paper-reported SOTA, and $\text{dir}_i \in \{+1, -1\}$ encodes metric direction. $g_i \geq 0$ means the agent matches or surpasses the published result. The task-level score averages $g_i$ across instances, with $g^{\text{fail}}_i = -1.0$ for no valid submission.

### Evaluation Protocol

- Agent operates in isolated Docker container with 4-hour wall-clock budget
- Web search disabled
- Iterative submission via three endpoints: `/evaluate`, `/best_score`, `/time_remaining`
- Post-hoc validity judge (Claude Sonnet 4.6) screens for shortcut behaviors
- 3 tasks CPU-only, 70 tasks on RTX 3090/4090, 17 compute-intensive tasks on A800

### Benchmark Composition

The final 90 tasks span 6 scientific domains, sourced from 6 Nature-family journals (primarily *Nature Machine Intelligence*, *Nature Methods*, and *Nature Computational Science*). Tasks cover 8 ML task types (prediction/regression 29, classification 19, clustering 14, generation 9, and tail tasks), with 81 distinct primary metrics across 333 evaluation instances.

## Empirical Validation / Results

### Main Results

**Table 4: Main results on NatureBench** — sorted by overall Surpass-SOTA ($g > 0.1$) and Match-SOTA ($g \geq 0$), as percentages of tasks.

| Model | All S↑ | All M↑ | Protein S↑ M↑ | Cellular S↑ M↑ | Physical S↑ M↑ | Molec. S↑ M↑ | Relat. S↑ M↑ | Biomed. S↑ M↑ |
|-------|--------|--------|---------------|----------------|----------------|---------------|---------------|----------------|
| **Claude Opus 4.7** | **17.8** | **47.8** | 12.5 56.2 | 22.6 54.8 | 30.8 46.2 | 18.2 45.5 | 0.0 60.0 | 7.1 21.4 |
| Gemini 3.5 Flash | 15.6 | 37.8 | 6.2 43.8 | 25.8 51.6 | 30.8 30.8 | 0.0 18.2 | 0.0 60.0 | 7.1 14.3 |
| GPT-5.5 | 14.4 | 44.4 | 6.2 50.0 | 25.8 54.8 | 23.1 38.5 | 0.0 18.2 | 0.0 60.0 | 7.1 35.7 |
| Claude Opus 4.6 | 12.2 | 36.7 | 12.5 31.2 | 19.4 41.9 | 23.1 30.8 | 0.0 36.4 | 0.0 60.0 | 0.0 28.6 |
| Qwen 3.7 Max | 10.0 | 28.9 | 12.5 37.5 | 16.1 35.5 | 15.4 23.1 | 0.0 18.2 | 0.0 40.0 | 0.0 14.3 |
| Kimi K2.6 | 8.9 | 30.0 | 12.5 37.5 | 12.9 29.0 | 15.4 15.4 | 0.0 27.3 | 0.0 60.0 | 0.0 28.6 |
| GPT-5.4 | 8.9 | 27.8 | 6.2 37.5 | 12.9 29.0 | 23.1 30.8 | 0.0 18.2 | 0.0 60.0 | 0.0 7.1 |
| GLM-5.1 | 7.8 | 28.9 | 6.2 25.0 | 12.9 35.5 | 7.7 23.1 | 0.0 18.2 | 0.0 60.0 | 7.1 21.4 |
| DeepSeek-V4-Pro | 4.4 | 26.7 | 6.2 37.5 | 9.7 32.3 | 0.0 15.4 | 0.0 18.2 | 0.0 60.0 | 0.0 7.1 |
| MiniMax-M2.7 | 1.1 | 13.3 | 0.0 18.8 | 3.2 16.1 | 0.0 7.7 | 0.0 0.0 | 0.0 20.0 | 0.0 14.3 |

### Score Distribution

**Table 5: Gap summary and submission rates**

| Model | $\tilde{g}_{\text{all}}$ | $\bar{g}_{\text{all}}$ | $\tilde{g}_{\text{valid}}$ | $\bar{g}_{\text{valid}}$ | CR% | SR% |
|-------|-------------------------|----------------------|---------------------------|-------------------------|-----|-----|
| Claude Opus 4.7 | **−0.007** | −4.54 | **−0.007** | −4.54 | **100.0** | **100.0** |
| Gemini 3.5 Flash | −0.083 | −5.71 | −0.041 | −5.98 | 94.4 | 98.9 |
| GPT-5.5 | −0.055 | −2.81 | +0.001 | −3.14 | 84.4 | 98.9 |
| Claude Opus 4.6 | −0.061 | −2.02 | −0.061 | −2.02 | **100.0** | **100.0** |
| Qwen 3.7 Max | −0.121 | −2.94 | −0.105 | −3.03 | 95.6 | 98.9 |
| Kimi K2.6 | −0.142 | −10.11 | −0.087 | −10.88 | 92.2 | 94.4 |
| GPT-5.4 | −0.123 | −3.72 | −0.113 | −3.88 | 94.4 | **100.0** |
| GLM-5.1 | −0.150 | −8.44 | −0.131 | −8.98 | 93.3 | 93.3 |
| DeepSeek-V4-Pro | −0.242 | −8.57 | −0.239 | −8.66 | 98.9 | 98.9 |
| MiniMax-M2.7 | −0.401 | −11.76 | −0.347 | −12.53 | 93.3 | 98.9 |

### Solution Mechanisms (from 900 runs × 10 agents)

**Success modes** (among Match-SOTA runs):
- Supervised proxy prediction: 45.5%
- Search/tuning: 17.6%
- Engineering pipeline: 11.0%
- Pretraining/scaling: 8.6%
- Domain-reasoned alternatives: 8.3%
- Method-aligned solutions: 9.0%

**Failure modes** (among below-SOTA/invalid runs):
- Method-layer failures: 61.1% (wrong method choice 45.1%)
- Execution-layer: 28.7% (insufficient budget/time 24.4%)
- Strategy: 7.0%
- Understanding: 3.1%

### Domain Performance

A stable difficulty gradient emerges, shared across agents ($\rho \geq 0.71$):
- **Easier tier**: Relational Reasoning (60.0% Match-SOTA), Protein Biology (37.5%), Cellular Omics (35.5%)
- **Harder tier**: Physical Modeling (26.9%), Molecular Design (18.2%), Biomedical Modeling (17.9%)

Cross-discipline tasks (15 of 90) show wider gaps: median $\tilde{g}_{\text{all}}$ drops from −0.13 to −0.21.

## Theoretical and Practical Implications

### Theoretical Implications

1. **Current coding agents excel at methodological translation, not scientific invention.** The dominant success pathway is converting scientific tasks into familiar supervised-prediction problems, suggesting agents lack deep scientific reasoning capabilities.
2. **Method selection and implementation depth are the primary bottlenecks**, not code generation or task understanding. This suggests that future improvements in agent architecture should focus on method selection and resource management rather than language model capabilities alone.
3. **Cross-discipline integration remains a distinct challenge**, as evidenced by the wider performance gap on interdisciplinary tasks. This aligns with the observation that contemporary scientists face increasingly restrictive information cocoons.

### Practical Implications

1. **NatureGym provides a reusable pipeline** for converting published papers into reproducible, containerized benchmark tasks, addressing the long-standing environment-fragmentation problem in AI-for-Science evaluation.
2. **The benchmark establishes a credible Discovery-oriented evaluation protocol** that separates genuine algorithmic progress from engineering optimization and shortcut-taking, providing a more rigorous standard for evaluating scientific coding agents.
3. **The finding that agents succeed primarily through "supervised proxy prediction"** suggests that current agent capabilities are best suited for tasks that can be reformulated as standard ML problems, and poorly suited for tasks requiring novel scientific insight.

## Conclusion

NatureBench demonstrates that **current frontier coding agents remain far from matching published SOTA on genuine scientific problems from Nature-family papers**, with the strongest agent (Claude Opus 4.7) surpassing SOTA on only 17.8% of tasks. The dominant success pathway is **methodological translation** — converting scientific tasks into familiar supervised-prediction problems — rather than genuine scientific invention. **Failures are dominated by wrong method choice and insufficient compute budget**, not by task misunderstanding.

The authors release:
- **NatureBench**: 90 Nature-sourced tasks across six scientific domains
- **NatureGym**: The automated pipeline for constructing task packages from papers
- **A public leaderboard** with maintainer-side reproduction

**Future directions** include turning the same benchmark substrate into training data for future scientific-discovery agents, enabling AI systems to learn directly from the task packages how to discover methods that advance the state of the art across disciplines.

---

_Markdown view of https://picx.dev/p/fuFjtc, served by PicX — AI-generated visual whiteboard summaries of research papers._
