Summary (Overview)
- Arbor is a general framework for Autonomous Optimization (AO) that enables an AI agent to iteratively improve a research artifact (codebase, pipeline, etc.) through long-horizon experimentation without step-level human supervision.
- The core innovation is Hypothesis Tree Refinement (HTR): a persistent tree that links hypotheses, artifact versions, experimental evidence, and distilled insights, serving simultaneously as search frontier, long-term memory, and auditable research record.
- A long-lived coordinator manages global research strategy over the tree, while short-lived executors test individual hypotheses in isolated git worktrees, returning structured evidence.
- Arbor achieves the best held-out result on all six real research tasks (model training, harness engineering, data synthesis), with more than 2.5× the average relative held-out gain of Codex and Claude Code under the same budget.
- On MLE-Bench Lite, Arbor reaches 86.36% Any Medal with GPT-5.5, the strongest result in the comparison; ablations confirm that both the hypothesis tree and insight feedback are critical for performance.
Introduction and Theoretical Foundation
Scientific research is a long-horizon human intelligence process: researchers form hypotheses, test them via experiments, interpret successes and failures, and let those lessons reshape future exploration. The paper formalizes this as Autonomous Optimization (AO), defined as a tuple:
where is the initial mutable artifact (codebase + data), is the objective specifying improvement direction, returns feedback for free use during search, and the held-out measures whether dev-driven improvement transfers. The agent aims to return:
subject to not using as an exploration oracle.
Current systems (Codex, Claude Code) can execute long trajectories but lack a mechanism to make research cumulative: they treat each trial as an independent local attempt, losing the structure of competing hypotheses, evidence interpretation, and constraint accumulation. Arbor addresses this by making the research state persistent and operational through a hypothesis tree.
Methodology
Hypothesis Tree as Research State
Let be a rooted hypothesis tree. Each node is a research unit:
- Hypothesis : a verifiable/falsifiable claim about how to improve the artifact (coarse near root, concrete at leaves).
- Insight : reusable interpretation of evidence; for leaves, summarizes what was tried and why; for internal nodes, abstracts over children's insights.
- Metadata : connects to executable evidence — node status, dev score, factual result, git branch/commit reference.
The tree separates internal direction nodes from executable leaf nodes. After a leaf is executed, its score, result, artifact reference, and distilled insight are written back, and the insight is propagated upward via abstraction.
Hypothesis Tree Refinement (HTR)
A coordinator (long-lived) owns the tree and executes a six-step loop: Observe, Ideate, Select, Dispatch, Backpropagate, Decide. Short-lived executors test individual hypotheses in isolated worktrees.
Key steps:
- Observe: re-grounds coordinator in current tree state (frontier, insights, constraints, best artifact).
- Ideate: proposes child hypotheses under a chosen parent, conditioned on accumulated evidence.
- Select: chooses pending leaves to execute, balancing expected utility with evidence from ancestors/siblings.
- Dispatch: sends selected hypotheses to parallel executors, each materializing the intervention, evaluating on , and returning a compact report: dev score, factual result, distilled insight, branch reference.
- Backpropagate: writes evidence into leaf nodes, then updates insights along the path to the root via abstraction.
- Decide: decides to continue, prune, or merge. Promotion guarded by a held-out merge gate: candidate evaluated on in a fresh worktree, merged only if it improves over current best under .
The executor contract ensures each experiment is bound to a single hypothesis, keeping local flexibility while preserving the semantic meaning of tree updates.
Empirical Validation / Results
AO Task Suite
Six tasks across three types:
| Type | Task | Initial Material | Metric & Split |
|---|---|---|---|
| Model Training | Optimizer Design | NanoGPT-Bench; tuned Muon baseline | Steps to target loss (↓); test averages two seeds |
| Model Training | Architecture Design | autoresearch LLM codebase | Final loss (↓); test averages two seeds |
| Harness Engineering | Terminal-Bench 2.0 | Official terminal-agent codebase | Pass rate (↑); 36 dev / 53 test |
| Harness Engineering | BrowseComp | Minimal ReAct-style search harness | Accuracy (↑); 50 dev / 300 test |
| Data Synthesis | Search-Agent Data Synth. | Hand-designed search-data pipeline | Mean pass gap (↑); 50 dev / 100 test seeds |
| Data Synthesis | Math-Reasoning Data Synth. | Hand-designed math-data pipeline | Mean pass gap (↑); 50 dev / 96 test problems |
Main Results
Table 2: Main results on real research tasks. Each cell shows Dev / Test scores. ∆ rows show relative improvement (for model training) or absolute change (others).
| Type | Task | Initial | Codex | Claude Code | Arbor (Ours) |
|---|---|---|---|---|---|
| Model Training | Optimizer Design (steps ↓) | 3325 / 3325 | 3325 / 3325 (+0.00% / +0.00%) | 3275 / 3287.5 (+1.50% / +1.13%) | 3225 / 3237.5 (+3.01% / +2.63%) |
| Model Training | Architecture Design (loss ↓) | 1.096 / 1.098 | 1.089 / 1.083 (+0.64% / +1.37%) | 1.033 / 1.033 (+5.75% / +5.92%) | 1.029 / 1.028 (+6.11% / +6.38%) |
| Harness Engineering | Terminal-Bench 2.0 (pass ↑) | 58.33 / 69.81 | 63.89 / 73.59 (+5.56 / +3.78) | 75.00 / 71.70 (+16.67 / +1.89) | 72.22 / 77.36 (+13.89 / +7.55) |
| Harness Engineering | BrowseComp (acc. ↑) | 52.50 / 45.33 | 57.50 / 50.00 (+5.00 / +4.67) | 55.00 / 53.33 (+2.50 / +8.00) | 72.50 / 67.67 (+20.00 / +22.34) |
| Data Synthesis | Search-Agent (gap ↑) | 4.00 / 5.00 | 12.00 / 9.00 (+8.00 / +4.00) | 12.00 / 12.00 (+8.00 / +7.00) | 16.00 / 18.00 (+12.00 / +13.00) |
| Data Synthesis | Math-Reasoning (gap ↑) | 2.00 / 1.04 | 6.00 / 6.25 (+4.00 / +5.21) | 8.00 / 8.33 (+6.00 / +7.29) | 24.00 / 20.83 (+22.00 / +19.79) |
- Arbor achieves best held-out test on all six tasks.
- Dev/test split exposes overfitting: e.g., Claude Code has highest dev on Terminal-Bench (75.00) but lower test (71.70); Arbor's held-out gate prevents such overfitting.
MLE-Bench Lite Results
Table 3: MLE-Bench Lite results (percentages).
| Method | Model | Valid sub. | Above median | Bronze | Silver | Gold | Any medal |
|---|---|---|---|---|---|---|---|
| InternAgent | DeepSeek-R1 | 100.00 | 78.79 | 10.61 | 16.67 | 34.85 | 62.12 |
| ML-Master 2.0 | DeepSeekV3.2-Spe | 100.00 | 84.85 | 13.64 | 31.82 | 30.30 | 75.76 |
| MARS | Gemini-3-Pro | 100.00 | 89.39 | 6.06 | 15.15 | 53.03 | 74.24 |
| AIBuildAI | Claude-Opus-4.6 | 100.00 | 81.82 | 13.64 | 25.76 | 37.88 | 77.27 |
| AI-Scientist | Gemini-3-Flash | 100.00 | 86.36 | 18.18 | 31.82 | 31.82 | 81.82 |
| Arbor | Gemini-3-Flash | 100.00 | 86.36 | 13.64 | 27.27 | 40.90 | 81.82 |
| Arbor | GPT-5.5 | 100.00 | 95.45 | 0.00 | 9.09 | 77.27 | 86.36 |
Arbor with GPT-5.5 achieves the highest Any Medal (86.36%) and Gold (77.27%) among all compared methods.
Ablations
Table 4: Component ablations on MLE-Bench Lite (Claude Opus 4.6 backbone).
| Variant | Valid sub. | Above median | Bronze | Silver | Gold | Any medal |
|---|---|---|---|---|---|---|
| Full Arbor | 100.00 | 90.91 | 4.55 | 27.27 | 50.00 | 81.82 |
| w/o tree | 100.00 | 72.72 | 9.09 | 22.73 | 31.82 | 63.64 |
| w/o insight feedback | 100.00 | 77.27 | 4.55 | 13.64 | 36.36 | 54.54 |
- Removing the tree or insight feedback drops performance significantly; insight feedback is more critical than tree structure alone.
- Both components are complementary: tree organizes competing hypotheses, insight feedback carries reusable information forward.
Additional Findings
- Backbone generality: Arbor works with Gemini-3-Flash, Claude Opus 4.6, and GPT-5.5; gains are model-agnostic.
- Cross-task transfer: A BrowseComp-evolved harness improves unseen tasks (HLE: +6.0%, DeepSearchQA: +8.0%), showing generalizable improvements.
- Token consumption: Arbor uses 20–43M tokens, comparable to baselines, but achieves larger gains through structured search rather than larger sampling.
- Node statistics (Table 5): Many nodes improve dev, but only a subset are merged, confirming the held-out gate's utility.
Theoretical and Practical Implications
- Theoretical: The paper formalizes Autonomous Optimization (AO) as a distinct class of long-horizon research tasks. The hypothesis tree provides a principled representation of research state that separates hypotheses, evidence, and insights, turning trial-and-error into a cumulative, auditable process.
- Practical: Arbor demonstrates that persistent hypothesis management enables stronger and more general held-out gains than flat trial-and-error, even with comparable token budgets. The framework is model-agnostic and transfers across task types (training, harness, data synthesis) and benchmarks (AO tasks, MLE-Bench Lite).
- Design insight: The held-out merge gate is critical to prevent overfitting to development feedback. The separation of coordinator (strategic) and executor (local) roles allows scalable parallel experimentation while maintaining a coherent research state.
Conclusion
Arbor provides a general framework for autonomous research by organizing exploration through Hypothesis Tree Refinement. The persistent tree binds hypotheses, artifact versions, evidence, and insights, enabling a coordinator to manage strategic search while executors test individual ideas. Across six real research tasks and MLE-Bench Lite, Arbor consistently outperforms strong coding-agent baselines, with gains driven by the tree structure and insight propagation rather than increased token budgets. The framework's ability to transfer learned improvements across tasks and its model-agnostic nature suggest that persistent hypothesis management is a promising abstraction for autonomous research. Future work includes expanding the task suite, handling multi-objective optimization, and reducing search cost.
Related papers
- Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution
Role-Agent outperforms baselines by using a single LLM as both agent and environment for bootstrapped co-evolution, with only 5.2% extra computation.
- SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-cognitive Variations
Even top LLM mediators close only a third of the consensus gap, revealing that mediation success depends on socio-cognitive adaptation, not general reasoning.
- GENEB: Why Genomic Models Are Hard to Compare
GENEB reveals that architecture and pretraining alignment often outweigh model scale for genomic foundation model performance across diverse tasks.