Summary (Overview)

  • Arbor is a general framework for Autonomous Optimization (AO) that enables an AI agent to iteratively improve a research artifact (codebase, pipeline, etc.) through long-horizon experimentation without step-level human supervision.
  • The core innovation is Hypothesis Tree Refinement (HTR): a persistent tree that links hypotheses, artifact versions, experimental evidence, and distilled insights, serving simultaneously as search frontier, long-term memory, and auditable research record.
  • A long-lived coordinator manages global research strategy over the tree, while short-lived executors test individual hypotheses in isolated git worktrees, returning structured evidence.
  • Arbor achieves the best held-out result on all six real research tasks (model training, harness engineering, data synthesis), with more than 2.5× the average relative held-out gain of Codex and Claude Code under the same budget.
  • On MLE-Bench Lite, Arbor reaches 86.36% Any Medal with GPT-5.5, the strongest result in the comparison; ablations confirm that both the hypothesis tree and insight feedback are critical for performance.

Introduction and Theoretical Foundation

Scientific research is a long-horizon human intelligence process: researchers form hypotheses, test them via experiments, interpret successes and failures, and let those lessons reshape future exploration. The paper formalizes this as Autonomous Optimization (AO), defined as a tuple:

P=(M0,O,Edev,Etest)\mathcal{P} = (M_0, \mathcal{O}, \mathcal{E}_{\text{dev}}, \mathcal{E}_{\text{test}})

where M0M_0 is the initial mutable artifact (codebase + data), O\mathcal{O} is the objective specifying improvement direction, Edev\mathcal{E}_{\text{dev}} returns feedback for free use during search, and the held-out Etest\mathcal{E}_{\text{test}} measures whether dev-driven improvement transfers. The agent aims to return:

M=argmaxMAStest(M)M^\star = \arg\max_{M' \in \mathcal{A}} S_{\text{test}}(M')

subject to not using Etest\mathcal{E}_{\text{test}} as an exploration oracle.

Current systems (Codex, Claude Code) can execute long trajectories but lack a mechanism to make research cumulative: they treat each trial as an independent local attempt, losing the structure of competing hypotheses, evidence interpretation, and constraint accumulation. Arbor addresses this by making the research state persistent and operational through a hypothesis tree.

Methodology

Hypothesis Tree as Research State

Let T=(V,E)\mathcal{T} = (V, E) be a rooted hypothesis tree. Each node nVn \in V is a research unit:

n=hn,ιn,μnn = \langle h_n, \iota_n, \mu_n \rangle
  • Hypothesis hnh_n: a verifiable/falsifiable claim about how to improve the artifact (coarse near root, concrete at leaves).
  • Insight ιn\iota_n: reusable interpretation of evidence; for leaves, summarizes what was tried and why; for internal nodes, abstracts over children's insights.
  • Metadata μn\mu_n: connects to executable evidence — node status, dev score, factual result, git branch/commit reference.

The tree separates internal direction nodes from executable leaf nodes. After a leaf is executed, its score, result, artifact reference, and distilled insight are written back, and the insight is propagated upward via abstraction.

Hypothesis Tree Refinement (HTR)

A coordinator (long-lived) owns the tree and executes a six-step loop: Observe, Ideate, Select, Dispatch, Backpropagate, Decide. Short-lived executors test individual hypotheses in isolated worktrees.

Key steps:

  • Observe: re-grounds coordinator in current tree state (frontier, insights, constraints, best artifact).
  • Ideate: proposes child hypotheses under a chosen parent, conditioned on accumulated evidence.
  • Select: chooses pending leaves to execute, balancing expected utility with evidence from ancestors/siblings.
  • Dispatch: sends selected hypotheses to parallel executors, each materializing the intervention, evaluating on Edev\mathcal{E}_{\text{dev}}, and returning a compact report: dev score, factual result, distilled insight, branch reference.
  • Backpropagate: writes evidence into leaf nodes, then updates insights along the path to the root via abstraction.
  • Decide: decides to continue, prune, or merge. Promotion guarded by a held-out merge gate: candidate evaluated on Etest\mathcal{E}_{\text{test}} in a fresh worktree, merged only if it improves over current best under O\mathcal{O}.

The executor contract ensures each experiment is bound to a single hypothesis, keeping local flexibility while preserving the semantic meaning of tree updates.

Empirical Validation / Results

AO Task Suite

Six tasks across three types:

TypeTaskInitial MaterialMetric & Split
Model TrainingOptimizer DesignNanoGPT-Bench; tuned Muon baselineSteps to target loss (↓); test averages two seeds
Model TrainingArchitecture Designautoresearch LLM codebaseFinal loss (↓); test averages two seeds
Harness EngineeringTerminal-Bench 2.0Official terminal-agent codebasePass rate (↑); 36 dev / 53 test
Harness EngineeringBrowseCompMinimal ReAct-style search harnessAccuracy (↑); 50 dev / 300 test
Data SynthesisSearch-Agent Data Synth.Hand-designed search-data pipelineMean pass gap (↑); 50 dev / 100 test seeds
Data SynthesisMath-Reasoning Data Synth.Hand-designed math-data pipelineMean pass gap (↑); 50 dev / 96 test problems

Main Results

Table 2: Main results on real research tasks. Each cell shows Dev / Test scores. ∆ rows show relative improvement (for model training) or absolute change (others).

TypeTaskInitialCodexClaude CodeArbor (Ours)
Model TrainingOptimizer Design (steps ↓)3325 / 33253325 / 3325 (+0.00% / +0.00%)3275 / 3287.5 (+1.50% / +1.13%)3225 / 3237.5 (+3.01% / +2.63%)
Model TrainingArchitecture Design (loss ↓)1.096 / 1.0981.089 / 1.083 (+0.64% / +1.37%)1.033 / 1.033 (+5.75% / +5.92%)1.029 / 1.028 (+6.11% / +6.38%)
Harness EngineeringTerminal-Bench 2.0 (pass ↑)58.33 / 69.8163.89 / 73.59 (+5.56 / +3.78)75.00 / 71.70 (+16.67 / +1.89)72.22 / 77.36 (+13.89 / +7.55)
Harness EngineeringBrowseComp (acc. ↑)52.50 / 45.3357.50 / 50.00 (+5.00 / +4.67)55.00 / 53.33 (+2.50 / +8.00)72.50 / 67.67 (+20.00 / +22.34)
Data SynthesisSearch-Agent (gap ↑)4.00 / 5.0012.00 / 9.00 (+8.00 / +4.00)12.00 / 12.00 (+8.00 / +7.00)16.00 / 18.00 (+12.00 / +13.00)
Data SynthesisMath-Reasoning (gap ↑)2.00 / 1.046.00 / 6.25 (+4.00 / +5.21)8.00 / 8.33 (+6.00 / +7.29)24.00 / 20.83 (+22.00 / +19.79)
  • Arbor achieves best held-out test on all six tasks.
  • Dev/test split exposes overfitting: e.g., Claude Code has highest dev on Terminal-Bench (75.00) but lower test (71.70); Arbor's held-out gate prevents such overfitting.

MLE-Bench Lite Results

Table 3: MLE-Bench Lite results (percentages).

MethodModelValid sub.Above medianBronzeSilverGoldAny medal
InternAgentDeepSeek-R1100.0078.7910.6116.6734.8562.12
ML-Master 2.0DeepSeekV3.2-Spe100.0084.8513.6431.8230.3075.76
MARSGemini-3-Pro100.0089.396.0615.1553.0374.24
AIBuildAIClaude-Opus-4.6100.0081.8213.6425.7637.8877.27
AI-ScientistGemini-3-Flash100.0086.3618.1831.8231.8281.82
ArborGemini-3-Flash100.0086.3613.6427.2740.9081.82
ArborGPT-5.5100.0095.450.009.0977.2786.36

Arbor with GPT-5.5 achieves the highest Any Medal (86.36%) and Gold (77.27%) among all compared methods.

Ablations

Table 4: Component ablations on MLE-Bench Lite (Claude Opus 4.6 backbone).

VariantValid sub.Above medianBronzeSilverGoldAny medal
Full Arbor100.0090.914.5527.2750.0081.82
w/o tree100.0072.729.0922.7331.8263.64
w/o insight feedback100.0077.274.5513.6436.3654.54
  • Removing the tree or insight feedback drops performance significantly; insight feedback is more critical than tree structure alone.
  • Both components are complementary: tree organizes competing hypotheses, insight feedback carries reusable information forward.

Additional Findings

  • Backbone generality: Arbor works with Gemini-3-Flash, Claude Opus 4.6, and GPT-5.5; gains are model-agnostic.
  • Cross-task transfer: A BrowseComp-evolved harness improves unseen tasks (HLE: +6.0%, DeepSearchQA: +8.0%), showing generalizable improvements.
  • Token consumption: Arbor uses 20–43M tokens, comparable to baselines, but achieves larger gains through structured search rather than larger sampling.
  • Node statistics (Table 5): Many nodes improve dev, but only a subset are merged, confirming the held-out gate's utility.

Theoretical and Practical Implications

  • Theoretical: The paper formalizes Autonomous Optimization (AO) as a distinct class of long-horizon research tasks. The hypothesis tree provides a principled representation of research state that separates hypotheses, evidence, and insights, turning trial-and-error into a cumulative, auditable process.
  • Practical: Arbor demonstrates that persistent hypothesis management enables stronger and more general held-out gains than flat trial-and-error, even with comparable token budgets. The framework is model-agnostic and transfers across task types (training, harness, data synthesis) and benchmarks (AO tasks, MLE-Bench Lite).
  • Design insight: The held-out merge gate is critical to prevent overfitting to development feedback. The separation of coordinator (strategic) and executor (local) roles allows scalable parallel experimentation while maintaining a coherent research state.

Conclusion

Arbor provides a general framework for autonomous research by organizing exploration through Hypothesis Tree Refinement. The persistent tree binds hypotheses, artifact versions, evidence, and insights, enabling a coordinator to manage strategic search while executors test individual ideas. Across six real research tasks and MLE-Bench Lite, Arbor consistently outperforms strong coding-agent baselines, with gains driven by the tree structure and insight propagation rather than increased token budgets. The framework's ability to transfer learned improvements across tasks and its model-agnostic nature suggest that persistent hypothesis management is a promising abstraction for autonomous research. Future work includes expanding the task suite, handling multi-objective optimization, and reducing search cost.

Related papers