Visual Summary | Toward Generalist Autonomous Research via Hypothesis-Tree Refinement

Summary (Overview)

Arbor is a general framework for Autonomous Optimization (AO) that enables an AI agent to iteratively improve a research artifact (codebase, pipeline, etc.) through long-horizon experimentation without step-level human supervision.
The core innovation is Hypothesis Tree Refinement (HTR): a persistent tree that links hypotheses, artifact versions, experimental evidence, and distilled insights, serving simultaneously as search frontier, long-term memory, and auditable research record.
A long-lived coordinator manages global research strategy over the tree, while short-lived executors test individual hypotheses in isolated git worktrees, returning structured evidence.
Arbor achieves the best held-out result on all six real research tasks (model training, harness engineering, data synthesis), with more than 2.5× the average relative held-out gain of Codex and Claude Code under the same budget.
On MLE-Bench Lite, Arbor reaches 86.36% Any Medal with GPT-5.5, the strongest result in the comparison; ablations confirm that both the hypothesis tree and insight feedback are critical for performance.

Introduction and Theoretical Foundation

Scientific research is a long-horizon human intelligence process: researchers form hypotheses, test them via experiments, interpret successes and failures, and let those lessons reshape future exploration. The paper formalizes this as Autonomous Optimization (AO), defined as a tuple:

\mathcal{P} = (M_0, \mathcal{O}, \mathcal{E}_{\text{dev}}, \mathcal{E}_{\text{test}})

where $M_0$ is the initial mutable artifact (codebase + data), $\mathcal{O}$ is the objective specifying improvement direction, $\mathcal{E}_{\text{dev}}$ returns feedback for free use during search, and the held-out $\mathcal{E}_{\text{test}}$ measures whether dev-driven improvement transfers. The agent aims to return:

M^\star = \arg\max_{M' \in \mathcal{A}} S_{\text{test}}(M')

subject to not using $\mathcal{E}_{\text{test}}$ as an exploration oracle.

Current systems (Codex, Claude Code) can execute long trajectories but lack a mechanism to make research cumulative: they treat each trial as an independent local attempt, losing the structure of competing hypotheses, evidence interpretation, and constraint accumulation. Arbor addresses this by making the research state persistent and operational through a hypothesis tree.

Methodology

Hypothesis Tree as Research State

Let $\mathcal{T} = (V, E)$ be a rooted hypothesis tree. Each node $n \in V$ is a research unit:

n = \langle h_n, \iota_n, \mu_n \rangle

Hypothesis $h_n$ : a verifiable/falsifiable claim about how to improve the artifact (coarse near root, concrete at leaves).
Insight $\iota_n$ : reusable interpretation of evidence; for leaves, summarizes what was tried and why; for internal nodes, abstracts over children's insights.
Metadata $\mu_n$ : connects to executable evidence — node status, dev score, factual result, git branch/commit reference.

The tree separates internal direction nodes from executable leaf nodes. After a leaf is executed, its score, result, artifact reference, and distilled insight are written back, and the insight is propagated upward via abstraction.

Hypothesis Tree Refinement (HTR)

A coordinator (long-lived) owns the tree and executes a six-step loop: Observe, Ideate, Select, Dispatch, Backpropagate, Decide. Short-lived executors test individual hypotheses in isolated worktrees.

Key steps:

Observe: re-grounds coordinator in current tree state (frontier, insights, constraints, best artifact).
Ideate: proposes child hypotheses under a chosen parent, conditioned on accumulated evidence.
Select: chooses pending leaves to execute, balancing expected utility with evidence from ancestors/siblings.
Dispatch: sends selected hypotheses to parallel executors, each materializing the intervention, evaluating on $\mathcal{E}_{\text{dev}}$ , and returning a compact report: dev score, factual result, distilled insight, branch reference.
Backpropagate: writes evidence into leaf nodes, then updates insights along the path to the root via abstraction.
Decide: decides to continue, prune, or merge. Promotion guarded by a held-out merge gate: candidate evaluated on $\mathcal{E}_{\text{test}}$ in a fresh worktree, merged only if it improves over current best under $\mathcal{O}$ .

The executor contract ensures each experiment is bound to a single hypothesis, keeping local flexibility while preserving the semantic meaning of tree updates.

Empirical Validation / Results

AO Task Suite

Six tasks across three types:

Type	Task	Initial Material	Metric & Split
Model Training	Optimizer Design	NanoGPT-Bench; tuned Muon baseline	Steps to target loss (↓); test averages two seeds
Model Training	Architecture Design	autoresearch LLM codebase	Final loss (↓); test averages two seeds
Harness Engineering	Terminal-Bench 2.0	Official terminal-agent codebase	Pass rate (↑); 36 dev / 53 test
Harness Engineering	BrowseComp	Minimal ReAct-style search harness	Accuracy (↑); 50 dev / 300 test
Data Synthesis	Search-Agent Data Synth.	Hand-designed search-data pipeline	Mean pass gap (↑); 50 dev / 100 test seeds
Data Synthesis	Math-Reasoning Data Synth.	Hand-designed math-data pipeline	Mean pass gap (↑); 50 dev / 96 test problems

Main Results

Table 2: Main results on real research tasks. Each cell shows Dev / Test scores. ∆ rows show relative improvement (for model training) or absolute change (others).

Type	Task	Initial	Codex	Claude Code	Arbor (Ours)
Model Training	Optimizer Design (steps ↓)	3325 / 3325	3325 / 3325 (+0.00% / +0.00%)	3275 / 3287.5 (+1.50% / +1.13%)	3225 / 3237.5 (+3.01% / +2.63%)
Model Training	Architecture Design (loss ↓)	1.096 / 1.098	1.089 / 1.083 (+0.64% / +1.37%)	1.033 / 1.033 (+5.75% / +5.92%)	1.029 / 1.028 (+6.11% / +6.38%)
Harness Engineering	Terminal-Bench 2.0 (pass ↑)	58.33 / 69.81	63.89 / 73.59 (+5.56 / +3.78)	75.00 / 71.70 (+16.67 / +1.89)	72.22 / 77.36 (+13.89 / +7.55)
Harness Engineering	BrowseComp (acc. ↑)	52.50 / 45.33	57.50 / 50.00 (+5.00 / +4.67)	55.00 / 53.33 (+2.50 / +8.00)	72.50 / 67.67 (+20.00 / +22.34)
Data Synthesis	Search-Agent (gap ↑)	4.00 / 5.00	12.00 / 9.00 (+8.00 / +4.00)	12.00 / 12.00 (+8.00 / +7.00)	16.00 / 18.00 (+12.00 / +13.00)
Data Synthesis	Math-Reasoning (gap ↑)	2.00 / 1.04	6.00 / 6.25 (+4.00 / +5.21)	8.00 / 8.33 (+6.00 / +7.29)	24.00 / 20.83 (+22.00 / +19.79)

Arbor achieves best held-out test on all six tasks.
Dev/test split exposes overfitting: e.g., Claude Code has highest dev on Terminal-Bench (75.00) but lower test (71.70); Arbor's held-out gate prevents such overfitting.

MLE-Bench Lite Results

Table 3: MLE-Bench Lite results (percentages).

Method	Model	Valid sub.	Above median	Bronze	Silver	Gold	Any medal
InternAgent	DeepSeek-R1	100.00	78.79	10.61	16.67	34.85	62.12
ML-Master 2.0	DeepSeekV3.2-Spe	100.00	84.85	13.64	31.82	30.30	75.76
MARS	Gemini-3-Pro	100.00	89.39	6.06	15.15	53.03	74.24
AIBuildAI	Claude-Opus-4.6	100.00	81.82	13.64	25.76	37.88	77.27
AI-Scientist	Gemini-3-Flash	100.00	86.36	18.18	31.82	31.82	81.82
Arbor	Gemini-3-Flash	100.00	86.36	13.64	27.27	40.90	81.82
Arbor	GPT-5.5	100.00	95.45	0.00	9.09	77.27	86.36

Arbor with GPT-5.5 achieves the highest Any Medal (86.36%) and Gold (77.27%) among all compared methods.

Ablations

Table 4: Component ablations on MLE-Bench Lite (Claude Opus 4.6 backbone).

Variant	Valid sub.	Above median	Bronze	Silver	Gold	Any medal
Full Arbor	100.00	90.91	4.55	27.27	50.00	81.82
w/o tree	100.00	72.72	9.09	22.73	31.82	63.64
w/o insight feedback	100.00	77.27	4.55	13.64	36.36	54.54

Removing the tree or insight feedback drops performance significantly; insight feedback is more critical than tree structure alone.
Both components are complementary: tree organizes competing hypotheses, insight feedback carries reusable information forward.

Additional Findings

Backbone generality: Arbor works with Gemini-3-Flash, Claude Opus 4.6, and GPT-5.5; gains are model-agnostic.
Cross-task transfer: A BrowseComp-evolved harness improves unseen tasks (HLE: +6.0%, DeepSearchQA: +8.0%), showing generalizable improvements.
Token consumption: Arbor uses 20–43M tokens, comparable to baselines, but achieves larger gains through structured search rather than larger sampling.
Node statistics (Table 5): Many nodes improve dev, but only a subset are merged, confirming the held-out gate's utility.

Theoretical and Practical Implications

Theoretical: The paper formalizes Autonomous Optimization (AO) as a distinct class of long-horizon research tasks. The hypothesis tree provides a principled representation of research state that separates hypotheses, evidence, and insights, turning trial-and-error into a cumulative, auditable process.
Practical: Arbor demonstrates that persistent hypothesis management enables stronger and more general held-out gains than flat trial-and-error, even with comparable token budgets. The framework is model-agnostic and transfers across task types (training, harness, data synthesis) and benchmarks (AO tasks, MLE-Bench Lite).
Design insight: The held-out merge gate is critical to prevent overfitting to development feedback. The separation of coordinator (strategic) and executor (local) roles allows scalable parallel experimentation while maintaining a coherent research state.

Conclusion

Arbor provides a general framework for autonomous research by organizing exploration through Hypothesis Tree Refinement. The persistent tree binds hypotheses, artifact versions, evidence, and insights, enabling a coordinator to manage strategic search while executors test individual ideas. Across six real research tasks and MLE-Bench Lite, Arbor consistently outperforms strong coding-agent baselines, with gains driven by the tree structure and insight propagation rather than increased token budgets. The framework's ability to transfer learned improvements across tasks and its model-agnostic nature suggest that persistent hypothesis management is a promising abstraction for autonomous research. Future work includes expanding the task suite, handling multi-objective optimization, and reducing search cost.