Summary (Overview)
- Proposes Retrospective Harness Optimization (RHO), a self-supervised method that improves an LLM agent’s harness (skills, tools, prompts) using only unlabeled past trajectories, without any external validation labels.
- RHO consists of three stages: (1) Coreset Selection via a Determinantal Point Process (DPP) to pick diverse, challenging past tasks; (2) Group Rollout that re-solves each task multiple times and extracts diagnostic signals from self-validation and self-consistency; (3) Best-of-N Harness Proposal that samples candidate harnesses and selects the most preferred one by pairwise self-preference.
- Evaluations across three domains (software engineering, technical work, knowledge work) show consistent improvements: e.g., SWE-Bench Pro pass rate rises from 59% to 78% in a single optimization round without any external grading.
- Analysis reveals that RHO creates targeted skills and tools that address specific failure modes, shifts agent behavior toward more verification or execution, and sustains higher accuracy on long-horizon tasks.
- The self-preference mechanism reliably selects effective harness updates, and both self-validation and self-consistency signals are essential for performance gains.
Introduction and Theoretical Foundation
Background: AI agents rely on a harness — a persistent collection of skills, tools, prompts, and workflows — to solve complex tasks. Continuously improving this harness from deployment experience is crucial for adapting to new tasks.
Problem: Existing harness optimization methods (e.g., Zhou et al., 2022; Yang et al., 2023; Khattab et al., 2023; Lee et al., 2026) require a labeled validation set to guide improvements. In practical deployment, such labeled data is often unavailable or costly to obtain. However, the agent naturally accumulates a rich set of unlabeled trajectories from past tasks.
Central Question: Can we improve the agent harness to enhance future performance when we only have access to past trajectories?
Theoretical Foundation: The paper formalizes harness optimization as maximizing expected utility on future tasks:
where is a latent utility function. Since is unobservable, RHO substitutes it with a self-preference estimator — the agent compares multiple trajectories on the same task and produces a ranking with rationales, guiding harness updates without ground-truth labels.
Key distinction: RHO is a single retrospective pass over past trajectories, contrasting with validation-feedback methods that iterate against labeled data (see Figure 1 in the paper).
Methodology
RHO operates in three stages as described in Algorithm 1:
Stage 1: Coreset Selection
Given past trajectories and initial harness :
- Use an LLM judge to assign a difficulty score and textual description for each trajectory.
- Compute cosine similarity between descriptions as similarity matrix .
- Construct a DPP kernel matrix:
where and . Parameter balances difficulty vs. diversity.
- Select trajectories into a coreset via DPP greedy (with ) that covers challenging and diverse failure modes.
Stage 2: Group Rollout
For each task in the coreset:
- Run parallel agent solves to produce trajectories .
- Keep one fixed baseline rollout .
- Extract two diagnostic signals:
- Self-validation: inspect each trajectory for correctness (incorrect tool invocations, false assumptions, premature stopping) — produces improvement cues for underperforming runs.
- Self-consistency: compare across trajectories to detect contradictions (divergent plans, tool sequences, answers) — generates instructions to encourage consistent behavior.
- Union of these signals forms improvement instruction for task . Aggregate across coreset: .
Stage 3: Best-of-N Harness Proposal
- Sample candidate harnesses in parallel using and improvement instructions .
- For each candidate, re-solve all coreset tasks to get new trajectories .
- Compute relative advantage score for each candidate:
where outputs +1 if the new trajectory is preferred over the baseline, -1 if worse, or 0 if equal.
- Select candidate and accept it only if .
Implementation details: All operators (judge, solve, optimize, rank) are instantiated by the same backbone GPT-5.5 agent, differing only in inputs. The harness is a configurable workspace folder containing markdown files (instructions, skills) and executable scripts (tools).
Empirical Validation / Results
Datasets
- SWE-Bench Pro: long-horizon software engineering tasks (multi-file repository editing).
- Terminal-Bench 2: command-line tasks with executable graders.
- GAIA-2: dynamic asynchronous knowledge-work environments.
Comparison with Feedback-Free Baselines
Table 1: Held-out pass rate after harness optimization.
| Method | Harness Surface | SWE-Bench Pro Pass (Δ) | Terminal-Bench 2 Pass (Δ) | GAIA-2 Pass (Δ) |
|---|---|---|---|---|
| Vanilla Codex | None | 0.59 (—) | 0.71 (—) | 0.29 (—) |
| Dynamic Cheatsheet | Skills | 0.62 (+0.03) | 0.73 (+0.02) | 0.30 (+0.01) |
| ReasoningBank | Memory | 0.61 (+0.02) | 0.73 (+0.02) | 0.28 (−0.01) |
| Sleep-time Compute | Memory | 0.64 (+0.05) | 0.73 (+0.02) | 0.32 (+0.03) |
| RHO (ours) | Skills+Tools | 0.78 (+0.19) | 0.76 (+0.05) | 0.37 (+0.08) |
RHO significantly outperforms baselines, especially on SWE-Bench Pro (+19%). The advantage comes from optimizing the full harness (skills, tools, instructions) rather than just memory/skills.
Optimized Harness Contents (Figure 3)
On each benchmark, RHO adds new skills and tools targeting specific failure modes:
- SWE-Bench Pro: Adds
check_build_and_linttool to locate non-standard toolchains and strip Python cache directories from diffs. - Terminal-Bench 2: Adds skills like "Recover behavior from queries alone" and tools for package fix, polygon validation.
- GAIA-2: Adds skills for timing, user reply, decomposition, and tools for listing/calling app functions.
Comparison with Validation-Feedback Optimization (Meta-Harness)
Table 2: RHO vs Meta-Harness on SWE-Bench Pro.
| Method | Val. Labels | Agent Calls (×RHO) | Pass Rate |
|---|---|---|---|
| RHO | None | 103 (1.0×) | 0.78 |
| Meta-Harness (1 round) | Required | 41 (0.4×) | 0.62 |
| Meta-Harness (10 rounds) | Required | 320 (3.1×) | 0.80 |
At matched compute budget (1 round), RHO outperforms Meta-Harness (0.78 vs 0.62). Scaling Meta-Harness to 10 rounds achieves 0.80 but requires 3× more agent calls and still uses labels that RHO does not.
Behavioral Shifts (Figure 4)
RHO shifts agent action patterns: on SWE-Bench Pro, verification actions increase (+61%), navigation decreases (−13%); on Terminal-Bench 2, edit decreases (−44%) while navigate increases; on GAIA-2, execution increases (+25%) while edit decreases. Gains concentrate on long-horizon tasks.
Ablation Studies
Coreset Selection (Figure 5): DPP (balancing difficulty and diversity) achieves 0.78 pass rate vs. difficulty-only (0.62), coverage-only (0.58), or random (0.64). Pure clustering on difficulty leads to narrow task coverage.
Best-of-N Consistency (Table 3): Over 3 candidates, chosen harness scores 0.78 (SWE), 0.76 (TB2), 0.37 (GAIA) — better than lowest candidate and close to mean.
| Dataset | Mean | Chosen | Std | Lowest |
|---|---|---|---|---|
| SWE-Bench Pro | 0.79 | 0.78 | 0.06 | 0.73 |
| Terminal-Bench 2 | 0.74 | 0.76 | 0.03 | 0.71 |
| GAIA-2 | 0.34 | 0.37 | 0.03 | 0.32 |
Diagnosis Ablation (Table 4): Removing self-consistency or self-validation degrades performance. Raw trajectory (no diagnosis) also underperforms full diagnosis.
| Variant | SWE Pro | TB 2 | GAIA-2 |
|---|---|---|---|
| Full diagnosis | 0.78 | 0.76 | 0.37 |
| − self-consistency | 0.56 | 0.75 | 0.27 |
| − self-validation | 0.70 | 0.73 | 0.30 |
| Raw trajectory | 0.60 | 0.75 | 0.29 |
Theoretical and Practical Implications
Theoretical Significance:
- Demonstrates that self-preference over trajectories can substitute ground-truth labels for optimizing agent harnesses, opening a new paradigm for self-supervised agent improvement.
- The DPP-based coreset selection shows that both difficulty and diversity are critical for extracting effective improvement signals.
- The two diagnostic signals (self-validation and self-consistency) are complementary and non-redundant, providing richer cues than raw trajectory analysis.
Practical Implications:
- Enables continuous harness improvement in deployment scenarios where labeled validation sets are unavailable or costly.
- The optimized harness produces targeted tools and skills that address specific failure modes from past trajectories, leading to sustained higher accuracy on long-horizon tasks.
- RHO is computationally efficient: a single optimization round on SWE-Bench Pro uses only 103 agent calls and yields 19% absolute improvement.
- The method is domain-agnostic, showing consistent gains across software engineering, technical work, and knowledge work.
Limitations:
- Requires environments that can be cleanly reset for group rollout (not suitable for one-shot or irreversible tasks).
- Assumes an editable harness surface; applicability to domains with different harness architectures or rollout budgets is future work.
Conclusion
RHO reframes harness improvement as a retrospective self-supervised process rather than a validation-guided search. By re-solving past tasks in parallel and comparing outcomes via self-validation and self-consistency, the agent extracts targeted improvement signals that are used to generate candidate harness updates. A best-of-N proposal with self-preference selects the most promising harness. Across three diverse domains, RHO yields consistent held-out gains (e.g., +19% on SWE-Bench Pro) without any external labeling. The optimized harness introduces new skills and tools that reshape agent behavior, particularly improving long-horizon task success. RHO represents a step toward agents that autonomously improve from deployment experience, reducing reliance on scarce labeled feedback. Future work may extend RHO to irreversible tasks, different harness surfaces, and online continuous learning scenarios.
Related papers
- Toward Generalist Autonomous Research via Hypothesis-Tree Refinement
Arbor's hypothesis tree framework achieves best held-out results on all six real research tasks, with over 2.5x the average gain of Codex and Claude Code.
- Kwai Keye-VL-2.0 Technical Report
First multimodal MoE achieves SOTA long-video understanding and agentic tasks with 3B active parameters via sparse attention and multi-teacher distillation.
- SWE-Explore: Benchmarking How Coding Agents Explore Repositories
SWE-Explore benchmarks repository exploration and finds that even strong agents are recall-limited at line level, where missing core evidence dominates failures.