Visual Summary | Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts

Summary (Overview)

Proposes Retrospective Harness Optimization (RHO), a self-supervised method that improves an LLM agent’s harness (skills, tools, prompts) using only unlabeled past trajectories, without any external validation labels.
RHO consists of three stages: (1) Coreset Selection via a Determinantal Point Process (DPP) to pick diverse, challenging past tasks; (2) Group Rollout that re-solves each task multiple times and extracts diagnostic signals from self-validation and self-consistency; (3) Best-of-N Harness Proposal that samples candidate harnesses and selects the most preferred one by pairwise self-preference.
Evaluations across three domains (software engineering, technical work, knowledge work) show consistent improvements: e.g., SWE-Bench Pro pass rate rises from 59% to 78% in a single optimization round without any external grading.
Analysis reveals that RHO creates targeted skills and tools that address specific failure modes, shifts agent behavior toward more verification or execution, and sustains higher accuracy on long-horizon tasks.
The self-preference mechanism reliably selects effective harness updates, and both self-validation and self-consistency signals are essential for performance gains.

Introduction and Theoretical Foundation

Background: AI agents rely on a harness — a persistent collection of skills, tools, prompts, and workflows — to solve complex tasks. Continuously improving this harness from deployment experience is crucial for adapting to new tasks.

Problem: Existing harness optimization methods (e.g., Zhou et al., 2022; Yang et al., 2023; Khattab et al., 2023; Lee et al., 2026) require a labeled validation set to guide improvements. In practical deployment, such labeled data is often unavailable or costly to obtain. However, the agent naturally accumulates a rich set of unlabeled trajectories from past tasks.

Central Question: Can we improve the agent harness to enhance future performance when we only have access to past trajectories?

Theoretical Foundation: The paper formalizes harness optimization as maximizing expected utility on future tasks:

h^\star = \arg\max_{h'} \mathbb{E}_{t, \tau \sim \text{solve}(h', t)} [U(t, \tau)]

where $U$ is a latent utility function. Since $U$ is unobservable, RHO substitutes it with a self-preference estimator — the agent compares multiple trajectories on the same task and produces a ranking with rationales, guiding harness updates without ground-truth labels.

Key distinction: RHO is a single retrospective pass over past trajectories, contrasting with validation-feedback methods that iterate against labeled data (see Figure 1 in the paper).

Methodology

RHO operates in three stages as described in Algorithm 1:

Stage 1: Coreset Selection

Given past trajectories $\mathcal{D} = \{(t_i, \tau_i)\}$ and initial harness $h_0$ :

Use an LLM judge to assign a difficulty score $r_i$ and textual description for each trajectory.
Compute cosine similarity between descriptions as similarity matrix $S_{i,j}$ .
Construct a DPP kernel matrix:

K = \operatorname{diag}(e^r) \, S \, \operatorname{diag}(e^r)

where $e^{r_i} = \max(r_i, \epsilon) / \left( \max_j \max(r_j, \epsilon) \right)^\alpha$ and $\alpha = \theta / (2(1-\theta))$ . Parameter $\theta$ balances difficulty vs. diversity.

Select $k=10$ trajectories into a coreset $\mathcal{D}_{\text{core}}$ via DPP greedy (with $\theta=0.7$ ) that covers challenging and diverse failure modes.

Stage 2: Group Rollout

For each task $t$ in the coreset:

Run $G=3$ parallel agent solves to produce trajectories $\{\tau_{t,g}\}$ .
Keep one fixed baseline rollout $\tau_t^{(0)} = \tau_{t,1}$ .
Extract two diagnostic signals:
- Self-validation: inspect each trajectory for correctness (incorrect tool invocations, false assumptions, premature stopping) — produces improvement cues for underperforming runs.
- Self-consistency: compare across trajectories to detect contradictions (divergent plans, tool sequences, answers) — generates instructions to encourage consistent behavior.
Union of these signals forms improvement instruction $I_t$ for task $t$ . Aggregate across coreset: $I = \bigcup_{t \in \mathcal{D}_{\text{core}}} I_t$ .

Stage 3: Best-of-N Harness Proposal

Sample $N=3$ candidate harnesses $h_1, \dots, h_N$ in parallel using $h_0$ and improvement instructions $I$ .
For each candidate, re-solve all coreset tasks to get new trajectories $\{\tau_t^{(j)}\}$ .
Compute relative advantage score for each candidate:

S_j = \frac{1}{|\mathcal{D}_{\text{core}}|} \sum_{t \in \mathcal{D}_{\text{core}}} \operatorname{rank}\!\left(t, \tau_t^{(j)}, \tau_t^{(0)}\right)

where $\operatorname{rank}$ outputs +1 if the new trajectory is preferred over the baseline, -1 if worse, or 0 if equal.

Select candidate $j^\star = \arg\max_j S_j$ and accept it only if $S_{j^\star} > 0$ .

Implementation details: All operators (judge, solve, optimize, rank) are instantiated by the same backbone GPT-5.5 agent, differing only in inputs. The harness is a configurable workspace folder containing markdown files (instructions, skills) and executable scripts (tools).

Empirical Validation / Results

Datasets

SWE-Bench Pro: long-horizon software engineering tasks (multi-file repository editing).
Terminal-Bench 2: command-line tasks with executable graders.
GAIA-2: dynamic asynchronous knowledge-work environments.

Comparison with Feedback-Free Baselines

Table 1: Held-out pass rate after harness optimization.

Method	Harness Surface	SWE-Bench Pro Pass (Δ)	Terminal-Bench 2 Pass (Δ)	GAIA-2 Pass (Δ)
Vanilla Codex	None	0.59 (—)	0.71 (—)	0.29 (—)
Dynamic Cheatsheet	Skills	0.62 (+0.03)	0.73 (+0.02)	0.30 (+0.01)
ReasoningBank	Memory	0.61 (+0.02)	0.73 (+0.02)	0.28 (−0.01)
Sleep-time Compute	Memory	0.64 (+0.05)	0.73 (+0.02)	0.32 (+0.03)
RHO (ours)	Skills+Tools	0.78 (+0.19)	0.76 (+0.05)	0.37 (+0.08)

RHO significantly outperforms baselines, especially on SWE-Bench Pro (+19%). The advantage comes from optimizing the full harness (skills, tools, instructions) rather than just memory/skills.

Optimized Harness Contents (Figure 3)

On each benchmark, RHO adds new skills and tools targeting specific failure modes:

SWE-Bench Pro: Adds check_build_and_lint tool to locate non-standard toolchains and strip Python cache directories from diffs.
Terminal-Bench 2: Adds skills like "Recover behavior from queries alone" and tools for package fix, polygon validation.
GAIA-2: Adds skills for timing, user reply, decomposition, and tools for listing/calling app functions.

Comparison with Validation-Feedback Optimization (Meta-Harness)

Table 2: RHO vs Meta-Harness on SWE-Bench Pro.

Method	Val. Labels	Agent Calls (×RHO)	Pass Rate
RHO	None	103 (1.0×)	0.78
Meta-Harness (1 round)	Required	41 (0.4×)	0.62
Meta-Harness (10 rounds)	Required	320 (3.1×)	0.80

At matched compute budget (1 round), RHO outperforms Meta-Harness (0.78 vs 0.62). Scaling Meta-Harness to 10 rounds achieves 0.80 but requires 3× more agent calls and still uses labels that RHO does not.

Behavioral Shifts (Figure 4)

RHO shifts agent action patterns: on SWE-Bench Pro, verification actions increase (+61%), navigation decreases (−13%); on Terminal-Bench 2, edit decreases (−44%) while navigate increases; on GAIA-2, execution increases (+25%) while edit decreases. Gains concentrate on long-horizon tasks.

Ablation Studies

Coreset Selection (Figure 5): DPP (balancing difficulty and diversity) achieves 0.78 pass rate vs. difficulty-only (0.62), coverage-only (0.58), or random (0.64). Pure clustering on difficulty leads to narrow task coverage.

Best-of-N Consistency (Table 3): Over 3 candidates, chosen harness scores 0.78 (SWE), 0.76 (TB2), 0.37 (GAIA) — better than lowest candidate and close to mean.

Dataset	Mean	Chosen	Std	Lowest
SWE-Bench Pro	0.79	0.78	0.06	0.73
Terminal-Bench 2	0.74	0.76	0.03	0.71
GAIA-2	0.34	0.37	0.03	0.32

Diagnosis Ablation (Table 4): Removing self-consistency or self-validation degrades performance. Raw trajectory (no diagnosis) also underperforms full diagnosis.

Variant	SWE Pro	TB 2	GAIA-2
Full diagnosis	0.78	0.76	0.37
− self-consistency	0.56	0.75	0.27
− self-validation	0.70	0.73	0.30
Raw trajectory	0.60	0.75	0.29

Theoretical and Practical Implications

Theoretical Significance:

Demonstrates that self-preference over trajectories can substitute ground-truth labels for optimizing agent harnesses, opening a new paradigm for self-supervised agent improvement.
The DPP-based coreset selection shows that both difficulty and diversity are critical for extracting effective improvement signals.
The two diagnostic signals (self-validation and self-consistency) are complementary and non-redundant, providing richer cues than raw trajectory analysis.

Practical Implications:

Enables continuous harness improvement in deployment scenarios where labeled validation sets are unavailable or costly.
The optimized harness produces targeted tools and skills that address specific failure modes from past trajectories, leading to sustained higher accuracy on long-horizon tasks.
RHO is computationally efficient: a single optimization round on SWE-Bench Pro uses only 103 agent calls and yields 19% absolute improvement.
The method is domain-agnostic, showing consistent gains across software engineering, technical work, and knowledge work.

Limitations:

Requires environments that can be cleanly reset for group rollout (not suitable for one-shot or irreversible tasks).
Assumes an editable harness surface; applicability to domains with different harness architectures or rollout budgets is future work.

Conclusion

RHO reframes harness improvement as a retrospective self-supervised process rather than a validation-guided search. By re-solving past tasks in parallel and comparing outcomes via self-validation and self-consistency, the agent extracts targeted improvement signals that are used to generate candidate harness updates. A best-of-N proposal with self-preference selects the most promising harness. Across three diverse domains, RHO yields consistent held-out gains (e.g., +19% on SWE-Bench Pro) without any external labeling. The optimized harness introduces new skills and tools that reshape agent behavior, particularly improving long-horizon task success. RHO represents a step toward agents that autonomously improve from deployment experience, reducing reliance on scarce labeled feedback. Future work may extend RHO to irreversible tasks, different harness surfaces, and online continuous learning scenarios.