# HarnessX: A Composable, Adaptive, and Evolvable Agent Harness Foundry

> HarnessX evolves the agent harness as a typed, first-class interface, achieving average +14.5% and up to +44% gains across benchmarks.

- **Source:** [arXiv](https://arxiv.org/abs/2606.14249)
- **Published:** 2026-06-16
- **Permalink:** https://picx.dev/p/00m7IM
- **Whiteboard:** https://picx.dev/p/00m7IM/image

## Summary

## Summary (Overview)

- **HarnessX** introduces a **composable, adaptive, and evolvable agent harness foundry** that treats the runtime interface between model and environment as a first-class, typed object.
- It provides a **nine-dimensional processor taxonomy** and a **substitution algebra** for type-safe harness composition (Section 3).
- **AEGIS** is a trace-driven, multi-agent evolution engine that maps harness adaptation onto RL constructs via an **operational mirror** (Section 4), addressing pathologies like reward hacking, catastrophic forgetting, and under-exploration.
- **Harness-model co-evolution** interleaves harness evolution with model training over a shared replay buffer using **cross-harness GRPO**, yielding additional gains (Section 5).
- Across **five benchmarks (ALFWorld, GAIA, WebShop, $\tau^3$-Bench, SWE-bench Verified)** and **three model families**, HarnessX achieves an average gain of **+14.5%** (up to +44.0%), with gains largest where baselines are lowest.

## Introduction and Theoretical Foundation

**Background & Motivation:** Modern AI agent performance depends critically on the **runtime harness**—the prompts, tools, memory, and control flow that mediate how a model observes, reasons, and acts. Despite its importance, harness development remains:
- **Hand-crafted and static**: each new model or task demands bespoke scaffolding.
- **Architecturally entangled**: changes to one component silently break others.
- **Decoupled from model training**: execution traces are discarded rather than used to improve either the harness or the model.

**Theoretical Basis:** The paper formalizes harness evolution as an MDP over symbolic artifacts via an **operational mirror** (Section 4.1). Key definitions:

> **Definition 1 (Harness Configuration).** A harness configuration is a tuple $H = (c_1, c_2, \dots, c_9)$, where each $c_i \in \mathcal{C}_i$ instantiates one of the nine behavioral dimensions: model selection, context assembly, memory management, tool ecosystem, execution environment, evaluation and reward, control and safety, observability, and training bridge.

> **Definition 2 (Harness Edit).** A harness edit is a function $e : \mathcal{H} \to \mathcal{H}$ that modifies one or more dimensions while preserving type contracts. The action space $\mathcal{E}$ is discrete but open-ended.

> **Definition 3 (Operational Mirror).** The operational mirror is the tuple $(\mathcal{H}, \mathcal{E}, R, \mathcal{T})$, where $\mathcal{H}$ is the harness-configuration space (states), $\mathcal{E}$ is the code-level edit space (actions), $R : \mathcal{H} \times \mathcal{E} \to \mathbb{R}$ maps a configuration–edit pair to a scalar reward, and $\mathcal{T}$ is the trace store.

**Table 2: Operational Mirror: RL concepts and their symbolic-space duals in AEGIS.**
| RL concept | Symbolic-space dual | AEGIS realization |
|------------|-------------------|-------------------|
| Policy $\pi$ | Harness-update procedure $\pi_{\text{evo}}$ | Four-stage pipeline (Section 4.3) |
| State $s_t$ | $(H_t, \mathcal{T}_t)$ | Harness configuration + trace store |
| Action $a_t$ | Typed harness edit | Builder operation + change manifest |
| Feedback | Trace $\tau$ + verifier score $r$ | Observability layer |
| Update | $H_{t+1} \leftarrow U(\tilde{H}_t, \mathcal{T}_t, r_t)$ | Deterministic acceptance gate |

This mapping identifies three RL pathologies—**reward hacking, catastrophic forgetting, under-exploration**—that reappear in amplified form in symbolic space and motivate AEGIS’s architectural defenses.

## Methodology

### Harness Composition (Section 3)
HarnessX structures the harness as $H = (\mathcal{M}, \mathcal{C})$ where $\mathcal{M}$ is model configuration and $\mathcal{C} = (P, S)$ is harness configuration. $P : \text{Hook} \to \text{List[Processor]}$ is a hook-indexed list of processors attached to eight lifecycle events.

**Table 1: Hook points and their permitted modifications.**
| Hook | Event type | Permitted modifications |
|------|------------|------------------------|
| `task_start` | TaskStartEvent | system prompt |
| `step_start` | StepStartEvent | structural history edits |
| `before_model` | BeforeModelEvent | last user content; one user-message append |
| `after_model` | ModelResponseEvent | response content, tool calls |
| `before_tool` | ToolCallEvent | tool input, approval flag |
| `after_tool` | ToolResultEvent | tool result |
| `step_end` | StepEndEvent | read-only |
| `task_end` | TaskEndEvent | read-only |

Processors follow a restricted interface: `async def process(self, event: Event) -> AsyncIterator[Event]`, enabling **pass-through, transform, split, intercept, or interrupt** outcomes. The **nine-dimensional taxonomy** (Section 3.3) spans all behavioral dimensions; AEGIS edits span all nine.

### Harness Adaptation: AEGIS (Section 4)
AEGIS is a **multi-agent evolution engine** realized as a four-stage pipeline, all driven by the same meta-agent LLM:
- **Digester**: compresses raw traces ($\sim$10M tokens per iteration on GAIA) into structured per-task summaries.
- **Planner**: constructs an adaptation landscape to prevent under-exploration.
- **Evolver**: generates typed builder operations with change manifests and smoke tests.
- **Critic & Deterministic Gate**: defends against reward hacking and catastrophic forgetting via the **seesaw constraint** (no regression on previously passing tasks).

The loop is formalized in **Algorithm 1** (selective invocation with early stopping after $P$ idle rounds). **Variant isolation via Ensemble routing** (Section 4.5) maintains up to $K$ harness variants, routing tasks to the variant with highest estimated success rate. This prevents cross-task interference on heterogeneous benchmarks.

### Harness-Model Co-Evolution (Section 5)
The co-evolution iteration (Section 5.1) pairs harness evolution with model training over a **shared replay buffer** $\mathcal{B}$:

1. **Rollout**: Run $(\mathcal{M}_t, H_t)$ on $B_t$, record traces $\tau_i$.
2. **Verification**: Fixed verifier gives scalar rewards $r_i$.
3. **Buffer insertion**: Append scored traces with harness version.
4. **Harness evolution**: $H_{t+1} \leftarrow \text{AEGIS}(H_t, \mathcal{B})$.
5. **Behavior log-probabilities**: Cache $\pi_{\theta_{\text{old}}}(\tau_i)$.
6. **GRPO update**: $\mathcal{M}_{t+1} \leftarrow \text{GRPO}(\mathcal{M}_t, \mathcal{B})$.

**Cross-harness GRPO** (Section 5.3) groups all trajectories of the same task across harness versions:

$$
\mathcal{G}_x = \{\tau_i \in \mathcal{B} \mid \text{task}(\tau_i) = x\} = \bigcup_k \{\tau \sim \text{Agent}(\mathcal{M}_k, H_k, x)\}.
$$

The group-relative advantage is:

$$
\hat{A}(\tau_i) = \frac{r_i - \mu(\mathcal{G}_x)}{\sigma(\mathcal{G}_x) + \epsilon}.
$$

The policy objective:

$$
J_{\text{GRPO}}(\theta) = \mathbb{E}_{x, \tau_i \sim \mathcal{B}} \left[ \min \left( \rho_i(\theta) \hat{A}(\tau_i),\, \text{clip}(\rho_i(\theta), 1-\epsilon_c, 1+\epsilon_c) \hat{A}(\tau_i) \right) \right] - \beta D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}}),
$$

where $\rho_i(\theta) = \frac{\pi_\theta(\tau_i | x)}{\pi_{\theta_{\text{old}}}(\tau_i | x)}$.

Off-policy training over the **mixed-policy buffer** (Section 5.4) reuses trajectories at no additional rollout cost; FIFO eviction bounds the model-version lag.

## Empirical Validation / Results

**Experimental Setup.** Five benchmarks (Table 3), three task-agent families (Claude Sonnet 4.6, GPT-5.4, Qwen3.5-9B), up to 15 evolution rounds with early stopping after 3 idle rounds. Meta-agent: Claude Opus 4.6.

**Table 3: Benchmark characteristics.**
| Benchmark | Domain | Sampled Tasks | Verifier |
|-----------|--------|--------------|----------|
| GAIA (Level 1–3) | Multi-step retrieval | 103 | Exact match |
| ALFWorld | Embodied planning | 134 | Goal completion |
| WebShop | Web interaction | 100 | Attribute match |
| $\tau^3$-Bench | Multi-turn dialogue | 3 domains | Rule compliance |
| SWE-bench Verified | Software engineering | 55 | Patch resolution |

**Main Results (Table 4).** Evolution improves 14 of 15 configurations with an average gain of +14.5%. Gains range from +1.1% to +44.0%. **Inverse scaling**: weakest task agents gain most (Qwen3.5-9B on ALFWorld: +44.0%; on GAIA: +17.1%; on SWE-bench: +18.2%). GPT-5.4 on GAIA stagnates ($\Delta = 0.0$) due to task heterogeneity; variant isolation resolves this.

**Table 4: Main results (pass@2 success rate, %). Evolved = peak accuracy.**
| Benchmark | Task agent | Initial | Evolved | $\Delta$ | Best round |
|-----------|------------|---------|---------|----------|------------|
| ALFWorld | Sonnet 4.6 | 83.6 | 94.8 | +11.2 | 7 |
| | GPT-5.4 | 76.9 | 97.8 | +20.9 | 4 |
| | Qwen3.5-9B | 53.0 | 97.0 | +44.0 | 9 |
| WebShop | Sonnet 4.6 | 60.0 | 76.0 | +16.0 | 7 |
| | GPT-5.4 | 55.0 | 73.0 | +18.0 | 8 |
| | Qwen3.5-9B | 36.0 | 49.0 | +13.0 | 7 |
| GAIA | Sonnet 4.6 | 73.8 | 83.5 | +9.7 | 11 |
| | GPT-5.4 | 73.8 | 73.8 | 0.0 | 4 |
| | Qwen3.5-9B | 20.3 | 37.4 | +17.1 | 4 |
| SWE-bench | Sonnet 4.6 | 76.4 | 87.3 | +10.9 | 3 |
| | GPT-5.4 | 45.5 | 63.6 | +18.2 | 3 |
| | Qwen3.5-9B | 23.6 | 41.8 | +18.2 | 2 |
| $\tau^3$-Bench (Avg.) | Sonnet 4.6 | 89.6 | 95.0 | +5.4 | – |
| | GPT-5.4 | 76.2 | 90.7 | +14.5 | – |
| | Qwen3.5-9B | 93.5 | 94.6 | +1.1 | – |

**Evolution Strategy Comparison (Table 5).** On GAIA with GPT-5.4, the Global strategy (single harness) peaks at 73.8% then collapses to 49.5% (peak–final: –24.3). **Ensemble routing** (up to $K$ variants) achieves 87.4% final = peak, non-degrading, with 25% fewer tokens.

**Table 5: Evolution strategy comparison (GAIA, GPT-5.4, AEGIS, 15 rounds).**
| Strategy | Final (%) | Peak (%) | Final − Peak | Tokens |
|----------|-----------|----------|--------------|--------|
| Ensemble (up to K variants) | 87.4 | 87.4 | 0.0 | 107.8M |
| Global (single harness) | 49.5 | 73.8 | −24.3 | 143.7M |

**Meta-Agent Effectiveness (Table 6).** Replacing the four-stage AEGIS pipeline with a single-agent evolver (CC SDK) yields comparable accuracy (86.4% vs. 87.4%, within noise) but consumes ~14% more tokens.

**Table 6: Meta-agent architecture comparison (GAIA, GPT-5.4, variant isolation, 15 rounds).**
| Evolver | Accuracy (%) | Best round | Tokens |
|---------|--------------|------------|--------|
| AEGIS | 87.4 | R14 | 107.8M |
| CC SDK | 86.4 | R12 | 123.1M |

**Co-Evolution (Section 6.5).** Interleaving cross-harness GRPO with harness evolution yields an additional +4.7% average gain over harness-only evolution (GAIA: +4.3%, WebShop: +5.0%). Co-evolution breaks the **scaffolding ceiling** (Figure 5).

**Failure Analysis (Section 6.6).** All three predicted pathologies are empirically confirmed:
- **Reward hacking** (GAIA Sonnet): verifier format exploit co-shipped with genuine improvement; caught at next round via trace analysis.
- **Catastrophic forgetting** ($\tau^3$-Bench Telecom): five consecutive same-type prompt/processor edits accumulated sub-threshold coupling; a sixth edit triggered –14.0% regression. Self-corrected by R9.
- **Under-exploration** (ALFWorld Sonnet): prompt-space exhaustion (ship-prediction accuracy dropped to 0%); pipeline later shifted to structural edits.

## Theoretical and Practical Implications

- **Compositional structure matters for evolution** (Section 7.1). Typed components make the intended scope of each edit explicit, enabling **variant isolation**. Without it, the Global strategy collapses from sub-threshold regression accumulation. The analogy: types don't generate correct programs but make incorrect ones detectable.
- **Trace richness bounds safe evolution** (Section 7.2). Scalar reward alone cannot distinguish reward hacking, catastrophic forgetting, or under-exploration. Structured traces make pathologies diagnosable, but the $\tau^3$-Bench Telecom failure shows that accumulation below per-task detection thresholds can still cause damage.
- **Operational mirror as design heuristic** (Section 7.3). It identifies failure modes to defend against but does not predict ordering, timing, or severity. Convergence guarantees from classical RL are unattainable in symbolic space.
- **Generalization across model families** (Section 7.4). Inverse-scaling holds: gains are larger for weaker models regardless of family similarity to the meta-agent.
- **Cost-performance tradeoffs** (Section 7.5). Evolution amortizes; e.g., 107.8M upfront tokens on GAIA amortize within ~1300 invocations. Per-task inference cost can decrease (GAIA: –25%) or increase (ALFWorld: +60%) depending on harness changes.

**Table 7: Evolution cost summary.**
| Experiment | Rounds | Total Tokens | Gain |
|------------|--------|--------------|------|
| GAIA, GPT-5.4 (Global) | 15 | 143.7M | 0.0% (peak = initial) |
| GAIA, GPT-5.4 (Variant isolation) | 15 | 107.8M | +13.6% |
| ALFWorld, Sonnet 4.6 (Global) | 7 | 43.4M | +11.2% |

- **Ethical considerations** (Section 7.6): auditability (every edit carries manifest), deterministic gating (seesaw constraint), human-in-the-loop for high-risk edits. The $\tau^3$-Bench Telecom failure shows a structural limitation of per-edit gating.

## Conclusion

HarnessX demonstrates that **agent progress need not come from model scaling alone**. By treating the harness as a **composable, adaptive, and evolvable** first-class interface, the system achieves significant gains through:
- **Typed composition** enabling variant isolation and stable evolution.
- **Trace-driven multi-agent evolution** (AEGIS) that detects and mitigates RL-derived pathologies.
- **Harness-model co-evolution** via cross-harness GRPO, breaking both the scaffolding ceiling and the training-signal ceiling.

Across five benchmarks and three model families, HarnessX yields an **average gain of +14.5%** (up to +44.0%), with co-evolution adding **+4.7%** beyond harness-only evolution. The results suggest that composing and evolving the runtime interface from execution feedback is a complementary and actionable lever, especially for capability-limited agents.

**Limitations** (Section 7.7): no held-out evaluation, discrete action spaces only, closed-source meta-agent required, joint-control assumption, limited benchmark coverage. Future work should address held-out generalization, continuous action spaces, open-weight meta-agents, and cross-team coordination mechanisms.

---

_Markdown view of https://picx.dev/p/00m7IM, served by PicX — AI-generated visual whiteboard summaries of research papers._