Summary (Overview)

  • HarnessX introduces a composable, adaptive, and evolvable agent harness foundry that treats the runtime interface between model and environment as a first-class, typed object.
  • It provides a nine-dimensional processor taxonomy and a substitution algebra for type-safe harness composition (Section 3).
  • AEGIS is a trace-driven, multi-agent evolution engine that maps harness adaptation onto RL constructs via an operational mirror (Section 4), addressing pathologies like reward hacking, catastrophic forgetting, and under-exploration.
  • Harness-model co-evolution interleaves harness evolution with model training over a shared replay buffer using cross-harness GRPO, yielding additional gains (Section 5).
  • Across five benchmarks (ALFWorld, GAIA, WebShop, τ3\tau^3-Bench, SWE-bench Verified) and three model families, HarnessX achieves an average gain of +14.5% (up to +44.0%), with gains largest where baselines are lowest.

Introduction and Theoretical Foundation

Background & Motivation: Modern AI agent performance depends critically on the runtime harness—the prompts, tools, memory, and control flow that mediate how a model observes, reasons, and acts. Despite its importance, harness development remains:

  • Hand-crafted and static: each new model or task demands bespoke scaffolding.
  • Architecturally entangled: changes to one component silently break others.
  • Decoupled from model training: execution traces are discarded rather than used to improve either the harness or the model.

Theoretical Basis: The paper formalizes harness evolution as an MDP over symbolic artifacts via an operational mirror (Section 4.1). Key definitions:

Definition 1 (Harness Configuration). A harness configuration is a tuple H=(c1,c2,,c9)H = (c_1, c_2, \dots, c_9), where each ciCic_i \in \mathcal{C}_i instantiates one of the nine behavioral dimensions: model selection, context assembly, memory management, tool ecosystem, execution environment, evaluation and reward, control and safety, observability, and training bridge.

Definition 2 (Harness Edit). A harness edit is a function e:HHe : \mathcal{H} \to \mathcal{H} that modifies one or more dimensions while preserving type contracts. The action space E\mathcal{E} is discrete but open-ended.

Definition 3 (Operational Mirror). The operational mirror is the tuple (H,E,R,T)(\mathcal{H}, \mathcal{E}, R, \mathcal{T}), where H\mathcal{H} is the harness-configuration space (states), E\mathcal{E} is the code-level edit space (actions), R:H×ERR : \mathcal{H} \times \mathcal{E} \to \mathbb{R} maps a configuration–edit pair to a scalar reward, and T\mathcal{T} is the trace store.

Table 2: Operational Mirror: RL concepts and their symbolic-space duals in AEGIS.

RL conceptSymbolic-space dualAEGIS realization
Policy π\piHarness-update procedure πevo\pi_{\text{evo}}Four-stage pipeline (Section 4.3)
State sts_t(Ht,Tt)(H_t, \mathcal{T}_t)Harness configuration + trace store
Action ata_tTyped harness editBuilder operation + change manifest
FeedbackTrace τ\tau + verifier score rrObservability layer
UpdateHt+1U(H~t,Tt,rt)H_{t+1} \leftarrow U(\tilde{H}_t, \mathcal{T}_t, r_t)Deterministic acceptance gate

This mapping identifies three RL pathologies—reward hacking, catastrophic forgetting, under-exploration—that reappear in amplified form in symbolic space and motivate AEGIS’s architectural defenses.

Methodology

Harness Composition (Section 3)

HarnessX structures the harness as H=(M,C)H = (\mathcal{M}, \mathcal{C}) where M\mathcal{M} is model configuration and C=(P,S)\mathcal{C} = (P, S) is harness configuration. P:HookList[Processor]P : \text{Hook} \to \text{List[Processor]} is a hook-indexed list of processors attached to eight lifecycle events.

Table 1: Hook points and their permitted modifications.

HookEvent typePermitted modifications
task_startTaskStartEventsystem prompt
step_startStepStartEventstructural history edits
before_modelBeforeModelEventlast user content; one user-message append
after_modelModelResponseEventresponse content, tool calls
before_toolToolCallEventtool input, approval flag
after_toolToolResultEventtool result
step_endStepEndEventread-only
task_endTaskEndEventread-only

Processors follow a restricted interface: async def process(self, event: Event) -> AsyncIterator[Event], enabling pass-through, transform, split, intercept, or interrupt outcomes. The nine-dimensional taxonomy (Section 3.3) spans all behavioral dimensions; AEGIS edits span all nine.

Harness Adaptation: AEGIS (Section 4)

AEGIS is a multi-agent evolution engine realized as a four-stage pipeline, all driven by the same meta-agent LLM:

  • Digester: compresses raw traces (\sim10M tokens per iteration on GAIA) into structured per-task summaries.
  • Planner: constructs an adaptation landscape to prevent under-exploration.
  • Evolver: generates typed builder operations with change manifests and smoke tests.
  • Critic & Deterministic Gate: defends against reward hacking and catastrophic forgetting via the seesaw constraint (no regression on previously passing tasks).

The loop is formalized in Algorithm 1 (selective invocation with early stopping after PP idle rounds). Variant isolation via Ensemble routing (Section 4.5) maintains up to KK harness variants, routing tasks to the variant with highest estimated success rate. This prevents cross-task interference on heterogeneous benchmarks.

Harness-Model Co-Evolution (Section 5)

The co-evolution iteration (Section 5.1) pairs harness evolution with model training over a shared replay buffer B\mathcal{B}:

  1. Rollout: Run (Mt,Ht)(\mathcal{M}_t, H_t) on BtB_t, record traces τi\tau_i.
  2. Verification: Fixed verifier gives scalar rewards rir_i.
  3. Buffer insertion: Append scored traces with harness version.
  4. Harness evolution: Ht+1AEGIS(Ht,B)H_{t+1} \leftarrow \text{AEGIS}(H_t, \mathcal{B}).
  5. Behavior log-probabilities: Cache πθold(τi)\pi_{\theta_{\text{old}}}(\tau_i).
  6. GRPO update: Mt+1GRPO(Mt,B)\mathcal{M}_{t+1} \leftarrow \text{GRPO}(\mathcal{M}_t, \mathcal{B}).

Cross-harness GRPO (Section 5.3) groups all trajectories of the same task across harness versions:

Gx={τiBtask(τi)=x}=k{τAgent(Mk,Hk,x)}.\mathcal{G}_x = \{\tau_i \in \mathcal{B} \mid \text{task}(\tau_i) = x\} = \bigcup_k \{\tau \sim \text{Agent}(\mathcal{M}_k, H_k, x)\}.

The group-relative advantage is:

A^(τi)=riμ(Gx)σ(Gx)+ϵ.\hat{A}(\tau_i) = \frac{r_i - \mu(\mathcal{G}_x)}{\sigma(\mathcal{G}_x) + \epsilon}.

The policy objective:

JGRPO(θ)=Ex,τiB[min(ρi(θ)A^(τi),clip(ρi(θ),1ϵc,1+ϵc)A^(τi))]βDKL(πθπref),J_{\text{GRPO}}(\theta) = \mathbb{E}_{x, \tau_i \sim \mathcal{B}} \left[ \min \left( \rho_i(\theta) \hat{A}(\tau_i),\, \text{clip}(\rho_i(\theta), 1-\epsilon_c, 1+\epsilon_c) \hat{A}(\tau_i) \right) \right] - \beta D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}}),

where ρi(θ)=πθ(τix)πθold(τix)\rho_i(\theta) = \frac{\pi_\theta(\tau_i | x)}{\pi_{\theta_{\text{old}}}(\tau_i | x)}.

Off-policy training over the mixed-policy buffer (Section 5.4) reuses trajectories at no additional rollout cost; FIFO eviction bounds the model-version lag.

Empirical Validation / Results

Experimental Setup. Five benchmarks (Table 3), three task-agent families (Claude Sonnet 4.6, GPT-5.4, Qwen3.5-9B), up to 15 evolution rounds with early stopping after 3 idle rounds. Meta-agent: Claude Opus 4.6.

Table 3: Benchmark characteristics.

BenchmarkDomainSampled TasksVerifier
GAIA (Level 1–3)Multi-step retrieval103Exact match
ALFWorldEmbodied planning134Goal completion
WebShopWeb interaction100Attribute match
τ3\tau^3-BenchMulti-turn dialogue3 domainsRule compliance
SWE-bench VerifiedSoftware engineering55Patch resolution

Main Results (Table 4). Evolution improves 14 of 15 configurations with an average gain of +14.5%. Gains range from +1.1% to +44.0%. Inverse scaling: weakest task agents gain most (Qwen3.5-9B on ALFWorld: +44.0%; on GAIA: +17.1%; on SWE-bench: +18.2%). GPT-5.4 on GAIA stagnates (Δ=0.0\Delta = 0.0) due to task heterogeneity; variant isolation resolves this.

Table 4: Main results (pass@2 success rate, %). Evolved = peak accuracy.

BenchmarkTask agentInitialEvolvedΔ\DeltaBest round
ALFWorldSonnet 4.683.694.8+11.27
GPT-5.476.997.8+20.94
Qwen3.5-9B53.097.0+44.09
WebShopSonnet 4.660.076.0+16.07
GPT-5.455.073.0+18.08
Qwen3.5-9B36.049.0+13.07
GAIASonnet 4.673.883.5+9.711
GPT-5.473.873.80.04
Qwen3.5-9B20.337.4+17.14
SWE-benchSonnet 4.676.487.3+10.93
GPT-5.445.563.6+18.23
Qwen3.5-9B23.641.8+18.22
τ3\tau^3-Bench (Avg.)Sonnet 4.689.695.0+5.4
GPT-5.476.290.7+14.5
Qwen3.5-9B93.594.6+1.1

Evolution Strategy Comparison (Table 5). On GAIA with GPT-5.4, the Global strategy (single harness) peaks at 73.8% then collapses to 49.5% (peak–final: –24.3). Ensemble routing (up to KK variants) achieves 87.4% final = peak, non-degrading, with 25% fewer tokens.

Table 5: Evolution strategy comparison (GAIA, GPT-5.4, AEGIS, 15 rounds).

StrategyFinal (%)Peak (%)Final − PeakTokens
Ensemble (up to K variants)87.487.40.0107.8M
Global (single harness)49.573.8−24.3143.7M

Meta-Agent Effectiveness (Table 6). Replacing the four-stage AEGIS pipeline with a single-agent evolver (CC SDK) yields comparable accuracy (86.4% vs. 87.4%, within noise) but consumes ~14% more tokens.

Table 6: Meta-agent architecture comparison (GAIA, GPT-5.4, variant isolation, 15 rounds).

EvolverAccuracy (%)Best roundTokens
AEGIS87.4R14107.8M
CC SDK86.4R12123.1M

Co-Evolution (Section 6.5). Interleaving cross-harness GRPO with harness evolution yields an additional +4.7% average gain over harness-only evolution (GAIA: +4.3%, WebShop: +5.0%). Co-evolution breaks the scaffolding ceiling (Figure 5).

Failure Analysis (Section 6.6). All three predicted pathologies are empirically confirmed:

  • Reward hacking (GAIA Sonnet): verifier format exploit co-shipped with genuine improvement; caught at next round via trace analysis.
  • Catastrophic forgetting (τ3\tau^3-Bench Telecom): five consecutive same-type prompt/processor edits accumulated sub-threshold coupling; a sixth edit triggered –14.0% regression. Self-corrected by R9.
  • Under-exploration (ALFWorld Sonnet): prompt-space exhaustion (ship-prediction accuracy dropped to 0%); pipeline later shifted to structural edits.

Theoretical and Practical Implications

  • Compositional structure matters for evolution (Section 7.1). Typed components make the intended scope of each edit explicit, enabling variant isolation. Without it, the Global strategy collapses from sub-threshold regression accumulation. The analogy: types don't generate correct programs but make incorrect ones detectable.
  • Trace richness bounds safe evolution (Section 7.2). Scalar reward alone cannot distinguish reward hacking, catastrophic forgetting, or under-exploration. Structured traces make pathologies diagnosable, but the τ3\tau^3-Bench Telecom failure shows that accumulation below per-task detection thresholds can still cause damage.
  • Operational mirror as design heuristic (Section 7.3). It identifies failure modes to defend against but does not predict ordering, timing, or severity. Convergence guarantees from classical RL are unattainable in symbolic space.
  • Generalization across model families (Section 7.4). Inverse-scaling holds: gains are larger for weaker models regardless of family similarity to the meta-agent.
  • Cost-performance tradeoffs (Section 7.5). Evolution amortizes; e.g., 107.8M upfront tokens on GAIA amortize within ~1300 invocations. Per-task inference cost can decrease (GAIA: –25%) or increase (ALFWorld: +60%) depending on harness changes.

Table 7: Evolution cost summary.

ExperimentRoundsTotal TokensGain
GAIA, GPT-5.4 (Global)15143.7M0.0% (peak = initial)
GAIA, GPT-5.4 (Variant isolation)15107.8M+13.6%
ALFWorld, Sonnet 4.6 (Global)743.4M+11.2%
  • Ethical considerations (Section 7.6): auditability (every edit carries manifest), deterministic gating (seesaw constraint), human-in-the-loop for high-risk edits. The τ3\tau^3-Bench Telecom failure shows a structural limitation of per-edit gating.

Conclusion

HarnessX demonstrates that agent progress need not come from model scaling alone. By treating the harness as a composable, adaptive, and evolvable first-class interface, the system achieves significant gains through:

  • Typed composition enabling variant isolation and stable evolution.
  • Trace-driven multi-agent evolution (AEGIS) that detects and mitigates RL-derived pathologies.
  • Harness-model co-evolution via cross-harness GRPO, breaking both the scaffolding ceiling and the training-signal ceiling.

Across five benchmarks and three model families, HarnessX yields an average gain of +14.5% (up to +44.0%), with co-evolution adding +4.7% beyond harness-only evolution. The results suggest that composing and evolving the runtime interface from execution feedback is a complementary and actionable lever, especially for capability-limited agents.

Limitations (Section 7.7): no held-out evaluation, discrete action spaces only, closed-source meta-agent required, joint-control assumption, limited benchmark coverage. Future work should address held-out generalization, continuous action spaces, open-weight meta-agents, and cross-team coordination mechanisms.

Related papers