Visual Summary | HarnessX: A Composable, Adaptive, and Evolvable Agent Harness Foundry

Summary (Overview)

HarnessX introduces a composable, adaptive, and evolvable agent harness foundry that treats the runtime interface between model and environment as a first-class, typed object.
It provides a nine-dimensional processor taxonomy and a substitution algebra for type-safe harness composition (Section 3).
AEGIS is a trace-driven, multi-agent evolution engine that maps harness adaptation onto RL constructs via an operational mirror (Section 4), addressing pathologies like reward hacking, catastrophic forgetting, and under-exploration.
Harness-model co-evolution interleaves harness evolution with model training over a shared replay buffer using cross-harness GRPO, yielding additional gains (Section 5).
Across five benchmarks (ALFWorld, GAIA, WebShop, $\tau^3$ -Bench, SWE-bench Verified) and three model families, HarnessX achieves an average gain of +14.5% (up to +44.0%), with gains largest where baselines are lowest.

Introduction and Theoretical Foundation

Background & Motivation: Modern AI agent performance depends critically on the runtime harness—the prompts, tools, memory, and control flow that mediate how a model observes, reasons, and acts. Despite its importance, harness development remains:

Hand-crafted and static: each new model or task demands bespoke scaffolding.
Architecturally entangled: changes to one component silently break others.
Decoupled from model training: execution traces are discarded rather than used to improve either the harness or the model.

Theoretical Basis: The paper formalizes harness evolution as an MDP over symbolic artifacts via an operational mirror (Section 4.1). Key definitions:

Definition 1 (Harness Configuration). A harness configuration is a tuple $H = (c_1, c_2, \dots, c_9)$ , where each $c_i \in \mathcal{C}_i$ instantiates one of the nine behavioral dimensions: model selection, context assembly, memory management, tool ecosystem, execution environment, evaluation and reward, control and safety, observability, and training bridge.

Definition 2 (Harness Edit). A harness edit is a function $e : \mathcal{H} \to \mathcal{H}$ that modifies one or more dimensions while preserving type contracts. The action space $\mathcal{E}$ is discrete but open-ended.

Definition 3 (Operational Mirror). The operational mirror is the tuple $(\mathcal{H}, \mathcal{E}, R, \mathcal{T})$ , where $\mathcal{H}$ is the harness-configuration space (states), $\mathcal{E}$ is the code-level edit space (actions), $R : \mathcal{H} \times \mathcal{E} \to \mathbb{R}$ maps a configuration–edit pair to a scalar reward, and $\mathcal{T}$ is the trace store.

Table 2: Operational Mirror: RL concepts and their symbolic-space duals in AEGIS.

RL concept	Symbolic-space dual	AEGIS realization
Policy $\pi$	Harness-update procedure $\pi_{\text{evo}}$	Four-stage pipeline (Section 4.3)
State $s_t$	$(H_t, \mathcal{T}_t)$	Harness configuration + trace store
Action $a_t$	Typed harness edit	Builder operation + change manifest
Feedback	Trace $\tau$ + verifier score $r$	Observability layer
Update	$H_{t+1} \leftarrow U(\tilde{H}_t, \mathcal{T}_t, r_t)$	Deterministic acceptance gate

This mapping identifies three RL pathologies—reward hacking, catastrophic forgetting, under-exploration—that reappear in amplified form in symbolic space and motivate AEGIS’s architectural defenses.

Methodology

Harness Composition (Section 3)

HarnessX structures the harness as $H = (\mathcal{M}, \mathcal{C})$ where $\mathcal{M}$ is model configuration and $\mathcal{C} = (P, S)$ is harness configuration. $P : \text{Hook} \to \text{List[Processor]}$ is a hook-indexed list of processors attached to eight lifecycle events.

Table 1: Hook points and their permitted modifications.

Hook	Event type	Permitted modifications
`task_start`	TaskStartEvent	system prompt
`step_start`	StepStartEvent	structural history edits
`before_model`	BeforeModelEvent	last user content; one user-message append
`after_model`	ModelResponseEvent	response content, tool calls
`before_tool`	ToolCallEvent	tool input, approval flag
`after_tool`	ToolResultEvent	tool result
`step_end`	StepEndEvent	read-only
`task_end`	TaskEndEvent	read-only

Processors follow a restricted interface: async def process(self, event: Event) -> AsyncIterator[Event], enabling pass-through, transform, split, intercept, or interrupt outcomes. The nine-dimensional taxonomy (Section 3.3) spans all behavioral dimensions; AEGIS edits span all nine.

Harness Adaptation: AEGIS (Section 4)

AEGIS is a multi-agent evolution engine realized as a four-stage pipeline, all driven by the same meta-agent LLM:

Digester: compresses raw traces ( $\sim$ 10M tokens per iteration on GAIA) into structured per-task summaries.
Planner: constructs an adaptation landscape to prevent under-exploration.
Evolver: generates typed builder operations with change manifests and smoke tests.
Critic & Deterministic Gate: defends against reward hacking and catastrophic forgetting via the seesaw constraint (no regression on previously passing tasks).

The loop is formalized in Algorithm 1 (selective invocation with early stopping after $P$ idle rounds). Variant isolation via Ensemble routing (Section 4.5) maintains up to $K$ harness variants, routing tasks to the variant with highest estimated success rate. This prevents cross-task interference on heterogeneous benchmarks.

Harness-Model Co-Evolution (Section 5)

The co-evolution iteration (Section 5.1) pairs harness evolution with model training over a shared replay buffer $\mathcal{B}$ :

Rollout: Run $(\mathcal{M}_t, H_t)$ on $B_t$ , record traces $\tau_i$ .
Verification: Fixed verifier gives scalar rewards $r_i$ .
Buffer insertion: Append scored traces with harness version.
Harness evolution: $H_{t+1} \leftarrow \text{AEGIS}(H_t, \mathcal{B})$ .
Behavior log-probabilities: Cache $\pi_{\theta_{\text{old}}}(\tau_i)$ .
GRPO update: $\mathcal{M}_{t+1} \leftarrow \text{GRPO}(\mathcal{M}_t, \mathcal{B})$ .

Cross-harness GRPO (Section 5.3) groups all trajectories of the same task across harness versions:

\mathcal{G}_x = \{\tau_i \in \mathcal{B} \mid \text{task}(\tau_i) = x\} = \bigcup_k \{\tau \sim \text{Agent}(\mathcal{M}_k, H_k, x)\}.

The group-relative advantage is:

\hat{A}(\tau_i) = \frac{r_i - \mu(\mathcal{G}_x)}{\sigma(\mathcal{G}_x) + \epsilon}.

The policy objective:

J_{\text{GRPO}}(\theta) = \mathbb{E}_{x, \tau_i \sim \mathcal{B}} \left[ \min \left( \rho_i(\theta) \hat{A}(\tau_i),\, \text{clip}(\rho_i(\theta), 1-\epsilon_c, 1+\epsilon_c) \hat{A}(\tau_i) \right) \right] - \beta D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}}),

where $\rho_i(\theta) = \frac{\pi_\theta(\tau_i | x)}{\pi_{\theta_{\text{old}}}(\tau_i | x)}$ .

Off-policy training over the mixed-policy buffer (Section 5.4) reuses trajectories at no additional rollout cost; FIFO eviction bounds the model-version lag.

Empirical Validation / Results

Experimental Setup. Five benchmarks (Table 3), three task-agent families (Claude Sonnet 4.6, GPT-5.4, Qwen3.5-9B), up to 15 evolution rounds with early stopping after 3 idle rounds. Meta-agent: Claude Opus 4.6.

Table 3: Benchmark characteristics.

Benchmark	Domain	Sampled Tasks	Verifier
GAIA (Level 1–3)	Multi-step retrieval	103	Exact match
ALFWorld	Embodied planning	134	Goal completion
WebShop	Web interaction	100	Attribute match
$\tau^3$ -Bench	Multi-turn dialogue	3 domains	Rule compliance
SWE-bench Verified	Software engineering	55	Patch resolution

Main Results (Table 4). Evolution improves 14 of 15 configurations with an average gain of +14.5%. Gains range from +1.1% to +44.0%. Inverse scaling: weakest task agents gain most (Qwen3.5-9B on ALFWorld: +44.0%; on GAIA: +17.1%; on SWE-bench: +18.2%). GPT-5.4 on GAIA stagnates ( $\Delta = 0.0$ ) due to task heterogeneity; variant isolation resolves this.

Table 4: Main results (pass@2 success rate, %). Evolved = peak accuracy.

Benchmark	Task agent	Initial	Evolved	$\Delta$	Best round
ALFWorld	Sonnet 4.6	83.6	94.8	+11.2	7
	GPT-5.4	76.9	97.8	+20.9	4
	Qwen3.5-9B	53.0	97.0	+44.0	9
WebShop	Sonnet 4.6	60.0	76.0	+16.0	7
	GPT-5.4	55.0	73.0	+18.0	8
	Qwen3.5-9B	36.0	49.0	+13.0	7
GAIA	Sonnet 4.6	73.8	83.5	+9.7	11
	GPT-5.4	73.8	73.8	0.0	4
	Qwen3.5-9B	20.3	37.4	+17.1	4
SWE-bench	Sonnet 4.6	76.4	87.3	+10.9	3
	GPT-5.4	45.5	63.6	+18.2	3
	Qwen3.5-9B	23.6	41.8	+18.2	2
$\tau^3$ -Bench (Avg.)	Sonnet 4.6	89.6	95.0	+5.4	–
	GPT-5.4	76.2	90.7	+14.5	–
	Qwen3.5-9B	93.5	94.6	+1.1	–

Evolution Strategy Comparison (Table 5). On GAIA with GPT-5.4, the Global strategy (single harness) peaks at 73.8% then collapses to 49.5% (peak–final: –24.3). Ensemble routing (up to $K$ variants) achieves 87.4% final = peak, non-degrading, with 25% fewer tokens.

Table 5: Evolution strategy comparison (GAIA, GPT-5.4, AEGIS, 15 rounds).

Strategy	Final (%)	Peak (%)	Final − Peak	Tokens
Ensemble (up to K variants)	87.4	87.4	0.0	107.8M
Global (single harness)	49.5	73.8	−24.3	143.7M

Meta-Agent Effectiveness (Table 6). Replacing the four-stage AEGIS pipeline with a single-agent evolver (CC SDK) yields comparable accuracy (86.4% vs. 87.4%, within noise) but consumes ~14% more tokens.

Table 6: Meta-agent architecture comparison (GAIA, GPT-5.4, variant isolation, 15 rounds).

Evolver	Accuracy (%)	Best round	Tokens
AEGIS	87.4	R14	107.8M
CC SDK	86.4	R12	123.1M

Co-Evolution (Section 6.5). Interleaving cross-harness GRPO with harness evolution yields an additional +4.7% average gain over harness-only evolution (GAIA: +4.3%, WebShop: +5.0%). Co-evolution breaks the scaffolding ceiling (Figure 5).

Failure Analysis (Section 6.6). All three predicted pathologies are empirically confirmed:

Reward hacking (GAIA Sonnet): verifier format exploit co-shipped with genuine improvement; caught at next round via trace analysis.
Catastrophic forgetting ( $\tau^3$ -Bench Telecom): five consecutive same-type prompt/processor edits accumulated sub-threshold coupling; a sixth edit triggered –14.0% regression. Self-corrected by R9.
Under-exploration (ALFWorld Sonnet): prompt-space exhaustion (ship-prediction accuracy dropped to 0%); pipeline later shifted to structural edits.

Theoretical and Practical Implications

Compositional structure matters for evolution (Section 7.1). Typed components make the intended scope of each edit explicit, enabling variant isolation. Without it, the Global strategy collapses from sub-threshold regression accumulation. The analogy: types don't generate correct programs but make incorrect ones detectable.
Trace richness bounds safe evolution (Section 7.2). Scalar reward alone cannot distinguish reward hacking, catastrophic forgetting, or under-exploration. Structured traces make pathologies diagnosable, but the $\tau^3$ -Bench Telecom failure shows that accumulation below per-task detection thresholds can still cause damage.
Operational mirror as design heuristic (Section 7.3). It identifies failure modes to defend against but does not predict ordering, timing, or severity. Convergence guarantees from classical RL are unattainable in symbolic space.
Generalization across model families (Section 7.4). Inverse-scaling holds: gains are larger for weaker models regardless of family similarity to the meta-agent.
Cost-performance tradeoffs (Section 7.5). Evolution amortizes; e.g., 107.8M upfront tokens on GAIA amortize within ~1300 invocations. Per-task inference cost can decrease (GAIA: –25%) or increase (ALFWorld: +60%) depending on harness changes.

Table 7: Evolution cost summary.

Experiment	Rounds	Total Tokens	Gain
GAIA, GPT-5.4 (Global)	15	143.7M	0.0% (peak = initial)
GAIA, GPT-5.4 (Variant isolation)	15	107.8M	+13.6%
ALFWorld, Sonnet 4.6 (Global)	7	43.4M	+11.2%

Ethical considerations (Section 7.6): auditability (every edit carries manifest), deterministic gating (seesaw constraint), human-in-the-loop for high-risk edits. The $\tau^3$ -Bench Telecom failure shows a structural limitation of per-edit gating.

Conclusion

HarnessX demonstrates that agent progress need not come from model scaling alone. By treating the harness as a composable, adaptive, and evolvable first-class interface, the system achieves significant gains through:

Typed composition enabling variant isolation and stable evolution.
Trace-driven multi-agent evolution (AEGIS) that detects and mitigates RL-derived pathologies.
Harness-model co-evolution via cross-harness GRPO, breaking both the scaffolding ceiling and the training-signal ceiling.

Across five benchmarks and three model families, HarnessX yields an average gain of +14.5% (up to +44.0%), with co-evolution adding +4.7% beyond harness-only evolution. The results suggest that composing and evolving the runtime interface from execution feedback is a complementary and actionable lever, especially for capability-limited agents.

Limitations (Section 7.7): no held-out evaluation, discrete action spaces only, closed-source meta-agent required, joint-control assumption, limited benchmark coverage. Future work should address held-out generalization, continuous action spaces, open-weight meta-agents, and cross-team coordination mechanisms.