Summary (Overview)
- HarnessX introduces a composable, adaptive, and evolvable agent harness foundry that treats the runtime interface between model and environment as a first-class, typed object.
- It provides a nine-dimensional processor taxonomy and a substitution algebra for type-safe harness composition (Section 3).
- AEGIS is a trace-driven, multi-agent evolution engine that maps harness adaptation onto RL constructs via an operational mirror (Section 4), addressing pathologies like reward hacking, catastrophic forgetting, and under-exploration.
- Harness-model co-evolution interleaves harness evolution with model training over a shared replay buffer using cross-harness GRPO, yielding additional gains (Section 5).
- Across five benchmarks (ALFWorld, GAIA, WebShop, -Bench, SWE-bench Verified) and three model families, HarnessX achieves an average gain of +14.5% (up to +44.0%), with gains largest where baselines are lowest.
Introduction and Theoretical Foundation
Background & Motivation: Modern AI agent performance depends critically on the runtime harness—the prompts, tools, memory, and control flow that mediate how a model observes, reasons, and acts. Despite its importance, harness development remains:
- Hand-crafted and static: each new model or task demands bespoke scaffolding.
- Architecturally entangled: changes to one component silently break others.
- Decoupled from model training: execution traces are discarded rather than used to improve either the harness or the model.
Theoretical Basis: The paper formalizes harness evolution as an MDP over symbolic artifacts via an operational mirror (Section 4.1). Key definitions:
Definition 1 (Harness Configuration). A harness configuration is a tuple , where each instantiates one of the nine behavioral dimensions: model selection, context assembly, memory management, tool ecosystem, execution environment, evaluation and reward, control and safety, observability, and training bridge.
Definition 2 (Harness Edit). A harness edit is a function that modifies one or more dimensions while preserving type contracts. The action space is discrete but open-ended.
Definition 3 (Operational Mirror). The operational mirror is the tuple , where is the harness-configuration space (states), is the code-level edit space (actions), maps a configuration–edit pair to a scalar reward, and is the trace store.
Table 2: Operational Mirror: RL concepts and their symbolic-space duals in AEGIS.
| RL concept | Symbolic-space dual | AEGIS realization |
|---|---|---|
| Policy | Harness-update procedure | Four-stage pipeline (Section 4.3) |
| State | Harness configuration + trace store | |
| Action | Typed harness edit | Builder operation + change manifest |
| Feedback | Trace + verifier score | Observability layer |
| Update | Deterministic acceptance gate |
This mapping identifies three RL pathologies—reward hacking, catastrophic forgetting, under-exploration—that reappear in amplified form in symbolic space and motivate AEGIS’s architectural defenses.
Methodology
Harness Composition (Section 3)
HarnessX structures the harness as where is model configuration and is harness configuration. is a hook-indexed list of processors attached to eight lifecycle events.
Table 1: Hook points and their permitted modifications.
| Hook | Event type | Permitted modifications |
|---|---|---|
task_start | TaskStartEvent | system prompt |
step_start | StepStartEvent | structural history edits |
before_model | BeforeModelEvent | last user content; one user-message append |
after_model | ModelResponseEvent | response content, tool calls |
before_tool | ToolCallEvent | tool input, approval flag |
after_tool | ToolResultEvent | tool result |
step_end | StepEndEvent | read-only |
task_end | TaskEndEvent | read-only |
Processors follow a restricted interface: async def process(self, event: Event) -> AsyncIterator[Event], enabling pass-through, transform, split, intercept, or interrupt outcomes. The nine-dimensional taxonomy (Section 3.3) spans all behavioral dimensions; AEGIS edits span all nine.
Harness Adaptation: AEGIS (Section 4)
AEGIS is a multi-agent evolution engine realized as a four-stage pipeline, all driven by the same meta-agent LLM:
- Digester: compresses raw traces (10M tokens per iteration on GAIA) into structured per-task summaries.
- Planner: constructs an adaptation landscape to prevent under-exploration.
- Evolver: generates typed builder operations with change manifests and smoke tests.
- Critic & Deterministic Gate: defends against reward hacking and catastrophic forgetting via the seesaw constraint (no regression on previously passing tasks).
The loop is formalized in Algorithm 1 (selective invocation with early stopping after idle rounds). Variant isolation via Ensemble routing (Section 4.5) maintains up to harness variants, routing tasks to the variant with highest estimated success rate. This prevents cross-task interference on heterogeneous benchmarks.
Harness-Model Co-Evolution (Section 5)
The co-evolution iteration (Section 5.1) pairs harness evolution with model training over a shared replay buffer :
- Rollout: Run on , record traces .
- Verification: Fixed verifier gives scalar rewards .
- Buffer insertion: Append scored traces with harness version.
- Harness evolution: .
- Behavior log-probabilities: Cache .
- GRPO update: .
Cross-harness GRPO (Section 5.3) groups all trajectories of the same task across harness versions:
The group-relative advantage is:
The policy objective:
where .
Off-policy training over the mixed-policy buffer (Section 5.4) reuses trajectories at no additional rollout cost; FIFO eviction bounds the model-version lag.
Empirical Validation / Results
Experimental Setup. Five benchmarks (Table 3), three task-agent families (Claude Sonnet 4.6, GPT-5.4, Qwen3.5-9B), up to 15 evolution rounds with early stopping after 3 idle rounds. Meta-agent: Claude Opus 4.6.
Table 3: Benchmark characteristics.
| Benchmark | Domain | Sampled Tasks | Verifier |
|---|---|---|---|
| GAIA (Level 1–3) | Multi-step retrieval | 103 | Exact match |
| ALFWorld | Embodied planning | 134 | Goal completion |
| WebShop | Web interaction | 100 | Attribute match |
| -Bench | Multi-turn dialogue | 3 domains | Rule compliance |
| SWE-bench Verified | Software engineering | 55 | Patch resolution |
Main Results (Table 4). Evolution improves 14 of 15 configurations with an average gain of +14.5%. Gains range from +1.1% to +44.0%. Inverse scaling: weakest task agents gain most (Qwen3.5-9B on ALFWorld: +44.0%; on GAIA: +17.1%; on SWE-bench: +18.2%). GPT-5.4 on GAIA stagnates () due to task heterogeneity; variant isolation resolves this.
Table 4: Main results (pass@2 success rate, %). Evolved = peak accuracy.
| Benchmark | Task agent | Initial | Evolved | Best round | |
|---|---|---|---|---|---|
| ALFWorld | Sonnet 4.6 | 83.6 | 94.8 | +11.2 | 7 |
| GPT-5.4 | 76.9 | 97.8 | +20.9 | 4 | |
| Qwen3.5-9B | 53.0 | 97.0 | +44.0 | 9 | |
| WebShop | Sonnet 4.6 | 60.0 | 76.0 | +16.0 | 7 |
| GPT-5.4 | 55.0 | 73.0 | +18.0 | 8 | |
| Qwen3.5-9B | 36.0 | 49.0 | +13.0 | 7 | |
| GAIA | Sonnet 4.6 | 73.8 | 83.5 | +9.7 | 11 |
| GPT-5.4 | 73.8 | 73.8 | 0.0 | 4 | |
| Qwen3.5-9B | 20.3 | 37.4 | +17.1 | 4 | |
| SWE-bench | Sonnet 4.6 | 76.4 | 87.3 | +10.9 | 3 |
| GPT-5.4 | 45.5 | 63.6 | +18.2 | 3 | |
| Qwen3.5-9B | 23.6 | 41.8 | +18.2 | 2 | |
| -Bench (Avg.) | Sonnet 4.6 | 89.6 | 95.0 | +5.4 | – |
| GPT-5.4 | 76.2 | 90.7 | +14.5 | – | |
| Qwen3.5-9B | 93.5 | 94.6 | +1.1 | – |
Evolution Strategy Comparison (Table 5). On GAIA with GPT-5.4, the Global strategy (single harness) peaks at 73.8% then collapses to 49.5% (peak–final: –24.3). Ensemble routing (up to variants) achieves 87.4% final = peak, non-degrading, with 25% fewer tokens.
Table 5: Evolution strategy comparison (GAIA, GPT-5.4, AEGIS, 15 rounds).
| Strategy | Final (%) | Peak (%) | Final − Peak | Tokens |
|---|---|---|---|---|
| Ensemble (up to K variants) | 87.4 | 87.4 | 0.0 | 107.8M |
| Global (single harness) | 49.5 | 73.8 | −24.3 | 143.7M |
Meta-Agent Effectiveness (Table 6). Replacing the four-stage AEGIS pipeline with a single-agent evolver (CC SDK) yields comparable accuracy (86.4% vs. 87.4%, within noise) but consumes ~14% more tokens.
Table 6: Meta-agent architecture comparison (GAIA, GPT-5.4, variant isolation, 15 rounds).
| Evolver | Accuracy (%) | Best round | Tokens |
|---|---|---|---|
| AEGIS | 87.4 | R14 | 107.8M |
| CC SDK | 86.4 | R12 | 123.1M |
Co-Evolution (Section 6.5). Interleaving cross-harness GRPO with harness evolution yields an additional +4.7% average gain over harness-only evolution (GAIA: +4.3%, WebShop: +5.0%). Co-evolution breaks the scaffolding ceiling (Figure 5).
Failure Analysis (Section 6.6). All three predicted pathologies are empirically confirmed:
- Reward hacking (GAIA Sonnet): verifier format exploit co-shipped with genuine improvement; caught at next round via trace analysis.
- Catastrophic forgetting (-Bench Telecom): five consecutive same-type prompt/processor edits accumulated sub-threshold coupling; a sixth edit triggered –14.0% regression. Self-corrected by R9.
- Under-exploration (ALFWorld Sonnet): prompt-space exhaustion (ship-prediction accuracy dropped to 0%); pipeline later shifted to structural edits.
Theoretical and Practical Implications
- Compositional structure matters for evolution (Section 7.1). Typed components make the intended scope of each edit explicit, enabling variant isolation. Without it, the Global strategy collapses from sub-threshold regression accumulation. The analogy: types don't generate correct programs but make incorrect ones detectable.
- Trace richness bounds safe evolution (Section 7.2). Scalar reward alone cannot distinguish reward hacking, catastrophic forgetting, or under-exploration. Structured traces make pathologies diagnosable, but the -Bench Telecom failure shows that accumulation below per-task detection thresholds can still cause damage.
- Operational mirror as design heuristic (Section 7.3). It identifies failure modes to defend against but does not predict ordering, timing, or severity. Convergence guarantees from classical RL are unattainable in symbolic space.
- Generalization across model families (Section 7.4). Inverse-scaling holds: gains are larger for weaker models regardless of family similarity to the meta-agent.
- Cost-performance tradeoffs (Section 7.5). Evolution amortizes; e.g., 107.8M upfront tokens on GAIA amortize within ~1300 invocations. Per-task inference cost can decrease (GAIA: –25%) or increase (ALFWorld: +60%) depending on harness changes.
Table 7: Evolution cost summary.
| Experiment | Rounds | Total Tokens | Gain |
|---|---|---|---|
| GAIA, GPT-5.4 (Global) | 15 | 143.7M | 0.0% (peak = initial) |
| GAIA, GPT-5.4 (Variant isolation) | 15 | 107.8M | +13.6% |
| ALFWorld, Sonnet 4.6 (Global) | 7 | 43.4M | +11.2% |
- Ethical considerations (Section 7.6): auditability (every edit carries manifest), deterministic gating (seesaw constraint), human-in-the-loop for high-risk edits. The -Bench Telecom failure shows a structural limitation of per-edit gating.
Conclusion
HarnessX demonstrates that agent progress need not come from model scaling alone. By treating the harness as a composable, adaptive, and evolvable first-class interface, the system achieves significant gains through:
- Typed composition enabling variant isolation and stable evolution.
- Trace-driven multi-agent evolution (AEGIS) that detects and mitigates RL-derived pathologies.
- Harness-model co-evolution via cross-harness GRPO, breaking both the scaffolding ceiling and the training-signal ceiling.
Across five benchmarks and three model families, HarnessX yields an average gain of +14.5% (up to +44.0%), with co-evolution adding +4.7% beyond harness-only evolution. The results suggest that composing and evolving the runtime interface from execution feedback is a complementary and actionable lever, especially for capability-limited agents.
Limitations (Section 7.7): no held-out evaluation, discrete action spaces only, closed-source meta-agent required, joint-control assumption, limited benchmark coverage. Future work should address held-out generalization, continuous action spaces, open-weight meta-agents, and cross-team coordination mechanisms.
Related papers
- Memory is Reconstructed, Not Retrieved: Graph Memory for LLM Agents
Active memory reconstruction with associative graphs proves strictly more powerful than passive retrieval, achieving 23% higher scores and 81% lower cost.
- From Chatbot to Digital Colleague: The Paradigm Shift Toward Persistent Autonomous AI
The shift from chatbots to digital colleagues is enabled by the Workspace + Skill paradigm for persistent, stateful task execution.
- InterleaveThinker: Reinforcing Agentic Interleaved Generation
InterleaveThinker uses decoupled planner-critic agents to enable any frozen image generator to achieve state-of-the-art interleaved generation.