Summary (Overview)

  • Harness-1 is a 20B open-source search agent trained with reinforcement learning (RL) inside a stateful harness that externalises routine bookkeeping (candidate pools, curated sets, evidence graphs, verification records, context budget).
  • The policy retains only semantic decisions: what to search, which documents to curate, what to verify, and when to stop; the harness maintains recoverable search state.
  • Across eight retrieval benchmarks (web, finance, patents, multi-hop QA), Harness-1 achieves 0.730 average curated recall, outperforming the next best open subagent (Tongyi DeepResearch 30B) by +11.4 points, and remains competitive with frontier models like Opus-4.6 and GPT-5.4.
  • Gains are especially strong on held-out transfer benchmarks (mean +17.0 pts vs. Context-1), suggesting that RL over explicit search state produces retrieval behaviours that generalise beyond training domains.
  • Three design requirements for stateful search harnesses are identified: warm-started curation, compact derived-state rendering (evidence graph, BM25 compression, deduplication), and diversity-preserving incentives (tool-diversity reward).

Introduction and Theoretical Foundation

Background and Problem.
Search agents are typically trained as policies over growing transcripts: the model must decide how to search while also remembering what it has seen, which evidence is useful, which constraints remain open, and which claims have been verified. This formulation forces the policy to learn both semantic search decisions and recoverable bookkeeping – the latter can be maintained more reliably by the environment.

Stateful cognitive offloading.
The paper argues for a clear separation: the harness (environment) should maintain working memory (candidate pool, curated set, evidence graph, verification records, context budget), while the policy makes only high-level search decisions. This gives RL a stable interface for improving search behaviour, rather than asking the model to rediscover state from an append-only transcript.

Motivation for RL-compatible harness.
Prior work has shown that RL can improve multi-turn search [14, 18, 16, 45], but training often suffers from:

  • near-identical empty-set rewards on hard queries,
  • collapse to repeated search calls,
  • diffuse cross-document structure in transcripts,
  • reward that does not distinguish search failure from forgotten evidence or poor curation.

Three requirements are identified for a trainable stateful harness:

  1. Warm-started curation – auto-seed the curated set from the first successful search to avoid indistinguishable early rewards.
  2. Compact derived-state rendering – importance tags, evidence-graph summaries, verification records, BM25 compression, and deduplication to keep context usable.
  3. Diversity-preserving incentives – reward shaping that encourages a rhythm of searching, curating, reviewing, and verifying (not just repeated search).

Methodology

Harness Architecture

The harness maintains a per-episode WORKING MEMORY with two tiers:

  • Inner tier (rendered in the prompt): candidate pool PtP_t, curated set CtC_t with importance tags ItI_t, evidence graph GtG_t, verification records VtV_t, search history HtH_t, and budget marker BtB_t.
  • Outer tier: full-text store DtD_t for retrieved chunks (not inlined into the prompt).

The state notation is summarised in Table 1:

StateHarness-maintained bookkeepingPolicy decision preserved
PtP_tCandidate pool after compression/dedupWhich candidates to inspect, read, or curate
Ct,ItC_t, I_tCurated output set and importance tags, warm-started by auto-seedWhich documents to add, remove, promote, or demote
DtD_tFull-text memory for retrieved documentsWhich seen documents to revisit via review_docs
GtG_tEvidence graph (entities, dates, documents)Which bridge, singleton, or relation to pursue
VtV_tVerification records for policy-written claimsWhich claim to check and which documents to test
HtH_tSearch history and result summariesWhen to diversify, backtrack, or continue
BtB_tBudget-safe renderer and context markerWhen to search, read, summarize, or stop

Policy Actions

Five action classes:

  • Retrieval: fan_out_search (up to five diverse parallel queries), search_corpus (targeted hybrid search), grep_corpus (exact pattern match), read_document (full text). Outputs are compressed and deduplicated before updating PtP_t and DtD_t.
  • Curation: curate updates CtC_t and ItI_t – add/remove documents with importance tags (very_high, high, fair, low). Capacity capped at M=30M=30; lowest-importance evicted.
  • Verification: verify lets the policy write a claim and select documents to test; harness checks support against DtD_t and records yes/no with rationale in VtV_t.
  • Memory review: review_docs re-renders documents from DtD_t without a new corpus call.
  • Termination: end_search commits the curated set.

Auto-seeding:
First successful search automatically seeds CtC_t with the top k=8k=8 reranked results, each assigned It(d)=fairI_t(d)=\text{fair}. This changes the task from construction-from-scratch to refinement.

Derived-State Rendering

  • Evidence graph GtG_t: Lightweight regex extractor identifies multi-word capitalized proper nouns, four-digit years, and numeric dates. Renders frequent entities, bridge documents (contain ≥2 frequent entities), and singleton entities (potential hops).
  • Sentence-BM25 compression: Retrieval outputs (from search_corpus, fan_out_search, grep_corpus) are compressed by scoring sentences with BM25 against the query and rendering top K=4K=4 sentences in original order. Explicit read_document calls return full text.
  • Two-level deduplication: Near-duplicates removed by both chunk ID and content fingerprint (MinHash–LSH, θJaccard=0.85\theta_{\text{Jaccard}}=0.85). Reward accounting still credits all relevant evidence encountered.

Training Pipeline

1. SFT (Supervised Fine-Tuning):

  • Teacher: GPT-5.4 runs as a live agent inside the full Harness-1 harness.
  • Trajectories kept if format-valid, return ≥1 document, and final curated recall ≥0.10.
  • 899 filtered trajectories → one datum per turn.
  • LoRA (rank 32) on gpt-oss-20b for 3 epochs; step-550 checkpoint initializes RL.

2. RL (Reinforcement Learning):

  • On-policy CISPO with within-group advantage normalization on SEC training queries.
  • Full-trajectory rollouts, terminal-only reward, 40-turn cap, no KL anchor.
  • Groups with identical rewards dropped from gradient.
  • Batch size 128 × 8 rollouts.

Reward function:

R=wFFβ+wτρτ+wAρA+wτAρτA+BA1[ρA>0]+wdivmin(ν/ν0,1)wmiss(ρτAρA)+πturn(t)R = w_F F_\beta + w_\tau \rho_\tau + w_A \rho_A + w_{\tau A} \rho_{\tau A} + B_A \mathbb{1}[\rho_A>0] + w_{\text{div}} \min(\nu/\nu_0,1) - w_{\text{miss}}(\rho_{\tau A}-\rho_A)_+ - \pi_{\text{turn}}(t)

Where:

  • FβF_\beta: FβF_\beta score on curated set (β=2\beta=2, recall weighted 4× precision)
  • ρτ\rho_\tau: trajectory recall
  • ρA\rho_A: curated final-answer recall
  • ρτA\rho_{\tau A}: trajectory final-answer recall
  • ν\nu: number of distinct tools used; ν0\nu_0: diversity target
  • wmissw_{\text{miss}} penalizes answer evidence found but not curated
  • πturn\pi_{\text{turn}}: turn penalty

Episodes with empty curated set short-circuit to π=0.2\pi_\emptyset = -0.2. For non-empty episodes, reward is clipped R103R \geq 10^{-3}.

Empirical Validation / Results

Benchmarks and Setup

8 benchmarks: BrowseComp+ (BC+), Web synthetic, Patents (USPTO), SEC filings, LongSealQA, Seal0QA, FRAMES, HotpotQA.
Baselines: Context-1, gpt-oss-20b/120b, Qwen3-32B, Search-R1 32B, Tongyi DeepResearch 30B, and frontier models (Opus-4.6, GPT-5.4, Sonnet-4.6, Kimi-K2.5).
Metrics: Curated-Set Recall, Final-Answer Recall, Trajectory Recall.

Main Results (Table 2)

ModelMetricBC+WebPatentsSECLongSealSeal0FRAMESHotpotQA
Harness-1 (20B)Recall0.5840.7870.8900.5890.8910.5510.7100.841
Final-Answer Recall0.6670.8450.8900.6590.8910.5510.7100.841
Trajectory Recall0.6650.8810.9730.6770.9440.7080.7490.860
Context-1 (20B)Recall0.5000.6890.8570.4670.5630.3670.6380.746
Tongyi DR 30BRecall0.3810.6070.7800.4190.8550.5770.5610.751
Opus-4.6 (frontier)Recall0.6190.8190.9150.5890.8230.6360.8140.893
GPT-5.4 (frontier)Recall0.4430.8000.9090.4520.8240.5550.8150.870
  • Average curated recall: Harness-1 = 0.730, next best open (Tongyi DR) = 0.616 (+11.4 pts).
  • Competitive with frontier models: Harness-1 beats GPT-5.4 (0.659 avg), Sonnet-4.6, Kimi-K2.5, GPT-OSS-120B; only Opus-4.6 (0.775 avg) is ahead.

Transfer Pattern (Figure 3)

Gains on held-out transfer benchmarks (not used in SFT or RL) are 2.2× larger than on source-family benchmarks:

  • Source-family (BC+, Web, Patents, SEC): mean +7.9 recall pts over Context-1.
  • Held-out (LongSeal, Seal0, FRAMES, HotpotQA): mean +17.0 recall pts.

Inference-Time Component Ablation (Table 3)

ConfigurationRecall Δ (rel%)FA Recall Δ (rel%)
Full Harness-10.584 (—)0.667 (—)
– Importance tags0.560 (−4.1%)0.614 (−7.9%)
– Sentence-BM25 compression0.585 (+0.2%)0.620 (−7.0%)
– Auto-seed on first search0.582 (−0.3%)0.624 (−6.4%)
– Evidence graph0.569 (−2.6%)0.631 (−5.4%)
verify returns "unavailable"0.566 (−3.1%)0.641 (−3.9%)
review_docs returns "unavailable"0.598 (+2.4%)0.641 (−3.9%)
– Content-fingerprint dedup0.611 (+4.6%)0.678 (+1.6%)
All mechanisms disabled0.513 (−12.2%)0.624 (−6.4%)

Removing state mechanisms causes the trained policy to revert to a wide, shallow, search-dominated mode. Content dedup trades a small recall loss (some gold near-duplicates collapsed) for token efficiency.

Training Dynamics (Figure 5)

Comparison of RL runs with and without tool-diversity reward:

  • Without wdivw_{\text{div}}: tool diversity collapses from ~6 to ~3.5, curated recall plateaus at ~0.53 (finds documents but does not curate them).
  • With wdivw_{\text{div}}: tool diversity stabilises ~4.30, curated recall reaches ~0.60 (higher final quality).

Modular RAG Accuracy

When curated sets are passed to frozen frontier generators, Harness-1 yields higher downstream answer accuracy than other open subagents.

Theoretical and Practical Implications

  • Theoretical: The paper establishes stateful cognitive offloading as a design principle for retrieval-agent RL. Separating recoverable bookkeeping from semantic decisions makes RL training more stable and sample-efficient. The harness provides a clean interface between policy and environment.
  • Practical: Harness-1 demonstrates that a relatively small 20B model, when trained with a proper stateful harness and focused RL (4,352 unique training items), can match or exceed much larger frontier models. The strong transfer performance suggests that training on explicit search-state operations (refine curated set, use evidence graph, verify before promotion) yields domain-general retrieval strategies.
  • Three design principles (warm-started curation, compact derived-state rendering, diversity-preserving incentives) are directly applicable to any agentic search system being trained with RL.

Conclusion

  • Harness-1 is a 20B open-source search agent trained with RL inside a stateful harness.
  • The harness externalises routine state management (candidate pools, importance-tagged curated sets, evidence graphs, verification records, compression/dedup, budget markers), leaving the policy to make semantic decisions.
  • Achieves 0.730 average curated recall across 8 benchmarks, improving over the next best open agent by +11.4 pts and competitive with frontier models.
  • Gains are especially strong on held-out transfer benchmarks (+17.0 pts vs Context-1), indicating domain-general learning.
  • Component ablations confirm each harness mechanism contributes to final performance; the harness is not just an implementation detail but a core part of what the policy learns to use.
  • Future directions: replace regex-based evidence graph extraction with learned entity linking and relation extraction; explore uncertainty-aware evidence organisation; extend to more complex tool use and multi-step reasoning.

Related papers