Visual Summary | Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses

Summary (Overview)

Harness-1 is a 20B open-source search agent trained with reinforcement learning (RL) inside a stateful harness that externalises routine bookkeeping (candidate pools, curated sets, evidence graphs, verification records, context budget).
The policy retains only semantic decisions: what to search, which documents to curate, what to verify, and when to stop; the harness maintains recoverable search state.
Across eight retrieval benchmarks (web, finance, patents, multi-hop QA), Harness-1 achieves 0.730 average curated recall, outperforming the next best open subagent (Tongyi DeepResearch 30B) by +11.4 points, and remains competitive with frontier models like Opus-4.6 and GPT-5.4.
Gains are especially strong on held-out transfer benchmarks (mean +17.0 pts vs. Context-1), suggesting that RL over explicit search state produces retrieval behaviours that generalise beyond training domains.
Three design requirements for stateful search harnesses are identified: warm-started curation, compact derived-state rendering (evidence graph, BM25 compression, deduplication), and diversity-preserving incentives (tool-diversity reward).

Introduction and Theoretical Foundation

Background and Problem.
Search agents are typically trained as policies over growing transcripts: the model must decide how to search while also remembering what it has seen, which evidence is useful, which constraints remain open, and which claims have been verified. This formulation forces the policy to learn both semantic search decisions and recoverable bookkeeping – the latter can be maintained more reliably by the environment.

Stateful cognitive offloading.
The paper argues for a clear separation: the harness (environment) should maintain working memory (candidate pool, curated set, evidence graph, verification records, context budget), while the policy makes only high-level search decisions. This gives RL a stable interface for improving search behaviour, rather than asking the model to rediscover state from an append-only transcript.

Motivation for RL-compatible harness.
Prior work has shown that RL can improve multi-turn search [14, 18, 16, 45], but training often suffers from:

near-identical empty-set rewards on hard queries,
collapse to repeated search calls,
diffuse cross-document structure in transcripts,
reward that does not distinguish search failure from forgotten evidence or poor curation.

Three requirements are identified for a trainable stateful harness:

Warm-started curation – auto-seed the curated set from the first successful search to avoid indistinguishable early rewards.
Compact derived-state rendering – importance tags, evidence-graph summaries, verification records, BM25 compression, and deduplication to keep context usable.
Diversity-preserving incentives – reward shaping that encourages a rhythm of searching, curating, reviewing, and verifying (not just repeated search).

Methodology

Harness Architecture

The harness maintains a per-episode WORKING MEMORY with two tiers:

Inner tier (rendered in the prompt): candidate pool $P_t$ , curated set $C_t$ with importance tags $I_t$ , evidence graph $G_t$ , verification records $V_t$ , search history $H_t$ , and budget marker $B_t$ .
Outer tier: full-text store $D_t$ for retrieved chunks (not inlined into the prompt).

The state notation is summarised in Table 1:

State	Harness-maintained bookkeeping	Policy decision preserved
$P_t$	Candidate pool after compression/dedup	Which candidates to inspect, read, or curate
$C_t, I_t$	Curated output set and importance tags, warm-started by auto-seed	Which documents to add, remove, promote, or demote
$D_t$	Full-text memory for retrieved documents	Which seen documents to revisit via `review_docs`
$G_t$	Evidence graph (entities, dates, documents)	Which bridge, singleton, or relation to pursue
$V_t$	Verification records for policy-written claims	Which claim to check and which documents to test
$H_t$	Search history and result summaries	When to diversify, backtrack, or continue
$B_t$	Budget-safe renderer and context marker	When to search, read, summarize, or stop

Policy Actions

Five action classes:

Retrieval: fan_out_search (up to five diverse parallel queries), search_corpus (targeted hybrid search), grep_corpus (exact pattern match), read_document (full text). Outputs are compressed and deduplicated before updating $P_t$ and $D_t$ .
Curation: curate updates $C_t$ and $I_t$ – add/remove documents with importance tags (very_high, high, fair, low). Capacity capped at $M=30$ ; lowest-importance evicted.
Verification: verify lets the policy write a claim and select documents to test; harness checks support against $D_t$ and records yes/no with rationale in $V_t$ .
Memory review: review_docs re-renders documents from $D_t$ without a new corpus call.
Termination: end_search commits the curated set.

Auto-seeding:
First successful search automatically seeds $C_t$ with the top $k=8$ reranked results, each assigned $I_t(d)=\text{fair}$ . This changes the task from construction-from-scratch to refinement.

Derived-State Rendering

Evidence graph $G_t$ : Lightweight regex extractor identifies multi-word capitalized proper nouns, four-digit years, and numeric dates. Renders frequent entities, bridge documents (contain ≥2 frequent entities), and singleton entities (potential hops).
Sentence-BM25 compression: Retrieval outputs (from search_corpus, fan_out_search, grep_corpus) are compressed by scoring sentences with BM25 against the query and rendering top $K=4$ sentences in original order. Explicit read_document calls return full text.
Two-level deduplication: Near-duplicates removed by both chunk ID and content fingerprint (MinHash–LSH, $\theta_{\text{Jaccard}}=0.85$ ). Reward accounting still credits all relevant evidence encountered.

Training Pipeline

1. SFT (Supervised Fine-Tuning):

Teacher: GPT-5.4 runs as a live agent inside the full Harness-1 harness.
Trajectories kept if format-valid, return ≥1 document, and final curated recall ≥0.10.
899 filtered trajectories → one datum per turn.
LoRA (rank 32) on gpt-oss-20b for 3 epochs; step-550 checkpoint initializes RL.

2. RL (Reinforcement Learning):

On-policy CISPO with within-group advantage normalization on SEC training queries.
Full-trajectory rollouts, terminal-only reward, 40-turn cap, no KL anchor.
Groups with identical rewards dropped from gradient.
Batch size 128 × 8 rollouts.

Reward function:

R = w_F F_\beta + w_\tau \rho_\tau + w_A \rho_A + w_{\tau A} \rho_{\tau A} + B_A \mathbb{1}[\rho_A>0] + w_{\text{div}} \min(\nu/\nu_0,1) - w_{\text{miss}}(\rho_{\tau A}-\rho_A)_+ - \pi_{\text{turn}}(t)

Where:

$F_\beta$ : $F_\beta$ score on curated set ( $\beta=2$ , recall weighted 4× precision)
$\rho_\tau$ : trajectory recall
$\rho_A$ : curated final-answer recall
$\rho_{\tau A}$ : trajectory final-answer recall
$\nu$ : number of distinct tools used; $\nu_0$ : diversity target
$w_{\text{miss}}$ penalizes answer evidence found but not curated
$\pi_{\text{turn}}$ : turn penalty

Episodes with empty curated set short-circuit to $\pi_\emptyset = -0.2$ . For non-empty episodes, reward is clipped $R \geq 10^{-3}$ .

Empirical Validation / Results

Benchmarks and Setup

8 benchmarks: BrowseComp+ (BC+), Web synthetic, Patents (USPTO), SEC filings, LongSealQA, Seal0QA, FRAMES, HotpotQA.
Baselines: Context-1, gpt-oss-20b/120b, Qwen3-32B, Search-R1 32B, Tongyi DeepResearch 30B, and frontier models (Opus-4.6, GPT-5.4, Sonnet-4.6, Kimi-K2.5).
Metrics: Curated-Set Recall, Final-Answer Recall, Trajectory Recall.

Main Results (Table 2)

Model	Metric	BC+	Web	Patents	SEC	LongSeal	Seal0	FRAMES	HotpotQA
Harness-1 (20B)	Recall	0.584	0.787	0.890	0.589	0.891	0.551	0.710	0.841
	Final-Answer Recall	0.667	0.845	0.890	0.659	0.891	0.551	0.710	0.841
	Trajectory Recall	0.665	0.881	0.973	0.677	0.944	0.708	0.749	0.860
Context-1 (20B)	Recall	0.500	0.689	0.857	0.467	0.563	0.367	0.638	0.746
Tongyi DR 30B	Recall	0.381	0.607	0.780	0.419	0.855	0.577	0.561	0.751
Opus-4.6 (frontier)	Recall	0.619	0.819	0.915	0.589	0.823	0.636	0.814	0.893
GPT-5.4 (frontier)	Recall	0.443	0.800	0.909	0.452	0.824	0.555	0.815	0.870

Average curated recall: Harness-1 = 0.730, next best open (Tongyi DR) = 0.616 (+11.4 pts).
Competitive with frontier models: Harness-1 beats GPT-5.4 (0.659 avg), Sonnet-4.6, Kimi-K2.5, GPT-OSS-120B; only Opus-4.6 (0.775 avg) is ahead.

Transfer Pattern (Figure 3)

Gains on held-out transfer benchmarks (not used in SFT or RL) are 2.2× larger than on source-family benchmarks:

Source-family (BC+, Web, Patents, SEC): mean +7.9 recall pts over Context-1.
Held-out (LongSeal, Seal0, FRAMES, HotpotQA): mean +17.0 recall pts.

Inference-Time Component Ablation (Table 3)

Configuration	Recall Δ (rel%)	FA Recall Δ (rel%)
Full Harness-1	0.584 (—)	0.667 (—)
– Importance tags	0.560 (−4.1%)	0.614 (−7.9%)
– Sentence-BM25 compression	0.585 (+0.2%)	0.620 (−7.0%)
– Auto-seed on first search	0.582 (−0.3%)	0.624 (−6.4%)
– Evidence graph	0.569 (−2.6%)	0.631 (−5.4%)
– `verify` returns "unavailable"	0.566 (−3.1%)	0.641 (−3.9%)
– `review_docs` returns "unavailable"	0.598 (+2.4%)	0.641 (−3.9%)
– Content-fingerprint dedup	0.611 (+4.6%)	0.678 (+1.6%)
All mechanisms disabled	0.513 (−12.2%)	0.624 (−6.4%)

Removing state mechanisms causes the trained policy to revert to a wide, shallow, search-dominated mode. Content dedup trades a small recall loss (some gold near-duplicates collapsed) for token efficiency.

Training Dynamics (Figure 5)

Comparison of RL runs with and without tool-diversity reward:

Without $w_{\text{div}}$ : tool diversity collapses from ~6 to ~3.5, curated recall plateaus at ~0.53 (finds documents but does not curate them).
With $w_{\text{div}}$ : tool diversity stabilises ~4.30, curated recall reaches ~0.60 (higher final quality).

Modular RAG Accuracy

When curated sets are passed to frozen frontier generators, Harness-1 yields higher downstream answer accuracy than other open subagents.

Theoretical and Practical Implications

Theoretical: The paper establishes stateful cognitive offloading as a design principle for retrieval-agent RL. Separating recoverable bookkeeping from semantic decisions makes RL training more stable and sample-efficient. The harness provides a clean interface between policy and environment.
Practical: Harness-1 demonstrates that a relatively small 20B model, when trained with a proper stateful harness and focused RL (4,352 unique training items), can match or exceed much larger frontier models. The strong transfer performance suggests that training on explicit search-state operations (refine curated set, use evidence graph, verify before promotion) yields domain-general retrieval strategies.
Three design principles (warm-started curation, compact derived-state rendering, diversity-preserving incentives) are directly applicable to any agentic search system being trained with RL.

Conclusion

Harness-1 is a 20B open-source search agent trained with RL inside a stateful harness.
The harness externalises routine state management (candidate pools, importance-tagged curated sets, evidence graphs, verification records, compression/dedup, budget markers), leaving the policy to make semantic decisions.
Achieves 0.730 average curated recall across 8 benchmarks, improving over the next best open agent by +11.4 pts and competitive with frontier models.
Gains are especially strong on held-out transfer benchmarks (+17.0 pts vs Context-1), indicating domain-general learning.
Component ablations confirm each harness mechanism contributes to final performance; the harness is not just an implementation detail but a core part of what the policy learns to use.
Future directions: replace regex-based evidence graph extraction with learned entity linking and relation extraction; explore uncertainty-aware evidence organisation; extend to more complex tool use and multi-step reasoning.