Summary (Overview)
- Harness-1 is a 20B open-source search agent trained with reinforcement learning (RL) inside a stateful harness that externalises routine bookkeeping (candidate pools, curated sets, evidence graphs, verification records, context budget).
- The policy retains only semantic decisions: what to search, which documents to curate, what to verify, and when to stop; the harness maintains recoverable search state.
- Across eight retrieval benchmarks (web, finance, patents, multi-hop QA), Harness-1 achieves 0.730 average curated recall, outperforming the next best open subagent (Tongyi DeepResearch 30B) by +11.4 points, and remains competitive with frontier models like Opus-4.6 and GPT-5.4.
- Gains are especially strong on held-out transfer benchmarks (mean +17.0 pts vs. Context-1), suggesting that RL over explicit search state produces retrieval behaviours that generalise beyond training domains.
- Three design requirements for stateful search harnesses are identified: warm-started curation, compact derived-state rendering (evidence graph, BM25 compression, deduplication), and diversity-preserving incentives (tool-diversity reward).
Introduction and Theoretical Foundation
Background and Problem.
Search agents are typically trained as policies over growing transcripts: the model must decide how to search while also remembering what it has seen, which evidence is useful, which constraints remain open, and which claims have been verified. This formulation forces the policy to learn both semantic search decisions and recoverable bookkeeping – the latter can be maintained more reliably by the environment.
Stateful cognitive offloading.
The paper argues for a clear separation: the harness (environment) should maintain working memory (candidate pool, curated set, evidence graph, verification records, context budget), while the policy makes only high-level search decisions. This gives RL a stable interface for improving search behaviour, rather than asking the model to rediscover state from an append-only transcript.
Motivation for RL-compatible harness.
Prior work has shown that RL can improve multi-turn search [14, 18, 16, 45], but training often suffers from:
- near-identical empty-set rewards on hard queries,
- collapse to repeated search calls,
- diffuse cross-document structure in transcripts,
- reward that does not distinguish search failure from forgotten evidence or poor curation.
Three requirements are identified for a trainable stateful harness:
- Warm-started curation – auto-seed the curated set from the first successful search to avoid indistinguishable early rewards.
- Compact derived-state rendering – importance tags, evidence-graph summaries, verification records, BM25 compression, and deduplication to keep context usable.
- Diversity-preserving incentives – reward shaping that encourages a rhythm of searching, curating, reviewing, and verifying (not just repeated search).
Methodology
Harness Architecture
The harness maintains a per-episode WORKING MEMORY with two tiers:
- Inner tier (rendered in the prompt): candidate pool , curated set with importance tags , evidence graph , verification records , search history , and budget marker .
- Outer tier: full-text store for retrieved chunks (not inlined into the prompt).
The state notation is summarised in Table 1:
| State | Harness-maintained bookkeeping | Policy decision preserved |
|---|---|---|
| Candidate pool after compression/dedup | Which candidates to inspect, read, or curate | |
| Curated output set and importance tags, warm-started by auto-seed | Which documents to add, remove, promote, or demote | |
| Full-text memory for retrieved documents | Which seen documents to revisit via review_docs | |
| Evidence graph (entities, dates, documents) | Which bridge, singleton, or relation to pursue | |
| Verification records for policy-written claims | Which claim to check and which documents to test | |
| Search history and result summaries | When to diversify, backtrack, or continue | |
| Budget-safe renderer and context marker | When to search, read, summarize, or stop |
Policy Actions
Five action classes:
- Retrieval:
fan_out_search(up to five diverse parallel queries),search_corpus(targeted hybrid search),grep_corpus(exact pattern match),read_document(full text). Outputs are compressed and deduplicated before updating and . - Curation:
curateupdates and – add/remove documents with importance tags (very_high,high,fair,low). Capacity capped at ; lowest-importance evicted. - Verification:
verifylets the policy write a claim and select documents to test; harness checks support against and records yes/no with rationale in . - Memory review:
review_docsre-renders documents from without a new corpus call. - Termination:
end_searchcommits the curated set.
Auto-seeding:
First successful search automatically seeds with the top reranked results, each assigned . This changes the task from construction-from-scratch to refinement.
Derived-State Rendering
- Evidence graph : Lightweight regex extractor identifies multi-word capitalized proper nouns, four-digit years, and numeric dates. Renders frequent entities, bridge documents (contain ≥2 frequent entities), and singleton entities (potential hops).
- Sentence-BM25 compression: Retrieval outputs (from
search_corpus,fan_out_search,grep_corpus) are compressed by scoring sentences with BM25 against the query and rendering top sentences in original order. Explicitread_documentcalls return full text. - Two-level deduplication: Near-duplicates removed by both chunk ID and content fingerprint (MinHash–LSH, ). Reward accounting still credits all relevant evidence encountered.
Training Pipeline
1. SFT (Supervised Fine-Tuning):
- Teacher: GPT-5.4 runs as a live agent inside the full Harness-1 harness.
- Trajectories kept if format-valid, return ≥1 document, and final curated recall ≥0.10.
- 899 filtered trajectories → one datum per turn.
- LoRA (rank 32) on gpt-oss-20b for 3 epochs; step-550 checkpoint initializes RL.
2. RL (Reinforcement Learning):
- On-policy CISPO with within-group advantage normalization on SEC training queries.
- Full-trajectory rollouts, terminal-only reward, 40-turn cap, no KL anchor.
- Groups with identical rewards dropped from gradient.
- Batch size 128 × 8 rollouts.
Reward function:
Where:
- : score on curated set (, recall weighted 4× precision)
- : trajectory recall
- : curated final-answer recall
- : trajectory final-answer recall
- : number of distinct tools used; : diversity target
- penalizes answer evidence found but not curated
- : turn penalty
Episodes with empty curated set short-circuit to . For non-empty episodes, reward is clipped .
Empirical Validation / Results
Benchmarks and Setup
8 benchmarks: BrowseComp+ (BC+), Web synthetic, Patents (USPTO), SEC filings, LongSealQA, Seal0QA, FRAMES, HotpotQA.
Baselines: Context-1, gpt-oss-20b/120b, Qwen3-32B, Search-R1 32B, Tongyi DeepResearch 30B, and frontier models (Opus-4.6, GPT-5.4, Sonnet-4.6, Kimi-K2.5).
Metrics: Curated-Set Recall, Final-Answer Recall, Trajectory Recall.
Main Results (Table 2)
| Model | Metric | BC+ | Web | Patents | SEC | LongSeal | Seal0 | FRAMES | HotpotQA |
|---|---|---|---|---|---|---|---|---|---|
| Harness-1 (20B) | Recall | 0.584 | 0.787 | 0.890 | 0.589 | 0.891 | 0.551 | 0.710 | 0.841 |
| Final-Answer Recall | 0.667 | 0.845 | 0.890 | 0.659 | 0.891 | 0.551 | 0.710 | 0.841 | |
| Trajectory Recall | 0.665 | 0.881 | 0.973 | 0.677 | 0.944 | 0.708 | 0.749 | 0.860 | |
| Context-1 (20B) | Recall | 0.500 | 0.689 | 0.857 | 0.467 | 0.563 | 0.367 | 0.638 | 0.746 |
| Tongyi DR 30B | Recall | 0.381 | 0.607 | 0.780 | 0.419 | 0.855 | 0.577 | 0.561 | 0.751 |
| Opus-4.6 (frontier) | Recall | 0.619 | 0.819 | 0.915 | 0.589 | 0.823 | 0.636 | 0.814 | 0.893 |
| GPT-5.4 (frontier) | Recall | 0.443 | 0.800 | 0.909 | 0.452 | 0.824 | 0.555 | 0.815 | 0.870 |
- Average curated recall: Harness-1 = 0.730, next best open (Tongyi DR) = 0.616 (+11.4 pts).
- Competitive with frontier models: Harness-1 beats GPT-5.4 (0.659 avg), Sonnet-4.6, Kimi-K2.5, GPT-OSS-120B; only Opus-4.6 (0.775 avg) is ahead.
Transfer Pattern (Figure 3)
Gains on held-out transfer benchmarks (not used in SFT or RL) are 2.2× larger than on source-family benchmarks:
- Source-family (BC+, Web, Patents, SEC): mean +7.9 recall pts over Context-1.
- Held-out (LongSeal, Seal0, FRAMES, HotpotQA): mean +17.0 recall pts.
Inference-Time Component Ablation (Table 3)
| Configuration | Recall Δ (rel%) | FA Recall Δ (rel%) |
|---|---|---|
| Full Harness-1 | 0.584 (—) | 0.667 (—) |
| – Importance tags | 0.560 (−4.1%) | 0.614 (−7.9%) |
| – Sentence-BM25 compression | 0.585 (+0.2%) | 0.620 (−7.0%) |
| – Auto-seed on first search | 0.582 (−0.3%) | 0.624 (−6.4%) |
| – Evidence graph | 0.569 (−2.6%) | 0.631 (−5.4%) |
– verify returns "unavailable" | 0.566 (−3.1%) | 0.641 (−3.9%) |
– review_docs returns "unavailable" | 0.598 (+2.4%) | 0.641 (−3.9%) |
| – Content-fingerprint dedup | 0.611 (+4.6%) | 0.678 (+1.6%) |
| All mechanisms disabled | 0.513 (−12.2%) | 0.624 (−6.4%) |
Removing state mechanisms causes the trained policy to revert to a wide, shallow, search-dominated mode. Content dedup trades a small recall loss (some gold near-duplicates collapsed) for token efficiency.
Training Dynamics (Figure 5)
Comparison of RL runs with and without tool-diversity reward:
- Without : tool diversity collapses from ~6 to ~3.5, curated recall plateaus at ~0.53 (finds documents but does not curate them).
- With : tool diversity stabilises ~4.30, curated recall reaches ~0.60 (higher final quality).
Modular RAG Accuracy
When curated sets are passed to frozen frontier generators, Harness-1 yields higher downstream answer accuracy than other open subagents.
Theoretical and Practical Implications
- Theoretical: The paper establishes stateful cognitive offloading as a design principle for retrieval-agent RL. Separating recoverable bookkeeping from semantic decisions makes RL training more stable and sample-efficient. The harness provides a clean interface between policy and environment.
- Practical: Harness-1 demonstrates that a relatively small 20B model, when trained with a proper stateful harness and focused RL (4,352 unique training items), can match or exceed much larger frontier models. The strong transfer performance suggests that training on explicit search-state operations (refine curated set, use evidence graph, verify before promotion) yields domain-general retrieval strategies.
- Three design principles (warm-started curation, compact derived-state rendering, diversity-preserving incentives) are directly applicable to any agentic search system being trained with RL.
Conclusion
- Harness-1 is a 20B open-source search agent trained with RL inside a stateful harness.
- The harness externalises routine state management (candidate pools, importance-tagged curated sets, evidence graphs, verification records, compression/dedup, budget markers), leaving the policy to make semantic decisions.
- Achieves 0.730 average curated recall across 8 benchmarks, improving over the next best open agent by +11.4 pts and competitive with frontier models.
- Gains are especially strong on held-out transfer benchmarks (+17.0 pts vs Context-1), indicating domain-general learning.
- Component ablations confirm each harness mechanism contributes to final performance; the harness is not just an implementation detail but a core part of what the policy learns to use.
- Future directions: replace regex-based evidence graph extraction with learned entity linking and relation extraction; explore uncertainty-aware evidence organisation; extend to more complex tool use and multi-step reasoning.
Related papers
- GrepSeek: Training Search Agents for Direct Corpus Interaction
GrepSeek trains a compact LLM agent to outperform index-based retrieval by directly searching a corpus with shell commands, excelling at multi-hop reasoning with lexical precision.
- MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems
MemTrace introduces a framework that transforms memory pipelines into executable graphs to pinpoint the decisive error set causing LLM memory failures.
- DenoiseRL: Bootstrapping Reasoning Models to Recover from Noisy Prefixes
DenoiseRL trains reasoning models to correct mistakes by injecting erroneous partial solutions from a weaker model into training rollouts, improving performance without stronger teacher models.