Summary (Overview)
- Proposes Direct Corpus Interaction (DCI): A new retrieval paradigm where an agent uses general-purpose terminal tools (e.g.,
grep,bash, file reads) to search a raw corpus directly, bypassing conventional embedding models, vector indexes, and fixed top-k retrieval APIs. - Demonstrates Strong Performance: DCI agents outperform competitive sparse, dense, and reranking baselines across diverse tasks: end-to-end agentic search (BrowseComp-Plus), multi-hop QA, and IR ranking benchmarks (BRIGHT/BEIR), often while reducing cost.
- Introduces "Retrieval Interface Resolution": A conceptual lens explaining DCI's gains. Analysis shows its advantage stems not from higher gold-document recall but from finer-grained, localized evidence use—converting surfaced documents into high-value inspection, verification, and compositional search steps.
Introduction and Theoretical Foundation
Modern retrieval-augmented systems expose a corpus through a fixed similarity interface (lexical like BM25 or semantic/dense embeddings), compressing access into a single top-k retrieval step before reasoning. While efficient, this abstraction becomes a bottleneck for agentic search, where tasks require orchestrating multiple steps: discovering intermediate entities, combining weak clues, enforcing exact lexical constraints, checking local context, and refining hypotheses after observing partial evidence. Evidence filtered out early by the retriever cannot be recovered by downstream reasoning.
The paper argues that as language agents become stronger and more strategic, retrieval quality depends not only on reasoning ability but also on the resolution of the interface through which the model interacts with the corpus. To overcome this bottleneck, the authors propose Direct Corpus Interaction (DCI). In DCI, an agent searches the raw corpus directly using terminal tools, delegating semantic interpretation to the agent itself. This requires no offline indexing and adapts naturally to evolving local corpora. The core theoretical shift is reframing retrieval for capable agents as an interface-design problem (whose granularity determines what the agent can observe, verify, and act upon) rather than solely a retriever-design problem.
Methodology
The paper compares two broad paradigms for corpus access during agentic search (visualized in Figure 2):
- Retriever-mediated access: The agent queries a conventional retriever (sparse/dense) and receives a ranked top-k list of documents/snippets.
- Direct Corpus Interaction (DCI): The agent bypasses any embedding model or retrieval API, interacting with the raw corpus via a command-line interface using tools like
grep,rg(ripgrep),find,glob, and targeted file reads.
DCI Agent Implementations
Two agent scaffolds instantiate DCI, differing in runtime support to isolate the interface change:
- DCI-Agent-Lite: A minimal terminal coding agent adapted from Pi, restricted to
bashand file reads. It uses GPT-5.4 nano as its base model and includes a lightweight runtime context-management layer. - DCI-Agent-CC (Claude Code): Built on the off-the-shelf CLI agent Claude Code, using Claude Sonnet 4.6 as its base model. It provides stronger prompting and tool orchestration but still operates purely through terminal tools over the raw corpus.
Runtime Context Management
Repeated tool calls can return large amounts of text, risking context window overflow. DCI-Agent-Lite employs a lightweight runtime layer with three mechanisms (visualized in Figure 3):
- Truncation: Caps text from each tool call before inserting into context.
- Compaction: Clears contents of older tool-result turns once accumulated output exceeds a threshold, replacing them with short placeholders.
- Summarization: Replaces compacted history with a model-generated summary under high context pressure.
Five context-management policies (L0 to L4) enable different subsets of these mechanisms with varying aggressiveness, as defined in Table 1.
Evaluation Metrics
Beyond answer accuracy, the paper introduces trajectory-level metrics to characterize qualitative differences between DCI and retriever-mediated access:
-
Coverage: Measures whether a trajectory surfaces the relevant (gold) documents.
where are gold documents for question , and are those surfaced in trajectory .
-
Localization: Measures how efficiently a trajectory narrows to a small, usable evidence span within each surfaced gold document. It builds on normalizations:
with . Here maps character length to a segment count, and assigns a higher score when is small relative to . For a candidate snippet of length from gold document of length , the segment score is:
The best localization for document is , and the trajectory-level average is:
Empirical Validation / Results
Experiments evaluate DCI across three benchmark families: Agentic Search (BrowseComp-Plus), Knowledge-Intensive QA (NQ, TriviaQA, Bamboogle, HotpotQA, 2WikiMultiHopQA, MuSiQue), and IR Ranking (BRIGHT and BEIR datasets).
Main Results (RQ1)
Agentic Search: On BrowseComp-Plus, replacing a Qwen3-Embedding-8B retriever with DCI under the same Claude Sonnet 4.6 backbone improves accuracy from 69.0% to 80.0% (+11.0 points) while reducing cost from $1,440 to $1,016 (-29.4%). DCI-Agent-Lite (GPT-5.4 nano) achieves 62.9% accuracy at only $93 cost.
Knowledge-Intensive QA: As shown in Table 2, DCI agents consistently surpass retrieval-agent baselines. DCI-Agent-CC attains 83.0% average accuracy, exceeding the strongest baseline (ASearcher-Local-14B at 52.3%) by 30.7 points.
Table 2: Accuracy on multi-hop QA benchmarks.
| Model | NQ | Trivia | Bam. | Hotpot | 2Wiki | MuSiQue | Avg. | ∆ Avg. |
|---|---|---|---|---|---|---|---|---|
| Retrieval Agents | ||||||||
| R1-Searcher-7B | 58 | 50 | 54 | 46 | 40 | 24 | 45.3 | ↓ 7.0 |
| Search-R1-32B | 56 | 46 | 52 | 44 | 50 | 32 | 46.7 | ↓ 5.6 |
| ZeroSearch.7B | 26 | 30 | 18 | 10 | 18 | 4 | 17.7 | ↓ 34.6 |
| Verl-Tool-Search-7B-DAPO | 56 | 44 | 32 | 50 | 32 | 12 | 37.7 | ↓ 14.6 |
| ASearcher-Local-14B | 56 | 58 | 62 | 58 | 56 | 24 | 52.3 | – |
| DCI Agents | ||||||||
| DCI-Agent-Lite (GPT-5.4 nano) | 72 | 84 | 72 | 72 | 68 | 40 | 68.0 | ↑ 15.7 |
| DCI-Agent-CC (Sonnet 4.6) | 78 | 96 | 80 | 88 | 82 | 74 | 83.0 | ↑ 30.7 |
IR Ranking: As shown in Table 3, DCI-Agent-CC achieves the best NDCG@10 score on all six datasets, with an average of 68.5%, exceeding the strongest retrieval baseline (ReasonRank-32B at 47.0%) by 21.5 points. DCI-Agent-Lite ranks second overall with an average of 56.7%.
Table都会 3: NDCG@10 on IR ranking benchmarks.
| Method | Bio. | Earth. | Econ. | Robotics | ArguAna | SciFact | Avg. | ∆ Avg. |
|---|---|---|---|---|---|---|---|---|
| Sparse & Dense Retrieval | ||||||||
| BM25 | 18.9 | 27.2 | 14.9 | 13.6 | 31.5 | 15.8 | 20.3 | ↓ 26.7 |
| OpenAI-text-emb-3-large | 23.3 | 26.7 | 19.5 | 12.8 | 58.1 | 58.1 | 33.1 | ↓ 13.9 |
| GTE-Qwen2-7B-Instruct | 30.6 | 36.4 | 17.8 | 13.2 | 62.7 | 75.3 | 39.3 | ↓ 7.7 |
| Rank-R1-14B | 31.2 | 38.5 | 21.2 | 22.6 | 31.3 | 72.2 | 36.2 | ↓ 10.8 |
| Rank1-32B | 49.7 | 35.8 | 22.0 | 22.5 | 57.6 | 74.8 | 43.7 | ↓ 3.3 |
| ReasonRank-32B | 58.2 | 48.9 | 36.6 | 33.9 | 28.7 | 75.5 | 47.0 | – |
| DCI Agents | ||||||||
| DCI-Agent-Lite (GPT-5.4 nano) | 60.0 | 50.8 | 32.3 | 42.4 | 81.9 | 72.7 | 56.7 | ↑ 9.7 |
| DCI-Agent-CC (Sonnet 4.6) | 77.1 | 69.0 | 46.8 | 56.8 | 85.3 | 75.7 | 68.5 | ↑ 21.5 |
Controlled Ablations and Mechanism Analysis
RQ2 (Why does DCI help?): Analysis of BrowseComp-Plus trajectories shows DCI's advantage arises less from higher gold-document recall and more from fine-grained discovery, composition, and use of evidence through flexible bash commands. Among cases DCI-Agent-CC wins, only 34 involved the retriever failing to surface any gold document; 142 cases had the retriever surface at least one gold document but still fail. Tool usage concentrates on chained search, local context peeking, regex matching, and file localization, not full-document reads.
RQ3 (Behavioral tradeoffs): As shown in Table 4, DCI-Agent-Lite does not win by exhaustive gold-chain recovery. Its mean gold-document coverage (28.0) is much lower than Qwen3-Embedding-8B's (56.7), but its localization score is dramatically higher (48.4 vs. 21.7). This indicates DCI trades exhaustive recovery for high-resolution local progress, extracting more value from documents it reaches.
Table 4: Trajectory analysis on a BrowseComp-Plus subset (n=100).
| Method | Avg. tools ↓ | Cost / q ($) ↓ | coverage ↑ | Avg. localization ↑ | Acc. ↑ | ∆ Acc. |
|---|---|---|---|---|---|---|
| any | mean (recall) | all | ||||
| Retrieval Agents (GPT-5.4-nano) | ||||||
| BM25 | 19.07 | 0.0527 | 63.0 | 42.8 | 17.0 | 23.5 |
| Qwen3-Embedding-8B | 17.55 | 0.0498 | 74.0 | 56.7 | 28.0 | 21.7 |
| DCI-Agent-Lite (GPT-5.4-nano) | ||||||
| Direct interaction (L4) | 35.35 | 0.1021 | 70.0 | 28.0 | 1.0 | 48.4 |
RQ4 (Corpus scaling): DCI scales well in search depth but incurs rapidly rising costs in search breadth. Expanding the corpus from 100K to 200K documents causes tool calls per question to more than double (38.5 → 86.9), latency and cost to more than double, and accuracy to drop by 13.6 points.
RQ5 (Context management): Ablation of context-management policies (Table 6) shows a non-monotonic pattern. L3 achieves the best answer accuracy (77) but not the highest retained gold evidence coverage, indicating that preserving verbatim evidence is not the same as maintaining the right working state for continued search.
Table 6: DCI-Agent-Lite context-management ablation on a BrowseComp-Plus subset (n=100).
| Level | Avg. tools ↓ | Latency (s) ↓ | Cost / q ($) ↓ | Retained cov. ↑ | Acc. ↑ |
|---|---|---|---|---|---|
| L0 | 28.54 | 2226.22 | 0.0716 | 26.9 | 72 |
| L1 | 29.00 | 1819.78 | 0.0720 | 31.3 (+4.4) | 75 (+3) |
| L2 | 29.95 | 4412.73 | 0.0590 | 27.2 (+0.3) | 69 (-3) |
| L3 | 36.89 | 8711.81 | 0.1109 | 27.0 (+0.1) | 77 (+5) |
| L4 | 35.35 | 4531.11 | 0.1021 | 28.0 (+1.1) | 73 (+1) |
RQ6 (Tool-set expressivity): Ablation in Table 5 shows the benefit appears even under a highly constrained interface (read + grep only), which achieves 61% accuracy, outperforming the Qwen3-Embedding-8B retrieval baseline (45%) by 16 points. Enabling the full bash command set adds a further 12-point gain but with substantially higher tool usage and cost.
Table 5: Tool-profile ablation on a BrowseComp-Plus subset (n=100).
| Method | Avg. tools ↓ | Cost / q ($) ↓ | Acc. ↑ |
|---|---|---|---|
| Retrieval Agents (GPT-5.4-nano) | |||
| BM25 | 19 | 0.0527 | 32 |
| Qwen3-Embedding-8B | 18 | 0.0498 | 45 |
| DCI-Agent-Lite (GPT-5.4-nano) | |||
| read + grep (L4) | 19 | 0.0355 | 61 |
| Open bash (L4) | 35 | 0.1021 | 73 |
Theoretical and Practical Implications
The results support a broader view of retrieval in agentic systems: the central question is not just which retriever to use, but which interface best aligns with the agent's reasoning. When models can search strategically, compressed similarity indexes become a bottleneck, making higher-resolution interfaces like DCI more valuable.
Theoretical Implication: DCI is evidence that retrieval for capable agents should be reframed as an interface-design problem (whose granularity determines what the agent can observe, verify, and act upon) rather than solely a retriever-design problem. The paper introduces retrieval interface resolution as a conceptual lens to explain DCI's effectiveness.
Practical Implications:
- No Offline Indexing: DCI requires no offline embedding or indexing, reducing setup complexity.
- Adaptability: It