Summary (Overview)

Proposes Direct Corpus Interaction (DCI): A new retrieval paradigm where an agent uses general-purpose terminal tools (e.g., grep, bash, file reads) to search a raw corpus directly, bypassing conventional embedding models, vector indexes, and fixed top-k retrieval APIs.
Demonstrates Strong Performance: DCI agents outperform competitive sparse, dense, and reranking baselines across diverse tasks: end-to-end agentic search (BrowseComp-Plus), multi-hop QA, and IR ranking benchmarks (BRIGHT/BEIR), often while reducing cost.
Introduces "Retrieval Interface Resolution": A conceptual lens explaining DCI's gains. Analysis shows its advantage stems not from higher gold-document recall but from finer-grained, localized evidence use—converting surfaced documents into high-value inspection, verification, and compositional search steps.

Introduction and Theoretical Foundation

Modern retrieval-augmented systems expose a corpus through a fixed similarity interface (lexical like BM25 or semantic/dense embeddings), compressing access into a single top-k retrieval step before reasoning. While efficient, this abstraction becomes a bottleneck for agentic search, where tasks require orchestrating multiple steps: discovering intermediate entities, combining weak clues, enforcing exact lexical constraints, checking local context, and refining hypotheses after observing partial evidence. Evidence filtered out early by the retriever cannot be recovered by downstream reasoning.

The paper argues that as language agents become stronger and more strategic, retrieval quality depends not only on reasoning ability but also on the resolution of the interface through which the model interacts with the corpus. To overcome this bottleneck, the authors propose Direct Corpus Interaction (DCI). In DCI, an agent searches the raw corpus directly using terminal tools, delegating semantic interpretation to the agent itself. This requires no offline indexing and adapts naturally to evolving local corpora. The core theoretical shift is reframing retrieval for capable agents as an interface-design problem (whose granularity determines what the agent can observe, verify, and act upon) rather than solely a retriever-design problem.

Methodology

The paper compares two broad paradigms for corpus access during agentic search (visualized in Figure 2):

Retriever-mediated access: The agent queries a conventional retriever (sparse/dense) and receives a ranked top-k list of documents/snippets.
Direct Corpus Interaction (DCI): The agent bypasses any embedding model or retrieval API, interacting with the raw corpus via a command-line interface using tools like grep, rg (ripgrep), find, glob, and targeted file reads.

DCI Agent Implementations

Two agent scaffolds instantiate DCI, differing in runtime support to isolate the interface change:

DCI-Agent-Lite: A minimal terminal coding agent adapted from Pi, restricted to bash and file reads. It uses GPT-5.4 nano as its base model and includes a lightweight runtime context-management layer.
DCI-Agent-CC (Claude Code): Built on the off-the-shelf CLI agent Claude Code, using Claude Sonnet 4.6 as its base model. It provides stronger prompting and tool orchestration but still operates purely through terminal tools over the raw corpus.

Runtime Context Management

Repeated tool calls can return large amounts of text, risking context window overflow. DCI-Agent-Lite employs a lightweight runtime layer with three mechanisms (visualized in Figure 3):

Truncation: Caps text from each tool call before inserting into context.
Compaction: Clears contents of older tool-result turns once accumulated output exceeds a threshold, replacing them with short placeholders.
Summarization: Replaces compacted history with a model-generated summary under high context pressure.

Five context-management policies (L0 to L4) enable different subsets of these mechanisms with varying aggressiveness, as defined in Table 1.

Evaluation Metrics

Beyond answer accuracy, the paper introduces trajectory-level metrics to characterize qualitative differences between DCI and retriever-mediated access:

Coverage: Measures whether a trajectory surfaces the relevant (gold) documents.
$\text{coverage}_{\text{any}}(q, \tau) = \mathbb{1}[|M(q, \tau)| \geq 1], \quad \text{coverage}_{\text{mean}}(q, \tau) = \frac{|M(q, \tau)|}{|D^*(q)|}, \quad \text{coverage}_{\text{all}}(q, \tau) = \mathbb{1}[|M(q, \tau)| = |D^*(q)|]$
where $D^*(q)$ are gold documents for question $q$ , and $M(q, \tau) \subseteq D^*(q)$ are those surfaced in trajectory $\tau$ .
Localization: Measures how efficiently a trajectory narrows to a small, usable evidence span within each surfaced gold document. It builds on normalizations:
$\nu(x) = \max\left(1, \left\lceil \frac{x}{c_{\text{seg}}} \right\rceil \right), \quad \psi(a; b) = \max\left(1 - \frac{\log a}{\log b}, 0\right) \text{ for } 1 \leq a \leq b, b > 1$
with $\psi(a; 1) = 1$ . Here $\nu(x)$ maps character length to a segment count, and $\psi(a; b)$ assigns a higher score when $a$ is small relative to $b$ . For a candidate snippet of length $\ell_{t,i}$ from gold document $d^*$ of length $|d^*|$ , the segment score is:
$\text{seg-score}(d_{t,i}; d^*) = \psi(\nu(\ell_{t,i}); \nu(|d^*|))$
The best localization for document $d^*$ is $s(d^*, \tau) = \max_{d_{t,i} \in H(d^*, \tau)} \text{seg-score}(d_{t,i}; d^*)$ , and the trajectory-level average is:
$\text{localization}(q, \tau) = \frac{1}{|M(q, \tau)|} \sum_{d^* \in M(q, \tau)} s(d^*, \tau)$

Empirical Validation / Results

Experiments evaluate DCI across three benchmark families: Agentic Search (BrowseComp-Plus), Knowledge-Intensive QA (NQ, TriviaQA, Bamboogle, HotpotQA, 2WikiMultiHopQA, MuSiQue), and IR Ranking (BRIGHT and BEIR datasets).

Main Results (RQ1)

Agentic Search: On BrowseComp-Plus, replacing a Qwen3-Embedding-8B retriever with DCI under the same Claude Sonnet 4.6 backbone improves accuracy from 69.0% to 80.0% (+11.0 points) while reducing cost from $1,440 to $1,016 (-29.4%). DCI-Agent-Lite (GPT-5.4 nano) achieves 62.9% accuracy at only $93 cost.

Knowledge-Intensive QA: As shown in Table 2, DCI agents consistently surpass retrieval-agent baselines. DCI-Agent-CC attains 83.0% average accuracy, exceeding the strongest baseline (ASearcher-Local-14B at 52.3%) by 30.7 points.

Table 2: Accuracy on multi-hop QA benchmarks.

Model	NQ	Trivia	Bam.	Hotpot	2Wiki	MuSiQue	Avg.	∆ Avg.
Retrieval Agents
R1-Searcher-7B	58	50	54	46	40	24	45.3	↓ 7.0
Search-R1-32B	56	46	52	44	50	32	46.7	↓ 5.6
ZeroSearch.7B	26	30	18	10	18	4	17.7	↓ 34.6
Verl-Tool-Search-7B-DAPO	56	44	32	50	32	12	37.7	↓ 14.6
ASearcher-Local-14B	56	58	62	58	56	24	52.3	–
DCI Agents
DCI-Agent-Lite (GPT-5.4 nano)	72	84	72	72	68	40	68.0	↑ 15.7
DCI-Agent-CC (Sonnet 4.6)	78	96	80	88	82	74	83.0	↑ 30.7

IR Ranking: As shown in Table 3, DCI-Agent-CC achieves the best NDCG@10 score on all six datasets, with an average of 68.5%, exceeding the strongest retrieval baseline (ReasonRank-32B at 47.0%) by 21.5 points. DCI-Agent-Lite ranks second overall with an average of 56.7%.

Table都会 3: NDCG@10 on IR ranking benchmarks.

Method	Bio.	Earth.	Econ.	Robotics	ArguAna	SciFact	Avg.	∆ Avg.
Sparse & Dense Retrieval
BM25	18.9	27.2	14.9	13.6	31.5	15.8	20.3	↓ 26.7
OpenAI-text-emb-3-large	23.3	26.7	19.5	12.8	58.1	58.1	33.1	↓ 13.9
GTE-Qwen2-7B-Instruct	30.6	36.4	17.8	13.2	62.7	75.3	39.3	↓ 7.7
Rank-R1-14B	31.2	38.5	21.2	22.6	31.3	72.2	36.2	↓ 10.8
Rank1-32B	49.7	35.8	22.0	22.5	57.6	74.8	43.7	↓ 3.3
ReasonRank-32B	58.2	48.9	36.6	33.9	28.7	75.5	47.0	–
DCI Agents
DCI-Agent-Lite (GPT-5.4 nano)	60.0	50.8	32.3	42.4	81.9	72.7	56.7	↑ 9.7
DCI-Agent-CC (Sonnet 4.6)	77.1	69.0	46.8	56.8	85.3	75.7	68.5	↑ 21.5

Controlled Ablations and Mechanism Analysis

RQ2 (Why does DCI help?): Analysis of BrowseComp-Plus trajectories shows DCI's advantage arises less from higher gold-document recall and more from fine-grained discovery, composition, and use of evidence through flexible bash commands. Among cases DCI-Agent-CC wins, only 34 involved the retriever failing to surface any gold document; 142 cases had the retriever surface at least one gold document but still fail. Tool usage concentrates on chained search, local context peeking, regex matching, and file localization, not full-document reads.

RQ3 (Behavioral tradeoffs): As shown in Table 4, DCI-Agent-Lite does not win by exhaustive gold-chain recovery. Its mean gold-document coverage (28.0) is much lower than Qwen3-Embedding-8B's (56.7), but its localization score is dramatically higher (48.4 vs. 21.7). This indicates DCI trades exhaustive recovery for high-resolution local progress, extracting more value from documents it reaches.

Table 4: Trajectory analysis on a BrowseComp-Plus subset (n=100).

Method	Avg. tools ↓	Cost / q ($) ↓	coverage ↑	Avg. localization ↑	Acc. ↑	∆ Acc.
			any	mean (recall)	all
Retrieval Agents (GPT-5.4-nano)
BM25	19.07	0.0527	63.0	42.8	17.0	23.5
Qwen3-Embedding-8B	17.55	0.0498	74.0	56.7	28.0	21.7
DCI-Agent-Lite (GPT-5.4-nano)
Direct interaction (L4)	35.35	0.1021	70.0	28.0	1.0	48.4

RQ4 (Corpus scaling): DCI scales well in search depth but incurs rapidly rising costs in search breadth. Expanding the corpus from 100K to 200K documents causes tool calls per question to more than double (38.5 → 86.9), latency and cost to more than double, and accuracy to drop by 13.6 points.

RQ5 (Context management): Ablation of context-management policies (Table 6) shows a non-monotonic pattern. L3 achieves the best answer accuracy (77) but not the highest retained gold evidence coverage, indicating that preserving verbatim evidence is not the same as maintaining the right working state for continued search.

Table 6: DCI-Agent-Lite context-management ablation on a BrowseComp-Plus subset (n=100).

Level	Avg. tools ↓	Latency (s) ↓	Cost / q ($) ↓	Retained cov. ↑	Acc. ↑
L0	28.54	2226.22	0.0716	26.9	72
L1	29.00	1819.78	0.0720	31.3 (+4.4)	75 (+3)
L2	29.95	4412.73	0.0590	27.2 (+0.3)	69 (-3)
L3	36.89	8711.81	0.1109	27.0 (+0.1)	77 (+5)
L4	35.35	4531.11	0.1021	28.0 (+1.1)	73 (+1)

RQ6 (Tool-set expressivity): Ablation in Table 5 shows the benefit appears even under a highly constrained interface (read + grep only), which achieves 61% accuracy, outperforming the Qwen3-Embedding-8B retrieval baseline (45%) by 16 points. Enabling the full bash command set adds a further 12-point gain but with substantially higher tool usage and cost.

Table 5: Tool-profile ablation on a BrowseComp-Plus subset (n=100).

Method	Avg. tools ↓	Cost / q ($) ↓	Acc. ↑
Retrieval Agents (GPT-5.4-nano)
BM25	19	0.0527	32
Qwen3-Embedding-8B	18	0.0498	45
DCI-Agent-Lite (GPT-5.4-nano)
read + grep (L4)	19	0.0355	61
Open bash (L4)	35	0.1021	73

Theoretical and Practical Implications

The results support a broader view of retrieval in agentic systems: the central question is not just which retriever to use, but which interface best aligns with the agent's reasoning. When models can search strategically, compressed similarity indexes become a bottleneck, making higher-resolution interfaces like DCI more valuable.

Theoretical Implication: DCI is evidence that retrieval for capable agents should be reframed as an interface-design problem (whose granularity determines what the agent can observe, verify, and act upon) rather than solely a retriever-design problem. The paper introduces retrieval interface resolution as a conceptual lens to explain DCI's effectiveness.

Practical Implications:

No Offline Indexing: DCI requires no offline embedding or indexing, reducing setup complexity.
Adaptability: It