GrepSeek: Training Search Agents for Direct Corpus Interaction

Summary (Overview)

Introduces Direct Corpus Interaction (DCI): A novel paradigm where search agents interact directly with a raw text corpus via executable shell commands (e.g., grep, rg), bypassing traditional pre-computed retrieval indices.
Proposes GrepSeek: A two-stage training pipeline for a compact LLM agent (Qwen3.5-9B) that learns effective, interpretable, and lexically precise retrieval behavior. The pipeline consists of:
1. Cold-start SFT: Generates a synthetic dataset using an answer-aware Tutor and answer-blind Planner to create verified, causally grounded search trajectories.
2. Policy Refinement: Uses Group Relative Policy Optimization (GRPO) to improve task-oriented search through direct RL on the corpus.
Achieves Strong Performance: Outperforms index-based RAG and agentic search baselines on 4 out of 7 open-domain QA benchmarks, with significant gains on multi-hop reasoning tasks (NQ, HotpotQA, 2WikiMultihopQA, MuSiQue).
Enables Efficient Execution: Develops a semantics-preserving sharded-parallel execution engine that accelerates shell-based retrieval by up to 7.6× (0.71s vs. 5.39s) while maintaining byte-exact equivalence with sequential execution, reducing end-to-end latency to ~8.6 seconds per query.

Introduction and Theoretical Foundation

Large Language Model (LLM) search agents typically access information through a retriever that queries a pre-computed index of document representations. This paper explores a complementary perspective: Direct Corpus Interaction (DCI). Here, the agent treats the corpus as a search environment and finds evidence by issuing executable shell commands (e.g., rg -F "keyword"). This enables:

Surgical retrieval: Access to text at any granularity, not just pre-chunked documents.
Explicit, controllable operations: Shifts from a black-box ranking procedure to a sequence of inspectable corpus operations.
Effective for exact matching and compositional reasoning: Particularly useful for tasks requiring precise entity matching, lexical filtering, and following bridge entities across documents.

Contemporary work (Li et al., 2026; Sen et al., 2026) uses DCI as an inference-time prompting strategy with large, proprietary models, leading to computational expense and inefficiency. This paper focuses on training compact models to learn DCI as a capability, making it practical for real-world use.

Methodology

GrepSeek Agent Framework

The DCI search agent $π_θ$ operates within a ReAct framework. Given a question $q$ and corpus $C$ (a file with one document per line), it produces a trajectory $τ = \{(t_i, a_i, o_i)\}_{i=1}^T$ , where:

$t_i$ : Reasoning trace.
$a_i$ : Action (shell command or termination).
$o_i$ : Observation (command output).

The agent uses a specific interaction format with think, <tool_call>, <tool_response>, and <answer> tags.

Two-Stage Training Pipeline

1. Cold-Start Data Generation (Algorithm 1) A pipeline using two LLMs (Qwen3.5-27B as Tutor $M_T$ and Planner $M_P$ ) generates verified training trajectories.

Phase A (Backward Verification): The answer-aware Tutor, given the gold answer $y$ , decomposes the question into sub-queries $(q_1,..., q_N)$ and constructs a retrieval chain backwards ( $N → 1$ ). For each step, it proposes a target-masked shell command $c_i$ (forbidding the use of the target answer or its aliases) to retrieve a document $d_i$ that supports the current target answer $a$ . A bridge extraction step identifies the entity in $d_i$ that answers the preceding sub-query, which becomes the target for the next hop.
Phase B (Forward Assembly): The verified chain is reversed into chronological order. The answer-blind Planner drafts a reasoning trace and action based only on the observable history $H$ . The Tutor then aligns this draft to logically motivate the verified command $c_i$ while remaining causally grounded, producing the final trajectory $T_{train}$ .
Phase C (Quality Filtering): Trajectories are filtered for answer quality ( $F_1(\hat{y}, y) > 0$ ) and judged for causal/logical consistency to prevent information leakage.

2. Policy Optimization

Supervised Fine-Tuning (SFT): The policy model (Qwen3.5-9B) is first fine-tuned on the 10k-sample cold-start dataset to learn stable retrieval behavior.
Reinforcement Learning with GRPO: The policy is further optimized using Group Relative Policy Optimization (GRPO). For a query $q$ , the policy samples a group of $n=5$ trajectories $τ^{(1)}, ..., τ^{(n)} \sim π_θ(· | q)$ . Each trajectory receives a reward: $R(τ^{(i)}) = ϕ(τ^{(i)}) \cdot R_{ans}(τ^{(i)})$ where $ϕ(τ^{(i)}) \in \{0,1\}$ is a binary format indicator, and $R_{ans}(τ^{(i)})$ is the token-level $F_1$ score between the predicted answer $\hat{y}^{(i)}$ and the gold set $Y$ . The advantage is computed as a relative score within the group: $A^{(i)} = \frac{R(τ^{(i)}) - \text{mean}(\{R(τ^{(j)})\}_{j=1}^n)}{\text{std}(\{R(τ^{(j)})\}_{j=1}^n) + \epsilon}$

Efficient Corpus Interaction Engine

To make DCI practical over large corpora (e.g., 21M documents, ~14GB), a semantics-preserving sharded-parallel execution engine is developed (Algorithm 2).

Sharded-Parallel Search: The corpus is split into $S$ line-aligned shards. Compatible shell pipelines are executed in parallel across shards.
Semantics Preservation: The engine classifies pipelines and applies deterministic merge strategies (CONCAT, HEAD, COUNT, SORTHEAD) to reconstruct output byte-exact to sequential execution. Incompatible or stateful pipelines fall back to sequential execution.
Performance: This optimization reduces average retrieval latency from 5.39s (sequential) to 0.71s (32 shards), a 7.6× speedup.

Empirical Validation / Results

Experimental Setup

Datasets: Seven open-domain QA benchmarks: Single-hop: Natural Questions (NQ), TriviaQA, PopQA. Multi-hop: HotpotQA, 2WikiMultihopQA (2Wiki), MuSiQue, Bamboogle.
Corpus: 2018 Wikipedia dump (21M documents).
Baselines: Include direct LLM, RAG, IRCoT, Search-O1, Rejection Sampling, and Search-R1 (GRPO-optimized) with three retrievers: BM25 (sparse), E5-110M (dense), and Qwen3-4B (dense).
Primary Metric: Token-level $F_1$ score.

Main Findings

Table 1: Model performance ( $F_1$ scores) across QA datasets.

Method	Retriever	NQ*	TriviaQA	PopQA	HotpotQA*	2Wiki	MuSiQue	Bamboogle	Average (micro)
Direct	—	0.2733	0.5565	0.2364	0.2837	0.3353	0.1151	0.1648	0.3340
RAG	BM25	0.3329	0.6660	0.3239	0.4434	0.3469	0.1305	0.2841	0.4129
RAG	Qwen3-4B	0.5002	0.7212	0.5046	0.4548	0.3498	0.1609	0.3484	0.4905
Search-R1	Qwen3-4B	0.5067	0.7693	0.5101	0.5591	0.4299	0.2878	0.6989	0.5441
GrepSeek	—	0.5223↑	0.7673	0.4861↓	0.6231↑	0.5178↑	0.3006	0.6212	0.5691↑

↑/↓: statistically significant improvement/degradation vs. best baseline (p<0.05). Bold: best per column.

Overall Performance: GrepSeek achieves the best overall micro-average $F_1$ score (0.5691), significantly outperforming the best dense retrieval baseline (Search-R1 with Qwen3-4B, 0.5441).
Multi-hop Strength: Gains are most pronounced on multi-hop benchmarks (HotpotQA, 2Wiki), where DCI's lexical precision helps avoid semantic conflation and entity ambiguity common with dense retrievers.
Limitations: Shows minor degradation on datasets with substantial surface-form variation (PopQA) or semantically broad phrasing, highlighting the brittleness of purely lexical search compared to semantic embedding generalization.

Efficiency Analysis (Figure 3):

Inference Latency: GrepSeek has higher end-to-end latency (8.67s) than dense baselines (E5: 4.77s, Qwen3-4B: 6.07s), primarily due to longer reasoning and decoding.
Memory & Preprocessing Cost: GrepSeek requires only 14 GB RAM (raw corpus size), eliminating the massive memory footprint of embedding indices (E5: 70 GB, Qwen3-4B: 221 GB) and expensive offline indexing (Qwen3-4B: 62.4 A100-hours).

Ablation Study (Table 2):

Variant	Average $F_1$ (micro)
GrepSeek (Full)	0.5691↑
- w/o GRPO	0.4249
- w/o SFT	0.3314
Both SFT initialization and RL optimization are critical for strong performance.

Training Dynamics (Figure 5): GrepSeek achieves higher rewards during training but generates longer sequences. Interestingly, it learns to reduce the number of commands over time by composing more expressive multi-stage shell pipelines.

Retrieval Behavior Analysis (Table 3):

The agent consistently uses | head -n to limit output and .-F for exact-string matching.
~70% of commands use cascaded filtering (e.g., rg ... | rg ...).
SFT establishes low-level syntactic "primitives," while RL refines higher-level search efficiency and reasoning depth.

Theoretical and Practical Implications

DCI as a Competitive Paradigm: Establishes direct corpus interaction via learned shell commands as a practical and highly effective alternative to index-based retrieval, especially for tasks requiring precision and multi-hop reasoning.
Interpretability & Control: Shifts retrieval from an opaque ranking to an interpretable sequence of operations, offering greater transparency and user control.
Efficiency Trade-offs: Demonstrates a favorable trade-off: while inference latency increases, DCI eliminates costly offline indexing and drastically reduces memory requirements, simplifying deployment.
Limitations Highlight Research Directions: The sensitivity to surface-form variation underscores the need for hybrid approaches combining lexical precision with semantic robustness.

Conclusion

GrepSeek demonstrates that compact LLMs can be effectively trained to perform direct, surgical retrieval over large text corpora using shell commands. The two-stage training pipeline (cold-start SFT + GRPO) stabilizes learning and yields an agent that excels at multi-hop reasoning through lexical precision. The optimized execution engine makes this approach practical at scale. While purely lexical interaction has limitations on semantically broad queries, GrepSeek establishes DCI as a highly competitive and practical paradigm for agentic search, complementary to existing retrieval methods. Future work will explore hybrid retrieval architectures, richer matching primitives, and improved inference efficiency.