GrepSeek: Training Search Agents for Direct Corpus Interaction
Summary (Overview)
- Introduces Direct Corpus Interaction (DCI): A novel paradigm where search agents interact directly with a raw text corpus via executable shell commands (e.g.,
grep,rg), bypassing traditional pre-computed retrieval indices. - Proposes GrepSeek: A two-stage training pipeline for a compact LLM agent (Qwen3.5-9B) that learns effective, interpretable, and lexically precise retrieval behavior. The pipeline consists of:
- Cold-start SFT: Generates a synthetic dataset using an answer-aware Tutor and answer-blind Planner to create verified, causally grounded search trajectories.
- Policy Refinement: Uses Group Relative Policy Optimization (GRPO) to improve task-oriented search through direct RL on the corpus.
- Achieves Strong Performance: Outperforms index-based RAG and agentic search baselines on 4 out of 7 open-domain QA benchmarks, with significant gains on multi-hop reasoning tasks (NQ, HotpotQA, 2WikiMultihopQA, MuSiQue).
- Enables Efficient Execution: Develops a semantics-preserving sharded-parallel execution engine that accelerates shell-based retrieval by up to 7.6× (0.71s vs. 5.39s) while maintaining byte-exact equivalence with sequential execution, reducing end-to-end latency to ~8.6 seconds per query.
Introduction and Theoretical Foundation
Large Language Model (LLM) search agents typically access information through a retriever that queries a pre-computed index of document representations. This paper explores a complementary perspective: Direct Corpus Interaction (DCI). Here, the agent treats the corpus as a search environment and finds evidence by issuing executable shell commands (e.g., rg -F "keyword"). This enables:
- Surgical retrieval: Access to text at any granularity, not just pre-chunked documents.
- Explicit, controllable operations: Shifts from a black-box ranking procedure to a sequence of inspectable corpus operations.
- Effective for exact matching and compositional reasoning: Particularly useful for tasks requiring precise entity matching, lexical filtering, and following bridge entities across documents.
Contemporary work (Li et al., 2026; Sen et al., 2026) uses DCI as an inference-time prompting strategy with large, proprietary models, leading to computational expense and inefficiency. This paper focuses on training compact models to learn DCI as a capability, making it practical for real-world use.
Methodology
GrepSeek Agent Framework
The DCI search agent operates within a ReAct framework. Given a question and corpus (a file with one document per line), it produces a trajectory , where:
- : Reasoning trace.
- : Action (shell command or termination).
- : Observation (command output).
The agent uses a specific interaction format with think, <tool_call>, <tool_response>, and <answer> tags.
Two-Stage Training Pipeline
1. Cold-Start Data Generation (Algorithm 1) A pipeline using two LLMs (Qwen3.5-27B as Tutor and Planner ) generates verified training trajectories.
- Phase A (Backward Verification): The answer-aware Tutor, given the gold answer , decomposes the question into sub-queries and constructs a retrieval chain backwards (). For each step, it proposes a target-masked shell command (forbidding the use of the target answer or its aliases) to retrieve a document that supports the current target answer . A bridge extraction step identifies the entity in that answers the preceding sub-query, which becomes the target for the next hop.
- Phase B (Forward Assembly): The verified chain is reversed into chronological order. The answer-blind Planner drafts a reasoning trace and action based only on the observable history . The Tutor then aligns this draft to logically motivate the verified command while remaining causally grounded, producing the final trajectory .
- Phase C (Quality Filtering): Trajectories are filtered for answer quality () and judged for causal/logical consistency to prevent information leakage.
2. Policy Optimization
- Supervised Fine-Tuning (SFT): The policy model (Qwen3.5-9B) is first fine-tuned on the 10k-sample cold-start dataset to learn stable retrieval behavior.
- Reinforcement Learning with GRPO: The policy is further optimized using Group Relative Policy Optimization (GRPO). For a query , the policy samples a group of trajectories . Each trajectory receives a reward: where is a binary format indicator, and is the token-level score between the predicted answer and the gold set . The advantage is computed as a relative score within the group:
Efficient Corpus Interaction Engine
To make DCI practical over large corpora (e.g., 21M documents, ~14GB), a semantics-preserving sharded-parallel execution engine is developed (Algorithm 2).
- Sharded-Parallel Search: The corpus is split into line-aligned shards. Compatible shell pipelines are executed in parallel across shards.
- Semantics Preservation: The engine classifies pipelines and applies deterministic merge strategies (CONCAT, HEAD, COUNT, SORTHEAD) to reconstruct output byte-exact to sequential execution. Incompatible or stateful pipelines fall back to sequential execution.
- Performance: This optimization reduces average retrieval latency from 5.39s (sequential) to 0.71s (32 shards), a 7.6× speedup.
Empirical Validation / Results
Experimental Setup
- Datasets: Seven open-domain QA benchmarks: Single-hop: Natural Questions (NQ), TriviaQA, PopQA. Multi-hop: HotpotQA, 2WikiMultihopQA (2Wiki), MuSiQue, Bamboogle.
- Corpus: 2018 Wikipedia dump (21M documents).
- Baselines: Include direct LLM, RAG, IRCoT, Search-O1, Rejection Sampling, and Search-R1 (GRPO-optimized) with three retrievers: BM25 (sparse), E5-110M (dense), and Qwen3-4B (dense).
- Primary Metric: Token-level score.
Main Findings
Table 1: Model performance ( scores) across QA datasets.
| Method | Retriever | NQ* | TriviaQA | PopQA | HotpotQA* | 2Wiki | MuSiQue | Bamboogle | Average (micro) |
|---|---|---|---|---|---|---|---|---|---|
| Direct | — | 0.2733 | 0.5565 | 0.2364 | 0.2837 | 0.3353 | 0.1151 | 0.1648 | 0.3340 |
| RAG | BM25 | 0.3329 | 0.6660 | 0.3239 | 0.4434 | 0.3469 | 0.1305 | 0.2841 | 0.4129 |
| RAG | Qwen3-4B | 0.5002 | 0.7212 | 0.5046 | 0.4548 | 0.3498 | 0.1609 | 0.3484 | 0.4905 |
| Search-R1 | Qwen3-4B | 0.5067 | 0.7693 | 0.5101 | 0.5591 | 0.4299 | 0.2878 | 0.6989 | 0.5441 |
| GrepSeek | — | 0.5223↑ | 0.7673 | 0.4861↓ | 0.6231↑ | 0.5178↑ | 0.3006 | 0.6212 | 0.5691↑ |
↑/↓: statistically significant improvement/degradation vs. best baseline (p<0.05). Bold: best per column.
- Overall Performance: GrepSeek achieves the best overall micro-average score (0.5691), significantly outperforming the best dense retrieval baseline (Search-R1 with Qwen3-4B, 0.5441).
- Multi-hop Strength: Gains are most pronounced on multi-hop benchmarks (HotpotQA, 2Wiki), where DCI's lexical precision helps avoid semantic conflation and entity ambiguity common with dense retrievers.
- Limitations: Shows minor degradation on datasets with substantial surface-form variation (PopQA) or semantically broad phrasing, highlighting the brittleness of purely lexical search compared to semantic embedding generalization.
Efficiency Analysis (Figure 3):
- Inference Latency: GrepSeek has higher end-to-end latency (8.67s) than dense baselines (E5: 4.77s, Qwen3-4B: 6.07s), primarily due to longer reasoning and decoding.
- Memory & Preprocessing Cost: GrepSeek requires only 14 GB RAM (raw corpus size), eliminating the massive memory footprint of embedding indices (E5: 70 GB, Qwen3-4B: 221 GB) and expensive offline indexing (Qwen3-4B: 62.4 A100-hours).
Ablation Study (Table 2):
| Variant | Average (micro) |
|---|---|
| GrepSeek (Full) | 0.5691↑ |
| - w/o GRPO | 0.4249 |
| - w/o SFT | 0.3314 |
| Both SFT initialization and RL optimization are critical for strong performance. |
Training Dynamics (Figure 5): GrepSeek achieves higher rewards during training but generates longer sequences. Interestingly, it learns to reduce the number of commands over time by composing more expressive multi-stage shell pipelines.
Retrieval Behavior Analysis (Table 3):
- The agent consistently uses
| head -nto limit output and.-Ffor exact-string matching. - ~70% of commands use cascaded filtering (e.g.,
rg ... | rg ...). - SFT establishes low-level syntactic "primitives," while RL refines higher-level search efficiency and reasoning depth.
Theoretical and Practical Implications
- DCI as a Competitive Paradigm: Establishes direct corpus interaction via learned shell commands as a practical and highly effective alternative to index-based retrieval, especially for tasks requiring precision and multi-hop reasoning.
- Interpretability & Control: Shifts retrieval from an opaque ranking to an interpretable sequence of operations, offering greater transparency and user control.
- Efficiency Trade-offs: Demonstrates a favorable trade-off: while inference latency increases, DCI eliminates costly offline indexing and drastically reduces memory requirements, simplifying deployment.
- Limitations Highlight Research Directions: The sensitivity to surface-form variation underscores the need for hybrid approaches combining lexical precision with semantic robustness.
Conclusion
GrepSeek demonstrates that compact LLMs can be effectively trained to perform direct, surgical retrieval over large text corpora using shell commands. The two-stage training pipeline (cold-start SFT + GRPO) stabilizes learning and yields an agent that excels at multi-hop reasoning through lexical precision. The optimized execution engine makes this approach practical at scale. While purely lexical interaction has limitations on semantically broad queries, GrepSeek establishes DCI as a highly competitive and practical paradigm for agentic search, complementary to existing retrieval methods. Future work will explore hybrid retrieval architectures, richer matching primitives, and improved inference efficiency.
Related papers
- Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses
- K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts
Even strong frontier models achieve only 45.67% accuracy on K-BrowseComp, and Korean open-weight models score 0–10.33%, revealing a massive agentic gap.
- On the Scaling of PEFT: Towards Million Personal Models of Trillion Parameters
Parameter-efficient fine-tuning scales one shared foundation model into millions of persistent personal model instances, shown with trillion-parameter LoRA RL.