Summary
FastContext: Training Efficient Repository Explorer for Coding Agents
Summary (Overview)
- FastContext is a dedicated exploration subagent that separates repository search from the main coding agent's solving process, performing parallel, read-only tool calls and returning compact file-path and line-range evidence.
- Significant efficiency gains: Integrating FastContext into Mini-SWE-Agent reduces main-agent token consumption by up to 60% while improving end-to-end resolution rates by up to 5.5% across SWE-bench Multilingual, SWE-bench Pro, and SWE-QA benchmarks.
- Specialized trained explorers: A family of models (4B–30B parameters) is introduced, bootstrapped from reference-model trajectories via supervised fine-tuning (SFT) and refined with task-grounded reinforcement learning (RL) using GRPO.
- Standalone exploration quality: FastContext trained checkpoints achieve 73.71 file-level F1 and 60.35 module-level F1 on SWE-bench Verified, outperforming prior localization methods.
- Modular design: Repository exploration is treated as a first-class, trainable component, enabling smaller specialized models (e.g., 4B) to collaborate with stronger main agents (GPT-5.4, GLM-5.1, Kimi-K2.6) with clearer context boundaries.
Introduction and Theoretical Foundation
Background and Motivation
Coding agents (e.g., Claude Code, Codex, GitHub Copilot CLI, Cursor) have advanced automated software engineering, but repository exploration remains a major bottleneck. In existing systems, the same model that solves the task also explores the repository, causing:
- High token consumption: Reading and searching account for 56.2% of all tool-use turns and 46.5% of main-agent total tokens on average (analysis of GPT-5.4 trajectories on SWE-bench Multilingual).
- Long sequential prelude: Before the first edit, agents execute a median of six sequential exploration turns and 15.5 exploration tool calls. Unresolved trajectories involve more pre-edit exploration turns (8.34 vs. 6.67) than resolved ones.
- Noisy context: Irrelevant snippets accumulated during navigation pollute the solver's history, leading to mistaken hypotheses and wasted later turns.
Theoretical Basis
The paper argues that repository exploration should be separated from solving and delegated to a dedicated, lightweight subagent. This follows recent observations from SWE-Pruner that coding-agent context can be pruned (Wang et al., 2026b). Prior work includes graph-/structure-guided localization (AutoCodeRover, LocAgent, CoSIL), retrieval/compression methods (RepoCoder, LongCodeZip), and RL-trained search agents (CodeScout, SWE-grep), but none provides a lightweight, trained explorer that coexists with a standard main agent.
Key insight: Exploration is structured (read-only, parallel tool calls) and expensive enough to motivate delegation, yet can be handled by a small, task-optimized model that returns only the evidence the solver truly needs.
Methodology
FastContext Subagent Architecture
FastContext is a delegation mechanism with a simple runtime harness and three language-agnostic tools:
READ: Read line-numbered file contentsGLOB: Discover file pathsGREP: Regex search over repository text (usingripgrep)
The subagent operates in a loop: at each turn, it issues parallel tool calls (multiple calls executed concurrently) or outputs a final evidence list in a compact format:
<final_answer>
/src/router.py:42-58 (Router definition)
/tests/test_router.py:101-119
</final_answer>
This output is directly consumable by the main agent as focused context, avoiding the long exploratory trajectory.
Policy Initialization with Supervised Fine-Tuning (SFT)
Training data (2,954 filtered examples) is constructed from Sonnet 4.6 exploration traces with three sources:
parallel_toolcalls: Broad first-turn search – the reference model issues nonredundant parallel tool calls covering complementary signals.multiturn_traj: Multi-turn evidence gathering – full reference-model trajectories preserved.linerange: Precise citation generation – model produces only a narrow<final_answer>block from retrieved contents.
The SFT loss is an assistant-token-only objective:
where masks out non-assistant tokens. The model is fine-tuned with this objective from an initial checkpoint.
Policy Refinement with Reinforcement Learning (RL)
SFT imitation does not directly optimize whether final citations cover the code locations needed to solve the issue. Therefore, RL is used with a 400-prompt set from issue-resolution tasks with reference patches.
For each instance, the reference patch is parsed into target file-and-line sets and (files and lines). The model rolls out as the actual FastContext subagent, interacting with tools and finally producing a <final_answer> block. The predicted sets and are parsed from the model's output.
The reward function combines task outcome, parallel bonus, and format penalty:
- Task outcome: Sum of file-level and line-level F1 after path normalization (zero for empty sets).
- : Small bonus for bounded multi-call exploration.
- : Penalty for empty, overly long, malformed, or excessive-fan-out outputs.
The model is optimized with GRPO (Shao et al., 2024), sampling multiple trajectories per prompt from the SFT checkpoint. This stage aligns the explorer with the practical goal of returning a compact citation set covering the code regions most likely to matter.
Model Variants
- FC-30B-SFT: 30B parameter model trained only with SFT (scaling reference)
- FC-4B-SFT: 4B model trained with SFT (compact deployment target)
- FC-4B-RL: 4B model additionally refined with RL (test of task-grounded optimization)
Empirical Validation / Results
End-to-End Performance (Table 1)
Three benchmarks are used: SWE-bench Multilingual (300 instances), SWE-bench Pro (200 random subset), and SWE-QA (repository-level QA). Main agents: GPT-5.4, GLM-5.1, Kimi-K2.6. Baseline: direct solving (w/o Explore) vs. same-model exploration vs. FastContext variants.
Table 1: End-to-end performance and efficiency across three benchmarks.
| Main Agent | Subagent | SWE-bench Multilingual Score / Tokens / Turns | SWE-bench Pro Score / Tokens / Turns | SWE-QA Score / Tokens / Turns |
|---|---|---|---|---|
| GPT-5.4 | w/o Explore | 71.7 / 457k / 17.7 | 46.0 / 818k / 20.7 | 81.3 / 418k / 15.7 |
| GPT-5.4 | Same model | 73.3 / 379k / 18.3 | 51.5 / 703k / 23.7 | 81.4 / 166k / 9.8 |
| GPT-5.4 | FC-30B-SFT | 75.0 / 356k / 18.2 | 49.0 / 688k / 23.5 | 82.0 / 206k / 11.2 |
| GPT-5.4 | FC-4B-SFT | 73.3 / 364k / 18.3 | 47.0 / 689k / 23.2 | 81.9 / 213k / 11.6 |
| GPT-5.4 | FC-4B-RL | 74.7 / 338k / 18.3 | 48.5 / 701k / 23.5 | 82.0 / 210k / 11.4 |
| GLM-5.1 | w/o Explore | 72.3 / 2514k / 73.9 | 17.5 / 2692k / 67.4 | 72.7 / 401k / 27.7 |
| GLM-5.1 | Same model | 73.3 / 1994k / 55.9 | 18.0 / 2356k / 63.9 | 73.4 / 249k / 20.4 |
| GLM-5.1 | FC-30B-SFT | 73.7 / 1797k / 55.0 | 20.0 / 2370k / 64.2 | 73.3 / 292k / 23.0 |
| GLM-5.1 | FC-4B-SFT | 73.3 / 1919k / 56.9 | 18.0 / 2279k / 64.0 | 73.4 / 306k / 23.8 |
| GLM-5.1 | FC-4B-RL | 73.7 / 1971k / 56.6 | 22.5 / 2210k / 64.3 | 73.5 / 302k / 23.2 |
| Kimi-K2.6 | w/o Explore | 76.3 / 1553k / 55.7 | 31.0 / 2383k / 68.0 | 71.6 / 510k / 32.5 |
| Kimi-K2.6 | Same model | 76.3 / 1367k / 50.5 | 32.0 / 2060k / 58.0 | 73.0 / 361k / 24.4 |
| Kimi-K2.6 | FC-30B-SFT | 76.7 / 1360k / 49.9 | 33.0 / 2150k / 58.8 | 72.8 / 373k / 26.4 |
| Kimi-K2.6 | FC-4B-SFT | 75.3 / 1306k / 49.3 | 32.5 / 2159k / 61.6 | 72.6 / 402k / 28.0 |
| Kimi-K2.6 | FC-4B-RL | 78.3 / 1384k / 52.1 | 33.5 / 2158k / 61.1 | 72.6 / 378k / 27.5 |
Score deltas and token reductions relative to w/o Explore. Bold = best, underline = second best per benchmark.
Key observations:
- FastContext improves accuracy over direct solving for all main agents and benchmarks. Largest gain: GPT-5.4 on SWE-bench Pro (+5.5 points).
- Token savings are substantial: up to 60.3% for GPT-5.4 on SWE-QA; 14–26% on issue-resolution tasks.
- 4B-RL often beats 30B-SFT in both score and token efficiency (e.g., GLM-5.1 SWE-bench Pro: 22.5 vs. 20.0; Kimi-K2.6 Multilingual: 78.3 vs. 76.7).
- RL consistently improves over SFT on the compact 4B model in all nine settings.
Standalone Exploration Quality (Table 2)
On SWE-bench Verified, patch-derived reference locations are used to compute F1 at file, module, and function granularity. FastContext is compared to baselines: RepoSearcher, LocAgent, Agentless, OrcaLoca, CoSIL, OpenHands-Bash, CodeScout.
Table 2: Standalone exploration quality on SWE-bench Verified.
| Scaffold | LLM | File-level F1 / Prec. / Rec. | Module-level F1 / Prec. / Rec. | Function-level F1 / Prec. / Rec. |
|---|---|---|---|---|
| RepoSearcher | Qwen3-4B | 55.83 / 59.32 / 54.48 | 38.66 / 35.39 / 53.53 | 20.76 / 15.26 / 46.71 |
| RepoSearcher | Qwen3-30B | 67.14 / 70.94 / 65.61 | 25.18 / 22.24 / 35.93 | 15.40 / 11.82 / 33.11 |
| LocAgent | Qwen3-4B | 61.78 / 65.13 / 60.42 | 43.88 / 44.80 / 47.27 | 28.04 / 24.66 / 41.73 |
| ... (other baselines) | ... | ... | ... | ... |
| FastContext | GPT-5.4 | 72.34 / 76.55 / 70.69 | 55.16 / 51.63 / 71.76 | 35.91 / 29.15 / 70.81 |
| FastContext | GLM-5.1 | 73.88 / 77.96 / 72.29 | 59.31 / 56.28 / 73.51 | 43.50 / 37.46 / 72.02 |
| FastContext | Kimi-K2.6 | 71.34 / 75.15 / 69.86 | 59.34 / 56.85 / 70.80 | 43.87 / 38.68 / 68.22 |
| FastContext | Qwen3-4B | 62.57 / 66.13 / 61.19 | 51.25 / 53.05 / 53.79 | 37.80 / 37.37 / 47.22 |
| FastContext | Qwen3-30B | 65.29 / 69.14 / 63.78 | 57.04 / 56.48 / 64.04 | 42.90 / 39.98 / 60.62 |
| FastContext | FC-30B-SFT | 73.71 / 77.76 / 72.13 | 60.35 / 58.43 / 71.17 | 40.74 / 35.32 / 67.97 |
| FastContext | FC-4B-SFT | 70.55 / 74.75 / 68.85 | 55.26 / 53.00 / 69.25 | 37.48 / 32.83 / 66.82 |
| FastContext | FC-4B-RL | 71.48 / 75.35 / 69.92 | 56.26 / 53.80 / 70.49 | 38.45 / 32.79 / 68.05 |
Bold = best per column excluding frontier-model rows; underline = second best.
Key observations:
- FastContext trained checkpoints form the strongest group at file and module granularity, reaching 73.71 file-level F1 and 60.35 module-level F1.
- SFT substantially improves the 4B explorer (file F1: 62.57 → 70.55; module F1: 51.25 → 55.26).
- RL further improves the 4B model (file F1: 71.48; module F1: 56.26), mainly via higher recall.
- The advantage is clearest at module and function level, indicating that FastContext narrows evidence toward code regions most likely to matter.
Ablation and Analysis
- Figure 4 breaks down main-agent total tokens by action category (File Reading, Code Search, File Editing, Testing, Other, FastContext overhead). Adding FastContext drastically reduces File Reading and Code Search tokens while adding only small FastContext invocation overhead.
- Figure 5 shows per-instance token distributions shift leftward (lower usage) for all FastContext variants.
- Same-model exploration (frontier model also doing delegation) is usually inferior to trained FastContext in both score and token efficiency.
Theoretical and Practical Implications
- Separation of concerns: The paper demonstrates that repository exploration can be decoupled from the solving agent and handled effectively by a much smaller, trained subagent. This modular view contrasts with monolithic agent trajectories where exploration and solving are interleaved.
- Efficiency without sacrifice: Reducing main-agent token consumption by up to 60% while improving accuracy challenges the assumption that better results require more context. Focused, grounded evidence is more valuable than exhaustive exploration.
- **Scal
Related papers
- APPO: Agentic Procedural Policy Optimization
APPO shifts credit assignment to fine-grained decision points using a Branching Score, outperforming baselines on 13 agentic reasoning benchmarks.
- Toward Generalist Autonomous Research via Hypothesis-Tree Refinement
Arbor's hypothesis tree framework achieves best held-out results on all six real research tasks, with over 2.5x the average gain of Codex and Claude Code.
- Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories
Data2Story produces fully auditable data articles by binding every sentence and chart to its source code or URL through a seven-agent virtual newsroom.