Summary (Overview)
- SWE-Explore is a new benchmark that isolates repository exploration from end-to-end patch generation, formulating it as a ranked, line-level context selection task under a fixed line budget. It covers 848 issues across 10 programming languages and 203 open-source repositories.
- Ground truth is derived from successful agent trajectories (≥2 per instance) by intersecting read actions across independent runs, followed by LLM-based refinement and human audit—enabling trajectory-grounded supervision without manual line-level annotation for every instance.
- The benchmark evaluates explorers along coverage, ranking, and efficiency dimensions, and a controlled downstream protocol confirms that these upstream metrics strongly predict repair success (e.g., Context Efficiency correlates with resolve rate at Pearson r = 0.950).
- Experiments across 12 explorers (sparse retrievers, dense retrievers, general coding agents, and specialized localizers) show that agentic explorers form a clear tier above classical retrieval, but most remain recall-limited at the line level despite strong file-level hit rates.
- Controlled context degradation reveals that missing core evidence is the dominant failure mode; redundant irrelevant context is less damaging once essential regions are present.
Introduction and Theoretical Foundation
Background. Repository-level coding benchmarks like SWE-bench have driven rapid progress in coding agents, but they reduce each repair attempt to a single pass/fail prediction (resolved or unresolved). This holistic metric obscures why an agent succeeds or fails: it conflates repository exploration, bug localization, patch synthesis, and validation. Two distinct failure modes emerge: (1) the agent fails to explore the relevant code, or (2) it retrieves sufficient evidence but fails to generate a correct patch. The former is largely hidden by binary resolution rates.
Motivation. Determining which specific lines carry evidence for a given issue is a daunting challenge, even for agents that ultimately solve it. Existing evaluations of localization and retrieval [5, 13, 22, 26, 31, 37] lack a common, precise target at line granularity—they measure file- or function-level hits but not the exact spans consulted during successful issue resolution.
Theoretical basis. SWE-Explore formalizes repository exploration as a standalone functionality. Given an issue and repository snapshot , an explorer returns:
where each region consists of a file path and a line range . The output is scored against trajectory-grounded supervision derived from independent successful agent runs. The benchmark deliberately keeps the output format simple so that sparse retrievers, interactive agents, and long-context selectors can all be compared as producers of the same ranked region list under a fixed line budget.
Comparison with prior work. As shown in Table 1, no existing benchmark jointly provides executable validation, multilingual coverage, line-level ground truth, trajectory-grounded supervision, joint exploration+repair evaluation, and ranked region evaluation.
Table 1: Comparison with existing repository-level coding and exploration benchmarks.
| Benchmark | Exec. Based | Multi-Lingual | Line-Level GT | Trajectory-Grounded GT | Joint Expl. + Repair Eval | Ranked Region Eval |
|---|---|---|---|---|---|---|
| Loc-Bench [5] | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| SWE-bench Verified [6,11] | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ |
| SWE-bench Multilingual [30] | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ |
| SWE-bench-Pro [7] | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ |
| ContextBench [13] | ✓ | ✓ | ✗ | ✗ | ✓ | ✗ |
| SWE-ContextBench [37] | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ |
| SWE-Explore (Ours) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Methodology
Task Formulation
SWE-Explore evaluates repository exploration as a standalone functionality: given issue and repository , an explorer returns a ranked list of code regions , each . The explorer does not generate a patch and does not access ground truth.
Data Sources
Built from three public sources: SWE-bench Verified [6,11], SWE-bench-Pro [7], and SWE-bench Multilingual [30]. After the solution-verification filter (at least two successful trajectories per instance), 848 instances are retained across 10 programming languages and 203 open-source repositories.
Table 2: Per-instance averages of ground-truth core context.
| Mean | Max | |
|---|---|---|
| Issue Text Length (Words) | 191.2 | 1,892 |
| Ground-Truth Files | 4.3 | 15 |
| Regions | 4.7 | 15 |
| Lines | 1,578 | 16,136 |
| Source Trajectories | 2.9 | 4 |
| Modified-by-Patch Files | 1.4 | 66 |
| Codebase Files (non-test) | 759 | 7,649 |
| Codebase Lines (non-test) | 179.6K | 1.4M |
Ground-Truth Annotation
SWE-Explore derives line-level supervision from solution-verified agent trajectories (GPT-5.4, Gemini-3-Pro, Sonnet-4.6, GLM-5.1, Kimi-K2.6). Only instances with at least two successful trajectories () are retained.
Extracting reads. From each trajectory, all read actions (editor view, cat, head, tail, grep -n) that resolve to explicit file–interval pairs are collected and normalized into regions .
Generating regions. Let be the set of regions extracted from trajectory . The core context is derived by:
- Computing the file-wise line-level intersection: .
- LLM-based refinement promotes a small subset of model-specific optional reads when they are load-bearing.
- Final human audit removes unsupported regions.
The intersection/union is taken file-wise at the line level (e.g., overlapping reads of parser.py:40–80 and parser.py:60–100 contribute parser.py:60–80 to ). The final is the only scoring target in main experiments.
Metrics
Let be the set of (file, line) pairs covered by region . For budget , let be the longest prefix of whose cumulative does not exceed . Write and .
Coverage and accuracy:
- Precision:
- Recall:
- F1: harmonic mean of precision and recall.
- HitFile: fraction of core files reached by at least one predicted region.
- HitRegion: fraction of core regions overlapped by at least one prediction.
Ranking under budget (nDCG@B): Each predicted region is assigned gain equal to the number of core lines it covers. Discounted Cumulative Gain under budget :
normalizes against the best possible DCG under the same line budget.
First useful hit (FUH): where is the smallest rank whose visible lines intersect (0 if none). Higher FUH means earlier surfacing of evidence.
Efficiency and noise:
- Context Efficiency: fraction of predicted visible lines inside .
- Noise Rate (region-level): fraction of predicted regions overlapping neither nor .
Validation by downstream repair. A one-time restricted-context protocol: hide everything outside the explorer’s selected regions, feed only that context to a fixed coding agent (Mini-SWE-Agent backed by GPT-5.4 and Gemini-3-Pro), and measure resolve rate on the original SWE-bench harness. This checks whether exploration metrics track actual repair success.
Empirical Validation / Results
Setup
Evaluated explorers span four families: baselines (Oracle, Random), sparse retrievers (BM25, TF–IDF), dense retriever (Potion/RAG pipeline), and agentic explorers (general-purpose: Claude Code, Codex, OpenHands, Mini-SWE-Agent, AweAgent; specialized localizers: AutoCodeRover, LocAgent, OrcaLoca, CoSIL). Every explorer returns regions (aligning with the average 4.7 core regions per instance).
Downstream Validation (Table 3 & 4)
Table 3: Downstream resolve rate under restricted-context validation (GPT-5.4 with Mini-SWE-Agent, K=5).
| Explorer | Resolve Rate (%) |
|---|---|
| Oracle | 59.7 |
| Random | 4.7 |
| TF-IDF | 26.0 |
| RAG | 23.3 |
| BM25 | 12.7 |
| CoSIL | 59.3 |
| Mini-SWE-Agent | 50.0 |
| OpenHands | 47.7 |
| OrcaLoca | 45.3 |
| AutoCodeRover | 44.7 |
| LocAgent | 44.7 |
| AweAgent | 41.3 |
| Codex | 50.3 |
| Claude Code | 48.0 |
Table 4: Explorer-level Pearson/Spearman correlation between upstream metrics and downstream resolve rate (↓ marks lower-is-better).
| Metric | Pearson | Spearman |
|---|---|---|
| CtxEff | +0.950 | +0.739 |
| FUH | +0.928 | +0.675 |
| Rec@100 | +0.926 | +0.845 |
| HitFile | +0.925 | +0.695 |
| nDCG@500 | +0.921 | +0.460 |
| nDCG@300 | +0.920 | +0.458 |
| nDCG@100 | +0.917 | +0.480 |
| HitReg | +0.901 | +0.695 |
| Prec | +0.890 | +0.671 |
| NoiseReg ↓ | –0.812 | –0.562 |
| NoiseFile ↓ | –0.808 | –0.590 |
| Rec@300 | +0.769 | +0.796 |
| Rec@500 | +0.710 | +0.796 |
| F1 | +0.673 | +0.810 |
| Rec | +0.617 | +0.796 |
Findings. Context Efficiency has the strongest Pearson correlation (); Rec@100 is the strongest rank-correlated signal (). Noise metrics show expected negative correlation. The results justify reporting a mixed metric set.
Exploration Quality (Table 5 & 6)
Table 5: Exploration quality at across different LLMs powering the same Mini-SWE-Agent scaffold (bold best, underline second best).
| Model | HitReg | Prec | Rec | F1 | HitFile | nDCG@500 | Rec@500 | FUH | CtxEff | NoiseReg↓ |
|---|---|---|---|---|---|---|---|---|---|---|
| GPT-5.4 | 0.516 | 0.542 | 0.154 | 0.194 | 0.655 | 0.905 | 0.154 | 0.927 | 0.771 | 0.258 |
| GPT-5.4-mini | 0.531 | 0.509 | 0.185 | 0.215 | 0.649 | 0.924 | 0.183 | 0.956 | 0.754 | 0.265 |
| Kimi-K2.6 | 0.413 | 0.475 | 0.117 | 0.149 | 0.509 | 0.739 | 0.115 | 0.759 | 0.676 | 0.316 |
| Sonnet-4.5 | 0.428 | 0.519 | 0.118 | 0.154 | 0.535 | 0.779 | 0.116 | 0.802 | 0.715 | 0.279 |
| GLM-4.7 | 0.289 | 0.414 | 0.122 | 0.148 | 0.343 | 0.557 | 0.105 | 0.572 | 0.536 | 0.465 |
| Gemini-3-Pro | 0.268 | 0.420 | 0.052 | 0.079 | 0.369 | 0.605 | 0.052 | 0.620 | 0.540 | 0.467 |
Table 6: Exploration quality at across all explorer families (bold best non-oracle, underline second best). All agentic explorers driven by GPT-5.4.
| Explorer | HitReg | Prec | Rec | F1 | HitFile | nDCG@500 | Rec@500 | FUH | CtxEff | NoiseReg↓ |
|---|---|---|---|---|---|---|---|---|---|---|
| Oracle | 0.915 | 1.000 | 0.953 | 0.964 | 0.923 | 0.858 | 0.576 | 1.000 | 1.000 | 0.000 |
| Random | 0.003 | 0.002 | 0.004 | 0.002 | 0.004 | 0.004 | 0.001 | 0.006 | 0.002 | 0.997 |
| BM25 | 0.065 | 0.055 | 0.021 | 0.024 | 0.079 | 0.132 | 0.021 | 0.141 | 0.087 | 0.910 |
| TF-IDF | 0.121 | 0.117 | 0.049 | 0.054 | 0.140 | 0.223 | 0.049 | 0.240 | 0.190 | 0.821 |
| Potion | 0.069 | 0.055 | 0.025 | 0.026 | 0.088 | 0.136 | 0.025 | 0.146 | 0.100 | 0.897 |
| OpenHands | 0.514 | 0.489 | 0.179 | 0.209 | 0.645 | 0.867 | 0.177 | 0.895 | 0.737 | 0.245 |
| Mini-SWE-Agent | 0.505 | 0.530 | 0.151 | 0.190 | 0.640 | 0.885 | 0.151 | 0.907 | 0.754 | 0.253 |
| AweAgent | 0.534 | 0.577 | 0.140 | 0.182 | 0.682 | 0.954 | 0.140 | 0.975 | 0.829 | 0.191 |
| AutoCodeRover | 0.272 | 0.680 | 0.233 | 0.291 | 0.280 | 0.720 | 0.165 | 0.730 | 0.738 | 0.034 |
| LocAgent | 0.472 | 0.642 | 0.191 | 0.241 | 0.540 | 0.950 | 0.173 | 0.977 | 0.799 | 0.195 |
| OrcaLoca | 0.126 | 0.295 | 0.033 | 0.049 | 0.129 | 0.311 | 0.030 | 0.313 | 0.317 | 0.003 |
| CoSIL | 0.544 | 0.581 | 0.788 | 0.602 | 0.544 | 0.824 | 0.412 | 0.920 | 0.898 | 0.471 |
| Claude Code | 0.531 | 0.598 | 0.154 | 0.202 | 0.667 | 0.938 | 0.154 | 0.963 | 0.829 | 0.186 |
| Codex | 0.516 | 0.523 | 0.194 | 0.223 | 0.649 | 0.901 | 0.190 | 0.936 | 0.762 | 0.249 |
**Key findings
Related papers
- MMAE: A Massive Multitask Audio Editing Benchmark
Current audio editing systems achieve exact match rates below 5%, dropping to 0% on complex mixed-modality tasks.
- SearchSwarm: Towards Delegation Intelligence in Agentic LLMs for Long-Horizon Deep Research
Training a 30B-A3B model on harness-elicited delegation trajectories yields state-of-the-art on long-horizon benchmarks, rivaling 10x larger models.
- SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-cognitive Variations
Even top LLM mediators close only a third of the consensus gap, revealing that mediation success depends on socio-cognitive adaptation, not general reasoning.