# SWE-Explore: Benchmarking How Coding Agents Explore Repositories

> SWE-Explore benchmarks repository exploration and finds that even strong agents are recall-limited at line level, where missing core evidence dominates failures.

- **Source:** [arXiv](https://arxiv.org/abs/2606.07297)
- **Published:** 2026-06-10
- **Permalink:** https://picx.dev/p/MvFdES
- **Whiteboard:** https://picx.dev/p/MvFdES/image

## Summary

## Summary (Overview)

- SWE-Explore is a new benchmark that **isolates repository exploration** from end-to-end patch generation, formulating it as a **ranked, line-level context selection task** under a fixed line budget. It covers 848 issues across 10 programming languages and 203 open-source repositories.
- Ground truth is derived **from successful agent trajectories** (≥2 per instance) by intersecting read actions across independent runs, followed by LLM-based refinement and human audit—enabling trajectory-grounded supervision without manual line-level annotation for every instance.
- The benchmark evaluates explorers along **coverage, ranking, and efficiency dimensions**, and a controlled downstream protocol confirms that these upstream metrics strongly predict repair success (e.g., Context Efficiency correlates with resolve rate at Pearson r = 0.950).
- Experiments across 12 explorers (sparse retrievers, dense retrievers, general coding agents, and specialized localizers) show that **agentic explorers form a clear tier above classical retrieval**, but most remain **recall-limited at the line level** despite strong file-level hit rates.
- Controlled context degradation reveals that **missing core evidence is the dominant failure mode**; redundant irrelevant context is less damaging once essential regions are present.

## Introduction and Theoretical Foundation

**Background.** Repository-level coding benchmarks like SWE-bench have driven rapid progress in coding agents, but they reduce each repair attempt to a **single pass/fail prediction** (resolved or unresolved). This holistic metric obscures *why* an agent succeeds or fails: it conflates repository exploration, bug localization, patch synthesis, and validation. Two distinct failure modes emerge: (1) the agent fails to explore the relevant code, or (2) it retrieves sufficient evidence but fails to generate a correct patch. The former is largely hidden by binary resolution rates.

**Motivation.** Determining which specific lines carry evidence for a given issue is a daunting challenge, even for agents that ultimately solve it. Existing evaluations of localization and retrieval [5, 13, 22, 26, 31, 37] lack a common, precise target at line granularity—they measure file- or function-level hits but not the exact spans consulted during successful issue resolution.

**Theoretical basis.** SWE-Explore formalizes repository exploration as a standalone functionality. Given an issue $q$ and repository snapshot $R$, an explorer returns:

$$f: (q, R) \mapsto P = (r_1, r_2, \ldots, r_K)$$

where each region $r_i = (p_i, s_i, e_i)$ consists of a file path $p_i$ and a line range $[s_i, e_i]$. The output is scored against **trajectory-grounded supervision** derived from independent successful agent runs. The benchmark deliberately keeps the output format simple so that sparse retrievers, interactive agents, and long-context selectors can all be compared as producers of the same ranked region list under a fixed line budget.

**Comparison with prior work.** As shown in Table 1, no existing benchmark jointly provides executable validation, multilingual coverage, line-level ground truth, trajectory-grounded supervision, joint exploration+repair evaluation, and ranked region evaluation.

**Table 1:** Comparison with existing repository-level coding and exploration benchmarks.

| Benchmark | Exec. Based | Multi-Lingual | Line-Level GT | Trajectory-Grounded GT | Joint Expl. + Repair Eval | Ranked Region Eval |
|---|---|---|---|---|---|---|
| Loc-Bench [5] | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| SWE-bench Verified [6,11] | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ |
| SWE-bench Multilingual [30] | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ |
| SWE-bench-Pro [7] | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ |
| ContextBench [13] | ✓ | ✓ | ✗ | ✗ | ✓ | ✗ |
| SWE-ContextBench [37] | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ |
| **SWE-Explore (Ours)** | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |

## Methodology

### Task Formulation

SWE-Explore evaluates repository exploration as a standalone functionality: given issue $q$ and repository $R$, an explorer returns a ranked list of code regions $P = (r_1, \ldots, r_K)$, each $r_i = (p_i, s_i, e_i)$. The explorer does not generate a patch and does not access ground truth.

### Data Sources

Built from three public sources: SWE-bench Verified [6,11], SWE-bench-Pro [7], and SWE-bench Multilingual [30]. After the solution-verification filter (at least two successful trajectories per instance), **848 instances** are retained across **10 programming languages** and **203 open-source repositories**.

**Table 2:** Per-instance averages of ground-truth core context.

| | Mean | Max |
|---|---|---|
| Issue Text Length (Words) | 191.2 | 1,892 |
| Ground-Truth Files | 4.3 | 15 |
| Regions | 4.7 | 15 |
| Lines | 1,578 | 16,136 |
| Source Trajectories | 2.9 | 4 |
| Modified-by-Patch Files | 1.4 | 66 |
| Codebase Files (non-test) | 759 | 7,649 |
| Codebase Lines (non-test) | 179.6K | 1.4M |

### Ground-Truth Annotation

SWE-Explore **derives line-level supervision from solution-verified agent trajectories** (GPT-5.4, Gemini-3-Pro, Sonnet-4.6, GLM-5.1, Kimi-K2.6). Only instances with at least two successful trajectories ($|T| \ge 2$) are retained.

**Extracting reads.** From each trajectory, all read actions (editor view, cat, head, tail, grep -n) that resolve to explicit file–interval pairs are collected and normalized into regions $(p, s, e)$.

**Generating regions.** Let $R(\tau)$ be the set of regions extracted from trajectory $\tau$. The **core context** $R_{\text{core}}$ is derived by:
1. Computing the file-wise line-level intersection: $R_{\text{int}} = \bigcap_{\tau \in T} R(\tau)$.
2. LLM-based refinement promotes a small subset of model-specific optional reads $R^{(m)}_{\text{opt}} = \bigcup_{\tau \in T_m} R(\tau) \setminus R_{\text{int}}$ when they are load-bearing.
3. Final human audit removes unsupported regions.

The intersection/union is taken file-wise at the line level (e.g., overlapping reads of `parser.py:40–80` and `parser.py:60–100` contribute `parser.py:60–80` to $R_{\text{int}}$). The final $R_{\text{core}}$ is the only scoring target in main experiments.

### Metrics

Let $L(r) \subseteq \{(p, \ell)\}$ be the set of (file, line) pairs covered by region $r$. For budget $B$, let $P_{\le B}$ be the longest prefix of $P$ whose cumulative $|L(\cdot)|$ does not exceed $B$. Write $L(P) = \bigcup_i L(r_i)$ and $Y = L(R_{\text{core}})$.

**Coverage and accuracy:**
- **Precision:** $\text{PREC} = |L(P) \cap Y| / |L(P)|$
- **Recall:** $\text{REC} = |L(P) \cap Y| / |Y|$
- **F1:** harmonic mean of precision and recall.
- **HitFile:** fraction of core files reached by at least one predicted region.
- **HitRegion:** fraction of core regions overlapped by at least one prediction.

**Ranking under budget (nDCG@B):**
Each predicted region $r_i$ is assigned gain $g_i$ equal to the number of core lines it covers. Discounted Cumulative Gain under budget $B$:

$$\text{DCG@}B = \sum_{i \in P_{\le B}} \frac{g_i}{\log_2(i+2)}$$

$\text{NDCG@}B$ normalizes against the best possible DCG under the same line budget.

**First useful hit (FUH):** $1 - i^\star / |P|$ where $i^\star$ is the smallest rank whose visible lines intersect $Y$ (0 if none). Higher FUH means earlier surfacing of evidence.

**Efficiency and noise:**
- **Context Efficiency:** fraction of predicted visible lines inside $L(R_{\text{core}}) \cup L(R^{(m)}_{\text{opt}})$.
- **Noise Rate (region-level):** fraction of predicted regions overlapping neither $R_{\text{core}}$ nor $R^{(m)}_{\text{opt}}$.

**Validation by downstream repair.** A one-time restricted-context protocol: hide everything outside the explorer’s selected regions, feed only that context to a fixed coding agent (Mini-SWE-Agent backed by GPT-5.4 and Gemini-3-Pro), and measure resolve rate on the original SWE-bench harness. This checks whether exploration metrics track actual repair success.

## Empirical Validation / Results

### Setup

Evaluated explorers span four families: **baselines** (Oracle, Random), **sparse retrievers** (BM25, TF–IDF), **dense retriever** (Potion/RAG pipeline), and **agentic explorers** (general-purpose: Claude Code, Codex, OpenHands, Mini-SWE-Agent, AweAgent; specialized localizers: AutoCodeRover, LocAgent, OrcaLoca, CoSIL). Every explorer returns $K=5$ regions (aligning with the average 4.7 core regions per instance).

### Downstream Validation (Table 3 & 4)

**Table 3:** Downstream resolve rate under restricted-context validation (GPT-5.4 with Mini-SWE-Agent, K=5).

| Explorer | Resolve Rate (%) |
|---|---|
| Oracle | 59.7 |
| Random | 4.7 |
| TF-IDF | 26.0 |
| RAG | 23.3 |
| BM25 | 12.7 |
| CoSIL | 59.3 |
| Mini-SWE-Agent | 50.0 |
| OpenHands | 47.7 |
| OrcaLoca | 45.3 |
| AutoCodeRover | 44.7 |
| LocAgent | 44.7 |
| AweAgent | 41.3 |
| Codex | 50.3 |
| Claude Code | 48.0 |

**Table 4:** Explorer-level Pearson/Spearman correlation between upstream metrics and downstream resolve rate (↓ marks lower-is-better).

| Metric | Pearson $r$ | Spearman $\rho$ |
|---|---|---|
| CtxEff | **+0.950** | +0.739 |
| FUH | +0.928 | +0.675 |
| Rec@100 | +0.926 | **+0.845** |
| HitFile | +0.925 | +0.695 |
| nDCG@500 | +0.921 | +0.460 |
| nDCG@300 | +0.920 | +0.458 |
| nDCG@100 | +0.917 | +0.480 |
| HitReg | +0.901 | +0.695 |
| Prec | +0.890 | +0.671 |
| NoiseReg ↓ | –0.812 | –0.562 |
| NoiseFile ↓ | –0.808 | –0.590 |
| Rec@300 | +0.769 | +0.796 |
| Rec@500 | +0.710 | +0.796 |
| F1 | +0.673 | +0.810 |
| Rec $\ell$ | +0.617 | +0.796 |

**Findings.** Context Efficiency has the strongest Pearson correlation ($r=0.950$); Rec@100 is the strongest rank-correlated signal ($\rho=0.845$). Noise metrics show expected negative correlation. The results justify reporting a mixed metric set.

### Exploration Quality (Table 5 & 6)

**Table 5:** Exploration quality at $K=5$ across different LLMs powering the same Mini-SWE-Agent scaffold (bold best, underline second best).

| Model | HitReg | Prec | Rec$\ell$ | F1 | HitFile | nDCG@500 | Rec@500 | FUH | CtxEff | NoiseReg↓ |
|---|---|---|---|---|---|---|---|---|---|---|
| GPT-5.4 | 0.516 | 0.542 | 0.154 | 0.194 | 0.655 | 0.905 | 0.154 | 0.927 | 0.771 | 0.258 |
| GPT-5.4-mini | **0.531** | 0.509 | **0.185** | **0.215** | 0.649 | **0.924** | **0.183** | **0.956** | 0.754 | 0.265 |
| Kimi-K2.6 | 0.413 | 0.475 | 0.117 | 0.149 | 0.509 | 0.739 | 0.115 | 0.759 | 0.676 | 0.316 |
| Sonnet-4.5 | 0.428 | **0.519** | 0.118 | 0.154 | 0.535 | 0.779 | 0.116 | 0.802 | 0.715 | 0.279 |
| GLM-4.7 | 0.289 | 0.414 | 0.122 | 0.148 | 0.343 | 0.557 | 0.105 | 0.572 | 0.536 | 0.465 |
| Gemini-3-Pro | 0.268 | 0.420 | 0.052 | 0.079 | 0.369 | 0.605 | 0.052 | 0.620 | 0.540 | **0.467** |

**Table 6:** Exploration quality at $K=5$ across all explorer families (bold best non-oracle, underline second best). All agentic explorers driven by GPT-5.4.

| Explorer | HitReg | Prec | Rec$\ell$ | F1 | HitFile | nDCG@500 | Rec@500 | FUH | CtxEff | NoiseReg↓ |
|---|---|---|---|---|---|---|---|---|---|---|
| **Oracle** | 0.915 | 1.000 | 0.953 | 0.964 | 0.923 | 0.858 | 0.576 | 1.000 | 1.000 | 0.000 |
| Random | 0.003 | 0.002 | 0.004 | 0.002 | 0.004 | 0.004 | 0.001 | 0.006 | 0.002 | 0.997 |
| BM25 | 0.065 | 0.055 | 0.021 | 0.024 | 0.079 | 0.132 | 0.021 | 0.141 | 0.087 | 0.910 |
| TF-IDF | 0.121 | 0.117 | 0.049 | 0.054 | 0.140 | 0.223 | 0.049 | 0.240 | 0.190 | 0.821 |
| Potion | 0.069 | 0.055 | 0.025 | 0.026 | 0.088 | 0.136 | 0.025 | 0.146 | 0.100 | 0.897 |
| OpenHands | 0.514 | 0.489 | 0.179 | 0.209 | 0.645 | 0.867 | 0.177 | 0.895 | 0.737 | 0.245 |
| Mini-SWE-Agent | 0.505 | 0.530 | 0.151 | 0.190 | 0.640 | 0.885 | 0.151 | 0.907 | 0.754 | 0.253 |
| AweAgent | 0.534 | 0.577 | 0.140 | 0.182 | 0.682 | 0.954 | 0.140 | 0.975 | 0.829 | 0.191 |
| **AutoCodeRover** | 0.272 | **0.680** | **0.233** | **0.291** | 0.280 | 0.720 | 0.165 | 0.730 | 0.738 | **0.034** |
| **LocAgent** | 0.472 | **0.642** | 0.191 | 0.241 | 0.540 | **0.950** | 0.173 | **0.977** | 0.799 | 0.195 |
| OrcaLoca | 0.126 | 0.295 | 0.033 | 0.049 | 0.129 | 0.311 | 0.030 | 0.313 | 0.317 | 0.003 |
| **CoSIL** | **0.544** | 0.581 | **0.788** | **0.602** | 0.544 | 0.824 | **0.412** | 0.920 | **0.898** | 0.471 |
| Claude Code | 0.531 | 0.598 | 0.154 | 0.202 | **0.667** | 0.938 | 0.154 | 0.963 | 0.829 | 0.186 |
| Codex | 0.516 | 0.523 | 0.194 | 0.223 | 0.649 | 0.901 | 0.190 | 0.936 | 0.762 | 0.249 |

**Key findings

---

_Markdown view of https://picx.dev/p/MvFdES, served by PicX — AI-generated visual whiteboard summaries of research papers._
