Visual Summary | SWE-Explore: Benchmarking How Coding Agents Explore Repositories

Summary (Overview)

SWE-Explore is a new benchmark that isolates repository exploration from end-to-end patch generation, formulating it as a ranked, line-level context selection task under a fixed line budget. It covers 848 issues across 10 programming languages and 203 open-source repositories.
Ground truth is derived from successful agent trajectories (≥2 per instance) by intersecting read actions across independent runs, followed by LLM-based refinement and human audit—enabling trajectory-grounded supervision without manual line-level annotation for every instance.
The benchmark evaluates explorers along coverage, ranking, and efficiency dimensions, and a controlled downstream protocol confirms that these upstream metrics strongly predict repair success (e.g., Context Efficiency correlates with resolve rate at Pearson r = 0.950).
Experiments across 12 explorers (sparse retrievers, dense retrievers, general coding agents, and specialized localizers) show that agentic explorers form a clear tier above classical retrieval, but most remain recall-limited at the line level despite strong file-level hit rates.
Controlled context degradation reveals that missing core evidence is the dominant failure mode; redundant irrelevant context is less damaging once essential regions are present.

Introduction and Theoretical Foundation

Background. Repository-level coding benchmarks like SWE-bench have driven rapid progress in coding agents, but they reduce each repair attempt to a single pass/fail prediction (resolved or unresolved). This holistic metric obscures why an agent succeeds or fails: it conflates repository exploration, bug localization, patch synthesis, and validation. Two distinct failure modes emerge: (1) the agent fails to explore the relevant code, or (2) it retrieves sufficient evidence but fails to generate a correct patch. The former is largely hidden by binary resolution rates.

Motivation. Determining which specific lines carry evidence for a given issue is a daunting challenge, even for agents that ultimately solve it. Existing evaluations of localization and retrieval [5, 13, 22, 26, 31, 37] lack a common, precise target at line granularity—they measure file- or function-level hits but not the exact spans consulted during successful issue resolution.

Theoretical basis. SWE-Explore formalizes repository exploration as a standalone functionality. Given an issue $q$ and repository snapshot $R$ , an explorer returns:

f: (q, R) \mapsto P = (r_1, r_2, \ldots, r_K)

where each region $r_i = (p_i, s_i, e_i)$ consists of a file path $p_i$ and a line range $[s_i, e_i]$ . The output is scored against trajectory-grounded supervision derived from independent successful agent runs. The benchmark deliberately keeps the output format simple so that sparse retrievers, interactive agents, and long-context selectors can all be compared as producers of the same ranked region list under a fixed line budget.

Comparison with prior work. As shown in Table 1, no existing benchmark jointly provides executable validation, multilingual coverage, line-level ground truth, trajectory-grounded supervision, joint exploration+repair evaluation, and ranked region evaluation.

Table 1: Comparison with existing repository-level coding and exploration benchmarks.

Benchmark	Exec. Based	Multi-Lingual	Line-Level GT	Trajectory-Grounded GT	Joint Expl. + Repair Eval	Ranked Region Eval
Loc-Bench [5]	✗	✗	✗	✗	✗	✗
SWE-bench Verified [6,11]	✓	✗	✗	✗	✗	✗
SWE-bench Multilingual [30]	✓	✓	✗	✗	✗	✗
SWE-bench-Pro [7]	✓	✓	✗	✗	✗	✗
ContextBench [13]	✓	✓	✗	✗	✓	✗
SWE-ContextBench [37]	✓	✗	✗	✗	✗	✗
SWE-Explore (Ours)	✓	✓	✓	✓	✓	✓

Methodology

Task Formulation

SWE-Explore evaluates repository exploration as a standalone functionality: given issue $q$ and repository $R$ , an explorer returns a ranked list of code regions $P = (r_1, \ldots, r_K)$ , each $r_i = (p_i, s_i, e_i)$ . The explorer does not generate a patch and does not access ground truth.

Data Sources

Built from three public sources: SWE-bench Verified [6,11], SWE-bench-Pro [7], and SWE-bench Multilingual [30]. After the solution-verification filter (at least two successful trajectories per instance), 848 instances are retained across 10 programming languages and 203 open-source repositories.

Table 2: Per-instance averages of ground-truth core context.

	Mean	Max
Issue Text Length (Words)	191.2	1,892
Ground-Truth Files	4.3	15
Regions	4.7	15
Lines	1,578	16,136
Source Trajectories	2.9	4
Modified-by-Patch Files	1.4	66
Codebase Files (non-test)	759	7,649
Codebase Lines (non-test)	179.6K	1.4M

Ground-Truth Annotation

SWE-Explore derives line-level supervision from solution-verified agent trajectories (GPT-5.4, Gemini-3-Pro, Sonnet-4.6, GLM-5.1, Kimi-K2.6). Only instances with at least two successful trajectories ( $|T| \ge 2$ ) are retained.

Extracting reads. From each trajectory, all read actions (editor view, cat, head, tail, grep -n) that resolve to explicit file–interval pairs are collected and normalized into regions $(p, s, e)$ .

Generating regions. Let $R(\tau)$ be the set of regions extracted from trajectory $\tau$ . The core context $R_{\text{core}}$ is derived by:

Computing the file-wise line-level intersection: $R_{\text{int}} = \bigcap_{\tau \in T} R(\tau)$ .
LLM-based refinement promotes a small subset of model-specific optional reads $R^{(m)}_{\text{opt}} = \bigcup_{\tau \in T_m} R(\tau) \setminus R_{\text{int}}$ when they are load-bearing.
Final human audit removes unsupported regions.

The intersection/union is taken file-wise at the line level (e.g., overlapping reads of parser.py:40–80 and parser.py:60–100 contribute parser.py:60–80 to $R_{\text{int}}$ ). The final $R_{\text{core}}$ is the only scoring target in main experiments.

Metrics

Let $L(r) \subseteq \{(p, \ell)\}$ be the set of (file, line) pairs covered by region $r$ . For budget $B$ , let $P_{\le B}$ be the longest prefix of $P$ whose cumulative $|L(\cdot)|$ does not exceed $B$ . Write $L(P) = \bigcup_i L(r_i)$ and $Y = L(R_{\text{core}})$ .

Coverage and accuracy:

Precision: $\text{PREC} = |L(P) \cap Y| / |L(P)|$
Recall: $\text{REC} = |L(P) \cap Y| / |Y|$
F1: harmonic mean of precision and recall.
HitFile: fraction of core files reached by at least one predicted region.
HitRegion: fraction of core regions overlapped by at least one prediction.

Ranking under budget (nDCG@B): Each predicted region $r_i$ is assigned gain $g_i$ equal to the number of core lines it covers. Discounted Cumulative Gain under budget $B$ :

\text{DCG@}B = \sum_{i \in P_{\le B}} \frac{g_i}{\log_2(i+2)}

$\text{NDCG@}B$ normalizes against the best possible DCG under the same line budget.

First useful hit (FUH): $1 - i^\star / |P|$ where $i^\star$ is the smallest rank whose visible lines intersect $Y$ (0 if none). Higher FUH means earlier surfacing of evidence.

Efficiency and noise:

Context Efficiency: fraction of predicted visible lines inside $L(R_{\text{core}}) \cup L(R^{(m)}_{\text{opt}})$ .
Noise Rate (region-level): fraction of predicted regions overlapping neither $R_{\text{core}}$ nor $R^{(m)}_{\text{opt}}$ .

Validation by downstream repair. A one-time restricted-context protocol: hide everything outside the explorer’s selected regions, feed only that context to a fixed coding agent (Mini-SWE-Agent backed by GPT-5.4 and Gemini-3-Pro), and measure resolve rate on the original SWE-bench harness. This checks whether exploration metrics track actual repair success.

Empirical Validation / Results

Setup

Evaluated explorers span four families: baselines (Oracle, Random), sparse retrievers (BM25, TF–IDF), dense retriever (Potion/RAG pipeline), and agentic explorers (general-purpose: Claude Code, Codex, OpenHands, Mini-SWE-Agent, AweAgent; specialized localizers: AutoCodeRover, LocAgent, OrcaLoca, CoSIL). Every explorer returns $K=5$ regions (aligning with the average 4.7 core regions per instance).

Downstream Validation (Table 3 & 4)

Table 3: Downstream resolve rate under restricted-context validation (GPT-5.4 with Mini-SWE-Agent, K=5).

Explorer	Resolve Rate (%)
Oracle	59.7
Random	4.7
TF-IDF	26.0
RAG	23.3
BM25	12.7
CoSIL	59.3
Mini-SWE-Agent	50.0
OpenHands	47.7
OrcaLoca	45.3
AutoCodeRover	44.7
LocAgent	44.7
AweAgent	41.3
Codex	50.3
Claude Code	48.0

Table 4: Explorer-level Pearson/Spearman correlation between upstream metrics and downstream resolve rate (↓ marks lower-is-better).

Metric	Pearson $r$	Spearman $\rho$
CtxEff	+0.950	+0.739
FUH	+0.928	+0.675
Rec@100	+0.926	+0.845
HitFile	+0.925	+0.695
nDCG@500	+0.921	+0.460
nDCG@300	+0.920	+0.458
nDCG@100	+0.917	+0.480
HitReg	+0.901	+0.695
Prec	+0.890	+0.671
NoiseReg ↓	–0.812	–0.562
NoiseFile ↓	–0.808	–0.590
Rec@300	+0.769	+0.796
Rec@500	+0.710	+0.796
F1	+0.673	+0.810
Rec $\ell$	+0.617	+0.796

Findings. Context Efficiency has the strongest Pearson correlation ( $r=0.950$ ); Rec@100 is the strongest rank-correlated signal ( $\rho=0.845$ ). Noise metrics show expected negative correlation. The results justify reporting a mixed metric set.

Exploration Quality (Table 5 & 6)

Table 5: Exploration quality at $K=5$ across different LLMs powering the same Mini-SWE-Agent scaffold (bold best, underline second best).

Model	HitReg	Prec	Rec $\ell$	F1	HitFile	nDCG@500	Rec@500	FUH	CtxEff	NoiseReg↓
GPT-5.4	0.516	0.542	0.154	0.194	0.655	0.905	0.154	0.927	0.771	0.258
GPT-5.4-mini	0.531	0.509	0.185	0.215	0.649	0.924	0.183	0.956	0.754	0.265
Kimi-K2.6	0.413	0.475	0.117	0.149	0.509	0.739	0.115	0.759	0.676	0.316
Sonnet-4.5	0.428	0.519	0.118	0.154	0.535	0.779	0.116	0.802	0.715	0.279
GLM-4.7	0.289	0.414	0.122	0.148	0.343	0.557	0.105	0.572	0.536	0.465
Gemini-3-Pro	0.268	0.420	0.052	0.079	0.369	0.605	0.052	0.620	0.540	0.467

Table 6: Exploration quality at $K=5$ across all explorer families (bold best non-oracle, underline second best). All agentic explorers driven by GPT-5.4.

Explorer	HitReg	Prec	Rec $\ell$	F1	HitFile	nDCG@500	Rec@500	FUH	CtxEff	NoiseReg↓
Oracle	0.915	1.000	0.953	0.964	0.923	0.858	0.576	1.000	1.000	0.000
Random	0.003	0.002	0.004	0.002	0.004	0.004	0.001	0.006	0.002	0.997
BM25	0.065	0.055	0.021	0.024	0.079	0.132	0.021	0.141	0.087	0.910
TF-IDF	0.121	0.117	0.049	0.054	0.140	0.223	0.049	0.240	0.190	0.821
Potion	0.069	0.055	0.025	0.026	0.088	0.136	0.025	0.146	0.100	0.897
OpenHands	0.514	0.489	0.179	0.209	0.645	0.867	0.177	0.895	0.737	0.245
Mini-SWE-Agent	0.505	0.530	0.151	0.190	0.640	0.885	0.151	0.907	0.754	0.253
AweAgent	0.534	0.577	0.140	0.182	0.682	0.954	0.140	0.975	0.829	0.191
AutoCodeRover	0.272	0.680	0.233	0.291	0.280	0.720	0.165	0.730	0.738	0.034
LocAgent	0.472	0.642	0.191	0.241	0.540	0.950	0.173	0.977	0.799	0.195
OrcaLoca	0.126	0.295	0.033	0.049	0.129	0.311	0.030	0.313	0.317	0.003
CoSIL	0.544	0.581	0.788	0.602	0.544	0.824	0.412	0.920	0.898	0.471
Claude Code	0.531	0.598	0.154	0.202	0.667	0.938	0.154	0.963	0.829	0.186
Codex	0.516	0.523	0.194	0.223	0.649	0.901	0.190	0.936	0.762	0.249

**Key findings