Summary (Overview)

  • SWE-Explore is a new benchmark that isolates repository exploration from end-to-end patch generation, formulating it as a ranked, line-level context selection task under a fixed line budget. It covers 848 issues across 10 programming languages and 203 open-source repositories.
  • Ground truth is derived from successful agent trajectories (≥2 per instance) by intersecting read actions across independent runs, followed by LLM-based refinement and human audit—enabling trajectory-grounded supervision without manual line-level annotation for every instance.
  • The benchmark evaluates explorers along coverage, ranking, and efficiency dimensions, and a controlled downstream protocol confirms that these upstream metrics strongly predict repair success (e.g., Context Efficiency correlates with resolve rate at Pearson r = 0.950).
  • Experiments across 12 explorers (sparse retrievers, dense retrievers, general coding agents, and specialized localizers) show that agentic explorers form a clear tier above classical retrieval, but most remain recall-limited at the line level despite strong file-level hit rates.
  • Controlled context degradation reveals that missing core evidence is the dominant failure mode; redundant irrelevant context is less damaging once essential regions are present.

Introduction and Theoretical Foundation

Background. Repository-level coding benchmarks like SWE-bench have driven rapid progress in coding agents, but they reduce each repair attempt to a single pass/fail prediction (resolved or unresolved). This holistic metric obscures why an agent succeeds or fails: it conflates repository exploration, bug localization, patch synthesis, and validation. Two distinct failure modes emerge: (1) the agent fails to explore the relevant code, or (2) it retrieves sufficient evidence but fails to generate a correct patch. The former is largely hidden by binary resolution rates.

Motivation. Determining which specific lines carry evidence for a given issue is a daunting challenge, even for agents that ultimately solve it. Existing evaluations of localization and retrieval [5, 13, 22, 26, 31, 37] lack a common, precise target at line granularity—they measure file- or function-level hits but not the exact spans consulted during successful issue resolution.

Theoretical basis. SWE-Explore formalizes repository exploration as a standalone functionality. Given an issue qq and repository snapshot RR, an explorer returns:

f:(q,R)P=(r1,r2,,rK)f: (q, R) \mapsto P = (r_1, r_2, \ldots, r_K)

where each region ri=(pi,si,ei)r_i = (p_i, s_i, e_i) consists of a file path pip_i and a line range [si,ei][s_i, e_i]. The output is scored against trajectory-grounded supervision derived from independent successful agent runs. The benchmark deliberately keeps the output format simple so that sparse retrievers, interactive agents, and long-context selectors can all be compared as producers of the same ranked region list under a fixed line budget.

Comparison with prior work. As shown in Table 1, no existing benchmark jointly provides executable validation, multilingual coverage, line-level ground truth, trajectory-grounded supervision, joint exploration+repair evaluation, and ranked region evaluation.

Table 1: Comparison with existing repository-level coding and exploration benchmarks.

BenchmarkExec. BasedMulti-LingualLine-Level GTTrajectory-Grounded GTJoint Expl. + Repair EvalRanked Region Eval
Loc-Bench [5]
SWE-bench Verified [6,11]
SWE-bench Multilingual [30]
SWE-bench-Pro [7]
ContextBench [13]
SWE-ContextBench [37]
SWE-Explore (Ours)

Methodology

Task Formulation

SWE-Explore evaluates repository exploration as a standalone functionality: given issue qq and repository RR, an explorer returns a ranked list of code regions P=(r1,,rK)P = (r_1, \ldots, r_K), each ri=(pi,si,ei)r_i = (p_i, s_i, e_i). The explorer does not generate a patch and does not access ground truth.

Data Sources

Built from three public sources: SWE-bench Verified [6,11], SWE-bench-Pro [7], and SWE-bench Multilingual [30]. After the solution-verification filter (at least two successful trajectories per instance), 848 instances are retained across 10 programming languages and 203 open-source repositories.

Table 2: Per-instance averages of ground-truth core context.

MeanMax
Issue Text Length (Words)191.21,892
Ground-Truth Files4.315
Regions4.715
Lines1,57816,136
Source Trajectories2.94
Modified-by-Patch Files1.466
Codebase Files (non-test)7597,649
Codebase Lines (non-test)179.6K1.4M

Ground-Truth Annotation

SWE-Explore derives line-level supervision from solution-verified agent trajectories (GPT-5.4, Gemini-3-Pro, Sonnet-4.6, GLM-5.1, Kimi-K2.6). Only instances with at least two successful trajectories (T2|T| \ge 2) are retained.

Extracting reads. From each trajectory, all read actions (editor view, cat, head, tail, grep -n) that resolve to explicit file–interval pairs are collected and normalized into regions (p,s,e)(p, s, e).

Generating regions. Let R(τ)R(\tau) be the set of regions extracted from trajectory τ\tau. The core context RcoreR_{\text{core}} is derived by:

  1. Computing the file-wise line-level intersection: Rint=τTR(τ)R_{\text{int}} = \bigcap_{\tau \in T} R(\tau).
  2. LLM-based refinement promotes a small subset of model-specific optional reads Ropt(m)=τTmR(τ)RintR^{(m)}_{\text{opt}} = \bigcup_{\tau \in T_m} R(\tau) \setminus R_{\text{int}} when they are load-bearing.
  3. Final human audit removes unsupported regions.

The intersection/union is taken file-wise at the line level (e.g., overlapping reads of parser.py:40–80 and parser.py:60–100 contribute parser.py:60–80 to RintR_{\text{int}}). The final RcoreR_{\text{core}} is the only scoring target in main experiments.

Metrics

Let L(r){(p,)}L(r) \subseteq \{(p, \ell)\} be the set of (file, line) pairs covered by region rr. For budget BB, let PBP_{\le B} be the longest prefix of PP whose cumulative L()|L(\cdot)| does not exceed BB. Write L(P)=iL(ri)L(P) = \bigcup_i L(r_i) and Y=L(Rcore)Y = L(R_{\text{core}}).

Coverage and accuracy:

  • Precision: PREC=L(P)Y/L(P)\text{PREC} = |L(P) \cap Y| / |L(P)|
  • Recall: REC=L(P)Y/Y\text{REC} = |L(P) \cap Y| / |Y|
  • F1: harmonic mean of precision and recall.
  • HitFile: fraction of core files reached by at least one predicted region.
  • HitRegion: fraction of core regions overlapped by at least one prediction.

Ranking under budget (nDCG@B): Each predicted region rir_i is assigned gain gig_i equal to the number of core lines it covers. Discounted Cumulative Gain under budget BB:

DCG@B=iPBgilog2(i+2)\text{DCG@}B = \sum_{i \in P_{\le B}} \frac{g_i}{\log_2(i+2)}

NDCG@B\text{NDCG@}B normalizes against the best possible DCG under the same line budget.

First useful hit (FUH): 1i/P1 - i^\star / |P| where ii^\star is the smallest rank whose visible lines intersect YY (0 if none). Higher FUH means earlier surfacing of evidence.

Efficiency and noise:

  • Context Efficiency: fraction of predicted visible lines inside L(Rcore)L(Ropt(m))L(R_{\text{core}}) \cup L(R^{(m)}_{\text{opt}}).
  • Noise Rate (region-level): fraction of predicted regions overlapping neither RcoreR_{\text{core}} nor Ropt(m)R^{(m)}_{\text{opt}}.

Validation by downstream repair. A one-time restricted-context protocol: hide everything outside the explorer’s selected regions, feed only that context to a fixed coding agent (Mini-SWE-Agent backed by GPT-5.4 and Gemini-3-Pro), and measure resolve rate on the original SWE-bench harness. This checks whether exploration metrics track actual repair success.

Empirical Validation / Results

Setup

Evaluated explorers span four families: baselines (Oracle, Random), sparse retrievers (BM25, TF–IDF), dense retriever (Potion/RAG pipeline), and agentic explorers (general-purpose: Claude Code, Codex, OpenHands, Mini-SWE-Agent, AweAgent; specialized localizers: AutoCodeRover, LocAgent, OrcaLoca, CoSIL). Every explorer returns K=5K=5 regions (aligning with the average 4.7 core regions per instance).

Downstream Validation (Table 3 & 4)

Table 3: Downstream resolve rate under restricted-context validation (GPT-5.4 with Mini-SWE-Agent, K=5).

ExplorerResolve Rate (%)
Oracle59.7
Random4.7
TF-IDF26.0
RAG23.3
BM2512.7
CoSIL59.3
Mini-SWE-Agent50.0
OpenHands47.7
OrcaLoca45.3
AutoCodeRover44.7
LocAgent44.7
AweAgent41.3
Codex50.3
Claude Code48.0

Table 4: Explorer-level Pearson/Spearman correlation between upstream metrics and downstream resolve rate (↓ marks lower-is-better).

MetricPearson rrSpearman ρ\rho
CtxEff+0.950+0.739
FUH+0.928+0.675
Rec@100+0.926+0.845
HitFile+0.925+0.695
nDCG@500+0.921+0.460
nDCG@300+0.920+0.458
nDCG@100+0.917+0.480
HitReg+0.901+0.695
Prec+0.890+0.671
NoiseReg ↓–0.812–0.562
NoiseFile ↓–0.808–0.590
Rec@300+0.769+0.796
Rec@500+0.710+0.796
F1+0.673+0.810
Rec \ell+0.617+0.796

Findings. Context Efficiency has the strongest Pearson correlation (r=0.950r=0.950); Rec@100 is the strongest rank-correlated signal (ρ=0.845\rho=0.845). Noise metrics show expected negative correlation. The results justify reporting a mixed metric set.

Exploration Quality (Table 5 & 6)

Table 5: Exploration quality at K=5K=5 across different LLMs powering the same Mini-SWE-Agent scaffold (bold best, underline second best).

ModelHitRegPrecRec\ellF1HitFilenDCG@500Rec@500FUHCtxEffNoiseReg↓
GPT-5.40.5160.5420.1540.1940.6550.9050.1540.9270.7710.258
GPT-5.4-mini0.5310.5090.1850.2150.6490.9240.1830.9560.7540.265
Kimi-K2.60.4130.4750.1170.1490.5090.7390.1150.7590.6760.316
Sonnet-4.50.4280.5190.1180.1540.5350.7790.1160.8020.7150.279
GLM-4.70.2890.4140.1220.1480.3430.5570.1050.5720.5360.465
Gemini-3-Pro0.2680.4200.0520.0790.3690.6050.0520.6200.5400.467

Table 6: Exploration quality at K=5K=5 across all explorer families (bold best non-oracle, underline second best). All agentic explorers driven by GPT-5.4.

ExplorerHitRegPrecRec\ellF1HitFilenDCG@500Rec@500FUHCtxEffNoiseReg↓
Oracle0.9151.0000.9530.9640.9230.8580.5761.0001.0000.000
Random0.0030.0020.0040.0020.0040.0040.0010.0060.0020.997
BM250.0650.0550.0210.0240.0790.1320.0210.1410.0870.910
TF-IDF0.1210.1170.0490.0540.1400.2230.0490.2400.1900.821
Potion0.0690.0550.0250.0260.0880.1360.0250.1460.1000.897
OpenHands0.5140.4890.1790.2090.6450.8670.1770.8950.7370.245
Mini-SWE-Agent0.5050.5300.1510.1900.6400.8850.1510.9070.7540.253
AweAgent0.5340.5770.1400.1820.6820.9540.1400.9750.8290.191
AutoCodeRover0.2720.6800.2330.2910.2800.7200.1650.7300.7380.034
LocAgent0.4720.6420.1910.2410.5400.9500.1730.9770.7990.195
OrcaLoca0.1260.2950.0330.0490.1290.3110.0300.3130.3170.003
CoSIL0.5440.5810.7880.6020.5440.8240.4120.9200.8980.471
Claude Code0.5310.5980.1540.2020.6670.9380.1540.9630.8290.186
Codex0.5160.5230.1940.2230.6490.9010.1900.9360.7620.249

**Key findings

Related papers