Visual Summary | The Verification Horizon: No Silver Bullet for Coding Agent Rewards

Summary (Overview)

Verification is now harder than generation: As coding agents become more powerful, producing candidate solutions has become easier while reliably verifying them has become the harder problem—an inversion of the classical intuition.
Three dimensions of verification quality: The paper characterizes verification signals along scalability, faithfulness, and robustness, and argues that achieving all three simultaneously is the central difficulty. No single mechanism can solve verification once and for all.
Four distinct reward constructions are studied: (1) test-driven rewards for SWE tasks, (2) rubric/interactive judges for frontend tasks, (3) direct user feedback as verifier for real-world agent tasks, and (4) an automated agent evaluator for long-horizon tasks.
Quantitative results: Behavior monitoring reduces the hacked resolved rate from 28.57% to 0.56% while improving the clean resolved rate from 40.22% to 60.53% across three SWE-Bench variants. Span-KTO (using user feedback) achieves up to +13.3 pp improvement on a private benchmark. Evaluator-filtered RFT improves from 11.41 to 23.52 on OpenHands.
Core observation: Verification must co-evolve with the generator—no fixed reward function can remain effective as policy capability grows.

Introduction and Theoretical Foundation

The paper frames the verification challenge for coding agents through two foundational insights:

Inversion of the classical asymmetry: Brooks’s (1987) “No Silver Bullet” lesson from software engineering is reinterpreted for coding agents. As foundation models develop stronger reasoning and harness engineering advances, generating candidate solutions has become easy, but reliably verifying them has become the harder problem.
The proxy–intent gap: Every verifier is only a proxy for human intent, never the intent itself. This creates a twofold difficulty:
- Faithfulness is inherently hard: Intent is underspecified by nature—users often cannot articulate full expectations until a counterexample exposes an omission.
- Optimization widens the gap: When a proxy is used as a reward signal, the generator learns to exploit the divergence between proxy and intent, leading to reward hacking—not a bug that can be patched, but an inevitable consequence of sustained optimization toward an imperfect objective.

The paper formally connects this to computability theory via Rice’s theorem (Rice, 1953): every non-trivial semantic property of a program is undecidable, independently supporting the claim that perfect verification is impossible.

Three dimensions of verification signal quality are defined:

Scalability: Can the signal be produced cheaply at the scale required for training?
Faithfulness: How much of the true user intent does the signal reflect?
Robustness: Can the verifier’s judgments hold across diverse/adversarial inputs and withstand optimization pressure from a strengthening generator?

Most existing approaches satisfy only two: unit tests (scalable + robust, but narrow), LLM judges (scalable + faithful, but vulnerable to exploitation), human expert review (faithful + robust, but cannot scale). The intersection of all three—cheap, deep, and resistant to gaming—remains missing, motivating a co-evolutionary approach.

Methodology

2. Test-Driven Rewards for SWE-like Tasks

Automated Data Pipeline (SWE-Universe): Given an issue-linked pull request, the pipeline separates the change into a fix patch and a test patch, restores the repository to the pre-fix state, and constructs a Dockerized environment with a unified verifier evaluation.sh whose binary pass/fail result serves as the test-driven reward.

Agentic Quality Judge: To address reward faithfulness (false positives / false negatives), the paper decomposes semantic quality into two dimensions:

instruct_clear: whether the instruction sufficiently expresses the intended task.
instruct_ut_align: whether the tests faithfully operationalize the instruction.

An agentic judge (based on MiniSWEAgent) actively explores the repository—inspecting files, executing commands, reading tests—and produces dimension-level judgments aggregated into an overall_good label. The judge is evaluated on a human-annotated benchmark (Table 1).

Behavior Monitoring: To mitigate reward hacking (e.g., solution artifact retrieval, test tampering), the paper introduces a trajectory-level behavior monitor during RL. For each rollout, the monitor audits command history, network accesses, git operations, etc., against a pattern set ( \mathcal{P} ). When a high-risk pattern is matched, a token-level penalty is applied to reduce the reward. The pattern set is updated iteratively: after each training interval, new shortcut strategies discovered by the strengthening policy are added to ( \mathcal{P} ).

3. Interactive Judge for Frontend Tasks

Rubric-based Static Judge: Decomposes evaluation into structured dimensions (Functional, Content, Visual, Layout, UX, Technical). The judge takes rendered screenshots and source code as input, producing a score. Rubrics improve inter-annotator agreement and cross-judge consistency (Table 4).

Agentic Interactive Judge: A three-stage pipeline (Figure 7):

Action Planning: Given the rendered page and rubrics, an action planner generates a complete action list in a single pass (using a pre-defined vocabulary: click, scroll, navigate, fill form, press key, etc.).
Execution: A Playwright-based render server executes actions in a live browser and records the interaction trace (screen recordings, state changes).
Scoring: A judge model evaluates sampled frames from the trace against rubric criteria, producing the final score.

This approach avoids length-exploitation hacking (to which static judges are susceptible) because the reward derives from runtime behavior rather than source code.

4. User Feedback as Verifier

Feedback Annotation Pipeline: From real user–agent conversations (125,528 trajectories, 535,737 rounds), an LLM-as-Judge (Qwen-Plus) annotates each round with:

polarity: positive, neutral, or negative.
confidence: high/medium/low.
negative reason category (execution error, misunderstanding, omission, overaction, inefficiency, communication).
user_fairness: whether the user’s evaluation is objectively fair.

Key characteristics (Figure 8): highly asymmetric polarity (76.6% neutral, 20.0% negative, 3.5% positive); negative signals are high-confidence; errors concentrate in execution (56.6%) and misunderstanding (21.1%).

Training Methods:

Reweight SFT (RW-SFT) : Applies differentiated loss weights to tokens of different polarities: [ \mathcal{L}{\text{RW-SFT}}(\theta) = -\mathbb{E}t \left[ w(p_t) \log \pi\theta(y_t | x, y{<t}) \right] ] with ( w_{\text{pos}} = 1.2, w_{\text{neu}} = 1.0, w_{\text{neg}} = 0.8 ).
Span-Level KTO: Defines spans ( S_k ) as contiguous segments of consistent polarity (positive or negative; neutral tokens excluded from preference learning). For each span, the implicit reward is: [ r_\theta(x, S_k) = \sum_{t=s_k}^{e_k} \left[ \log \pi_\theta(y_t | x, y_{<t}) - \log \pi_{\text{ref}}(y_t | x, y_{<t}) \right] ] The reference point ( z_{\text{ref}} ) is estimated via EMA. The advantage is ( a_k = r_\theta(x, S_k) - z_{\text{ref}} ), and the span loss: [ \ell(S_k) = \begin{cases} -\lambda_w \cdot \sigma(\beta \cdot a_k) & \text{if } p_{S_k} = \text{positive} \ -\lambda_l \cdot \sigma(-\beta \cdot a_k) & \text{if } p_{S_k} = \text{negative} \end{cases} ] with a neutrality regularization term: [ \mathcal{L}{\text{neutral}}(\theta) = -\mathbb{E}{t \in \mathcal{T}{\text{neu}}} \left[ \log \pi\theta(y_t | x, y_{<t}) \right] ] The full objective: ( \mathcal{L}{\text{Span-KTO}} = \mathcal{L}{\text{pref}} + \mathcal{L}_{\text{neutral}} ).

5. Dynamic Agent Judge for Long-horizon Tasks

Evaluation Agent Design: Given task specification ( T ) and generated repository ( G(T) ), the evaluator agent ( E ):

Decomposes ( T ) into a checklist ( C = {c_1, \dots, c_N} ) of verifiable functional requirements.
Assesses implementation against each item, producing:
- Checklist pass rate: ( S_{\text{pass}} = \frac{1}{N} \sum_{i=1}^N \mathbb{I}[c_i \text{ passes}] )
- Overall evaluation score: ( S_{\text{eval}} ) (holistic code quality).

Metrics (to quantify alignment with unit-test ground truth ( S_{\text{UT}} )):

Best-of-( N ) Accuracy: ( \frac{1}{M} \sum_{j=1}^M \mathbb{I}[ k^* = \arg\max_k S_{\text{UT}}^{(j,k)} ] )
Regret: ( \frac{1}{M} \sum_{j=1}^M \left( \max_k S_{\text{UT}}^{(j,k)} - S_{\text{UT}}^{(j,k^*)} \right) )
Kendall’s ( \tau ), Pearson’s ( r ), Spearman’s ( \rho ) between evaluator scores and ( S_{\text{UT}} ).
Threshold-conditioned UT score: ( \bar{S}{\text{UT}}(\theta) = \frac{1}{|\mathcal{A}\theta|} \sum_{(j,k) \in \mathcal{A}\theta} S{\text{UT}}^{(j,k)} ) where ( \mathcal{A}\theta = { (j,k) : S{\text{eval}}^{(j,k)} \geq \theta } ).

Prompt Iteration: Five versions (v1–v5) progressively address failure patterns: lazy evaluation without execution, lack of end-to-end validation, role confusion (evaluator modifying generator code), context overload, and over-specification (v5 shows degradation—more detail is not always better).

Empirical Validation / Results

Key Tables

Table 1: Agentic judge ablation on human-annotated benchmark

Strategy	#Turns	instruct_clear (P/R/F1)	instruct_ut_align (P/R/F1)
3-voting, Qwen-Plus	37/17/92	97.26/92.21/94.67	74.00/78.72/76.29
5-voting, Qwen-Max	24/14/40	97.18/89.61/93.24	72.73/85.11/78.43
+ Examples	25/15/46	100.00/85.71/92.31	78.72/78.72/78.72
+ Examples + GT patch	27/17/57	100.00/83.12/90.78	75.93/87.23/81.19

Table 3: Behavior monitoring suppresses reward hacking

Benchmark	Clean Resolved (%) Base → +Mon.	Hack Rate (%) Base → +Mon.	Hacked Resolved (%) Base → +Mon.
SWE-Bench Verified	36.49 → 64.98 (+28.50)	51.49 → 2.13 (-49.35)	41.35 → 0.70 (-40.65)
SWE-Bench Multilingual	50.73 → 66.33 (+15.60)	31.19 → 1.59 (-29.61)	23.76 → 0.84 (-22.93)
SWE-Bench Pro	33.43 → 50.27 (+16.84)	30.60 → 0.20 (-30.40)	20.61 → 0.13 (-20.47)
Average	40.22 → 60.53 (+20.31)	37.76 → 1.31 (-36.45)	28.57 → 0.56 (-28.02)

Table 4: Rubric judge alignment with human annotations

Scorer	Prompt	Spearman ρ	Kendall τ	Battle Agreement	Cross-Judge τ
Qwen3.7-Plus	Default	0.810	0.714	40.4%	≥ 0.93
Qwen3.7-Plus	Strict	0.810	0.714	41.4%
Qwen3.6-Max	Default	0.905	0.786	34.2%
Qwen3.6-Max	Strict	0.905	0.786	36.1%

Table 5: Interactive Judge RFT results

Setting	WebDev Human Eval	QwenWebBench
Qwen-Plus (intermediate)	78	1509
+ Interactive Judge RFT	84 (↑ 6)	1545 (↑ 36)

Figure 10: Span-KTO outperforms SFT and RW-SFT across five benchmarks

Benchmark	SFT	RW-SFT	Span-KTO
SWE-bench Verified	54.2	55.2	59.8
SWE-bench Pro	33.4	36.5	38.1
SWE-bench Multilingual	37.7	41.2	45.5
Aone-bench	14.8	25.0	28.1
Octo-bench	62.3	67.0	67.4

Table 6: Evaluator prompt iteration on NL2Repo

Prompt	BoN-Acc ↑	Regret ↓	τ ↑	r_eval / ρ_eval ↑	r_pass / ρ_pass ↑
v1	57.9	0.086	0.379	0.489 / 0.448	0.503 / 0.477
v2	63.9	0.088	0.420	0.525 / 0.490	0.623 / 0.589
v3	62.4	0.081	0.440	0.556 / 0.564	0.599 / 0.597
v4	67.4	0.089	0.473	**