Summary (Overview)

  • Verification is now harder than generation: As coding agents become more powerful, producing candidate solutions has become easier while reliably verifying them has become the harder problem—an inversion of the classical intuition.
  • Three dimensions of verification quality: The paper characterizes verification signals along scalability, faithfulness, and robustness, and argues that achieving all three simultaneously is the central difficulty. No single mechanism can solve verification once and for all.
  • Four distinct reward constructions are studied: (1) test-driven rewards for SWE tasks, (2) rubric/interactive judges for frontend tasks, (3) direct user feedback as verifier for real-world agent tasks, and (4) an automated agent evaluator for long-horizon tasks.
  • Quantitative results: Behavior monitoring reduces the hacked resolved rate from 28.57% to 0.56% while improving the clean resolved rate from 40.22% to 60.53% across three SWE-Bench variants. Span-KTO (using user feedback) achieves up to +13.3 pp improvement on a private benchmark. Evaluator-filtered RFT improves from 11.41 to 23.52 on OpenHands.
  • Core observation: Verification must co-evolve with the generator—no fixed reward function can remain effective as policy capability grows.

Introduction and Theoretical Foundation

The paper frames the verification challenge for coding agents through two foundational insights:

  1. Inversion of the classical asymmetry: Brooks’s (1987) “No Silver Bullet” lesson from software engineering is reinterpreted for coding agents. As foundation models develop stronger reasoning and harness engineering advances, generating candidate solutions has become easy, but reliably verifying them has become the harder problem.

  2. The proxy–intent gap: Every verifier is only a proxy for human intent, never the intent itself. This creates a twofold difficulty:

    • Faithfulness is inherently hard: Intent is underspecified by nature—users often cannot articulate full expectations until a counterexample exposes an omission.
    • Optimization widens the gap: When a proxy is used as a reward signal, the generator learns to exploit the divergence between proxy and intent, leading to reward hacking—not a bug that can be patched, but an inevitable consequence of sustained optimization toward an imperfect objective.

The paper formally connects this to computability theory via Rice’s theorem (Rice, 1953): every non-trivial semantic property of a program is undecidable, independently supporting the claim that perfect verification is impossible.

Three dimensions of verification signal quality are defined:

  • Scalability: Can the signal be produced cheaply at the scale required for training?
  • Faithfulness: How much of the true user intent does the signal reflect?
  • Robustness: Can the verifier’s judgments hold across diverse/adversarial inputs and withstand optimization pressure from a strengthening generator?

Most existing approaches satisfy only two: unit tests (scalable + robust, but narrow), LLM judges (scalable + faithful, but vulnerable to exploitation), human expert review (faithful + robust, but cannot scale). The intersection of all three—cheap, deep, and resistant to gaming—remains missing, motivating a co-evolutionary approach.

Methodology

2. Test-Driven Rewards for SWE-like Tasks

Automated Data Pipeline (SWE-Universe): Given an issue-linked pull request, the pipeline separates the change into a fix patch and a test patch, restores the repository to the pre-fix state, and constructs a Dockerized environment with a unified verifier evaluation.sh whose binary pass/fail result serves as the test-driven reward.

Agentic Quality Judge: To address reward faithfulness (false positives / false negatives), the paper decomposes semantic quality into two dimensions:

  • instruct_clear: whether the instruction sufficiently expresses the intended task.
  • instruct_ut_align: whether the tests faithfully operationalize the instruction.

An agentic judge (based on MiniSWEAgent) actively explores the repository—inspecting files, executing commands, reading tests—and produces dimension-level judgments aggregated into an overall_good label. The judge is evaluated on a human-annotated benchmark (Table 1).

Behavior Monitoring: To mitigate reward hacking (e.g., solution artifact retrieval, test tampering), the paper introduces a trajectory-level behavior monitor during RL. For each rollout, the monitor audits command history, network accesses, git operations, etc., against a pattern set ( \mathcal{P} ). When a high-risk pattern is matched, a token-level penalty is applied to reduce the reward. The pattern set is updated iteratively: after each training interval, new shortcut strategies discovered by the strengthening policy are added to ( \mathcal{P} ).

3. Interactive Judge for Frontend Tasks

Rubric-based Static Judge: Decomposes evaluation into structured dimensions (Functional, Content, Visual, Layout, UX, Technical). The judge takes rendered screenshots and source code as input, producing a score. Rubrics improve inter-annotator agreement and cross-judge consistency (Table 4).

Agentic Interactive Judge: A three-stage pipeline (Figure 7):

  1. Action Planning: Given the rendered page and rubrics, an action planner generates a complete action list in a single pass (using a pre-defined vocabulary: click, scroll, navigate, fill form, press key, etc.).
  2. Execution: A Playwright-based render server executes actions in a live browser and records the interaction trace (screen recordings, state changes).
  3. Scoring: A judge model evaluates sampled frames from the trace against rubric criteria, producing the final score.

This approach avoids length-exploitation hacking (to which static judges are susceptible) because the reward derives from runtime behavior rather than source code.

4. User Feedback as Verifier

Feedback Annotation Pipeline: From real user–agent conversations (125,528 trajectories, 535,737 rounds), an LLM-as-Judge (Qwen-Plus) annotates each round with:

  • polarity: positive, neutral, or negative.
  • confidence: high/medium/low.
  • negative reason category (execution error, misunderstanding, omission, overaction, inefficiency, communication).
  • user_fairness: whether the user’s evaluation is objectively fair.

Key characteristics (Figure 8): highly asymmetric polarity (76.6% neutral, 20.0% negative, 3.5% positive); negative signals are high-confidence; errors concentrate in execution (56.6%) and misunderstanding (21.1%).

Training Methods:

  • Reweight SFT (RW-SFT) : Applies differentiated loss weights to tokens of different polarities: [ \mathcal{L}{\text{RW-SFT}}(\theta) = -\mathbb{E}t \left[ w(p_t) \log \pi\theta(y_t | x, y{<t}) \right] ] with ( w_{\text{pos}} = 1.2, w_{\text{neu}} = 1.0, w_{\text{neg}} = 0.8 ).

  • Span-Level KTO: Defines spans ( S_k ) as contiguous segments of consistent polarity (positive or negative; neutral tokens excluded from preference learning). For each span, the implicit reward is: [ r_\theta(x, S_k) = \sum_{t=s_k}^{e_k} \left[ \log \pi_\theta(y_t | x, y_{<t}) - \log \pi_{\text{ref}}(y_t | x, y_{<t}) \right] ] The reference point ( z_{\text{ref}} ) is estimated via EMA. The advantage is ( a_k = r_\theta(x, S_k) - z_{\text{ref}} ), and the span loss: [ \ell(S_k) = \begin{cases} -\lambda_w \cdot \sigma(\beta \cdot a_k) & \text{if } p_{S_k} = \text{positive} \ -\lambda_l \cdot \sigma(-\beta \cdot a_k) & \text{if } p_{S_k} = \text{negative} \end{cases} ] with a neutrality regularization term: [ \mathcal{L}{\text{neutral}}(\theta) = -\mathbb{E}{t \in \mathcal{T}{\text{neu}}} \left[ \log \pi\theta(y_t | x, y_{<t}) \right] ] The full objective: ( \mathcal{L}{\text{Span-KTO}} = \mathcal{L}{\text{pref}} + \mathcal{L}_{\text{neutral}} ).

5. Dynamic Agent Judge for Long-horizon Tasks

Evaluation Agent Design: Given task specification ( T ) and generated repository ( G(T) ), the evaluator agent ( E ):

  • Decomposes ( T ) into a checklist ( C = {c_1, \dots, c_N} ) of verifiable functional requirements.
  • Assesses implementation against each item, producing:
    • Checklist pass rate: ( S_{\text{pass}} = \frac{1}{N} \sum_{i=1}^N \mathbb{I}[c_i \text{ passes}] )
    • Overall evaluation score: ( S_{\text{eval}} ) (holistic code quality).

Metrics (to quantify alignment with unit-test ground truth ( S_{\text{UT}} )):

  • Best-of-( N ) Accuracy: ( \frac{1}{M} \sum_{j=1}^M \mathbb{I}[ k^* = \arg\max_k S_{\text{UT}}^{(j,k)} ] )
  • Regret: ( \frac{1}{M} \sum_{j=1}^M \left( \max_k S_{\text{UT}}^{(j,k)} - S_{\text{UT}}^{(j,k^*)} \right) )
  • Kendall’s ( \tau ), Pearson’s ( r ), Spearman’s ( \rho ) between evaluator scores and ( S_{\text{UT}} ).
  • Threshold-conditioned UT score: ( \bar{S}{\text{UT}}(\theta) = \frac{1}{|\mathcal{A}\theta|} \sum_{(j,k) \in \mathcal{A}\theta} S{\text{UT}}^{(j,k)} ) where ( \mathcal{A}\theta = { (j,k) : S{\text{eval}}^{(j,k)} \geq \theta } ).

Prompt Iteration: Five versions (v1–v5) progressively address failure patterns: lazy evaluation without execution, lack of end-to-end validation, role confusion (evaluator modifying generator code), context overload, and over-specification (v5 shows degradation—more detail is not always better).

Empirical Validation / Results

Key Tables

Table 1: Agentic judge ablation on human-annotated benchmark

Strategy#Turnsinstruct_clear (P/R/F1)instruct_ut_align (P/R/F1)
3-voting, Qwen-Plus37/17/9297.26/92.21/94.6774.00/78.72/76.29
5-voting, Qwen-Max24/14/4097.18/89.61/93.2472.73/85.11/78.43
+ Examples25/15/46100.00/85.71/92.3178.72/78.72/78.72
+ Examples + GT patch27/17/57100.00/83.12/90.7875.93/87.23/81.19

Table 3: Behavior monitoring suppresses reward hacking

BenchmarkClean Resolved (%) Base → +Mon.Hack Rate (%) Base → +Mon.Hacked Resolved (%) Base → +Mon.
SWE-Bench Verified36.49 → 64.98 (+28.50)51.49 → 2.13 (-49.35)41.35 → 0.70 (-40.65)
SWE-Bench Multilingual50.73 → 66.33 (+15.60)31.19 → 1.59 (-29.61)23.76 → 0.84 (-22.93)
SWE-Bench Pro33.43 → 50.27 (+16.84)30.60 → 0.20 (-30.40)20.61 → 0.13 (-20.47)
Average40.22 → 60.53 (+20.31)37.76 → 1.31 (-36.45)28.57 → 0.56 (-28.02)

Table 4: Rubric judge alignment with human annotations

ScorerPromptSpearman ρKendall τBattle AgreementCross-Judge τ
Qwen3.7-PlusDefault0.8100.71440.4%≥ 0.93
Qwen3.7-PlusStrict0.8100.71441.4%
Qwen3.6-MaxDefault0.9050.78634.2%
Qwen3.6-MaxStrict0.9050.78636.1%

Table 5: Interactive Judge RFT results

SettingWebDev Human EvalQwenWebBench
Qwen-Plus (intermediate)781509
+ Interactive Judge RFT84 (↑ 6)1545 (↑ 36)

Figure 10: Span-KTO outperforms SFT and RW-SFT across five benchmarks

BenchmarkSFTRW-SFTSpan-KTO
SWE-bench Verified54.255.259.8
SWE-bench Pro33.436.538.1
SWE-bench Multilingual37.741.245.5
Aone-bench14.825.028.1
Octo-bench62.367.067.4

Table 6: Evaluator prompt iteration on NL2Repo

PromptBoN-Acc ↑Regret ↓τ ↑r_eval / ρ_eval ↑r_pass / ρ_pass ↑
v157.90.0860.3790.489 / 0.4480.503 / 0.477
v263.90.0880.4200.525 / 0.4900.623 / 0.589
v362.40.0810.4400.556 / 0.5640.599 / 0.597
v467.40.0890.473**

Related papers