Summary (Overview)
- Proposes Dockerless, an environment-free agentic verifier that evaluates code patches by actively exploring the repository rather than executing tests.
- Outperforms the strongest open-source verifier by 14.3 AUC points on a trajectory-level verifier evaluation benchmark.
- Enables a fully environment-free post-training pipeline: Dockerless serves as both the filter for supervised fine-tuning (SFT) trajectories and the reward signal for reinforcement learning (RL), requiring zero per-repository Docker setup.
- The resulting model (Dockerless-RL-9B) achieves 62.0%, 50.0%, and 35.2% resolve rate on SWE-bench Verified, Multilingual, and Pro, respectively, improving over the Qwen3.5-9B baseline by +2.4, +8.7, and +2.9 points.
- Matches the performance of standard environment-based post-training, demonstrating the viability of scalable, environment-free verification for real-world repositories.
Introduction and Theoretical Foundation
Background. Program verifiers play a central role in training automated coding agents. They are used to curate high-quality trajectories for supervised fine-tuning (SFT) and to provide rewards for reinforcement learning (RL). The gold standard is execution-based verification, which runs unit tests inside isolated, per-repository Docker environments. This imposes substantial engineering overhead (building Docker images, resolving dependencies, writing test scripts) and is infeasible for many private, enterprise, or legacy codebases that lack reproducible environments or comprehensive test suites.
Motivation. Prior environment-free verifiers score patches using only surface-level information (e.g., textual similarity or LLM judging from a fixed prompt) without ever inspecting the repository. Such shallow approaches are insufficient for complex software engineering (SWE) tasks, where determining functional equivalence requires deep repository context—e.g., whether a modified function is actually called by the failing behavior, or whether an alternative implementation correctly integrates with surrounding modules.
Proposal. To close this gap, the paper introduces Dockerless, an environment-free agentic verifier that actively explores the repository to judge patch correctness. Instead of blindly matching textual diffs, Dockerless grounds its verification in the actual codebase through parallel sub-agents that gather evidence. It then aggregates that evidence into a continuous correctness score, enabling reliable verification without any per-repository environment.
Methodology
Problem Setting
Given an issue (x) and a candidate patch (y), a verifier assigns a correctness score (r(x, y) \in [0,1]). The environment-based gold standard is: [ r_{\text{env}}(x, y) = \mathbb{1}[\text{tests in } \mathcal{E}x \text{ pass under } y], ] where (\mathcal{E}x) is the repository-specific Docker environment. The goal is to train an environment-free verifier (r\phi(x, y)) that replaces (r{\text{env}}).
Architecture of Dockerless
Dockerless operates in two stages (see Figure 2):
-
Question Generation and Exploration. Given an issue (x), a reference patch (y_{\text{ref}}), and a candidate patch (y), the model first generates (K) verification questions ({Q_1, \dots, Q_K}). These probe where the fix should take effect, what the patched code does, what tests confirm correctness, and whether other parts of the repository could break. For each question, a dedicated sub-agent explores the repository using read-only shell tools (e.g.,
find,grep,rg) and returns a short evidence-backed answer (A_k). The (K) sub-agents run in parallel for efficiency. -
Judgment. Given ((x, y_{\text{ref}}, y, {(Q_k, A_k)}{k=1}^K)), the verdict model outputs a binary token in ({0,1}) (1 = correct patch). At inference, the logits of the two verdict tokens are converted into a continuous score: [ r\phi(x, y) = \frac{\exp(\ell_1)}{\exp(\ell_0) + \exp(\ell_1)}, ] where (\ell_0) and (\ell_1) are the logits for tokens 0 and 1.
Training
Dockerless is trained via rejection sampling on execution-labeled candidate patches. Each training example is a tuple ((x, y_{\text{ref}}, y, r^)) with ground-truth verdict (r^ \in {0,1}) from unit test execution. A teacher model generates question-answer-judge trajectories (\tau) and a predicted verdict (\hat{r}). Trajectories are kept only if (\hat{r} = r^*), forming dataset (\mathcal{D}_{\text{rej}}). The negative-to-positive ratio is capped at (\rho) to mitigate class imbalance.
The verifier is trained with standard next-token cross-entropy: [ \mathcal{L}\phi = -\mathbb{E}{\mathcal{D}{\text{rej}}}\left[ \sum{t=1}^T \log p_\phi(z_t \mid x, y_{\text{ref}}, y, z_{<t}) \right], ] where (z = (z_1, \dots, z_T)) is the token sequence of the trajectory. A single backbone is shared across question generation, sub-agent exploration, and final judgment.
Environment-Free Post-training
With Dockerless trained, it is applied in two pipelines (see Figure 4):
Environment-free RFT (Rejection Sampling Fine-Tuning). Rollouts are collected from an agent running in a minimal Linux image (no per-repository environment). The final patch of each rollout is scored by Dockerless, and the top-(K) rollouts are kept to fine-tune the base model.
Environment-free RL. Dockerless serves as the per-rollout reward model. For a group of (G) rollouts on issue (x), let ({y_1, \dots, y_G}) be the final patches. Group-normalized advantages are: [ A_i = \frac{r_\phi(x, y_i) - \bar{r}}{\hat{\sigma}r}, \quad \bar{r} = \frac{1}{G} \sum{j=1}^G r_\phi(x, y_j), ] where (\hat{\sigma}_r) is the standard deviation. These advantages are used in the standard GRPO objective. Each reward is computed by averaging (M) independent Dockerless evaluations for stability.
Empirical Validation / Results
Main Results
Fully environment-free post-training reaches strongest open-source performance (Table 1). Starting from Qwen3.5-9B, Dockerless-RL-9B achieves 62.0%, 50.0%, and 35.2% resolve rate on SWE-bench Verified, Multilingual, and Pro, improving over the base model by +2.4, +8.7, and +2.9 points, and over the next-best open-source SWE specialist (SWE-Lego-8B) by +20.8, +31.0, and +19.1 points.
Table 1: Resolve rate (%) on SWE-bench Verified, Multilingual, and Pro under env-based evaluation.
| Model | Base | Training | Env-free | Verified | Multilingual | Pro |
|---|---|---|---|---|---|---|
| Qwen3.5-9B | – | – | – | 59.6 | 41.3 | 32.3 |
| Env-SFT-9B | Qwen3.5-9B | SFT | No | 60.0 | 48.3 | 33.9 |
| Dockerless-SFT-9B | Qwen3.5-9B | SFT | Yes | 60.6 | 47.7 | 35.3 |
| + DeepSWE-Verifier RL | Dockerless-SFT-9B | RL | Yes | 60.6 | 47.3 | 34.1 |
| + Test-Execution RL | Dockerless-SFT-9B | RL | No | 62.4 | 51.3 | 35.7 |
| Dockerless-RL-9B | Dockerless-SFT-9B | RL | Yes | 62.0 | 50.0 | 35.2 |
Env-free SFT matches env-based SFT. Dockerless-SFT-9B achieves comparable performance to Env-SFT-9B (60.6 vs. 60.0 on Verified, 47.7 vs. 48.3 on Multilingual, 35.3 vs. 33.9 on Pro).
Env-free RL approaches env-based RL. Dockerless-RL-9B is close to Test-Execution RL (62.0 vs. 62.4 on Verified, 50.0 vs. 51.3 on Multilingual, 35.2 vs. 35.7 on Pro), while outperforming DeepSWE-Verifier RL by +1.4, +2.7, and +1.1 points.
Verifier Evaluation
Table 2: Verifier AUC on trajectory-level verifier evaluation benchmark.
| Model | Verified | Multi-SWE |
|---|---|---|
| DeepSeek-V3.2 | 69.4 | 58.5 |
| Kimi-K2.5 | 70.7 | 63.9 |
| GLM-5 | 73.2 | 62.5 |
| GPT-5.4 | 75.9 | 59.5 |
| SWE-Gym Verifier | 61.0 | 53.7 |
| R2E-Gym Verifier | 64.3 | 55.1 |
| OpenHands Critic | 48.6 | 52.2 |
| DeepSWE Verifier | 66.7 | 62.9 |
| Dockerless | 81.0 | 72.1 |
Dockerless outperforms all baselines, improving AUC by 14.3 points over the strongest trained verifier (DeepSWE) and by 5.1 points over the strongest LLM judge (GPT-5.4) on the Verified split.
Effect of the SFT Data Filter
Table 3: Effect of SFT data filter on downstream resolve rate (%).
| Training Data | Verified | Multilingual | Pro |
|---|---|---|---|
| None (base) | 59.6 | 41.3 | 32.3 |
| All 16K | 58.8 | 41.3 | 31.9 |
| Random 4K | 58.2 | 44.3 | 32.0 |
| Env-based 4K | 60.0 | 48.3 | 33.9 |
| Dockerless 4K | 60.6 | 47.7 | 35.3 |
Dockerless 4K substantially outperforms Random 4K and All 16K, and matches Env-based 4K, showing effective trajectory filtering without per-repository environments.
Effect of Number of Verification Questions
Dockerless performance improves as (K) increases from 0 to 4 (AUC from 78.3 to 81.0 on Verified). Beyond 4, performance plateaus (79.6 at (K=6), 80.3 at (K=8)), so 2–4 questions are used at inference to balance accuracy and cost.
Latency Analysis
During RL training, agent rollout time dominates (2308s per rollout), while reward evaluation adds only 41–180s. Dockerless adds 180s (7.2% of total), making its additional cost small compared to rollout generation.
Case Study
A candidate patch for a Matplotlib issue uses an inline conditional instead of the helper-variable refactor in the reference patch. Dockerless dispatches sub-agents that confirm the fix is applied to both XAxis and YAxis paths and that the inherit/explicit labelcolor semantics are preserved. It scores the patch 0.996 (matching execution result), while text similarity scores 0.468 and DeepSWE Verifier scores 0.035.
Theoretical and Practical Implications
Theoretical. Dockerless demonstrates that agentic, evidence-grounded verification can replace execution-based verification for post-training. By actively exploring the repository, the verifier obtains deep context about code semantics and integration, enabling correct judgments even when candidate patches differ substantially from reference patches in surface form.
Practical. Dockerless unlocks a fully environment-free post-training pipeline. This is crucial for scaling to the long tail of real-world repositories where reproducible execution environments or comprehensive test suites are unavailable. The pipeline achieves performance comparable to standard environment-based post-training, making it a scalable and viable path for training coding agents on diverse codebases. The approach reduces engineering overhead and opens the door to post-training on private, enterprise, or legacy repositories.
Conclusion
This paper proposes Dockerless, an environment-free agentic verifier that scores patches by actively exploring the repository through parallel sub-agents. Dockerless outperforms prior open-source verifiers by a large margin (14.3 AUC points). When used as both the SFT trajectory filter and the RL reward signal, Dockerless enables a fully environment-free post-training pipeline that matches the performance of traditional environment-based post-training. The resulting model (Dockerless-RL-9B) achieves state-of-the-art open-source results on three SWE-bench variants. The authors believe that agentic, evidence-grounded verification provides a new perspective on reward modeling for code and opens a scalable path toward post-training on the long tail of real-world repositories without reproducible execution environments.
Related papers
- LiveEdit: Towards Real-Time Diffusion-Based Streaming Video Editing
LiveEdit achieves real-time streaming video editing by distilling a bidirectional DiT into a causal 4-step model and caching self-attention features for static regions.
- OPID: On-Policy Skill Distillation for Agentic Reinforcement Learning
OPID extracts hierarchical hindsight skills from on-policy trajectories for dense token-level self-distillation, consistently outperforming outcome-only RL across diverse agentic tasks.
- DOPD: Dual On-policy Distillation
DOPD introduces an advantage-aware dual distillation that dynamically routes token supervision to prevent privilege illusion, achieving a 7.5-point gain over vanilla OPD.