Summary of: EvoPolicyGym: Evaluating Autonomous Policy Evolution in Interactive Environments

Summary (Overview)

  • Formalizes Autonomous Policy Evolution as a controlled evaluation setting where a coding agent repeatedly edits an executable policy system under a fixed interaction budget, receiving only visible training feedback.
  • Introduces EvoPolicyGym, a benchmark built from 16 compact interactive RL environments (Gym/Box2D, MuJoCo, MiniGrid, Robotics/Driving) with strict visibility boundaries: train feedback is visible; validation and held-out evaluation remain server-side.
  • Evaluates four harness–model agents (GPT-5.5, Claude Opus 4.7, MiniMax-M3, DeepSeek-V4-Pro) on a 128-episode budget. GPT-5.5 achieves the highest Core16 aggregate rank score (0.891) and top-two performance on all 16 environments.
  • Provides trajectory-level diagnostics beyond final scores, analyzing budget allocation, structural synthesis vs. parametric tuning, and how agents translate feedback into policy revisions.
  • Demonstrates that strong autonomous policy evolution depends on discovering task-appropriate mechanisms, not just isolated task wins; weaker agents churn structure without traction on synthesis-dominant tasks.

Introduction and Theoretical Foundation

Autonomous agents are increasingly expected to improve iteratively through feedback rather than produce single fixed outputs. Modern coding agents can call tools, observe failures, and revise artifacts over long horizons, while self-improvement systems show that language models can use reflection to refine answers across attempts. However, evaluating this broader capability is hard because improvement is both an outcome and a process: final scores can hide blind retries, overfitting, brittle special cases, and missing verification. Fully open-ended engineering tasks add further confounders, including evolving specifications and software-maintenance quality.

Autonomous Policy Evolution is formalized as a problem where an agent repeatedly revises an executable decision policy using feedback from prior deployments. The observable object is the sequence of submitted policy systems and train-feedback records, while the outcome is the held-out return of the checkpoint selected on hidden validation. The bounded budget is part of the capability being measured: agents must choose what information to acquire, when to explore or exploit, and how efficiently to convert sparse behavioral evidence into robust policy improvement.

EvoPolicyGym instantiates this problem in a controlled benchmark built from compact interactive environments, making policy evolution itself the evaluated object rather than direct task execution or open-ended engineering progress.

Methodology

Framework: Autonomous Policy Evolution Protocol

EvoPolicyGym frames autonomous policy evolution as an agent-driven optimization loop. A coding agent maintains a persistent policy workspace, submits candidate revisions for visible train episodes, receives server-generated rollout summaries and trajectories, and revises the policy under a fixed episode budget. The primary evaluation unit is a complete budget-constrained run, scored by the held-out return of its best validation checkpoint.

Environment, Policy, and Episode:

  • Environment: Interactive task with reset/step interface; observations may be pixels, vectors, or symbolic grids; actions discrete or continuous.
  • Policy: A decision rule mapping observations to actions, possibly with internal state. Deterministic: (at,ht+1)=μ(ot,ht)(a_t, h_{t+1}) = \mu(o_t, h_t). Stochastic: (at,ht+1)π(ot,ht)(a_t, h_{t+1}) \sim \pi(\cdot | o_t, h_t). Implemented by an executable Python artifact with reset() and act(obs) interface.
  • Episode: Complete policy–environment interaction from reset to termination; return is cumulative reward.

Agent Loop Transition: At observed revision ii, the agent observes (Wi,Fi,Bi)(W_i, F_i, B_i) with history HiH_i and produces:

πθ(Wi,Fi,Bi,Hi)(ui,si,Hi+1),si{}C(Bi)\pi_\theta(W_i, F_i, B_i, H_i) \rightarrow (u_i, s_i, H_{i+1}),\quad s_i \in \{\bot\} \cup \mathcal{C}(B_i) Wi+1=apply(Wi,ui),Pi+1=Φ(Wi+1)W_{i+1} = \text{apply}(W_i, u_i),\quad P_{i+1} = \Phi(W_{i+1}) (ΔFi,ci)=S(Bi,Pi+1,si),Bi+1=Bici,Fi+1=FiΔFi(\Delta F_i, c_i) = \mathcal{S}(B_i, P_{i+1}, s_i),\quad B_{i+1} = B_i - c_i,\quad F_{i+1} = F_i \cup \Delta F_i

Here uiu_i is a workspace patch, sis_i is a submit command (or \bot for no evaluation), and S\mathcal{S} returns feedback and episode cost cic_i.

Core16 Environment Suite

The benchmark uses 16 tasks across four families: Gym/Box2D (Acrobot, ContinuousCar, BipedalWalker, CarRacing), MuJoCo (Reacher, HalfCheetah, Ant, Pusher), MiniGrid (DoorKey, KeyCorridor, FourRooms, ObstructedMaze), and Robotics/Driving (Parking, Roundabout, FetchPush, FetchPickAndPlace).

Experimental Protocol

  • Budget: 128 training episodes per environment run.
  • Evaluation: Hidden validation (16 cases) selects the best checkpoint; final score is mean return on hidden held-out (32 cases).
  • Agents: GPT-5.5 (Codex harness), Claude Opus 4.7 (Claude Code), MiniMax-M3 (Claude Code), DeepSeek-V4-Pro (Claude Code). Harness is part of evaluated system.
  • Scoring: Per-environment rank scores are macro-averaged to produce family and Core16 scores.

Empirical Validation / Results

Leaderboard Results (Tables 1 & 2)

Table 1 shows validation-selected held-out returns on native reward scales. Table 2 summarizes aggregate rank scores:

ModelHarnessCore16 ScoreWinsTop-2
GPT-5.5Codex0.891916
Claude Opus 4.7Claude Code0.750512
MiniMax-M3Claude Code0.53113
DeepSeek-V4-ProClaude Code0.35911
Random policy0.10900
  • GPT-5.5 leads Gym/Box2D (0.938), MuJoCo (0.875), and Robotics/Driving (0.938); Claude Opus 4.7 leads MiniGrid (0.938).
  • GPT-5.5 is top-two on all 16 environments, while weaker agents win isolated tasks but lack coverage.

Post-Hoc Score Trajectories (Figure 3)

Figure 3 plots the evolution of hidden validation best-so-far score over consumed budget. Key findings:

  • MiniGrid tasks show sparse but sharp jumps; MuJoCo shows more incremental gains; robotics/driving show delayed improvements.
  • Similar final scores can arise from early jumps followed by plateaus or from late improvements after much budget consumed.

Mechanisms of Policy Evolution

Structural Synthesis vs. Parametric Tuning:

  • Synthesis-dominant tasks (pixel-perception, symbolic-planning) need task-specific machinery (e.g., visual state extraction, memory, search). Stronger agents build richer code bundles (Table 3).
  • Tuning-dominant tasks (low-dimensional control) allow improvement by adjusting gains, thresholds within an existing controller family.

Figure 4 normalizes held-out scores per environment to a random-to-best scale:

normm,e=clip[0,1](Rm,eheldoutRerandomRebestRerandom)\text{norm}_{m,e} = \text{clip}_{[0,1]} \left( \frac{R^{\text{heldout}}_{m,e} - R^{\text{random}}_e}{R^{\text{best}}_e - R^{\text{random}}_e} \right)
  • GPT-5.5 (0.98) and Claude Opus 4.7 (1.00) nearly reach best on synthesis tasks; MiniMax-M3 (0.19) and DeepSeek-V4-Pro (0.03) remain near random.
  • On tuning tasks, agents cluster tighter (0.67–0.99), showing the gap opens on synthesis.

Edit Success Analysis (Table 4): On synthesis tasks, GPT-5.5 and Claude Opus 4.7 convert synthesis edits into new validation bests at 41% and 48% hit rates respectively, while weaker agents achieve only 3–10%. On tuning tasks, parametric edits become useful once the controller family is plausible (GPT-5.5: 61% hit rate).

Trajectory Case Studies (Figures 5–7):

  • CarRacing (synthesis-dominant): Successful agents attribute visible failures to perception/control, edit accordingly, and use feedback to select/roll back candidates. Weaker agents churn structure without traction.
  • BipedalWalker (tuning-dominant): Tuning succeeds only after a viable gait topology exists. GPT-5.5 reaches a positive gait (timeline best 271); others remain negative.

Theoretical and Practical Implications

  • Theoretical contribution: Formalizes autonomous policy evolution as a distinct, benchmarkable problem separating policy improvement from open-ended engineering or one-shot task completion.
  • Evaluation design: Shows that leaderboard scores alone are insufficient; trajectory-level diagnostics (budget allocation, synthesis vs. tuning, feedback trace analysis) reveal how agents achieve outcomes and where they fail.
  • Practical implications: Strong autonomous improvement requires agents that can (a) infer task-appropriate abstractions, (b) translate feedback into mechanism-level code changes, and (c) preserve useful candidates under budget pressure. The benchmark provides a controlled protocol for measuring stable, feedback-driven policy evolution.
  • Diagnostic value: The structural synthesis vs. parametric tuning lens explains why some agents fail on complex perceptual tasks despite reasonable performance on low-dimensional control. Failing to discover effective structure (not just failing to tune) is a primary weakness.
  • Limitations: AST topology and source-bundle analysis are conservative proxies; two topologies can implement similar behavior, and one topology can mix useful and harmful ideas. The synthesis/tuning split is a lens rather than a taxonomy.

Conclusion

EvoPolicyGym casts autonomous policy improvement as a controlled evaluation of the systems agents build over time. The Core16 results show that high scores require more than isolated task wins: strong agents infer task-appropriate abstractions, translate feedback into mechanism-level code changes, and preserve useful candidates under budget pressure. By pairing leaderboard scores with trajectory-level diagnostics, EvoPolicyGym provides a concrete protocol for measuring stable, feedback-driven autonomous policy evolution. Future work can extend the environment suite, study larger budgets, and investigate how different coding harnesses influence policy evolution dynamics.

Related papers