Summary (Overview)
- Role-Agent introduces a bootstrapped agent-environment co-evolution framework using a single LLM that switches between dual roles: as an agent (World-In-Agent, WIA) and as an environment (Agent-In-World, AIW).
- WIA prompts the agent to predict future states after each action; alignment between predicted and actual states provides a process reward, encouraging environment-aware reasoning and fine-grained credit assignment with state-grouped advantages.
- AIW analyzes failure modes from unsuccessful trajectories, then retrieves tasks with similar failure patterns to dynamically reshape the training data distribution, focusing practice on historical deficiencies.
- Extensive experiments on ALFWorld, WebShop, and search-augmented QA benchmarks show Role-Agent consistently outperforms strong baselines (e.g., GiGPO, GRPO) with average gains of over 4%, while incurring only about 5.2% extra computation.
- Ablation studies confirm both WIA and AIW contribute complementary gains; hyper-parameter sensitivity analysis shows optimal performance with moderate prediction horizon and balanced advantage scaling.
Introduction and Theoretical Foundation
Large Language Model (LLM) agents have demonstrated strong performance on complex tasks through multi-turn tool-use and long-horizon reasoning. To further enhance their capabilities, Agentic Reinforcement Learning (ARL) incorporates full interaction rollouts into RL frameworks (e.g., GRPO, PPO), allowing agents to optimize via environment feedback beyond supervised fine-tuning. However, most existing methods treat the environment as a static source of tasks, observations, and rewards, which fails to expose hidden weaknesses or provide targeted feedback. Synthetic environments that adapt to the agent’s deficiencies often require separate models, increasing complexity.
The key insight is to use a single LLM to act as both the agent and the environment, enabling bootstrapped co-evolution without additional models. This leads to Role-Agent, which consists of:
- World-In-Agent (WIA): The agent predicts future states after each action, and the discrepancy between predicted and actual states serves as a predictive reward, promoting reliable decision-making.
- Agent-In-World (AIW): The same LLM analyzes failure trajectories to extract failure modes, then retrieves similar tasks to adjust the training distribution, providing targeted practice.
Methodology
Preliminaries
Multi-step agent–environment interaction is formalized as: given task prompt , agent policy generates action at step , environment returns next state and reward , yielding trajectory . The Group Relative Policy Optimization (GRPO) objective (Guo et al., 2025) is:
with , .
World-In-Agent (WIA)
During rollout, after generating action , the agent is prompted to predict future states for horizons (Eq. (2)). At the end of rollout, the predictive reward matrix is computed via Longest Matching Subsequence (LMS) between predicted and ground-truth states:
Task reward and predictive reward are combined as:
Multiplication ensures predictive reward only modulates actions with non-zero task reward. State grouping collects actions under identical states (via hash) into groups , and state-level advantage for action in group is:
Final advantage combines trajectory-level and state-level advantages: .
Agent-In-World (AIW)
For each failed trajectory, the LLM is prompted to identify failure modes (e.g., entity confusion, wrong target location) and generate reflections (failure type, core lesson, retrieval query). These failure modes are stored in an offline interaction history. The LLM then retrieves tasks with similar failure patterns and adds them back to the training set, dynamically adjusting the data distribution to focus on the agent’s weaknesses.
Empirical Validation / Results
Experiments are conducted on ALFWorld (household tasks), WebShop (e-commerce), and search-augmented QA (single-hop: NQ, TriviaQA, PopQA; multi-hop: HotpotQA, 2Wiki, MuSiQue, Bamboogle). Backbone models: Qwen2.5-1.5B/3B/7B-Instruct.
Table 1: Performance on ALFWorld and WebShop (with Qwen2.5-1.5B and 7B)
| Type | Method | ALFWorld (Avg) | WebShop Score | WebShop Succ. |
|---|---|---|---|---|
| Qwen2.5-1.5B | GiGPO | 86.7 | 83.1 | 65.0 |
| Role-Agent | 90.9 | 87.7 | 71.9 | |
| Qwen2.5-7B | GiGPO | 90.8 | 84.4 | 72.8 |
| Role-Agent | 93.8 | 88.0 | 77.1 |
Role-Agent consistently outperforms GiGPO, with relative gains of 4.2% on ALFWorld and 6.9% on WebShop (1.5B). Gains are larger on complex compositional tasks (e.g., +11.0% on Look task).
Table 2: Search-Augmented QA (Qwen2.5-3B)
| Method | NQ | TriviaQA | PopQA | HotpotQA | 2Wiki | MuSiQue | Bamboogle | Avg. |
|---|---|---|---|---|---|---|---|---|
| GiGPO | 42.0 | 59.5 | 42.4 | 36.9 | 37.0 | 12.6 | 64.1 | 42.1 |
| Role-Agent | 40.1 | 60.4 | 49.8 | 38.8 | 45.2 | 17.8 | 68.4 | 45.8 |
Role-Agent shows strongest gains on multi-hop tasks (+8.2% on 2Wiki, +5.2% on MuSiQue).
Ablation (Table 3) with Qwen2.5-1.5B:
| Method | ALFWorld | WebShop | Average |
|---|---|---|---|
| Role-Agent | 90.9 | 71.9 | 81.4 |
| - w/o AIW | 87.5 | 66.9 | 77.2 |
| - w/o Predictive Reward | 88.0 | 68.3 | 78.2 |
| GiGPO | 86.7 | 65.0 | 75.9 |
Both components are crucial, with AIW removal causing larger drops (especially on WebShop, -5.0%). Both ablations still outperform GiGPO, indicating complementary gains.
Hyper-parameter Sensitivity (Table 4):
| Hyper-param | Value | ALFWorld | WebShop | Average |
|---|---|---|---|---|
| 0.5 | 89.5 | 71.0 | 80.2 | |
| 1.0 | 90.9 | 71.9 | 81.4 | |
| 2.0 | 86.0 | 65.4 | 75.7 | |
| 5% | 90.9 | 71.9 | 81.4 | |
| 10% | 90.2 | 68.5 | 79.3 | |
| 20% | 75.6 | 62.3 | 69.0 |
Optimal balance: (equal weight to trajectory- and state-level advantages), (prediction horizon 5% of max steps).
Running Dynamics (Figure 3): Role-Agent achieves higher performance ceiling (90.9%) and faster convergence than GiGPO, and reduces train-inference mismatch.
Efficiency (Figure 6): Extra costs from predictions, predictive reward, and AIW are minor (18.63s + 0.14s + 8.92s), only about 5.2% extra computation over GiGPO.
Theoretical and Practical Implications
- Theoretical: Demonstrates that a single LLM can serve dual roles (agent and environment) to achieve bootstrapped co-evolution without separate models. The predictive reward mechanism provides a form of internal world model, enhancing environment-aware reasoning, while failure-mode-driven data reshuffling enables adaptive, targeted learning.
- Practical: Role-Agent yields consistent, substantial improvements across diverse benchmarks (text-based interactive environments) with minimal extra computational overhead (~5.2%). The method is especially effective on complex compositional tasks requiring multi-step planning and memory, and on multi-hop QA. The framework is model-agnostic and can be applied to any LLM backbone, offering a practical path for post-training improvements without external models.
Conclusion
Role-Agent introduces a bootstrapped framework for agent-environment co-evolution using a single LLM in dual roles. The World-In-Agent (WIA) component encourages environment-aware reasoning by rewarding accurate future-state prediction and enabling state-grouped credit assignment. The Agent-In-World (AIW) component dynamically reshapes training data by analyzing failure modes and retrieving analogous tasks, providing targeted practice. Extensive experiments on ALFWorld, WebShop, and search-augmented QA demonstrate consistent performance improvements over strong baselines (GRPO, GiGPO) with only marginal extra cost.
Future work includes extending to multi-modal or real-time embodied settings, which would require vision-language state descriptions or latent-state matching, and exploring the use of stronger, possibly frozen, environment LLMs for the AIW component.
Related papers
- SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-cognitive Variations
Even top LLM mediators close only a third of the consensus gap, revealing that mediation success depends on socio-cognitive adaptation, not general reasoning.
- WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces
WeaveBench reveals frontier agents achieve only 41.2% success on hybrid GUI-CLI tasks, proving decision-making under uncertainty, not perception, is the bottleneck.
- Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models
Flow-DPPO replaces PPO's noisy ratio clipping with an exact divergence mask, achieving higher rewards and better KL efficiency in flow model fine-tuning.