Visual Summary | Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution

Summary (Overview)

Role-Agent introduces a bootstrapped agent-environment co-evolution framework using a single LLM that switches between dual roles: as an agent (World-In-Agent, WIA) and as an environment (Agent-In-World, AIW).
WIA prompts the agent to predict future states after each action; alignment between predicted and actual states provides a process reward, encouraging environment-aware reasoning and fine-grained credit assignment with state-grouped advantages.
AIW analyzes failure modes from unsuccessful trajectories, then retrieves tasks with similar failure patterns to dynamically reshape the training data distribution, focusing practice on historical deficiencies.
Extensive experiments on ALFWorld, WebShop, and search-augmented QA benchmarks show Role-Agent consistently outperforms strong baselines (e.g., GiGPO, GRPO) with average gains of over 4%, while incurring only about 5.2% extra computation.
Ablation studies confirm both WIA and AIW contribute complementary gains; hyper-parameter sensitivity analysis shows optimal performance with moderate prediction horizon and balanced advantage scaling.

Introduction and Theoretical Foundation

Large Language Model (LLM) agents have demonstrated strong performance on complex tasks through multi-turn tool-use and long-horizon reasoning. To further enhance their capabilities, Agentic Reinforcement Learning (ARL) incorporates full interaction rollouts into RL frameworks (e.g., GRPO, PPO), allowing agents to optimize via environment feedback beyond supervised fine-tuning. However, most existing methods treat the environment as a static source of tasks, observations, and rewards, which fails to expose hidden weaknesses or provide targeted feedback. Synthetic environments that adapt to the agent’s deficiencies often require separate models, increasing complexity.

The key insight is to use a single LLM to act as both the agent and the environment, enabling bootstrapped co-evolution without additional models. This leads to Role-Agent, which consists of:

World-In-Agent (WIA): The agent predicts future states after each action, and the discrepancy between predicted and actual states serves as a predictive reward, promoting reliable decision-making.
Agent-In-World (AIW): The same LLM analyzes failure trajectories to extract failure modes, then retrieves similar tasks to adjust the training distribution, providing targeted practice.

Methodology

Preliminaries

Multi-step agent–environment interaction is formalized as: given task prompt $x$ , agent policy $\pi_\theta(a_t|s_t,x)$ generates action $a_t$ at step $t$ , environment returns next state $s_{t+1}$ and reward $r_t$ , yielding trajectory $\tau = \{(s_t,a_t,r_t)\}_{t=1}^T$ . The Group Relative Policy Optimization (GRPO) objective (Guo et al., 2025) is:

J(\theta) = \frac{1}{N}\sum_{i=1}^N \frac{1}{|\tau_i|}\sum_{t=1}^{|\tau_i|} \min\left(\rho_{\theta,t}^{(i)} A_E(\tau_i), \operatorname{clip}(\rho_{\theta,t}^{(i)}, 1\pm\epsilon) A_E(\tau_i)\right) - \beta D_{\text{KL}}[\pi_\theta \|\pi_{\text{ref}}]

with $A_E(\tau_i) = \frac{R_E(\tau_i) - \operatorname{avg}(\{R_E(\tau_i)\}_{i=1}^N)}{\operatorname{std}(\{R_E(\tau_i)\}_{i=1}^N)}$ , $\rho_{\theta,t}^{(i)} = \pi_\theta(y_t^{(i)}|x,y_{<t}^{(i)})/\pi_{\text{old}}(y_t^{(i)}|x,y_{<t}^{(i)})$ .

World-In-Agent (WIA)

During rollout, after generating action $a_t$ , the agent is prompted to predict future states $\hat{s}_{t,h}$ for horizons $h=1,\dots,H$ (Eq. (2)). At the end of rollout, the predictive reward matrix $\tilde{r} \in \mathbb{R}^{T\times H}$ is computed via Longest Matching Subsequence (LMS) between predicted and ground-truth states:

\tilde{r}_{t,h} = \operatorname{LMS}(\hat{s}_{t,h}, s_{t+h}) \in [0,1] \qquad (4)

Task reward and predictive reward are combined as:

R_{\text{task}}(a_t) = \sum_{k=t}^T \gamma^{k-t} r_k, \quad R_{\text{pre}}(a_t) = \sum_{h=1}^H \gamma^{h-1} \tilde{r}_{t,h}, \quad R_t = R_{\text{task}}(a_t)(1 + R_{\text{pre}}(a_t)) \qquad (5,6)

Multiplication ensures predictive reward only modulates actions with non-zero task reward. State grouping collects actions under identical states (via hash) into groups $\mathcal{G}$ , and state-level advantage for action $a_t^{(o)}$ in group $o$ is:

A_S(a_t^{(o)}) = \frac{R_t^{(o)} - \operatorname{avg}(\{R_t^{(o)} | (s_t^{(o)},a_t^{(o)}) \in \mathcal{G}^{(o)}\})}{\operatorname{std}(\{R_t^{(o)} | (s_t^{(o)},a_t^{(o)}) \in \mathcal{G}^{(o)}\})} \qquad (8)

Final advantage combines trajectory-level and state-level advantages: $A(a_t^{(i)}) = A_S(a_t^{(o)}) + \alpha \cdot A_E(\tau_i)$ .

Agent-In-World (AIW)

For each failed trajectory, the LLM is prompted to identify failure modes (e.g., entity confusion, wrong target location) and generate reflections (failure type, core lesson, retrieval query). These failure modes are stored in an offline interaction history. The LLM then retrieves tasks with similar failure patterns and adds them back to the training set, dynamically adjusting the data distribution to focus on the agent’s weaknesses.

Empirical Validation / Results

Experiments are conducted on ALFWorld (household tasks), WebShop (e-commerce), and search-augmented QA (single-hop: NQ, TriviaQA, PopQA; multi-hop: HotpotQA, 2Wiki, MuSiQue, Bamboogle). Backbone models: Qwen2.5-1.5B/3B/7B-Instruct.

Table 1: Performance on ALFWorld and WebShop (with Qwen2.5-1.5B and 7B)

Type	Method	ALFWorld (Avg)	WebShop Score	WebShop Succ.
Qwen2.5-1.5B	GiGPO	86.7	83.1	65.0
	Role-Agent	90.9	87.7	71.9
Qwen2.5-7B	GiGPO	90.8	84.4	72.8
	Role-Agent	93.8	88.0	77.1

Role-Agent consistently outperforms GiGPO, with relative gains of 4.2% on ALFWorld and 6.9% on WebShop (1.5B). Gains are larger on complex compositional tasks (e.g., +11.0% on Look task).

Table 2: Search-Augmented QA (Qwen2.5-3B)

Method	NQ	TriviaQA	PopQA	HotpotQA	2Wiki	MuSiQue	Bamboogle	Avg.
GiGPO	42.0	59.5	42.4	36.9	37.0	12.6	64.1	42.1
Role-Agent	40.1	60.4	49.8	38.8	45.2	17.8	68.4	45.8

Role-Agent shows strongest gains on multi-hop tasks (+8.2% on 2Wiki, +5.2% on MuSiQue).

Ablation (Table 3) with Qwen2.5-1.5B:

Method	ALFWorld	WebShop	Average
Role-Agent	90.9	71.9	81.4
- w/o AIW	87.5	66.9	77.2
- w/o Predictive Reward	88.0	68.3	78.2
GiGPO	86.7	65.0	75.9

Both components are crucial, with AIW removal causing larger drops (especially on WebShop, -5.0%). Both ablations still outperform GiGPO, indicating complementary gains.

Hyper-parameter Sensitivity (Table 4):

Hyper-param	Value	ALFWorld	WebShop	Average
$\alpha$	0.5	89.5	71.0	80.2
	1.0	90.9	71.9	81.4
	2.0	86.0	65.4	75.7
$H$	5% $T_{\max}$	90.9	71.9	81.4
	10% $T_{\max}$	90.2	68.5	79.3
	20% $T_{\max}$	75.6	62.3	69.0

Optimal balance: $\alpha=1.0$ (equal weight to trajectory- and state-level advantages), $H=5\%\cdot T_{\max}$ (prediction horizon 5% of max steps).

Running Dynamics (Figure 3): Role-Agent achieves higher performance ceiling (90.9%) and faster convergence than GiGPO, and reduces train-inference mismatch.

Efficiency (Figure 6): Extra costs from predictions, predictive reward, and AIW are minor (18.63s + 0.14s + 8.92s), only about 5.2% extra computation over GiGPO.

Theoretical and Practical Implications

Theoretical: Demonstrates that a single LLM can serve dual roles (agent and environment) to achieve bootstrapped co-evolution without separate models. The predictive reward mechanism provides a form of internal world model, enhancing environment-aware reasoning, while failure-mode-driven data reshuffling enables adaptive, targeted learning.
Practical: Role-Agent yields consistent, substantial improvements across diverse benchmarks (text-based interactive environments) with minimal extra computational overhead (~5.2%). The method is especially effective on complex compositional tasks requiring multi-step planning and memory, and on multi-hop QA. The framework is model-agnostic and can be applied to any LLM backbone, offering a practical path for post-training improvements without external models.

Conclusion

Role-Agent introduces a bootstrapped framework for agent-environment co-evolution using a single LLM in dual roles. The World-In-Agent (WIA) component encourages environment-aware reasoning by rewarding accurate future-state prediction and enabling state-grouped credit assignment. The Agent-In-World (AIW) component dynamically reshapes training data by analyzing failure modes and retrieving analogous tasks, providing targeted practice. Extensive experiments on ALFWorld, WebShop, and search-augmented QA demonstrate consistent performance improvements over strong baselines (GRPO, GiGPO) with only marginal extra cost.

Future work includes extending to multi-modal or real-time embodied settings, which would require vision-language state descriptions or latent-state matching, and exploring the use of stronger, possibly frozen, environment LLMs for the AIW component.