Visual Summary | Qwen-AgentWorld: Language World Models for General Agents

Summary (Overview)

Qwen‑AgentWorld is the first family of native language world models (LWM) that simulate seven agent‑interaction domains (MCP, Search, Terminal, SWE, Android, Web, and OS) using long chain‑of‑thought reasoning, at two scales: 35B‑A3B and 397B‑A17B.
Three‑stage training (“CPT injects, SFT activates, RL sharpens”) progressively injects environment world knowledge, activates next‑state‑prediction reasoning, and refines simulation fidelity via a hybrid rubric‑and‑rule reward.
AgentWorldBench – a 2,170‑sample benchmark built from real environment interactions of five frontier models on nine established agent benchmarks, evaluated across five dimensions (Format, Factuality, Consistency, Realism, Quality) with a reference‑grounded LLM judge.
Qwen‑AgentWorld‑397B‑A17B achieves the highest overall score (58.71) on AgentWorldBench, surpassing GPT‑5.4 (58.25), Claude Opus 4.6 (57.80), and all other frontier and open‑weight models.
Two complementary paradigms demonstrate how world modeling improves general agents: (i) as a decoupled environment simulator enabling scalable and controllable simulation that surpasses real‑environment training; (ii) as a unified agent foundation model where LWM warm‑up consistently improves downstream agent performance across 7 diverse benchmarks, including out‑of‑domain gains of up to +11.3 points.

Introduction and Theoretical Foundation

World models, which predict environment dynamics given current observations and actions, are widely recognized as a cornerstone of general intelligence (LeCun et al., 2022; Hafner et al., 2023). Richens et al. (2025) strengthen this claim by proving that any agent capable of generalizing across a sufficiently broad range of tasks must contain a world model. However, current LLM‑agent research has focused almost exclusively on the policy side (state → action). The paper argues that world modeling is a crucial missing piece in the path to general agents.

The authors formalize a Language World Model (LWM) as a conditional text generator:

[ \hat{o}{t+1} = f\theta(c, o_{\leq t}, a_{\leq t}) ]

where (c) is the system prompt, (o_t) the environment observation at turn (t), and (a_t) the agent’s action. The objective is to predict the ground‑truth observation (o_{t+1}).

Seven domains are unified under a shared trajectory schema, with observations ranging from terminal outputs to UI view hierarchies. The diversity of domains ensures that a single world‑modeling objective simultaneously exercises reasoning, factual knowledge, and long‑context understanding—capabilities that are foundational for general agents.

The motivation is not cost reduction, but rather a complementary axis for scaling agents:

Decoupling – using the LWM as a simulator provides turn‑level scalability (infinite environments without sandbox infrastructure) and controllability (targeted perturbations that expose agent weaknesses).
Unifying – a capable general agent should possess both decision‑making and world‑modeling abilities; next‑state prediction can be internalised as a meta‑level thinking pattern analogous to “reflection” but oriented toward the future.

Methodology

Training Recipe: Three‑Stage Pipeline

Stage 1 – Continual Pre‑Training (CPT)

Data: >10M environment trajectories from dedicated agent infrastructure, open interaction traces, and in‑house agentic logs, plus specialised‑domain world knowledge corpora (industrial control, cybersecurity, law, medicine, finance, current affairs).
Objective: Standard next‑token prediction. Multi‑turn trajectories are expanded into turn‑level prediction samples; the language‑modeling objective maps directly onto (p(o_{t+1} \mid o_{\leq t}, a_{\leq t})).
Turn‑Level Information‑Theoretic Loss Masking: To avoid learning from low‑information turns (e.g., echo, boilerplate), four statistics (Overlap, Novelty, Jaccard, length ratio) classify turns into seven categories (Table 3) and mask low‑value turns from the loss while retaining them as context.

Category	Keep ratio	Example
retrieval	100%	read_file → contents
expansion	100%	fetch → page + metadata
action	100%	send_email → “sent”
transform	50%	long input → status
boilerplate	10%	API echo
echo	5%	think(x) → {thought:x}
other	100%	–

Stage 2 – Supervised Fine‑Tuning (SFT)

Objective: Activate next‑state prediction as an explicit reasoning pattern. Standard next‑token prediction with a 256k‑token context window.
Data curation: Prompt template diversification (10 variants) and rejection sampling. Starting from 10,250 queries, 7,094 trajectories survive (69.2% retention). Average turns per trajectory: 8.5.

Stage 3 – Reinforcement Learning (RL)

Algorithm: GSPO (Zheng et al., 2025). Prompt capped at 128k tokens to handle extreme prompt–output asymmetry.
Reward design: Hybrid of a five‑dimensional rubric (LLM judge scores Format, Factuality, Consistency, Realism, Quality on 1‑5 scale, mean×5 → [5,25]) and a rule‑based verifier (binary 0/1 scaled to [0,25]), combined at 9:1 ratio.
Training stability: Addressed via three mitigations: (i) restrict expansion to one turn per trajectory (avoid reward collapse from shared prefixes); (ii) reward shaping (rubric+rule outperforms reference‑reward and Turing‑test reward); (iii) prevent reward hacking through self‑praise by using strict tag extraction and content‑type classification.

AgentWorldBench

Construction:

Drawn from 9 established benchmarks (Terminal‑Bench 1.0 & 2.0, MCPMark, Tool Decathlon, WideSearch, in‑house SWE, WebArena Verified, AndroidWorld, OSWorld‑Verified, etc.).
Generated by 5 frontier agents (Claude Opus 4.6/4.8, GPT‑5.4, Gemini 3.1 Pro, DeepSeek‑V4‑Pro, Qwen‑family) executing real environments.
2,170 turn‑level evaluation samples; asymmetric turn sampling (first, last, three intermediate turns) for text domains, selective sampling for GUI domains.

Evaluation Protocol:

Open‑ended rubric judged by an LLM (GPT‑5.2 selected via Turing‑test calibration, with cross‑judge Spearman ρ=0.92–0.99).
Reference‑grounded: judge compares predicted vs. real observation, reducing subjective bias.
Differentiated matching: deterministic content → exact match; pre‑existing content → plausibility; runtime metadata → format/range verification.

Empirical Validation / Results

Main Results (AgentWorldBench)

Model	MCP	Search	Term.	SWE	Android	Web	OS	Avg.
Claude Opus 4.8	54.93	35.14	59.18	64.10	61.50	54.66	66.62	56.59
Claude Opus 4.6	69.90	29.30	57.51	64.55	61.74	51.42	70.20	57.80
GPT‑5.4	70.10	37.26	53.69	66.29	60.00	51.80	68.58	58.25
Gemini 3.1 Pro	59.07	30.21	52.47	59.07	61.40	52.83	66.92	54.57
Qwen‑AgentWorld‑397B‑A17B	68.24	37.82	57.73	68.49	60.20	50.98	67.89	58.71

Qwen‑AgentWorld‑397B‑A17B achieves the highest overall average (58.71), surpassing GPT‑5.4 (58.25).
Advantage is most pronounced on Terminal (+4.04 over GPT‑5.4) and SWE (+2.20).
Search remains the hardest domain (best 37.82).
World‑model training lifts the 35B model by +8.66 points (47.73→56.39) and the 397B model by +3.97 points (54.74→58.71).

Cross‑Domain Generalization

Training RL on Terminal data alone improves all three held‑out text domains:

Terminal: +14.2 points; SWE: +11.5; Search: +11.8; MCP: +5.0 (Figure 8).
Indicates that RL reinforces generalizable world knowledge rather than domain‑specific formats.

Applications

Application I: Environment Simulator

Generalizable Environment Scaling (OpenClaw – zero‑shot):

Model	Claw‑Eval	QwenClawBench
Qwen3.5‑35B‑A3B (base)	65.4	47.9
Sim RL (w/ Qwen‑AgentWorld‑397B)	69.7	55.0
Δ	+4.3	+7.1

Simulation of 4k synthetic OpenClaw environments yields substantial gains without any domain‑specific adaptation.

Controllable Simulation (MCP):

Model	Tool Decathlon	MCPMark (Avg)
Base SFT	32.4	21.5
Sim RL (uncontrolled)	31.5	24.6
Sim RL (controlled)	36.1	33.8
Δ	+3.7	+12.3

Controllable perturbations (partial results, intermittent errors) turn ineffective Sim RL into a strong training signal.

Fictional‑World Construction (Search):

Model	WideSearch F1 Item	WideSearch F1 Row
Qwen3.5‑35B‑A3B‑SFT	34.02	13.72
Sim RL (controlled, 35B)	50.31	24.21
Δ	+16.29	+10.49
Qwen3.5‑397B‑A17B‑SFT	70.11	45.69
Sim RL (controlled, 397B)	73.98	51.74
Δ	+3.87	+6.05

Agents trained entirely in fictional, self‑consistent worlds generalize to real search tasks. Controllable Sim RL surpasses Real RL (50.3% vs 45.6% F1 Item) by forcing agents to issue more web_extractor calls (Figure 9).

Application II: Agent Foundation Model

Single‑turn, non‑agentic LWM RL warm‑up (no tool calls) transfers to multi‑turn, tool‑calling agentic tasks:

Model	Term‑B 2.0	SWE‑B Verified	SWE‑B Pro	WideSearch (F1 I / F1 R)	Claw‑Eval	QwenClawBen	BFCL v4	Avg
Base SFT	33.25	64.47	42.18	33.38 / 13.27	53.60	39.76	62.29	62.29
+ LWM RL	39.55	67.86	47.42	46.17 / 20.14	64.88	49.43	71.25	71.25
Δ	+6.30	+3.39	+5.24	+12.79 / +6.87	+11.28	+9.67	+8.96	+8.96

In‑domain and out‑of‑domain (Claw, QwenClaw, BFCL) all benefit.
Prediction accuracy on Terminal‑Bench 2.0 trajectories improves from 69.9% to 78.3% (+8.4%, Figure 10).
Case study (mailman task, Figure 11) shows the model uses internal world‑model predictions to refine actions before execution.

Theoretical and Practical Implications

Theoretical: The work provides empirical validation of the claim (Richens et al., 2025) that world models are necessary for general agents. The unified view of world modeling as both a standalone simulator and an intrinsic component of an agent aligns with frameworks like LeCun et al. (2022) and DreamerV3, but extends them to text‑based agentic environments.
Practical Applications:
- Scalable training: LWMs eliminate the need for costly sandbox infrastructure by simulating thousands of environments (including rare edge cases) with controllable perturbations. This enables agentic RL in domains where real execution is infeasible (proprietary, irreversible, or absent).
- Beyond‑real training: Controllable simulation can systematically expose agent weaknesses (e.g., partial results, intermittent errors) that real environments rarely produce, leading to agents that outperform those trained only in real settings.
- Warm‑up for agents: LWM training serves as a highly effective pre‑training stage for building stronger agent foundation models, transferring across tasks and domains without additional fine‑tuning.
Methodological: The three‑stage training pipeline (“CPT injects, SFT activates, RL sharpens”) and the design of hybrid rewards (rubric + rule) provide a reusable recipe for training language world models. The cross‑domain generalization result (RL on Terminal alone improves all text domains) indicates that world knowledge is transferable, reducing the need for per‑domain data.

Conclusion

Qwen‑AgentWorld is the first native language world model covering seven agent interaction domains within a single model at two scales (35B‑A3B and 397B‑A17B). The three‑stage pipeline—CPT (injection of world knowledge), SFT (activation of next‑state‑prediction reasoning), and RL (sharpening of fidelity via hybrid rewards)—establishes a new paradigm for environment simulation in language models. AgentWorldBench, a comprehensive benchmark with 2,170 samples from real environment interactions of frontier agents, provides a robust evaluation framework.

The work demonstrates two complementary ways to leverage world modeling for general agents:

As a decoupled simulator, enabling scalable and controllable simulation that surpasses real‑environment training, with gains of up to +12.3 on MCPMark and +16.3 on WideSearch.
As a unified agent foundation model, where LWM warm‑up consistently improves downstream agent performance across 7 benchmarks by an average of +8.96 points, including out‑of‑domain gains of +11.3, +9.7, and +9.0.

Future directions include agent–LWM co‑evolution (self‑play between agent and world model), multimodal extension (fusing GUI screenshots with text), adaptive sim‑to‑real routing, and dynamic tool synthesis. By establishing next‑state prediction as a transferable agent foundation, language world modeling opens a new axis for scaling general agents beyond what real‑environment interaction alone can provide.