MultiWorld: Scalable Multi-Agent Multi-View Video World Models

Summary (Overview)

Key Contribution: Introduces MultiWorld, a unified framework for scalable multi-agent, multi-view video world modeling that addresses three core challenges: multi-agent controllability, multi-view consistency, and framework scalability.
Core Modules: Proposes the Multi-Agent Condition Module (MACM) for precise control of multiple agents via Agent Identity Embedding and Adaptive Action Weighting, and the Global State Encoder (GSE) using a VGGT backbone to ensure multi-view consistency by encoding a shared 3D-aware environment state.
Scalability: Framework supports arbitrary numbers of agents and camera views without architectural changes, enabling parallel view generation and stable long-horizon autoregressive simulation.
Empirical Validation: Demonstrates superior performance on multi-player game (It Takes Two) and multi-robot manipulation (RoboFactory) datasets, outperforming baselines in video fidelity (FVD), action-following ability, and multi-view consistency (RPE).
Practical Application: Shows capability to simulate realistic failure trajectories and long-horizon scenarios (up to 4× training context) for collaborative robotics, highlighting potential for downstream tasks like planning and simulation.

Introduction and Theoretical Foundation

Video world models, typically formulated as action-conditioned video generation models, have succeeded in simulating environmental dynamics. However, existing models implicitly assume a single-agent scenario, ignoring the complex interactions and interdependencies inherent in real-world multi-agent systems (e.g., collaborative robotics, multi-player games). Furthermore, multi-agent simulation inherently requires generating consistent observations across multiple views, as each agent perceives the shared environment from its own viewpoint. Previous models fail to preserve this multi-view consistency.

Extending world models to multi-agent, multi-view scenarios introduces three key challenges:

Multi-Agent Controllability: Controlling multiple agents requires associating specific actions with corresponding agents and synchronizing their executions. Simple stacking of actions leads to identity ambiguity (e.g., "mirror actions").
Multi-View Consistency: The model must synthesize visually coherent videos across diverse perspectives, ensuring observations from multiple agents remain geometrically consistent within a shared environment.
Framework Scalability: Real-world environments involve variable numbers of agents and views. Previous works assume fixed counts or predefined setups, limiting applicability.

MultiWorld addresses these challenges by proposing a scalable framework that enables flexible scaling of agent and view counts. The theoretical foundation combines:

Flow Matching (FM) as the generative backbone for conditional video distribution.
Transformer architecture for sequential modeling with causal masking to prevent future information leakage.
Relative positional embeddings (RoPE) for agent identity distinction.
3D-aware latent representations from a pre-trained reconstruction model (VGGT) to encode a global environment state.

Methodology

3.1 Backbone and Notation

The problem is formulated as a collection of $C$ image-action-conditioned video generation tasks, where $C$ is the number of camera views. Videos from different views are synthesized in parallel conditioned on a shared global environment state.

Notation: $K$ agents, $C$ camera views (independent). Let $a_i = (a_i^1, ..., a_i^K)$ denote the joint action of all $K$ agents at frame $i$ , and $a = \{a_0, ..., a_I\}$ the full action sequence. The video from camera $c$ is $x^c$ . The environment observation is $o = \{o^c\}_{c=1}^C$ , where $o^c$ is the initial frame of $x^c$ .
Flow Matching (FM): For each view $c$ , the conditional distribution of future frames is modeled. Sampling $t \sim U(0,1)$ and noise $\epsilon \sim N(0,I)$ , the noisy observation and target velocity are:

x_t^c = (1 - t) x^c + t \epsilon,

u = \epsilon - x^c.

The velocity network is parameterized as $v_\theta(x_t^c, t, a, o)$ , trained to predict $u$ .

Causal Masking: A frame-wise causal mask is applied to action cross-attention to ensure tokens at frame $i$ attend only to actions from frames $\{0, ..., i\}$ , supporting stable autoregressive generation.

3.2 Multi-Agent Condition Module (MACM)

MACM processes multi-agent actions to resolve identity ambiguity and prioritize dynamic agents.

Agent Identity Embedding (AIE) To support variable agent counts and eliminate identity confusion, AIE uses Rotary Position Embedding (RoPE). Given an action latent at frame $f$ , $a_f \in \mathbb{R}^{K \times D}$ ( $D$ is latent dimension), for each agent $i \in \{1, ..., K\}$ , a rotation matrix $R_{\Theta,i}$ is applied:

\text{AIE}(a^i, i) = R_{\Theta,i} a^i

where $R_{\Theta,i}$ is defined by pre-computed frequencies $\theta_j = b^{-2j/D}$ , with $b$ as the base frequency constant. For each pair of dimensions $(2j, 2j+1)$ :

\begin{bmatrix} a^{(2j)} \\ a^{(2j+1)} \end{bmatrix}_{\text{out}} = \begin{bmatrix} \cos(i\theta_j) & -\sin(i\theta_j) \\ \sin(i\theta_j) & \cos(i\theta_j) \end{bmatrix} \begin{bmatrix} a^{(2j)} \\ a^{(2j+1)} \end{bmatrix}_{\text{in}}

The attention between agents $m$ and $n$ becomes:

(R_m a_m)^\top (R_n a_n) = a_m^\top R_m^\top R_n a_n = a_m^\top R_{n-m} a_n.

This provides relative identity encoding.

Adaptive Action Weighting (AAW) An MLP predicts adaptive weighting factors for each action token. Tokens are multiplied by their weights and summed into a unified representation per frame, allowing the model to focus on active agents driving environmental change.

3.3 Global State Encoder (GSE)

GSE aggregates multi-view observations into a compact, 3D-aware global environment state to ensure consistency.

Process: Given multi-view observation set $O = \{O^c\}_{c=1}^C$ with $C$ images $O^c \in \mathbb{R}^{3 \times H \times W}$ , a frozen VGGT backbone (an end-to-end 3D reconstruction model) encodes them:

H_{\text{vggt}} = \text{VGGT}(O), \quad H_{\text{vggt}} \in \mathbb{R}^{C \times n \times d}

where $n$ is tokens per image, $d$ is latent dimension.

Alignment: An MLP aligns features with the DiT backbone: $H = \text{MLP}(H_{\text{vggt}})$ . $H$ is injected via cross-attention to condition generation.
Advantages: Improves multi-view spatial consistency, supports arbitrary view counts by compression, and enables parallel view generation.

3.4 Scalable Framework

Agent Scalability: AIE's relative embeddings can be extrapolated, accommodating arbitrary $K$ without architectural changes.
View Scalability: GSE compresses variable-length observations into a unified global state, decoupling computation from $C$ . Parallel generation of different views keeps inference latency nearly constant with scaled resources (~1.5× speedup over sequential for double-view).
Autoregressive Long-Horizon Simulation: Videos are generated chunk-wise. The last frames of a chunk update the global state via GSE for the next chunk, enabling stable generation beyond training context length (up to 2× with minimal degradation, extendable to 4×).

Empirical Validation / Results

Datasets & Metrics

Multi-player game: 500 hours (100 hours cleaned) of real-player data from It Takes Two, 2560×1440 resolution, over 21M frames.
Multi-robot manipulation: Constructed using RoboFactory, tasks with 2-4 agents, variable camera viewpoints.
Metrics:
- Visual quality: FVD, PSNR, SSIM, LPIPS.
- Multi-view consistency: Reprojection Error (RPE).
- Action-following: Inverse Dynamics Model (IDM) accuracy.

Baselines

Standard Image-Action-to-Video Model: Treats each view independently.
Concatenated-View Video World Model: Combines views into a single video (fixed view count, memory-intensive).
COMBO: Compositional multi-agent model (two-stage: train single-agent models, then combine). Neglects inter-agent interactions.

Implementation Details

Backbone: Wan2.2-5B model.
Resolution: 320×320 (game), 320×256 (robotics).
Training: 40,000 iterations, learning rate 5e-5, cosine scheduler, batch size 64 on 8 NVIDIA A800 GPUs (~4 days).

Main Results

Table 1: Quantitative comparison across scenarios

Method	FVD ↓	LPIPS ↓	SSIM ↑	PSNR ↑	Action ↑	RPE ↓
Multi-Player Video Game
Standard	245	0.36	0.50	17.48	88.4	0.75
Concat-View	215	0.36	0.49	17.54	89.1	0.74
Combo	207	0.34	0.51	17.82	89.3	0.72
Ours	179	0.35	0.51	17.72	89.8	0.67
Multi-Robot Manipulation
Standard	100	0.07	0.90	26.39	88.2	1.60
Concat-View*	106	0.06	0.90	27.44	92.0	0.82
Combo	99	0.08	0.90	26.49	88.5	1.54
Ours	96	0.07	0.90	26.60	88.7	1.52

*Concat-View trained only on two camera views per episode, not directly comparable.

MultiWorld achieves best or second-best performance across most metrics in both domains, demonstrating effectiveness and generalization.

Ablation Studies

Table 2: Ablation of main architectural components

Config	FVD ↓	LPIPS ↓	SSIM ↑	PSNR ↑	Action ↑	RPE ↓
Standard	245	0.36	0.50	17.48	88.4	0.75
+ MACM	228	0.36	0.51	17.56	89.7	0.76
Both	179	0.35	0.51	17.72	89.8	0.67

MACM improves action controllability; GSE improves multi-view consistency; together they enhance visual quality.

Table 3: Ablation on Agent Identity Embedding base frequency

Config (base)	FVD ↓	PSNR ↑	Action ↑
base=10k	234	17.53	89.2
base=20	228	17.56	89.7

Lower base frequency (20 vs default 10000) provides better angular separation for agent identities, improving performance.

Table 4: Ablation on Adaptive Action Weighting

Config	FVD ↓	PSNR ↑	Action ↑
w/o AAW	245	17.48	88.4
w/ AAW	236	17.52	88.6

AAW enhances visual quality and action-following by prioritizing active agents.

Table 5: Ablation on Global State Encoder backbone

Global State Encoder	FVD ↓	LPIPS ↓	SSIM ↑	PSNR ↑	RPE ↓
w/o Global State	228	0.36	0.51	17.56	0.75
Wan VAE	256	0.36	0.50	17.38	0.71
DINOv2	232	0.36	0.50	17.48	0.72
VGGT (Ours)	179	0.35	0.51	17.72	0.67

VGGT, as a 3D reconstruction model, best captures shared 3D environment state, leading to superior multi-view consistency and visual quality.

Qualitative Results

Multi-player game: MultiWorld achieves more accurate action following and better multi-view consistency compared to baselines, mitigating failure modes like inaccurate action execution, agent disappearance, and view inconsistency.
Multi-robot failure trajectory simulation: MultiWorld can simulate realistic failure modes (e.g., inter-robot collisions) by enhancing failure trajectory data, useful for safe robotics training.
Long-horizon generation: Autoregressive simulation of three robots stacking cubes fluently up to 2× training context without significant degradation, extendable to 4×.

Theoretical and Practical Implications

Theoretical: Advances video world modeling by formally addressing multi-agent controllability and multi-view consistency, introducing scalable architectures (MACM, GSE) that generalize beyond fixed configurations. Provides a framework for modeling inter-agent interactions and shared 3D environment states.
Practical:
- Robotics: Enables simulation of multi-robot collaborative tasks, including failure trajectories, reducing physical risk and data collection burden.
- Gaming: Supports realistic multi-player environment simulation with consistent viewpoints.
- Planning & Embodied AI: Can serve as a generalizable virtual environment for training multi-agent reinforcement learning or vision-language-action models.
- Scalability: The parallel view generation and autoregressive capabilities allow efficient simulation of complex, long-horizon multi-agent scenarios.

Conclusion

MultiWorld presents a scalable framework for multi-agent, multi-view video world modeling, successfully addressing controllability, consistency, and scalability challenges through MACM and GSE. Extensive experiments on diverse datasets demonstrate superior performance in video quality, action adherence, and multi-view coherence.

Limitations: Current scale is limited by computational constraints; large-scale training remains unexplored.

Future Directions:

Investigate real-time multi-agent generation for downstream tasks.
Explore memory mechanisms for ultra-long multi-agent simulation to handle spatial/temporal demands.
Extend to even larger numbers of agents and views with optimized architectures.