# MultiWorld: Scalable Multi-Agent Multi-View Video World Models

> MultiWorld introduces a scalable framework for multi-agent, multi-view video generation with precise agent control, multi-view consistency, and flexible scaling of agents and views.

- **Source:** [arXiv](https://arxiv.org/abs/2604.18564)
- **Published:** 2026-04-22
- **Permalink:** https://picx.dev/p/UFJpA8
- **Whiteboard:** https://picx.dev/p/UFJpA8/image

## Summary

# MultiWorld: Scalable Multi-Agent Multi-View Video World Models

## Summary (Overview)
- **Key Contribution**: Introduces MultiWorld, a unified framework for scalable multi-agent, multi-view video world modeling that addresses three core challenges: multi-agent controllability, multi-view consistency, and framework scalability.
- **Core Modules**: Proposes the **Multi-Agent Condition Module (MACM)** for precise control of multiple agents via Agent Identity Embedding and Adaptive Action Weighting, and the **Global State Encoder (GSE)** using a VGGT backbone to ensure multi-view consistency by encoding a shared 3D-aware environment state.
- **Scalability**: Framework supports arbitrary numbers of agents and camera views without architectural changes, enabling parallel view generation and stable long-horizon autoregressive simulation.
- **Empirical Validation**: Demonstrates superior performance on multi-player game (It Takes Two) and multi-robot manipulation (RoboFactory) datasets, outperforming baselines in video fidelity (FVD), action-following ability, and multi-view consistency (RPE).
- **Practical Application**: Shows capability to simulate realistic failure trajectories and long-horizon scenarios (up to 4× training context) for collaborative robotics, highlighting potential for downstream tasks like planning and simulation.

## Introduction and Theoretical Foundation
Video world models, typically formulated as action-conditioned video generation models, have succeeded in simulating environmental dynamics. However, existing models implicitly assume a **single-agent** scenario, ignoring the complex interactions and interdependencies inherent in real-world **multi-agent systems** (e.g., collaborative robotics, multi-player games). Furthermore, multi-agent simulation inherently requires generating **consistent observations across multiple views**, as each agent perceives the shared environment from its own viewpoint. Previous models fail to preserve this **multi-view consistency**.

Extending world models to multi-agent, multi-view scenarios introduces three key challenges:
1. **Multi-Agent Controllability**: Controlling multiple agents requires associating specific actions with corresponding agents and synchronizing their executions. Simple stacking of actions leads to identity ambiguity (e.g., "mirror actions").
2. **Multi-View Consistency**: The model must synthesize visually coherent videos across diverse perspectives, ensuring observations from multiple agents remain geometrically consistent within a shared environment.
3. **Framework Scalability**: Real-world environments involve variable numbers of agents and views. Previous works assume fixed counts or predefined setups, limiting applicability.

**MultiWorld** addresses these challenges by proposing a scalable framework that enables flexible scaling of agent and view counts. The theoretical foundation combines:
- **Flow Matching (FM)** as the generative backbone for conditional video distribution.
- **Transformer architecture** for sequential modeling with causal masking to prevent future information leakage.
- **Relative positional embeddings (RoPE)** for agent identity distinction.
- **3D-aware latent representations** from a pre-trained reconstruction model (VGGT) to encode a global environment state.

## Methodology

### 3.1 Backbone and Notation
The problem is formulated as a collection of $C$ image-action-conditioned video generation tasks, where $C$ is the number of camera views. Videos from different views are synthesized in parallel conditioned on a shared global environment state.

- **Notation**: $K$ agents, $C$ camera views (independent). Let $a_i = (a_i^1, ..., a_i^K)$ denote the joint action of all $K$ agents at frame $i$, and $a = \{a_0, ..., a_I\}$ the full action sequence. The video from camera $c$ is $x^c$. The environment observation is $o = \{o^c\}_{c=1}^C$, where $o^c$ is the initial frame of $x^c$.
- **Flow Matching (FM)**: For each view $c$, the conditional distribution of future frames is modeled. Sampling $t \sim U(0,1)$ and noise $\epsilon \sim N(0,I)$, the noisy observation and target velocity are:
$$x_t^c = (1 - t) x^c + t \epsilon,$$
$$u = \epsilon - x^c.$$
The velocity network is parameterized as $v_\theta(x_t^c, t, a, o)$, trained to predict $u$.
- **Causal Masking**: A frame-wise causal mask is applied to action cross-attention to ensure tokens at frame $i$ attend only to actions from frames $\{0, ..., i\}$, supporting stable autoregressive generation.

### 3.2 Multi-Agent Condition Module (MACM)
MACM processes multi-agent actions to resolve identity ambiguity and prioritize dynamic agents.

**Agent Identity Embedding (AIE)**
To support variable agent counts and eliminate identity confusion, AIE uses Rotary Position Embedding (RoPE). Given an action latent at frame $f$, $a_f \in \mathbb{R}^{K \times D}$ ($D$ is latent dimension), for each agent $i \in \{1, ..., K\}$, a rotation matrix $R_{\Theta,i}$ is applied:
$$\text{AIE}(a^i, i) = R_{\Theta,i} a^i$$
where $R_{\Theta,i}$ is defined by pre-computed frequencies $\theta_j = b^{-2j/D}$, with $b$ as the base frequency constant. For each pair of dimensions $(2j, 2j+1)$:
$$
\begin{bmatrix}
a^{(2j)} \\
a^{(2j+1)}
\end{bmatrix}_{\text{out}} =
\begin{bmatrix}
\cos(i\theta_j) & -\sin(i\theta_j) \\
\sin(i\theta_j) & \cos(i\theta_j)
\end{bmatrix}
\begin{bmatrix}
a^{(2j)} \\
a^{(2j+1)}
\end{bmatrix}_{\text{in}}
$$
The attention between agents $m$ and $n$ becomes:
$$(R_m a_m)^\top (R_n a_n) = a_m^\top R_m^\top R_n a_n = a_m^\top R_{n-m} a_n.$$
This provides relative identity encoding.

**Adaptive Action Weighting (AAW)**
An MLP predicts adaptive weighting factors for each action token. Tokens are multiplied by their weights and summed into a unified representation per frame, allowing the model to focus on active agents driving environmental change.

### 3.3 Global State Encoder (GSE)
GSE aggregates multi-view observations into a compact, 3D-aware global environment state to ensure consistency.

- **Process**: Given multi-view observation set $O = \{O^c\}_{c=1}^C$ with $C$ images $O^c \in \mathbb{R}^{3 \times H \times W}$, a frozen **VGGT** backbone (an end-to-end 3D reconstruction model) encodes them:
$$H_{\text{vggt}} = \text{VGGT}(O), \quad H_{\text{vggt}} \in \mathbb{R}^{C \times n \times d}$$
where $n$ is tokens per image, $d$ is latent dimension.
- **Alignment**: An MLP aligns features with the DiT backbone: $H = \text{MLP}(H_{\text{vggt}})$. $H$ is injected via cross-attention to condition generation.
- **Advantages**: Improves multi-view spatial consistency, supports arbitrary view counts by compression, and enables parallel view generation.

### 3.4 Scalable Framework
- **Agent Scalability**: AIE's relative embeddings can be extrapolated, accommodating arbitrary $K$ without architectural changes.
- **View Scalability**: GSE compresses variable-length observations into a unified global state, decoupling computation from $C$. Parallel generation of different views keeps inference latency nearly constant with scaled resources (~1.5× speedup over sequential for double-view).
- **Autoregressive Long-Horizon Simulation**: Videos are generated chunk-wise. The last frames of a chunk update the global state via GSE for the next chunk, enabling stable generation beyond training context length (up to 2× with minimal degradation, extendable to 4×).

## Empirical Validation / Results

### Datasets & Metrics
- **Multi-player game**: 500 hours (100 hours cleaned) of real-player data from *It Takes Two*, 2560×1440 resolution, over 21M frames.
- **Multi-robot manipulation**: Constructed using **RoboFactory**, tasks with 2-4 agents, variable camera viewpoints.
- **Metrics**: 
  - Visual quality: **FVD**, **PSNR**, **SSIM**, **LPIPS**.
  - Multi-view consistency: **Reprojection Error (RPE)**.
  - Action-following: **Inverse Dynamics Model (IDM)** accuracy.

### Baselines
1. **Standard Image-Action-to-Video Model**: Treats each view independently.
2. **Concatenated-View Video World Model**: Combines views into a single video (fixed view count, memory-intensive).
3. **COMBO**: Compositional multi-agent model (two-stage: train single-agent models, then combine). Neglects inter-agent interactions.

### Implementation Details
- Backbone: **Wan2.2-5B** model.
- Resolution: 320×320 (game), 320×256 (robotics).
- Training: 40,000 iterations, learning rate 5e-5, cosine scheduler, batch size 64 on 8 NVIDIA A800 GPUs (~4 days).

### Main Results

**Table 1: Quantitative comparison across scenarios**

| Method | FVD ↓ | LPIPS ↓ | SSIM ↑ | PSNR ↑ | Action ↑ | RPE ↓ |
|---------|--------|----------|---------|---------|----------|--------|
| **Multi-Player Video Game** |
| Standard | 245 | 0.36 | 0.50 | 17.48 | 88.4 | 0.75 |
| Concat-View | 215 | 0.36 | 0.49 | 17.54 | 89.1 | 0.74 |
| Combo | 207 | 0.34 | 0.51 | 17.82 | 89.3 | 0.72 |
| **Ours** | **179** | **0.35** | **0.51** | **17.72** | **89.8** | **0.67** |
| **Multi-Robot Manipulation** |
| Standard | 100 | 0.07 | 0.90 | 26.39 | 88.2 | 1.60 |
| Concat-View* | 106 | 0.06 | 0.90 | 27.44 | 92.0 | 0.82 |
| Combo | 99 | 0.08 | 0.90 | 26.49 | 88.5 | 1.54 |
| **Ours** | **96** | **0.07** | **0.90** | **26.60** | **88.7** | **1.52** |

*Concat-View trained only on two camera views per episode, not directly comparable.

MultiWorld achieves **best or second-best** performance across most metrics in both domains, demonstrating effectiveness and generalization.

### Ablation Studies

**Table 2: Ablation of main architectural components**

| Config | FVD ↓ | LPIPS ↓ | SSIM ↑ | PSNR ↑ | Action ↑ | RPE ↓ |
|--------|--------|----------|---------|---------|----------|--------|
| Standard | 245 | 0.36 | 0.50 | 17.48 | 88.4 | 0.75 |
| + MACM | 228 | 0.36 | 0.51 | 17.56 | 89.7 | 0.76 |
| Both | 179 | 0.35 | 0.51 | 17.72 | 89.8 | 0.67 |

MACM improves action controllability; GSE improves multi-view consistency; together they enhance visual quality.

**Table 3: Ablation on Agent Identity Embedding base frequency**

| Config (base) | FVD ↓ | PSNR ↑ | Action ↑ |
|---------------|--------|---------|----------|
| base=10k | 234 | 17.53 | 89.2 |
| base=20 | 228 | 17.56 | 89.7 |

Lower base frequency (20 vs default 10000) provides better angular separation for agent identities, improving performance.

**Table 4: Ablation on Adaptive Action Weighting**

| Config | FVD ↓ | PSNR ↑ | Action ↑ |
|--------|--------|---------|----------|
| w/o AAW | 245 | 17.48 | 88.4 |
| w/ AAW | 236 | 17.52 | 88.6 |

AAW enhances visual quality and action-following by prioritizing active agents.

**Table 5: Ablation on Global State Encoder backbone**

| Global State Encoder | FVD ↓ | LPIPS ↓ | SSIM ↑ | PSNR ↑ | RPE ↓ |
|---------------------|--------|----------|---------|---------|--------|
| w/o Global State | 228 | 0.36 | 0.51 | 17.56 | 0.75 |
| Wan VAE | 256 | 0.36 | 0.50 | 17.38 | 0.71 |
| DINOv2 | 232 | 0.36 | 0.50 | 17.48 | 0.72 |
| **VGGT (Ours)** | **179** | **0.35** | **0.51** | **17.72** | **0.67** |

VGGT, as a 3D reconstruction model, best captures shared 3D environment state, leading to superior multi-view consistency and visual quality.

### Qualitative Results
- **Multi-player game**: MultiWorld achieves more accurate action following and better multi-view consistency compared to baselines, mitigating failure modes like inaccurate action execution, agent disappearance, and view inconsistency.
- **Multi-robot failure trajectory simulation**: MultiWorld can simulate realistic failure modes (e.g., inter-robot collisions) by enhancing failure trajectory data, useful for safe robotics training.
- **Long-horizon generation**: Autoregressive simulation of three robots stacking cubes fluently up to 2× training context without significant degradation, extendable to 4×.

## Theoretical and Practical Implications
- **Theoretical**: Advances video world modeling by formally addressing multi-agent controllability and multi-view consistency, introducing scalable architectures (MACM, GSE) that generalize beyond fixed configurations. Provides a framework for modeling inter-agent interactions and shared 3D environment states.
- **Practical**: 
  - **Robotics**: Enables simulation of multi-robot collaborative tasks, including failure trajectories, reducing physical risk and data collection burden.
  - **Gaming**: Supports realistic multi-player environment simulation with consistent viewpoints.
  - **Planning & Embodied AI**: Can serve as a generalizable virtual environment for training multi-agent reinforcement learning or vision-language-action models.
  - **Scalability**: The parallel view generation and autoregressive capabilities allow efficient simulation of complex, long-horizon multi-agent scenarios.

## Conclusion
MultiWorld presents a **scalable framework** for multi-agent, multi-view video world modeling, successfully addressing controllability, consistency, and scalability challenges through **MACM** and **GSE**. Extensive experiments on diverse datasets demonstrate superior performance in video quality, action adherence, and multi-view coherence.

**Limitations**: Current scale is limited by computational constraints; large-scale training remains unexplored.

**Future Directions**:
- Investigate **real-time multi-agent generation** for downstream tasks.
- Explore **memory mechanisms** for ultra-long multi-agent simulation to handle spatial/temporal demands.
- Extend to even larger numbers of agents and views with optimized architectures.

---

_Markdown view of https://picx.dev/p/UFJpA8, served by PicX — AI-generated visual whiteboard summaries of research papers._
