# WorldCam: Interactive Autoregressive 3D Gaming Worlds with Camera Pose as a Unifying Geometric Representation

> WorldCam uses camera pose as a unifying geometric representation to jointly enable precise action control and enforce long-term 3D consistency in interactive gaming world generation.

- **Source:** [arXiv](https://arxiv.org/abs/2603.16871)
- **Published:** 2026-03-19
- **Permalink:** https://picx.dev/p/PDQ9KY
- **Whiteboard:** https://picx.dev/p/PDQ9KY/image

## Summary

# WorldCam: Interactive Autoregressive 3D Gaming Worlds with Camera Pose as a Unifying Geometric Representation

## Summary (Overview)
- **Core Contribution**: Establishes **camera pose** as a unifying geometric representation to jointly ground **immediate action control** and **long-term 3D consistency** in interactive gaming world models.
- **Key Method**: Introduces a **physics-based continuous action space** defined in the **Lie algebra** $se(3)$ to derive precise 6-DoF camera poses from user inputs, and a **pose-indexed long-term memory retrieval** mechanism to enforce spatial coherence.
- **New Dataset**: Publishes **WorldCam-50h**, a large-scale dataset of 3,000 minutes (50 hours) of authentic human gameplay from open-licensed games, annotated with camera trajectories and textual descriptions.
- **Performance**: Demonstrates substantial improvements over state-of-the-art models in **action controllability**, **long-horizon visual quality**, and **3D spatial consistency** through extensive quantitative and human evaluations.
- **Architecture**: Builds on a progressive autoregressive Video Diffusion Transformer (DiT) backbone, enhanced with a camera embedder, attention sink, and short-/long-term memory mechanisms for stable long-horizon generation.

## Introduction and Theoretical Foundation
Recent advances in Video Diffusion Transformers (DiTs) have enabled interactive gaming world models. However, existing approaches struggle with **precise action control** and **long-horizon 3D consistency**. The fundamental issue is that prior works treat user actions (keyboard/mouse) as abstract conditioning signals, overlooking the **geometric coupling** between actions and the 3D world. In a 3D environment, user actions induce relative camera motions that accumulate into a **global camera pose**, which dictates the 2D projection of the world. Therefore, accurate action control and 3D consistency are inherently coupled through the camera pose.

**WorldCam** addresses this by establishing the **camera pose** as the core geometric representation. This serves a dual purpose:
1.  **For Action Control**: User inputs are translated into geometrically accurate camera poses.
2.  **For 3D Consistency**: Global camera poses act as spatial indices to retrieve past observations, ensuring consistency when revisiting locations.

The paper positions WorldCam against prior works (see Table 1), highlighting its unique ability to combine action control, 3D consistency, and long-horizon inference.

## Methodology

### 3.1 Baseline: Video Diffusion Transformer
WorldCam builds on a pretrained video DiT (Wan-2.1-T2V). Given an input video $V \in \mathbb{R}^{F \times H \times W \times 3}$, a VAE encoder maps it to a latent sequence $z_0 \in \mathbb{R}^{f \times h \times w \times c}$. The model learns to predict a velocity field. The training objective is:

$$
L_{FM} = \mathbb{E}_{z_0, c_{text}, t} \left[ \left\| v_{\theta}(z_t, c_{text}, t) - \frac{z_0 - z_t}{1 - t} \right\|_2^2 \right]
$$

### 3.2 Action-to-Camera Mapping
To ensure physically accurate control, the action space is defined in the **Lie algebra** $se(3)$. At each transition from frame $I_{i-1}$ to $I_i$, the user action $A_i$ is a twist vector:

$$
A_i = [v_i; \omega_i] \in \mathbb{R}^6
$$

where $v_i = [v_x, v_y, v_z]^\top \in \mathbb{R}^3$ and $\omega_i = [\omega_x, \omega_y, \omega_z]^\top \in \mathbb{R}^3$ denote linear and angular velocities. The corresponding relative camera pose $\Delta P_i \in SE(3)$ is derived via the matrix exponential map:

$$
\Delta P_i = \exp(\hat{A}_i) = \begin{bmatrix} \Delta R_i & \Delta t_i \\ 0^\top & 1 \end{bmatrix}
$$

where $\hat{A}_i \in se(3)$ is the $4 \times 4$ matrix of the twist $A_i$. This formulation jointly integrates linear and angular velocities on the $SE(3)$ manifold, capturing coupled dynamics like screw motion, unlike decoupled linear approximations.

### 3.3 Camera-Controlled Video Generation
Relative poses $\{\Delta P_i\}_{i=1}^F$ are accumulated into global camera poses aligned with the first frame. These poses are converted into **Plücker embeddings** $\hat{P} \in \mathbb{R}^{F \times 6}$. A lightweight camera embedding module $c_{\phi}$ (two MLP layers) injects camera control into the DiT. Since the VAE compresses time by a factor $r$, $r$ consecutive Plücker embeddings are concatenated for each latent frame, resulting in $\hat{p} \in \mathbb{R}^{f \times (6r)}$. The embeddings are added to DiT features $d$ after each self-attention layer:

$$
d \gets d + c_{\phi}(\hat{p})
$$

### 3.4 Pose-Anchored Long-Term Memory
**Global Pose Accumulation**: Relative motions are accumulated into global camera poses $P^{global}_j$ via pose composition:

$$
P^{global}_j = P^{global}_{j-1} \circ \Delta P_j, \quad P^{global}_0 = I
$$

**Pose-Indexed Memory Retrieval**: A long-term memory pool $M$ stores previously generated latents with their global poses. A hierarchical retrieval strategy uses the global pose as a spatial index:
1.  Select top-$K$ candidates $M_{trans}$ whose camera positions $t_j$ are closest to the current position $t_i$:
    $$M_{trans} = \text{TopK}_K\left(-\|t_j - t_i\|_2; (P^{global}_j, z_j) \in M\right)$$
2.  From $M_{trans}$, select $L$ entries ($L \leq K$) whose viewing directions (rotation matrices $R_j$) are most aligned with the current orientation $R_i$, measured by the trace of the relative rotation matrix:
    $$M_{rot} = \text{TopK}_L\left(\text{tr}(R_j^\top R_i); (P^{global}_j, z_j) \in M_{trans}\right)$$

**Long-Term Memory Conditioning**: Retrieved latents are concatenated with the current input. Their associated camera poses are realigned, embedded via $c_{\phi}$, and injected into the DiT, establishing geometric correspondences.

### 3.5 Progressive Autoregressive Inference
- **Progressive Noise Scheduling**: Adopts a progressive per-frame noise schedule with monotonically increasing noise levels across latent frames within each denoising window. This provides a low-noise anchor in early frames while keeping future frames correctable. The diffusion process is discretized into $N$ inference steps partitioned into $S$ stages.
- **Attention Sink**: Incorporates an attention sink mechanism (akin to StreamingLLM) by retaining global initial frames as attention anchors to preserve frame fidelity, scene style, and UI consistency.
- **Short-Term Memory**: Provides recently generated latents to reduce error drift. The number is empirically set to match the number of generated latents.

**Overall Architecture** (Figure 2): The system converts user actions to camera poses in Lie algebra, conditions a progressive autoregressive video transformer on these poses, and uses retrieved long-term memory latents and poses to enforce 3D consistency.

## Empirical Validation / Results

### 4. WorldCam-50h Dataset
A new large-scale dataset collected to address the lack of authentic human gameplay data.
- **Sources**: 1 closed-licensed game (Counter-Strike) and 2 open-licensed games (Xonotic, Unvanquished). All figures/videos in the paper are from the open-licensed games.
- **Content**: Over 100 videos per game, averaging 8 minutes, totaling ~1,000 minutes (≈17 hours) per game (3,000 minutes overall).
- **Annotations**: Each video chunk is annotated with detailed textual descriptions generated by Qwen2.5-VL-7B and pseudo ground-truth camera poses extracted using ViPE (with filtering for erroneous estimates).
- **Statistics**: Captures diverse human behaviors including navigation, combined inputs, rapid camera movements, and revisiting locations (see Figure 3 for distributions).

### 5. Experiments

**Implementation Details**:
- Backbone: Wan2.1-1.3B-T2V video DiT. Spatial resolution: $480 \times 832$.
- Training: Three-stage training on 8 NVIDIA H100 GPUs.
    1.  Camera-controlled generation with short-term memory (10k iterations, batch size 64).
    2.  Progressive autoregressive training with short-term memory (10k iterations, batch size 48).
    3.  Progressive autoregressive training with both short- and long-term memory (10k iterations, batch size 16).
- Progressive autoregressive training: $N=64$ sampling timesteps, $S=8$ stages.

**Evaluation Settings & Metrics**:
- **Baselines**: Interactive gaming world models (Yume, Matrix-Game 2.0, GameCraft) and camera-controlled models (CameraCtrl, MotionCtrl).
- **Action Controllability**: Measured via average **Relative Pose Errors** in translation (RPE_trans), rotation (RPE_rot), and camera extrinsics (RPE_camera) between estimated and ground-truth camera trajectories (with Sim(3) Umeyama alignment).
- **Visual Quality**: Evaluated using **VBench++** metrics: Aesthetic Quality, Subject Consistency, Background Consistency, Imaging Quality, Temporal Flickering, Motion Smoothness, and their average.
- **3D Consistency**: Measured via:
    - **PSNR** and **LPIPS** between palindromic frame pairs in closed-loop trajectories.
    - **Sharpness** (variance of Laplacian) to account for blur.
    - **MEt3R**: Geometric multi-view consistency via DUSt3R reconstruction and DINO feature warping.
    - **DINO Similarity**: Cosine similarity between DINOv2 features of palindrome-corresponding frames.
- **Human Evaluation**: 30 participants rated models on action controllability, visual quality, and 3D consistency (scale 1-5).

**Key Results**:

**Table 2: Quantitative comparison with interactive gaming world models (200-frame generation)**
| Method | RPE_trans ↓ | RPE_rot (º) ↓ | RPE_camera ↓ | Avg. ↑ | Aesth. ↑ | Subj. Cons. ↑ | Bg. Cons. ↑ | Img. ↑ | Temp. ↑ | Motion. ↑ |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Yume | 0.111 | 2.222 | 0.137 | 0.774 | 0.476 | 0.741 | 0.892 | 0.600 | 0.955 | 0.986 |
| Matrix-Game 2.0 | 0.098 | 1.656 | 0.119 | 0.766 | 0.457 | 0.741 | 0.843 | 0.633 | 0.937 | 0.981 |
| GameCraft | 0.086 | 1.146 | 0.100 | 0.781 | 0.464 | 0.804 | 0.850 | 0.626 | 0.958 | 0.986 |
| **WorldCam** | **0.080** | **0.696** | **0.086** | **0.844** | **0.508** | **0.896** | **0.959** | **0.752** | **0.964** | **0.984** |

**Table 3: Quantitative results for long-horizon 3D consistency**
| Method | PSNR ↑ | LPIPS ↓ | MEt3R ↓ | DINO Sim. ↑ | Sharpness ↑ |
| :--- | :---: | :---: | :---: | :---: | :---: |
| Real Videos | - | - | - | - | 577 |
| Yume | 16.03 | 0.5629 | 0.0905 | 0.4545 | 95 |
| Matrix-Game 2.0 | 13.66 | 0.4997 | 0.0662 | 0.6153 | 179 |
| GameCraft | 14.27 | 0.5749 | 0.0489 | 0.5960 | 201 |
| **WorldCam** | **16.69** | **0.3277** | **0.0342** | **0.8884** | **656** |

**Table 4: Comparison with camera-controlled models (16-frame generation)**
| Method | RPE_trans ↓ | RPE_rot (º) ↓ | RPE_camera ↓ |
| :--- | :---: | :---: | :---: |
| CameraCtrl | 0.071 | 0.943 | 0.083 |
| MotionCtrl | 0.080 | 1.271 | 0.102 |
| **WorldCam** | **0.026** | **0.386** | **0.030** |

**Table 5: Human evaluation results (average user ratings, 1-5)**
| Method | Action Controllability ↑ | Visual Quality ↑ | 3D Consistency ↑ |
| :--- | :---: | :---: | :---: |
| Yume | 2.47 | 2.83 | 1.44 |
| Matrix-2.0 | 3.78 | 3.42 | 2.75 |
| GameCraft | 2.55 | 3.34 | 3.36 |
| **WorldCam** | **4.31** | **4.44** | **4.36** |

**Qualitative Results** (Figures 4, 5, 6):
- WorldCam accurately follows complex coupled keyboard/mouse inputs (Figure 5a).
- It generates plausible long-horizon outputs (>10 seconds) without error drift (Figure 5b).
- It preserves consistent 3D geometry when revisiting locations beyond the denoising window (Figures 4, 6).

**Ablation Studies**:

**Table 6: Ablation on number of long-term memory latents**
| # | Avg. ↑ | Aesth. ↑ | Subj. Cons. ↑ | Bg. Cons. ↑ | Img. ↑ | Temp. ↑ | Motion. ↑ | PSNR ↑ | LPIPS ↓ |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 0 | 0.840 | 0.511 | 0.876 | 0.955 | 0.751 | 0.961 | 0.984 | 12.163 | 0.591 |
| 1 | 0.840 | 0.511 | 0.879 | 0.955 | 0.752 | 0.961 | 0.984 | 12.624 | 0.573 |
| 4 | 0.841 | 0.510 | 0.881 | 0.956 | 0.750 | 0.961 | 0.984 | 12.950 | 0.554 |

**Table 7: Ablation on number of short-term memory latents**
| # | Avg. ↑ | Aesth. ↑ | Subj. Cons. ↑ | Bg. Cons. ↑ | Img. ↑ | Temp. ↑ | Motion. ↑ |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 1 | 0.749 | 0.317 | 0.873 | 0.947 | 0.414 | 0.959 | 0.983 |
| 4 | 0.836 | 0.514 | 0.873 | 0.947 | 0.737 | 0.959 | 0.983 |
| 8 | 0.840 | 0.511 | 0.876 | 0.955 | 0.751 | 0.961 | 0.984 |

**Table 8: Ablation on attention sink**
| Method | Avg. ↑ | Aesth. ↑ | Subj. Cons. ↑ | Bg. Cons. ↑ | Img. ↑ | Temp. ↑ | Motion. ↑ |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| w/o attention sink | 0.840 | 0.511 | 0.876 | 0.955 | 0.751 | 0.961 | 0.984 |
| with attention sink | 0.841 | 0.502 | 0.883 | 0.956 | 0.753 | 0.964 | 0.984 |

**Table 9: Ablation on action-to-camera mapping**
| Method | RPE_trans ↓ | RPE_rot (º) ↓ | RPE_camera ↓ |
| :--- | :---: | :---: | :---: |
| Yume | 0.111 | 2.222 | 0.137 |
| Matrix-Game 2.0 | 0.098 | 1.656 | 0.119 |
| GameCraft | 0.086 | 1.146 | 0.100 |
| WorldCam (Linear) | 0.093 | 0.962 | 0.102 |
| **WorldCam (Lie)** | **0.080** | **0.696** | **0.086** |

**Table 10: Ablation on long-term memory retrieval strategy**
| Method | PSNR ↑ | LPIPS ↓ | MEt3R ↓ |
| :--- | :---: | :---: | :---: |
| Random | 15

---

_Markdown view of https://picx.dev/p/PDQ9KY, served by PicX — AI-generated visual whiteboard summaries of research papers._