WorldCam: Interactive Autoregressive 3D Gaming Worlds with Camera Pose as a Unifying Geometric Representation

Summary (Overview)

Core Contribution: Establishes camera pose as a unifying geometric representation to jointly ground immediate action control and long-term 3D consistency in interactive gaming world models.
Key Method: Introduces a physics-based continuous action space defined in the Lie algebra $se(3)$ to derive precise 6-DoF camera poses from user inputs, and a pose-indexed long-term memory retrieval mechanism to enforce spatial coherence.
New Dataset: Publishes WorldCam-50h, a large-scale dataset of 3,000 minutes (50 hours) of authentic human gameplay from open-licensed games, annotated with camera trajectories and textual descriptions.
Performance: Demonstrates substantial improvements over state-of-the-art models in action controllability, long-horizon visual quality, and 3D spatial consistency through extensive quantitative and human evaluations.
Architecture: Builds on a progressive autoregressive Video Diffusion Transformer (DiT) backbone, enhanced with a camera embedder, attention sink, and short-/long-term memory mechanisms for stable long-horizon generation.

Introduction and Theoretical Foundation

Recent advances in Video Diffusion Transformers (DiTs) have enabled interactive gaming world models. However, existing approaches struggle with precise action control and long-horizon 3D consistency. The fundamental issue is that prior works treat user actions (keyboard/mouse) as abstract conditioning signals, overlooking the geometric coupling between actions and the 3D world. In a 3D environment, user actions induce relative camera motions that accumulate into a global camera pose, which dictates the 2D projection of the world. Therefore, accurate action control and 3D consistency are inherently coupled through the camera pose.

WorldCam addresses this by establishing the camera pose as the core geometric representation. This serves a dual purpose:

For Action Control: User inputs are translated into geometrically accurate camera poses.
For 3D Consistency: Global camera poses act as spatial indices to retrieve past observations, ensuring consistency when revisiting locations.

The paper positions WorldCam against prior works (see Table 1), highlighting its unique ability to combine action control, 3D consistency, and long-horizon inference.

Methodology

3.1 Baseline: Video Diffusion Transformer

WorldCam builds on a pretrained video DiT (Wan-2.1-T2V). Given an input video $V \in \mathbb{R}^{F \times H \times W \times 3}$ , a VAE encoder maps it to a latent sequence $z_0 \in \mathbb{R}^{f \times h \times w \times c}$ . The model learns to predict a velocity field. The training objective is:

L_{FM} = \mathbb{E}_{z_0, c_{text}, t} \left[ \left\| v_{\theta}(z_t, c_{text}, t) - \frac{z_0 - z_t}{1 - t} \right\|_2^2 \right]

3.2 Action-to-Camera Mapping

To ensure physically accurate control, the action space is defined in the Lie algebra $se(3)$ . At each transition from frame $I_{i-1}$ to $I_i$ , the user action $A_i$ is a twist vector:

A_i = [v_i; \omega_i] \in \mathbb{R}^6

where $v_i = [v_x, v_y, v_z]^\top \in \mathbb{R}^3$ and $\omega_i = [\omega_x, \omega_y, \omega_z]^\top \in \mathbb{R}^3$ denote linear and angular velocities. The corresponding relative camera pose $\Delta P_i \in SE(3)$ is derived via the matrix exponential map:

\Delta P_i = \exp(\hat{A}_i) = \begin{bmatrix} \Delta R_i & \Delta t_i \\ 0^\top & 1 \end{bmatrix}

where $\hat{A}_i \in se(3)$ is the $4 \times 4$ matrix of the twist $A_i$ . This formulation jointly integrates linear and angular velocities on the $SE(3)$ manifold, capturing coupled dynamics like screw motion, unlike decoupled linear approximations.

3.3 Camera-Controlled Video Generation

Relative poses $\{\Delta P_i\}_{i=1}^F$ are accumulated into global camera poses aligned with the first frame. These poses are converted into Plücker embeddings $\hat{P} \in \mathbb{R}^{F \times 6}$ . A lightweight camera embedding module $c_{\phi}$ (two MLP layers) injects camera control into the DiT. Since the VAE compresses time by a factor $r$ , $r$ consecutive Plücker embeddings are concatenated for each latent frame, resulting in $\hat{p} \in \mathbb{R}^{f \times (6r)}$ . The embeddings are added to DiT features $d$ after each self-attention layer:

d \gets d + c_{\phi}(\hat{p})

3.4 Pose-Anchored Long-Term Memory

Global Pose Accumulation: Relative motions are accumulated into global camera poses $P^{global}_j$ via pose composition:

P^{global}_j = P^{global}_{j-1} \circ \Delta P_j, \quad P^{global}_0 = I

Pose-Indexed Memory Retrieval: A long-term memory pool $M$ stores previously generated latents with their global poses. A hierarchical retrieval strategy uses the global pose as a spatial index:

Select top- $K$ candidates $M_{trans}$ whose camera positions $t_j$ are closest to the current position $t_i$ : $M_{trans} = \text{TopK}_K\left(-\|t_j - t_i\|_2; (P^{global}_j, z_j) \in M\right)$
From $M_{trans}$ , select $L$ entries ( $L \leq K$ ) whose viewing directions (rotation matrices $R_j$ ) are most aligned with the current orientation $R_i$ , measured by the trace of the relative rotation matrix: $M_{rot} = \text{TopK}_L\left(\text{tr}(R_j^\top R_i); (P^{global}_j, z_j) \in M_{trans}\right)$

Long-Term Memory Conditioning: Retrieved latents are concatenated with the current input. Their associated camera poses are realigned, embedded via $c_{\phi}$ , and injected into the DiT, establishing geometric correspondences.

3.5 Progressive Autoregressive Inference

Progressive Noise Scheduling: Adopts a progressive per-frame noise schedule with monotonically increasing noise levels across latent frames within each denoising window. This provides a low-noise anchor in early frames while keeping future frames correctable. The diffusion process is discretized into $N$ inference steps partitioned into $S$ stages.
Attention Sink: Incorporates an attention sink mechanism (akin to StreamingLLM) by retaining global initial frames as attention anchors to preserve frame fidelity, scene style, and UI consistency.
Short-Term Memory: Provides recently generated latents to reduce error drift. The number is empirically set to match the number of generated latents.

Overall Architecture (Figure 2): The system converts user actions to camera poses in Lie algebra, conditions a progressive autoregressive video transformer on these poses, and uses retrieved long-term memory latents and poses to enforce 3D consistency.

Empirical Validation / Results

4. WorldCam-50h Dataset

A new large-scale dataset collected to address the lack of authentic human gameplay data.

Sources: 1 closed-licensed game (Counter-Strike) and 2 open-licensed games (Xonotic, Unvanquished). All figures/videos in the paper are from the open-licensed games.
Content: Over 100 videos per game, averaging 8 minutes, totaling ~1,000 minutes (≈17 hours) per game (3,000 minutes overall).
Annotations: Each video chunk is annotated with detailed textual descriptions generated by Qwen2.5-VL-7B and pseudo ground-truth camera poses extracted using ViPE (with filtering for erroneous estimates).
Statistics: Captures diverse human behaviors including navigation, combined inputs, rapid camera movements, and revisiting locations (see Figure 3 for distributions).

5. Experiments

Implementation Details:

Backbone: Wan2.1-1.3B-T2V video DiT. Spatial resolution: $480 \times 832$ .
Training: Three-stage training on 8 NVIDIA H100 GPUs.
1. Camera-controlled generation with short-term memory (10k iterations, batch size 64).
2. Progressive autoregressive training with short-term memory (10k iterations, batch size 48).
3. Progressive autoregressive training with both short- and long-term memory (10k iterations, batch size 16).
Progressive autoregressive training: $N=64$ sampling timesteps, $S=8$ stages.

Evaluation Settings & Metrics:

Baselines: Interactive gaming world models (Yume, Matrix-Game 2.0, GameCraft) and camera-controlled models (CameraCtrl, MotionCtrl).
Action Controllability: Measured via average Relative Pose Errors in translation (RPE_trans), rotation (RPE_rot), and camera extrinsics (RPE_camera) between estimated and ground-truth camera trajectories (with Sim(3) Umeyama alignment).
Visual Quality: Evaluated using VBench++ metrics: Aesthetic Quality, Subject Consistency, Background Consistency, Imaging Quality, Temporal Flickering, Motion Smoothness, and their average.
3D Consistency: Measured via:
- PSNR and LPIPS between palindromic frame pairs in closed-loop trajectories.
- Sharpness (variance of Laplacian) to account for blur.
- MEt3R: Geometric multi-view consistency via DUSt3R reconstruction and DINO feature warping.
- DINO Similarity: Cosine similarity between DINOv2 features of palindrome-corresponding frames.
Human Evaluation: 30 participants rated models on action controllability, visual quality, and 3D consistency (scale 1-5).

Key Results:

Table 2: Quantitative comparison with interactive gaming world models (200-frame generation)

Method	RPE_trans ↓	RPE_rot (º) ↓	RPE_camera ↓	Avg. ↑	Aesth. ↑	Subj. Cons. ↑	Bg. Cons. ↑	Img. ↑	Temp. ↑	Motion. ↑
Yume	0.111	2.222	0.137	0.774	0.476	0.741	0.892	0.600	0.955	0.986
Matrix-Game 2.0	0.098	1.656	0.119	0.766	0.457	0.741	0.843	0.633	0.937	0.981
GameCraft	0.086	1.146	0.100	0.781	0.464	0.804	0.850	0.626	0.958	0.986
WorldCam	0.080	0.696	0.086	0.844	0.508	0.896	0.959	0.752	0.964	0.984

Table 3: Quantitative results for long-horizon 3D consistency

Method	PSNR ↑	LPIPS ↓	MEt3R ↓	DINO Sim. ↑	Sharpness ↑
Real Videos	-	-	-	-	577
Yume	16.03	0.5629	0.0905	0.4545	95
Matrix-Game 2.0	13.66	0.4997	0.0662	0.6153	179
GameCraft	14.27	0.5749	0.0489	0.5960	201
WorldCam	16.69	0.3277	0.0342	0.8884	656

Table 4: Comparison with camera-controlled models (16-frame generation)

Method	RPE_trans ↓	RPE_rot (º) ↓	RPE_camera ↓
CameraCtrl	0.071	0.943	0.083
MotionCtrl	0.080	1.271	0.102
WorldCam	0.026	0.386	0.030

Table 5: Human evaluation results (average user ratings, 1-5)

Method	Action Controllability ↑	Visual Quality ↑	3D Consistency ↑
Yume	2.47	2.83	1.44
Matrix-2.0	3.78	3.42	2.75
GameCraft	2.55	3.34	3.36
WorldCam	4.31	4.44	4.36

Qualitative Results (Figures 4, 5, 6):

WorldCam accurately follows complex coupled keyboard/mouse inputs (Figure 5a).
It generates plausible long-horizon outputs (>10 seconds) without error drift (Figure 5b).
It preserves consistent 3D geometry when revisiting locations beyond the denoising window (Figures 4, 6).

Ablation Studies:

Table 6: Ablation on number of long-term memory latents

#	Avg. ↑	Aesth. ↑	Subj. Cons. ↑	Bg. Cons. ↑	Img. ↑	Temp. ↑	Motion. ↑	PSNR ↑	LPIPS ↓
0	0.840	0.511	0.876	0.955	0.751	0.961	0.984	12.163	0.591
1	0.840	0.511	0.879	0.955	0.752	0.961	0.984	12.624	0.573
4	0.841	0.510	0.881	0.956	0.750	0.961	0.984	12.950	0.554

Table 7: Ablation on number of short-term memory latents

#	Avg. ↑	Aesth. ↑	Subj. Cons. ↑	Bg. Cons. ↑	Img. ↑	Temp. ↑	Motion. ↑
1	0.749	0.317	0.873	0.947	0.414	0.959	0.983
4	0.836	0.514	0.873	0.947	0.737	0.959	0.983
8	0.840	0.511	0.876	0.955	0.751	0.961	0.984

Table 8: Ablation on attention sink

Method	Avg. ↑	Aesth. ↑	Subj. Cons. ↑	Bg. Cons. ↑	Img. ↑	Temp. ↑	Motion. ↑
w/o attention sink	0.840	0.511	0.876	0.955	0.751	0.961	0.984
with attention sink	0.841	0.502	0.883	0.956	0.753	0.964	0.984

Table 9: Ablation on action-to-camera mapping

Method	RPE_trans ↓	RPE_rot (º) ↓	RPE_camera ↓
Yume	0.111	2.222	0.137
Matrix-Game 2.0	0.098	1.656	0.119
GameCraft	0.086	1.146	0.100
WorldCam (Linear)	0.093	0.962	0.102
WorldCam (Lie)	0.080	0.696	0.086

Table 10: Ablation on long-term memory retrieval strategy

Method	PSNR ↑	LPIPS ↓	MEt3R ↓
Random	15