WorldCam: Interactive Autoregressive 3D Gaming Worlds with Camera Pose as a Unifying Geometric Representation

Summary (Overview)

  • Core Contribution: Establishes camera pose as a unifying geometric representation to jointly ground immediate action control and long-term 3D consistency in interactive gaming world models.
  • Key Method: Introduces a physics-based continuous action space defined in the Lie algebra se(3)se(3) to derive precise 6-DoF camera poses from user inputs, and a pose-indexed long-term memory retrieval mechanism to enforce spatial coherence.
  • New Dataset: Publishes WorldCam-50h, a large-scale dataset of 3,000 minutes (50 hours) of authentic human gameplay from open-licensed games, annotated with camera trajectories and textual descriptions.
  • Performance: Demonstrates substantial improvements over state-of-the-art models in action controllability, long-horizon visual quality, and 3D spatial consistency through extensive quantitative and human evaluations.
  • Architecture: Builds on a progressive autoregressive Video Diffusion Transformer (DiT) backbone, enhanced with a camera embedder, attention sink, and short-/long-term memory mechanisms for stable long-horizon generation.

Introduction and Theoretical Foundation

Recent advances in Video Diffusion Transformers (DiTs) have enabled interactive gaming world models. However, existing approaches struggle with precise action control and long-horizon 3D consistency. The fundamental issue is that prior works treat user actions (keyboard/mouse) as abstract conditioning signals, overlooking the geometric coupling between actions and the 3D world. In a 3D environment, user actions induce relative camera motions that accumulate into a global camera pose, which dictates the 2D projection of the world. Therefore, accurate action control and 3D consistency are inherently coupled through the camera pose.

WorldCam addresses this by establishing the camera pose as the core geometric representation. This serves a dual purpose:

  1. For Action Control: User inputs are translated into geometrically accurate camera poses.
  2. For 3D Consistency: Global camera poses act as spatial indices to retrieve past observations, ensuring consistency when revisiting locations.

The paper positions WorldCam against prior works (see Table 1), highlighting its unique ability to combine action control, 3D consistency, and long-horizon inference.

Methodology

3.1 Baseline: Video Diffusion Transformer

WorldCam builds on a pretrained video DiT (Wan-2.1-T2V). Given an input video VRF×H×W×3V \in \mathbb{R}^{F \times H \times W \times 3}, a VAE encoder maps it to a latent sequence z0Rf×h×w×cz_0 \in \mathbb{R}^{f \times h \times w \times c}. The model learns to predict a velocity field. The training objective is:

LFM=Ez0,ctext,t[vθ(zt,ctext,t)z0zt1t22]L_{FM} = \mathbb{E}_{z_0, c_{text}, t} \left[ \left\| v_{\theta}(z_t, c_{text}, t) - \frac{z_0 - z_t}{1 - t} \right\|_2^2 \right]

3.2 Action-to-Camera Mapping

To ensure physically accurate control, the action space is defined in the Lie algebra se(3)se(3). At each transition from frame Ii1I_{i-1} to IiI_i, the user action AiA_i is a twist vector:

Ai=[vi;ωi]R6A_i = [v_i; \omega_i] \in \mathbb{R}^6

where vi=[vx,vy,vz]R3v_i = [v_x, v_y, v_z]^\top \in \mathbb{R}^3 and ωi=[ωx,ωy,ωz]R3\omega_i = [\omega_x, \omega_y, \omega_z]^\top \in \mathbb{R}^3 denote linear and angular velocities. The corresponding relative camera pose ΔPiSE(3)\Delta P_i \in SE(3) is derived via the matrix exponential map:

ΔPi=exp(A^i)=[ΔRiΔti01]\Delta P_i = \exp(\hat{A}_i) = \begin{bmatrix} \Delta R_i & \Delta t_i \\ 0^\top & 1 \end{bmatrix}

where A^ise(3)\hat{A}_i \in se(3) is the 4×44 \times 4 matrix of the twist AiA_i. This formulation jointly integrates linear and angular velocities on the SE(3)SE(3) manifold, capturing coupled dynamics like screw motion, unlike decoupled linear approximations.

3.3 Camera-Controlled Video Generation

Relative poses {ΔPi}i=1F\{\Delta P_i\}_{i=1}^F are accumulated into global camera poses aligned with the first frame. These poses are converted into Plücker embeddings P^RF×6\hat{P} \in \mathbb{R}^{F \times 6}. A lightweight camera embedding module cϕc_{\phi} (two MLP layers) injects camera control into the DiT. Since the VAE compresses time by a factor rr, rr consecutive Plücker embeddings are concatenated for each latent frame, resulting in p^Rf×(6r)\hat{p} \in \mathbb{R}^{f \times (6r)}. The embeddings are added to DiT features dd after each self-attention layer:

dd+cϕ(p^)d \gets d + c_{\phi}(\hat{p})

3.4 Pose-Anchored Long-Term Memory

Global Pose Accumulation: Relative motions are accumulated into global camera poses PjglobalP^{global}_j via pose composition:

Pjglobal=Pj1globalΔPj,P0global=IP^{global}_j = P^{global}_{j-1} \circ \Delta P_j, \quad P^{global}_0 = I

Pose-Indexed Memory Retrieval: A long-term memory pool MM stores previously generated latents with their global poses. A hierarchical retrieval strategy uses the global pose as a spatial index:

  1. Select top-KK candidates MtransM_{trans} whose camera positions tjt_j are closest to the current position tit_i: Mtrans=TopKK(tjti2;(Pjglobal,zj)M)M_{trans} = \text{TopK}_K\left(-\|t_j - t_i\|_2; (P^{global}_j, z_j) \in M\right)
  2. From MtransM_{trans}, select LL entries (LKL \leq K) whose viewing directions (rotation matrices RjR_j) are most aligned with the current orientation RiR_i, measured by the trace of the relative rotation matrix: Mrot=TopKL(tr(RjRi);(Pjglobal,zj)Mtrans)M_{rot} = \text{TopK}_L\left(\text{tr}(R_j^\top R_i); (P^{global}_j, z_j) \in M_{trans}\right)

Long-Term Memory Conditioning: Retrieved latents are concatenated with the current input. Their associated camera poses are realigned, embedded via cϕc_{\phi}, and injected into the DiT, establishing geometric correspondences.

3.5 Progressive Autoregressive Inference

  • Progressive Noise Scheduling: Adopts a progressive per-frame noise schedule with monotonically increasing noise levels across latent frames within each denoising window. This provides a low-noise anchor in early frames while keeping future frames correctable. The diffusion process is discretized into NN inference steps partitioned into SS stages.
  • Attention Sink: Incorporates an attention sink mechanism (akin to StreamingLLM) by retaining global initial frames as attention anchors to preserve frame fidelity, scene style, and UI consistency.
  • Short-Term Memory: Provides recently generated latents to reduce error drift. The number is empirically set to match the number of generated latents.

Overall Architecture (Figure 2): The system converts user actions to camera poses in Lie algebra, conditions a progressive autoregressive video transformer on these poses, and uses retrieved long-term memory latents and poses to enforce 3D consistency.

Empirical Validation / Results

4. WorldCam-50h Dataset

A new large-scale dataset collected to address the lack of authentic human gameplay data.

  • Sources: 1 closed-licensed game (Counter-Strike) and 2 open-licensed games (Xonotic, Unvanquished). All figures/videos in the paper are from the open-licensed games.
  • Content: Over 100 videos per game, averaging 8 minutes, totaling ~1,000 minutes (≈17 hours) per game (3,000 minutes overall).
  • Annotations: Each video chunk is annotated with detailed textual descriptions generated by Qwen2.5-VL-7B and pseudo ground-truth camera poses extracted using ViPE (with filtering for erroneous estimates).
  • Statistics: Captures diverse human behaviors including navigation, combined inputs, rapid camera movements, and revisiting locations (see Figure 3 for distributions).

5. Experiments

Implementation Details:

  • Backbone: Wan2.1-1.3B-T2V video DiT. Spatial resolution: 480×832480 \times 832.
  • Training: Three-stage training on 8 NVIDIA H100 GPUs.
    1. Camera-controlled generation with short-term memory (10k iterations, batch size 64).
    2. Progressive autoregressive training with short-term memory (10k iterations, batch size 48).
    3. Progressive autoregressive training with both short- and long-term memory (10k iterations, batch size 16).
  • Progressive autoregressive training: N=64N=64 sampling timesteps, S=8S=8 stages.

Evaluation Settings & Metrics:

  • Baselines: Interactive gaming world models (Yume, Matrix-Game 2.0, GameCraft) and camera-controlled models (CameraCtrl, MotionCtrl).
  • Action Controllability: Measured via average Relative Pose Errors in translation (RPE_trans), rotation (RPE_rot), and camera extrinsics (RPE_camera) between estimated and ground-truth camera trajectories (with Sim(3) Umeyama alignment).
  • Visual Quality: Evaluated using VBench++ metrics: Aesthetic Quality, Subject Consistency, Background Consistency, Imaging Quality, Temporal Flickering, Motion Smoothness, and their average.
  • 3D Consistency: Measured via:
    • PSNR and LPIPS between palindromic frame pairs in closed-loop trajectories.
    • Sharpness (variance of Laplacian) to account for blur.
    • MEt3R: Geometric multi-view consistency via DUSt3R reconstruction and DINO feature warping.
    • DINO Similarity: Cosine similarity between DINOv2 features of palindrome-corresponding frames.
  • Human Evaluation: 30 participants rated models on action controllability, visual quality, and 3D consistency (scale 1-5).

Key Results:

Table 2: Quantitative comparison with interactive gaming world models (200-frame generation)

MethodRPE_trans ↓RPE_rot (º) ↓RPE_camera ↓Avg. ↑Aesth. ↑Subj. Cons. ↑Bg. Cons. ↑Img. ↑Temp. ↑Motion. ↑
Yume0.1112.2220.1370.7740.4760.7410.8920.6000.9550.986
Matrix-Game 2.00.0981.6560.1190.7660.4570.7410.8430.6330.9370.981
GameCraft0.0861.1460.1000.7810.4640.8040.8500.6260.9580.986
WorldCam0.0800.6960.0860.8440.5080.8960.9590.7520.9640.984

Table 3: Quantitative results for long-horizon 3D consistency

MethodPSNR ↑LPIPS ↓MEt3R ↓DINO Sim. ↑Sharpness ↑
Real Videos----577
Yume16.030.56290.09050.454595
Matrix-Game 2.013.660.49970.06620.6153179
GameCraft14.270.57490.04890.5960201
WorldCam16.690.32770.03420.8884656

Table 4: Comparison with camera-controlled models (16-frame generation)

MethodRPE_trans ↓RPE_rot (º) ↓RPE_camera ↓
CameraCtrl0.0710.9430.083
MotionCtrl0.0801.2710.102
WorldCam0.0260.3860.030

Table 5: Human evaluation results (average user ratings, 1-5)

MethodAction Controllability ↑Visual Quality ↑3D Consistency ↑
Yume2.472.831.44
Matrix-2.03.783.422.75
GameCraft2.553.343.36
WorldCam4.314.444.36

Qualitative Results (Figures 4, 5, 6):

  • WorldCam accurately follows complex coupled keyboard/mouse inputs (Figure 5a).
  • It generates plausible long-horizon outputs (>10 seconds) without error drift (Figure 5b).
  • It preserves consistent 3D geometry when revisiting locations beyond the denoising window (Figures 4, 6).

Ablation Studies:

Table 6: Ablation on number of long-term memory latents

#Avg. ↑Aesth. ↑Subj. Cons. ↑Bg. Cons. ↑Img. ↑Temp. ↑Motion. ↑PSNR ↑LPIPS ↓
00.8400.5110.8760.9550.7510.9610.98412.1630.591
10.8400.5110.8790.9550.7520.9610.98412.6240.573
40.8410.5100.8810.9560.7500.9610.98412.9500.554

Table 7: Ablation on number of short-term memory latents

#Avg. ↑Aesth. ↑Subj. Cons. ↑Bg. Cons. ↑Img. ↑Temp. ↑Motion. ↑
10.7490.3170.8730.9470.4140.9590.983
40.8360.5140.8730.9470.7370.9590.983
80.8400.5110.8760.9550.7510.9610.984

Table 8: Ablation on attention sink

MethodAvg. ↑Aesth. ↑Subj. Cons. ↑Bg. Cons. ↑Img. ↑Temp. ↑Motion. ↑
w/o attention sink0.8400.5110.8760.9550.7510.9610.984
with attention sink0.8410.5020.8830.9560.7530.9640.984

Table 9: Ablation on action-to-camera mapping

MethodRPE_trans ↓RPE_rot (º) ↓RPE_camera ↓
Yume0.1112.2220.137
Matrix-Game 2.00.0981.6560.119
GameCraft0.0861.1460.100
WorldCam (Linear)0.0930.9620.102
WorldCam (Lie)0.0800.6960.086

Table 10: Ablation on long-term memory retrieval strategy

MethodPSNR ↑LPIPS ↓MEt3R ↓
Random15