WorldCam: Interactive Autoregressive 3D Gaming Worlds with Camera Pose as a Unifying Geometric Representation
Summary (Overview)
- Core Contribution: Establishes camera pose as a unifying geometric representation to jointly ground immediate action control and long-term 3D consistency in interactive gaming world models.
- Key Method: Introduces a physics-based continuous action space defined in the Lie algebra to derive precise 6-DoF camera poses from user inputs, and a pose-indexed long-term memory retrieval mechanism to enforce spatial coherence.
- New Dataset: Publishes WorldCam-50h, a large-scale dataset of 3,000 minutes (50 hours) of authentic human gameplay from open-licensed games, annotated with camera trajectories and textual descriptions.
- Performance: Demonstrates substantial improvements over state-of-the-art models in action controllability, long-horizon visual quality, and 3D spatial consistency through extensive quantitative and human evaluations.
- Architecture: Builds on a progressive autoregressive Video Diffusion Transformer (DiT) backbone, enhanced with a camera embedder, attention sink, and short-/long-term memory mechanisms for stable long-horizon generation.
Introduction and Theoretical Foundation
Recent advances in Video Diffusion Transformers (DiTs) have enabled interactive gaming world models. However, existing approaches struggle with precise action control and long-horizon 3D consistency. The fundamental issue is that prior works treat user actions (keyboard/mouse) as abstract conditioning signals, overlooking the geometric coupling between actions and the 3D world. In a 3D environment, user actions induce relative camera motions that accumulate into a global camera pose, which dictates the 2D projection of the world. Therefore, accurate action control and 3D consistency are inherently coupled through the camera pose.
WorldCam addresses this by establishing the camera pose as the core geometric representation. This serves a dual purpose:
- For Action Control: User inputs are translated into geometrically accurate camera poses.
- For 3D Consistency: Global camera poses act as spatial indices to retrieve past observations, ensuring consistency when revisiting locations.
The paper positions WorldCam against prior works (see Table 1), highlighting its unique ability to combine action control, 3D consistency, and long-horizon inference.
Methodology
3.1 Baseline: Video Diffusion Transformer
WorldCam builds on a pretrained video DiT (Wan-2.1-T2V). Given an input video , a VAE encoder maps it to a latent sequence . The model learns to predict a velocity field. The training objective is:
3.2 Action-to-Camera Mapping
To ensure physically accurate control, the action space is defined in the Lie algebra . At each transition from frame to , the user action is a twist vector:
where and denote linear and angular velocities. The corresponding relative camera pose is derived via the matrix exponential map:
where is the matrix of the twist . This formulation jointly integrates linear and angular velocities on the manifold, capturing coupled dynamics like screw motion, unlike decoupled linear approximations.
3.3 Camera-Controlled Video Generation
Relative poses are accumulated into global camera poses aligned with the first frame. These poses are converted into Plücker embeddings . A lightweight camera embedding module (two MLP layers) injects camera control into the DiT. Since the VAE compresses time by a factor , consecutive Plücker embeddings are concatenated for each latent frame, resulting in . The embeddings are added to DiT features after each self-attention layer:
3.4 Pose-Anchored Long-Term Memory
Global Pose Accumulation: Relative motions are accumulated into global camera poses via pose composition:
Pose-Indexed Memory Retrieval: A long-term memory pool stores previously generated latents with their global poses. A hierarchical retrieval strategy uses the global pose as a spatial index:
- Select top- candidates whose camera positions are closest to the current position :
- From , select entries () whose viewing directions (rotation matrices ) are most aligned with the current orientation , measured by the trace of the relative rotation matrix:
Long-Term Memory Conditioning: Retrieved latents are concatenated with the current input. Their associated camera poses are realigned, embedded via , and injected into the DiT, establishing geometric correspondences.
3.5 Progressive Autoregressive Inference
- Progressive Noise Scheduling: Adopts a progressive per-frame noise schedule with monotonically increasing noise levels across latent frames within each denoising window. This provides a low-noise anchor in early frames while keeping future frames correctable. The diffusion process is discretized into inference steps partitioned into stages.
- Attention Sink: Incorporates an attention sink mechanism (akin to StreamingLLM) by retaining global initial frames as attention anchors to preserve frame fidelity, scene style, and UI consistency.
- Short-Term Memory: Provides recently generated latents to reduce error drift. The number is empirically set to match the number of generated latents.
Overall Architecture (Figure 2): The system converts user actions to camera poses in Lie algebra, conditions a progressive autoregressive video transformer on these poses, and uses retrieved long-term memory latents and poses to enforce 3D consistency.
Empirical Validation / Results
4. WorldCam-50h Dataset
A new large-scale dataset collected to address the lack of authentic human gameplay data.
- Sources: 1 closed-licensed game (Counter-Strike) and 2 open-licensed games (Xonotic, Unvanquished). All figures/videos in the paper are from the open-licensed games.
- Content: Over 100 videos per game, averaging 8 minutes, totaling ~1,000 minutes (≈17 hours) per game (3,000 minutes overall).
- Annotations: Each video chunk is annotated with detailed textual descriptions generated by Qwen2.5-VL-7B and pseudo ground-truth camera poses extracted using ViPE (with filtering for erroneous estimates).
- Statistics: Captures diverse human behaviors including navigation, combined inputs, rapid camera movements, and revisiting locations (see Figure 3 for distributions).
5. Experiments
Implementation Details:
- Backbone: Wan2.1-1.3B-T2V video DiT. Spatial resolution: .
- Training: Three-stage training on 8 NVIDIA H100 GPUs.
- Camera-controlled generation with short-term memory (10k iterations, batch size 64).
- Progressive autoregressive training with short-term memory (10k iterations, batch size 48).
- Progressive autoregressive training with both short- and long-term memory (10k iterations, batch size 16).
- Progressive autoregressive training: sampling timesteps, stages.
Evaluation Settings & Metrics:
- Baselines: Interactive gaming world models (Yume, Matrix-Game 2.0, GameCraft) and camera-controlled models (CameraCtrl, MotionCtrl).
- Action Controllability: Measured via average Relative Pose Errors in translation (RPE_trans), rotation (RPE_rot), and camera extrinsics (RPE_camera) between estimated and ground-truth camera trajectories (with Sim(3) Umeyama alignment).
- Visual Quality: Evaluated using VBench++ metrics: Aesthetic Quality, Subject Consistency, Background Consistency, Imaging Quality, Temporal Flickering, Motion Smoothness, and their average.
- 3D Consistency: Measured via:
- PSNR and LPIPS between palindromic frame pairs in closed-loop trajectories.
- Sharpness (variance of Laplacian) to account for blur.
- MEt3R: Geometric multi-view consistency via DUSt3R reconstruction and DINO feature warping.
- DINO Similarity: Cosine similarity between DINOv2 features of palindrome-corresponding frames.
- Human Evaluation: 30 participants rated models on action controllability, visual quality, and 3D consistency (scale 1-5).
Key Results:
Table 2: Quantitative comparison with interactive gaming world models (200-frame generation)
| Method | RPE_trans ↓ | RPE_rot (º) ↓ | RPE_camera ↓ | Avg. ↑ | Aesth. ↑ | Subj. Cons. ↑ | Bg. Cons. ↑ | Img. ↑ | Temp. ↑ | Motion. ↑ |
|---|---|---|---|---|---|---|---|---|---|---|
| Yume | 0.111 | 2.222 | 0.137 | 0.774 | 0.476 | 0.741 | 0.892 | 0.600 | 0.955 | 0.986 |
| Matrix-Game 2.0 | 0.098 | 1.656 | 0.119 | 0.766 | 0.457 | 0.741 | 0.843 | 0.633 | 0.937 | 0.981 |
| GameCraft | 0.086 | 1.146 | 0.100 | 0.781 | 0.464 | 0.804 | 0.850 | 0.626 | 0.958 | 0.986 |
| WorldCam | 0.080 | 0.696 | 0.086 | 0.844 | 0.508 | 0.896 | 0.959 | 0.752 | 0.964 | 0.984 |
Table 3: Quantitative results for long-horizon 3D consistency
| Method | PSNR ↑ | LPIPS ↓ | MEt3R ↓ | DINO Sim. ↑ | Sharpness ↑ |
|---|---|---|---|---|---|
| Real Videos | - | - | - | - | 577 |
| Yume | 16.03 | 0.5629 | 0.0905 | 0.4545 | 95 |
| Matrix-Game 2.0 | 13.66 | 0.4997 | 0.0662 | 0.6153 | 179 |
| GameCraft | 14.27 | 0.5749 | 0.0489 | 0.5960 | 201 |
| WorldCam | 16.69 | 0.3277 | 0.0342 | 0.8884 | 656 |
Table 4: Comparison with camera-controlled models (16-frame generation)
| Method | RPE_trans ↓ | RPE_rot (º) ↓ | RPE_camera ↓ |
|---|---|---|---|
| CameraCtrl | 0.071 | 0.943 | 0.083 |
| MotionCtrl | 0.080 | 1.271 | 0.102 |
| WorldCam | 0.026 | 0.386 | 0.030 |
Table 5: Human evaluation results (average user ratings, 1-5)
| Method | Action Controllability ↑ | Visual Quality ↑ | 3D Consistency ↑ |
|---|---|---|---|
| Yume | 2.47 | 2.83 | 1.44 |
| Matrix-2.0 | 3.78 | 3.42 | 2.75 |
| GameCraft | 2.55 | 3.34 | 3.36 |
| WorldCam | 4.31 | 4.44 | 4.36 |
Qualitative Results (Figures 4, 5, 6):
- WorldCam accurately follows complex coupled keyboard/mouse inputs (Figure 5a).
- It generates plausible long-horizon outputs (>10 seconds) without error drift (Figure 5b).
- It preserves consistent 3D geometry when revisiting locations beyond the denoising window (Figures 4, 6).
Ablation Studies:
Table 6: Ablation on number of long-term memory latents
| # | Avg. ↑ | Aesth. ↑ | Subj. Cons. ↑ | Bg. Cons. ↑ | Img. ↑ | Temp. ↑ | Motion. ↑ | PSNR ↑ | LPIPS ↓ |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.840 | 0.511 | 0.876 | 0.955 | 0.751 | 0.961 | 0.984 | 12.163 | 0.591 |
| 1 | 0.840 | 0.511 | 0.879 | 0.955 | 0.752 | 0.961 | 0.984 | 12.624 | 0.573 |
| 4 | 0.841 | 0.510 | 0.881 | 0.956 | 0.750 | 0.961 | 0.984 | 12.950 | 0.554 |
Table 7: Ablation on number of short-term memory latents
| # | Avg. ↑ | Aesth. ↑ | Subj. Cons. ↑ | Bg. Cons. ↑ | Img. ↑ | Temp. ↑ | Motion. ↑ |
|---|---|---|---|---|---|---|---|
| 1 | 0.749 | 0.317 | 0.873 | 0.947 | 0.414 | 0.959 | 0.983 |
| 4 | 0.836 | 0.514 | 0.873 | 0.947 | 0.737 | 0.959 | 0.983 |
| 8 | 0.840 | 0.511 | 0.876 | 0.955 | 0.751 | 0.961 | 0.984 |
Table 8: Ablation on attention sink
| Method | Avg. ↑ | Aesth. ↑ | Subj. Cons. ↑ | Bg. Cons. ↑ | Img. ↑ | Temp. ↑ | Motion. ↑ |
|---|---|---|---|---|---|---|---|
| w/o attention sink | 0.840 | 0.511 | 0.876 | 0.955 | 0.751 | 0.961 | 0.984 |
| with attention sink | 0.841 | 0.502 | 0.883 | 0.956 | 0.753 | 0.964 | 0.984 |
Table 9: Ablation on action-to-camera mapping
| Method | RPE_trans ↓ | RPE_rot (º) ↓ | RPE_camera ↓ |
|---|---|---|---|
| Yume | 0.111 | 2.222 | 0.137 |
| Matrix-Game 2.0 | 0.098 | 1.656 | 0.119 |
| GameCraft | 0.086 | 1.146 | 0.100 |
| WorldCam (Linear) | 0.093 | 0.962 | 0.102 |
| WorldCam (Lie) | 0.080 | 0.696 | 0.086 |
Table 10: Ablation on long-term memory retrieval strategy
| Method | PSNR ↑ | LPIPS ↓ | MEt3R ↓ |
|---|---|---|---|
| Random | 15 |