# DreamX-World 1.0: A General-Purpose Interactive World Model

> DreamX-World 1.0 outperforms larger baselines in interactive world modeling via efficient camera control, memory persistence, and RL-aligned distillation.

- **Source:** [arXiv](https://arxiv.org/abs/2606.16993)
- **Published:** 2026-06-17
- **Permalink:** https://picx.dev/p/hMMQnj
- **Whiteboard:** https://picx.dev/p/hMMQnj/image

## Summary

## Summary (Overview)

- **DreamX-World 1.0** is a general-purpose interactive text/image-to-video world model supporting camera navigation, scene revisits, and promptable events across photorealistic, game-style, and stylized domains.
- Introduces **E-PRoPE**, a lightweight variant of projective positional encoding that applies camera-aware attention to spatially reduced tokens, achieving comparable trajectory-following to PRoPE with ~30% lower inference latency.
- Proposes **Memory-Conditioned Scene Persistence** using camera-geometry-based retrieval and residual recycling to maintain scene identity over long horizons, mitigating style and color drift.
- Combines **autoregressive distillation** (causal forcing, DMD, long-rollout training) with **reinforcement learning** post-distillation to recover camera control and visual quality while preserving few-step inference.
- Achieves **84.76 overall score** on 5-second basic evaluation, outperforming HY-WorldPlay 1.5 (80.79) and LingBot-World (80.45); reaches **16 FPS** on eight RTX 5090 GPUs through mixed-precision DiT, residual reuse, pruned VAE decoding, and asynchronous pipeline parallelism.

## Introduction and Theoretical Foundation

Interactive world models extend video generation from passive visual synthesis to responsive simulation. Unlike offline generation, they must:
- Follow user-specified camera trajectories revealing a consistent scene.
- Preserve scene content when revisiting locations after the local context window has passed.
- Support prompted events that modify world state across multiple objects.
- Run with low latency for continuous interaction.

Building such a model requires diverse video data spanning visual domains with reliable camera, action, and event annotations. No single data source provides this coverage, motivating the combination of synthetic (Unreal Engine), game, and real-world data.

The paper identifies three coupled technical challenges:
1. **Camera control**: translating prescribed trajectories into consistent viewpoint changes across varying scene scales and motion distributions.
2. **Scene persistence**: preserving content beyond the local context window to avoid appearance, style, and color drift during autoregressive generation.
3. **Latency**: reducing diffusion steps and decoding costs while retaining quality, controllability, and rollout stability.

The model is initialized from Wan2.2-TI2V and progressively adapted through camera conditioning, memory conditioning, event interaction, and autoregressive long-video generation, followed by post-distillation alignment via reinforcement learning.

## Methodology

### Data Engine

The data pipeline combines three sources:
- **UE-generated data**: First-person free-camera, third-person character-driven, and event subsets with per-frame ground-truth camera pose and discrete action vectors (WASD for translation, IJKL for rotation). Two-stage pipeline: trajectory discovery → offline rendering.
- **Real-world data**: Videos from SpatialVID, RealEstate10K, Sekai, and DL3DV with sparse camera poses estimated via MegaSaM and interpolated.
- **Game data**: From Sekai-Game and OmniWorld-Game with engine-exported poses.

Three-stage quality control: basic filtering (duration, text, static content), geometric camera-pose cleaning (translation spikes, rapid rotations, jitter), and video captioning + attribute tagging (aesthetic score, motion intensity, scene/style/subject category, 3D/4D motion distinction). For event instruction, clips with visible state changes are annotated with structured global + per-entity event descriptions.

### Camera Control: E-PRoPE

E-PRoPE modifies PRoPE by:
1. Downsampling the PRoPE attention input tokens spatially (e.g., from 18480 to 4096 tokens for 5-second 720P video – 4.5× reduction).
2. Projecting into lower-dimensional query/key/value space: $X^{EPRoPE} \in \mathbb{R}^{N \times d'}, d' < d$.
3. Omitting the RoPE submatrix $D^{RoPE}_s$, relying on the DiT backbone's existing spatiotemporal inductive bias.
4. After PRoPE attention, upsampling and adding to original DiT attention output.

The per-token matrix is:
$$
D^{EPRoPE}_s = \begin{bmatrix} D^{Proj}_s & 0 \\ 0 & 0 \end{bmatrix}
$$
where $D^{Proj}_s$ encodes the full projective camera geometry.

Training freezes the DiT backbone and backpropagates only to PRoPE parameters. During inference, the PRoPE component can be used plug-and-play even if not trained with it.

### Memory-Conditioned Scene Persistence

The model uses two conditioning sources: recent history latents $z_H$ and memory latents $z_M$ retrieved from earlier history via camera-geometry-based retrieval (camera pose and view overlap). The input sequence is:
$$
z^{pack} = [z_M | z_H | z^\tau_C]
$$
where $z^\tau_C$ are target latents at noise level $\tau$. Training uses standard rectified flow objective with loss only on target latent frames.

Memory frames receive RoPE embeddings corresponding to their original temporal positions to prevent treating them as temporally adjacent. For large time gaps, NTK-aware RoPE scaling, YaRN, or randomized positional encodings are used.

To mitigate exposure bias (clean frames during training vs. generated noisy frames during inference), error injection perturbs conditioning tokens while keeping target latents clean, following Stable Video Infinity.

### Event Instruction Tuning

Structured event descriptions are rendered as natural-language prompts covering global scene and per-entity dynamics. The model supports **composable events**: multiple objects with distinct actions and interactions in one generation (unlike previous systems, see Table 2). Training mixes event-instruction samples with non-event clips using conservative updates and strict gradient clipping.

### Autoregressive Distillation and Reinforcement Learning

A bidirectional E-PRoPE model is distilled into a few-step autoregressive generator using:
- **Causal forcing**: training on model-generated history.
- **DMD (Distribution Matching Distillation)**: matching student rollouts to teacher over local temporal windows from long videos.
- **Long-rollout training**: with Infinity-RoPE for extended context.
- **I2V DMD distillation**: first latent frame decoded and fed as image condition to teacher.

Post-distillation, **reinforcement learning** (DiffusionNFT-style) with two reward models (camera translation/rotation accuracy, visual quality) and KL regularization recovers camera control and quality while preserving few-step inference. Long-horizon rollouts provide temporal context; short sampled clips serve as reward-bearing units.

### Inference Acceleration

- **DiT denoising**: INT8 SageAttention for attention, FP8 AngelSlim for FFN, sequence parallelism, fused Triton kernels, TeaCache residual reuse.
- **VAE decoding**: Matrix-Game 3.0 VAE with 75% pruning ratio (~0.25s per chunk); torch.compile; ParVAE splitting latent video along height across GPUs.
- **Serving**: asynchronous pipeline parallelism overlapping VAE decoding of chunk $k$ with DiT denoising of chunk $k+1$.

Camera controls are chunk-relative: first chunk uses poses relative to its first frame; later chunks use poses relative to last frame of previous chunk.

## Empirical Validation / Results

### Basic Evaluation (5-second clips)

Camera control error:
$$
e_{\text{camera}} = \sqrt{e_\theta \cdot e_t}
$$
where $e_\theta$ and $e_t$ are scale-invariant rotation and translation errors.

Metrics: camera control, image quality, transition detect, temporal flicker, motion smoothness, dynamic degree, artifact detection (Gemini-3.1-Pro VLM).

**Table 3: Basic evaluation** (all scores normalized to [0,100])

| Model | Params | Camera↑ | Quality↑ | Trans.↑ | Flicker↑ | Smooth.↑ | Dynamic↑ | Artifact↑ | Overall↑ |
|---|---|---|---|---|---|---|---|---|---|
| HY-WorldPlay 1.5 | 8B | 65.12 | 68.23 | 98.33 | 96.45 | 99.05 | 66.67 | 71.66 | 80.79 |
| LingBot-World | 14B | 71.73 | 67.76 | 85.00 | 94.94 | 97.06 | 88.33 | 58.33 | 80.45 |
| **DreamX-World-1.0-5B** | **5B** | **73.75** | 66.75 | **98.33** | 96.17 | 98.79 | 85.83 | **73.75** | **84.76** |

DreamX-World achieves the highest camera control and overall score with competitive visual quality.

### Comparison of PRoPE vs. E-PRoPE (Table 1, Omni-WorldBench)

| Model | Camera Control↑ | Image Quality↑ | Dynamic Degree↑ | Transition Detect↑ | Temporal Flicker↑ | Motion Smooth.↑ | Latency↓ (s) |
|---|---|---|---|---|---|---|---|
| PRoPE | 73.89 | 66.15 | 87.5 | 96.67 | 96.02 | 98.65 | 80 |
| E-PRoPE | 73.75 | 66.75 | 85.83 | 98.33 | 96.17 | 98.79 | **59** |

E-PRoPE achieves comparable camera control with ~26% lower latency.

### Long-Horizon Evaluation (~30 seconds)

**Table 4: Long-horizon evaluation**

| Model | Params | Camera↑ | Quality↑ | Trans.↑ | Flicker↑ | Smooth.↑ | Dynamic↑ | Artifact↑ | Overall↑ |
|---|---|---|---|---|---|---|---|---|---|
| HY-WorldPlay 1.5 | 8B | 65.86 | 63.02 | 91.00 | 97.00 | 99.11 | 52.00 | 14.00 | 68.85 |
| LingBot-World | 14B | 63.76 | 60.81 | 54.00 | 96.59 | 97.86 | 87.00 | 12.00 | 67.43 |
| **DreamX-World-1.0-5B** | **5B** | 62.03 | **64.11** | 80.00 | 96.35 | 98.41 | 75.00 | **17.00** | **70.41** |

DreamX-World achieves the highest overall score and imaging quality, with better artifact detection.

### Memory Consistency Evaluation (revisit protocols)

Revisit pair detection condition:
$$
|\theta_i - \theta_j| \leq \tau_\theta, \quad \|t_i - t_j\|_2 \leq \tau_t, \quad \tau_\theta=2^\circ, \tau_t=0.1
$$
with minimum temporal gap $|j-i| \geq \lfloor 0.2T \rfloor$.

Metrics: ∆PSNR, ∆SSIM, ∆LPIPS, ∆DINO-Sim, ∆VPR-Sim, ∆SP-Match (all gains over non-revisit baselines), and CLIP-Video (absolute).

**Table 5: Memory consistency evaluation**

| Model | ∆PSNR | ∆SSIM | ∆LPIPS | ∆DINO-Sim | ∆VPR-Sim | ∆SP-Match | CLIP-V |
|---|---|---|---|---|---|---|---|
| LingBot-World | 0.61 | 0.019 | 0.039 | 0.090 | 0.100 | 0.088 | 0.987 |
| HY-WorldPlay 1.5 | 3.19 | 0.079 | 0.202 | 0.200 | 0.110 | 0.251 | 0.992 |
| **DreamX-World-1.0-5B** | **3.92** | **0.098** | **0.232** | **0.246** | **0.142** | 0.216 | 0.991 |

DreamX-World achieves highest gains on pixel-level, perceptual, semantic, and place-recognition metrics, demonstrating stronger memory.

### Human Preference Study

Blind side-by-side comparison (Figure 12):
- **Overall**: DreamX-World wins 57.5% vs. HY-WorldPlay 1.5, 61.9% vs. LingBot-World.
- **Visual quality**: Wins 57.5% and 61.3%.
- **Artifact detection**: Wins 59.4% and 56.2%.
- **Camera control**: Higher tie rates, comparable perceived controllability.

## Theoretical and Practical Implications

**Theoretical contributions**:
- **E-PRoPE** demonstrates that projective camera conditioning can be efficiently applied on spatially reduced tokens without sacrificing trajectory-following performance, decoupling geometric conditioning from full-resolution semantics.
- **Memory-conditioned scene persistence** with error injection provides a principled way to handle exposure bias in autoregressive world models, showing that geometry-guided retrieval plus residual recycling improves robustness to imperfect memory latents.
- **Autoregressive distillation + RL alignment** reveals that aggressive distillation degrades controllability and quality, but these can be recovered through conservative reward optimization while retaining few-step inference.

**Practical implications**:
- The multi-source data pipeline (UE + real-world + game) with unified annotation and geometric cleaning enables training interactive world models that generalize across photorealistic, game-style, and stylized domains.
- Real-time streaming at 16 FPS on 8 RTX 5090 GPUs makes interactive deployment feasible for interactive applications (gaming, simulation, virtual worlds).
- Composable event control (multi-entity, inter-object interactions) advances world models toward realistic simulation where multiple agents act concurrently.
- The evaluation framework (basic + long-horizon + revisit consistency + human preference) provides a more comprehensive assessment than existing benchmarks that only measure short-term quality and camera control.

## Conclusion

DreamX-World 1.0 establishes a full-stack solution for interactive world modeling, integrating data curation, efficient camera control (E-PRoPE), memory-conditioned scene persistence, event instruction tuning, autoregressive distillation with RL alignment, and real-time inference optimizations. It outperforms larger baselines (HY-WorldPlay 1.5 8B, LingBot-World 14B) on overall quality, long-horizon stability, and memory consistency while achieving competitive camera control.

**Future work**:
1. **Character-centric world models**: maintaining persistent character identity, coordinating character actions with free cameras, supporting multi-character interactions over long horizons.
2. **Native audio-visual world models**: jointly generating synchronized speech, ambient sound, and action-dependent audio while using sound as an interactive signal for events and scene dynamics.

Together with stronger memory and physical reasoning, these extensions would move world models toward more embodied, expressive, and immersive simulation.

---

_Markdown view of https://picx.dev/p/hMMQnj, served by PicX — AI-generated visual whiteboard summaries of research papers._
