minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models

Summary (Overview)

Framework: minWM is an end-to-end, open-source pipeline for converting high-quality bidirectional Text-to-Video (T2V) or Text-and-Image-to-Video (TI2V) diffusion foundation models into camera-controllable, few-step autoregressive video world models suitable for real-time interactive applications.
Key Contributions: Provides a modular, reproducible framework covering data construction, controllable fine-tuning (via PRoPE), autoregressive training, distillation (Causal Forcing/Causal Forcing++), and low-latency inference. It is instantiated on two representative open backbones: Wan2.1-T2V-1.3B and HY1.5-TI2V-8B.
Performance: The distilled few-step autoregressive models achieve a >223x reduction in first-frame latency compared to their multi-step bidirectional counterparts, enabling real-time interactive rollout.
Practical Guidance: Includes actionable ablations on critical training factors: the importance of ground-truth camera trajectories, the number of training steps needed for controllability, and minimal batch-size requirements.
Extensibility: Supports adapting existing world models (e.g., HY-WorldPlay) to new data, recipes, or latency targets, and is designed to be extensible to other control signals and model architectures.

Introduction and Theoretical Foundation

Recent video diffusion models have achieved remarkable quality in offline video generation. However, an interactive video world model requires causal rollout, responsiveness to user actions (like camera control), and low-latency generation for real-time interaction. While techniques like autoregressive diffusion distillation exist, building such a model involves a complex, scattered pipeline: data construction, controllable fine-tuning, AR training, few-step distillation, and inference optimization.

minWM addresses this by providing a unified, open-source framework that converts existing T2V/TI2V foundation models into real-time interactive world models. The core idea is a two-phase recipe:

Camera Control Training: Fine-tune a bidirectional diffusion model to follow prescribed camera trajectories, preserving its generative quality.
AR Diffusion Distillation: Transform the controllable bidirectional model into a few-step autoregressive generator using established distillation pipelines (Causal Forcing or Causal Forcing++), drastically reducing inference steps and latency.

The framework is motivated by the need for a reproducible and extensible baseline for the community to build and adapt interactive video models.

Methodology

The minWM pipeline consists of two major phases.

Phase 1: Camera-Controllable Training for Bidirectional Diffusion Models

The goal is to equip a pre-trained bidirectional T2V/TI2V model with camera controllability. The method PRoPE (Projective RoPE) [26] is used to inject camera parameters into the model's self-attention mechanism.

Given a video clip with camera parameters $\{(K_i, T_i^{cw})\}_{i=1}^N$ , where $K_i$ is the intrinsic matrix and $T_i^{cw} \in SE(3)$ is the world-to-camera extrinsic transformation for frame $i$ , PRoPE represents each camera by its lifted projective matrix:

\tilde{P}_i = \begin{bmatrix} [K_i \ 0] T_i^{cw} \\ e_4^\top \end{bmatrix} \in \mathbb{R}^{4\times4}, \quad e_4 = (0,0,0,1)^\top.

For a token $t$ belonging to frame $i(t)$ with spatial coordinate $(x_t, y_t)$ , PRoPE constructs a block-diagonal transformation $D_t^{PRoPE}$ . This transformation is injected into self-attention in a GTA (Gated Transformer Attention) form:

\text{Attn}^{PRoPE}(Q, K, V) = D^{PRoPE} \odot \text{Attn}\left( (D^{PRoPE})^\top \odot Q, (D^{PRoPE})^{-1} \odot K, (D^{PRoPE})^{-1} \odot V \right).

This causes the attention interaction between tokens $t_1$ and $t_2$ to depend explicitly on the relative projective transformation $\tilde{P}_{i(t_1)} \tilde{P}_{i(t_2)}^{-1}$ , which jointly encodes relative camera intrinsics and poses, enabling the model to condition on camera trajectories.

Phase 2: AR Diffusion Distillation for Real-Time Interactive Models

This phase uses either Causal Forcing [23] or Causal Forcing++ [24] to distill the camera-controllable bidirectional model into a camera-controllable few-step autoregressive model. It consists of three stages:

Stage 1: AR Diffusion Training The bidirectional model is fine-tuned into an autoregressive diffusion model via teacher forcing [27]. This is done by concatenating clean video with its noisy counterpart and training under a causal attention mask. The resulting model can generate autoregressively but still requires many diffusion steps.

Stage 2: Initialization for Few-Step Generation

Option A (Causal Forcing): Causal ODE Initialization. The AR diffusion model generates PF-ODE trajectories [32]. Over a predefined few-step timestep set $S$ , a timestep $t$ is sampled, and the few-step model $G_\theta$ is trained to regress from the noisy intermediate frame $x_t^i$ to the clean frame $x_0^i$ : $\theta^* = \arg\min_\theta \mathbb{E}_{x_{gt}^{<i}, t, i, x_t^i} \left[ \| G_\theta(x_t^i, x_{gt}^{<i}, t) - x_0^i \|^2 \right],$ where $x_{gt}^{<i}$ denotes the historical prefix of real data.
Option B (Causal Forcing++): Causal Consistency Distillation (Causal CD). This eliminates the need for storing ODE trajectories. The model is trained via: $\theta^* = \arg\min_\theta \mathbb{E}_{x_{gt}, \epsilon, t, i} \left[ w(t) d\left( G_\theta(x_t^i, x_{gt}^{<i}, t), G_{\theta^-}(\hat{x}_{t-\Delta t}^i, x_{gt}^{<i}, t-\Delta t) \right) \right],$ where $\hat{x}_{t-\Delta t}^i$ is obtained by a single ODE step from $x_t^i$ using the AR teacher, $\theta^-$ is an EMA of $\theta$ with stop-gradient, $w(\cdot)$ is a timestep weight, and $d(\cdot,\cdot)$ is a distance norm.

Stage 3: Asymmetric DMD The few-step AR model from Stage 2 is aligned with the high-quality distribution of the original bidirectional teacher via Asymmetric Diffusion Model Distillation (DMD) [20, 22, 28, 30]. The student model self-rolls out to generate a full sequence $\tilde{x}$ , and is optimized with the DMD gradient:

\nabla_\theta \mathbb{E}_t \left[ D_{KL}(p_{\theta,t}(\tilde{x}_t) \ || \ p_{data,t}(\tilde{x}_t)) \right] = -\mathbb{E}_{\tilde{x}, t, \tilde{x}_t} \left[ (s_{real}(\tilde{x}_t, t) - s_{fake}(\tilde{x}_t, t)) \frac{\partial \tilde{x}}{\partial \theta} \right].

Here, $\tilde{x}_t$ is a perturbed version of $\tilde{x}$ , $s_{real}$ is the score from a frozen diffusion model (the bidirectional teacher), and $s_{fake}$ is the score from an online-trained model.

Camera-Controllable Distillation: Throughout all three stages, all models (AR teacher, few-step student, and score estimators $s_{real}$ / $s_{fake$ ) are initialized from or conditioned on the camera-controllable bidirectional model from Phase 1, ensuring the final distilled model retains camera controllability.

Empirical Validation / Results

Experiments were conducted on two models: Wan2.1-T2V-1.3B [6] and HY1.5-TI2V-8B [7], generating 77-frame videos at 480×832 resolution with an autoregressive chunk size of 4 latent frames. Few-step distillation uses 4 steps.

Key Results

Latency Reduction: The framework drastically reduces first-frame latency, making models suitable for real-time interaction.

Table 1: First-frame latency of different HY1.5 and Wan2.1 models. We report the first-frame latency on a single A800 GPU. VAE-related time is excluded.

Base model	Model type	First-frame latency (s)	Speedup over multi-step bidirectional
HY1.5 [7]	Multi-step bidirectional	771.041	1.00 ×
HY1.5	Multi-step AR	81.014	9.52 ×
HY1.5	Few-step AR	3.446	223.75 ×
Wan2.1	Multi-step bidirectional	269.055	1.00 ×
Wan2.1 [6]	Multi-step AR	28.651	9.39 ×
Wan2.1	Few-step AR	1.137	236.64 ×

Camera Controllability Preservation: The distilled few-step AR models successfully retain the camera-controllable generation capability of the base models, as shown in generated samples (Fig. 2 in the paper).

Ablation Studies

Training Data: Direct training on SpatialVid [34] data (with perception-estimated camera poses) under the current setup did not yield reliable camera controllability. The framework found that ground-truth camera trajectories are crucial. Successful training was achieved using:

3D Reconstruction & Re-rendering: Reconstruct scenes from DL3DV [35] and render videos along prescribed trajectories.
WorldPlay Generation: Sample images from OpenVid [36] and use WorldPlay [8] to generate videos with specified trajectories.

Training Steps (HY1.5 Example):

~1-2K steps: Model is completely uncontrollable.
~5K steps: Controllability begins to emerge but is unstable.
~8K steps: Model achieves strong and reliable controllability.

Minimal Batch Size (Wan2.1 Example):

Batch size < 4: Often fails to learn controllability.
Batch size = 8: Controllability improves substantially but remains somewhat unstable.
Batch size = 16: Enables successful training with high controllability.

Theoretical and Practical Implications

Reproducibility and Standardization: minWM provides a complete, open-source pipeline that demystifies and standardizes the process of building interactive video world models, which previously required integrating scattered techniques.
Architectural Generalization: The framework's modular design and successful instantiation on two distinct backbones (cross-attention-based Wan2.1 and MMDiT-style HY1.5) demonstrate its architecture-extensible nature, making it applicable to a wide range of video foundation models.
Practical Engineering Guidance: The included ablation studies on data quality, training steps, and batch size provide actionable insights for researchers and practitioners, reducing trial-and-error effort.
Foundation for Future Research: By releasing intermediate checkpoints and a modular pipeline, minWM serves as a baseline and starting point for future work on extending interactive world models to new control conditions (e.g., pose), datasets, and architectures.

Conclusion

minWM is a full-stack, open-source framework that successfully converts high-quality bidirectional video diffusion models into real-time, camera-controllable, few-step autoregressive video world models. Its two-phase pipeline (camera-control fine-tuning + AR distillation) effectively preserves visual quality and controllability while achieving massive latency reductions (>223x). The framework is reproducible, extensible, and provides practical guidance for training. Future work will focus on supporting additional control conditions beyond camera trajectories and extending the framework to more model architectures.