Summary (Overview)

Core Innovation: Proposes Causal Forcing++, a new pipeline that uses Causal Consistency Distillation (Causal CD) to efficiently initialize a few-step autoregressive (AR) student model for real-time interactive video generation, replacing the costly Causal ODE distillation method.
Key Problem Solved: Addresses the scalability bottleneck in aggressive low-latency regimes (frame-wise AR with 1-2 sampling steps), where existing initialization methods are either architecturally misaligned (Self Forcing), lack few-step capability (multi-step AR), or are too expensive (Causal Forcing).
Main Findings:
- Causal CD is theoretically equivalent to Causal ODE distillation, as both learn the AR-conditional flow map (consistency function) of the teacher.
- Causal CD is more efficient and effective: It reduces Stage 2 training cost by ~4x (from ~11,600 to ~2,900 A800 GPU-hours), eliminates auxiliary data storage, and yields a stronger initialization due to a smaller per-step optimization gap.
- Under the frame-wise 2-step setting, Causal Forcing++ outperforms prior SOTA methods: +0.1 in VBench Total, +0.3 in VBench Quality, and +0.335 in VisionReward, while reducing first-frame latency by 50%.
Extended Application: Successfully demonstrates the pipeline's applicability to action-conditioned world model generation (e.g., camera-pose control) in the spirit of Genie3.

Introduction and Theoretical Foundation

Real-time interactive video generation, essential for applications like world models, demands low latency, streaming rollout, and user controllability. Autoregressive (AR) diffusion models are a natural fit, as they generate content causally across frames while using diffusion within each segment.

Recent AR diffusion distillation methods (e.g., CausVid, Self Forcing, Causal Forcing) have achieved strong results by distilling high-quality bidirectional diffusion models into few-step AR students. However, they typically operate in a chunk-wise 4-step regime, which still incurs non-negligible latency and coarse response granularity.

This paper pushes into a more aggressive, low-latency regime: frame-wise autoregression with only 1–2 sampling steps. In this regime, the initialization of the few-step AR student before the final asymmetric Distribution Matching Distillation (DMD) stage is identified as the critical bottleneck. Existing strategies fail in complementary ways:

ODE initialization with a bidirectional teacher (CausVid, Self Forcing): Architecturally misaligned. The teacher's trajectory depends on future frames unavailable to the AR student, leading to a blurred conditional expectation target.
Direct use of a multi-step AR diffusion model (LiveAvatar, WorldPlay): Lacks few-step generation capability. Approximation error is severely amplified during self-rollout in the 1-2 step setting.
Causal ODE initialization with an AR teacher (Causal Forcing): Corrects the target but is not scalable. It requires precomputing and storing full Probability Flow ODE (PF-ODE) trajectories for every training sample, which is prohibitively expensive.

Therefore, a satisfactory initialization must be simultaneously AR-aligned, few-step capable, and scalable. The paper's theoretical foundation builds on the equivalence between Causal ODE distillation and Causal Consistency Distillation (CD): both aim to learn the AR-conditional flow map (or consistency function) of the teacher model. The key insight is that Causal CD can achieve this target more efficiently by using local supervision between adjacent timesteps on real data, rather than regressing to the endpoint of a full precomputed trajectory.

Methodology

The proposed Causal Forcing++ pipeline retains the established three-stage framework but replaces the costly Stage 2:

Stage 1: Multi-step AR Diffusion Training. Train an AR diffusion teacher model via teacher forcing on ground-truth data.
Stage 2: Causal Consistency Distillation (Causal CD) for Initialization. This is the novel core. Instead of Causal ODE distillation, the few-step AR student is initialized by enforcing consistency between adjacent timesteps using the AR teacher.
- Objective: The student $G_\theta$ is trained to map a noisy frame at timestep $t$ to a point close to its output at a nearby timestep $t-\Delta t$ , where the target point is obtained by taking a single ODE step from the noisy frame using the AR teacher.
$\theta^* = \arg\min_{\theta} \mathbb{E}_{x_{gt}, \epsilon, t, i} \left[ w(t) d\left( G_\theta(x^i_t, x^{<i}_{gt}, t), G_{\theta^-}(\hat{x}^i_{t-\Delta t}, x^{<i}_{gt}, t-\Delta t) \right) \right]$ where:
- $x^i_t = \alpha(t)x^i_{gt} + \sigma(t)\epsilon$ is the noisy frame from ground-truth $x^i_{gt}$ .
- $\hat{x}^i_{t-\Delta t}$ is obtained via one teacher ODE step from $x^i_t$ .
- $\theta^-$ is an exponential moving average (EMA) of $\theta$ .
- $w(t)$ is a timestep weight, $d(\cdot,\cdot)$ is a distance metric.
- Under the flow-matching parameterization, $G_\theta(x^i_t, x^{<i}_{gt}, t) = x^i_t - t v_\theta(x^i_t, x^{<i}_{gt}, t)$ , where $v_\theta$ is a velocity prediction network.
- Key Advantage: This requires only one online teacher ODE step per training iteration, eliminating the need to pre-generate and store full multi-step trajectories.
Stage 3: Asymmetric DMD with Self-Rollout. The initialized student is further refined using asymmetric Distribution Matching Distillation, where the teacher and critic are bidirectional models, but the student performs self-rollout (conditioning on its own previously generated frames) to align training and inference.

The paper also explores and rejects Causal Score Distillation (Causal DMD) as an initialization alternative. While DMD often outperforms CD in bidirectional settings, its mode-seeking behavior (optimizing reverse KL divergence) makes it overly sensitive to accumulated history errors during AR rollout, leading to severe exposure bias and poorer final performance compared to the mode-covering Causal CD (which optimizes forward KL).

Empirical Validation / Results

Experiments are conducted using the Wan2.1-1.3B model as a base, generating videos at 480x832 resolution.

Main Comparison with SOTA Methods (Frame-wise 2-step): Causal Forcing++ (2-step) achieves the best overall performance among existing AR diffusion distillation methods.

Table 1: Quantitative Comparison with Prior Methods

Model	Throughput (FPS) ↑	Latency (s) ↓	VBench Total ↑	VBench Quality ↑	VBench Semantic ↑	Dynamic Degree ↑	VisionReward ↑	Instruct. Follow ↑
CausVid	10.4	0.60	81.33	83.98	70.72	62	5.741	12
Self Forcing	10.4	0.60	83.74	84.48	80.77	57	5.820	48
Causal Forcing	10.4	0.60	84.04	84.59	81.84	68	6.326	56
Causal Forcing++ (1-step)	20.7	0.27	83.35	84.50	78.75	66	5.412	38
Causal Forcing++ (2-step)	14.1	0.27	84.14	84.89	81.13	64	6.661	51
Causal Forcing++ (4-step)	8.69	0.27	84.10	84.94	80.75	71	6.798	47

Key Takeaways:

Superior Performance: CF++ (2-step) surpasses the previous SOTA (Causal Forcing) in Total (+0.1), Quality (+0.3), and VisionReward (+0.335) scores.
Dramatically Lower Latency: Achieves 50% lower first-frame latency (0.27s vs. 0.60s).
Higher Throughput: 2-step CF++ has ~1.4x higher throughput than 4-step chunk-wise methods.

Ablation Study on Initialization Methods: The paper ablates different Stage 2 initialization strategies under frame-wise 1, 2, and 4-step settings for Stage 3.

Table 2: Ablation Study on Initialization Methods (Excerpt for 2-Step Setting)

Initialization Method	VBench Total ↑	VBench Quality ↑	Stage 2 Time Cost (GPU-h) ↓	Extra Storage (GiB) ↓
Self Forcing ODE (Bidir. Teacher)	79.44	80.43	5000	1500
Multi-step AR Diffusion	82.43	83.04	-	0
Causal ODE (AR Teacher)	83.77	84.42	11600	1900
Causal DMD (AR Teacher)	83.73	84.56	2900	0
Causal CD (AR Teacher) - Ours	84.14	84.89	2900	0

Key Findings from Ablations:

Self Forcing Initialization fails in frame-wise settings, producing low-quality results.
Multi-step AR Initialization is insufficient, especially in aggressive low-step settings (e.g., near collapse in 1-step).
Causal CD matches or outperforms Causal ODE across all step settings while being ~4x more efficient in training time and requiring zero extra storage.
Causal DMD is suboptimal despite its efficiency, suffering from stronger exposure bias and lower final scores than Causal CD.

Visual Results: Qualitative comparisons show that Causal Forcing++ produces videos with strong dynamics and visual quality comparable to or better than Causal Forcing, while avoiding artifacts present in other methods (e.g., blurring, object inconsistency).

Theoretical and Practical Implications

Theoretical: Establishes the equivalence between Causal ODE distillation and Causal Consistency Distillation for learning the AR-conditional flow map. Provides analysis on why the mode-covering behavior of CD (forward KL) is more robust to exposure bias in AR rollout compared to the mode-seeking behavior of DMD (reverse KL).
Practical:
- Enables Scalable Low-Latency Generation: Causal Forcing++ provides a practical pathway to real-time interactive video generation by making aggressive few-step, frame-wise distillation feasible and efficient.
- Reduces Training Cost: The ~4x reduction in Stage 2 training cost and elimination of massive trajectory storage lower the barrier for research and development in this area.
- Improves Performance: Demonstrates that more efficient training can also lead to better model quality, achieving new SOTA results in the target regime.
- Broad Applicability: The pipeline is successfully extended to action-conditioned world model generation, showcasing its potential for building interactive AI agents and simulators.

Conclusion

Causal Forcing++ addresses the training inefficiency of existing AR diffusion distillation methods by replacing costly causal ODE distillation with causal consistency distillation (Causal CD) for few-step student initialization. This approach is principled (targeting the same AR flow map), scalable (avoiding offline trajectory generation), and effective (yielding stronger initialization).

The method, for the first time, achieves performance comparable to or better than prior SOTA methods under the challenging frame-wise 2-step setting, while reducing latency by 50%. This represents a significant step towards realizing the promise of real-time interactive video generation for applications like world models. Future work may focus on further reducing steps for action-conditioned models and exploring other conditional signals.