YoCausal: How Far is Video Generation from World Model? A Causality Perspective

Summary (Overview)

Key Contribution: Introduces YoCausal, a novel benchmark for evaluating causal cognition in Video Diffusion Models (VDMs) by leveraging the Violation of Expectation (VoE) paradigm from cognitive science.
Scalable Methodology: Uses temporally reversed real-world videos as natural, zero-cost counterfactual samples, eliminating the sim-to-real gap and enabling arbitrarily extensible evaluation.
Two-Level Framework: Proposes the Reverse Surprise Index (RSI) to measure arrow-of-time perception and the Causality Cognition Index (CCI) to disentangle genuine causal understanding from statistical temporal biases.
Key Findings: Evaluation of 13 state-of-the-art VDMs reveals that perceiving the arrow of time does not imply understanding causality, and a significant gap persists relative to human-level causal cognition.
Correlations: Shows causal cognition correlates with intuitive physics but not with aesthetic quality, and benefits from scaling laws (model size and architectural evolution).

Introduction and Theoretical Foundation

The long-standing aspiration of AI is to build machines that truly model the world. Video generation models, trained on vast real-world data, are often regarded as promising candidates for world modeling. However, a fundamental question remains: do these models actually understand causality beyond merely learning statistical temporal patterns?

Previous research on "world knowledge" has focused on adherence to physical laws, often using synthetic data, which creates a sim-to-real gap. YoCausal aims to bridge this gap by assessing broader causal comprehension. The benchmark is inspired by the Violation of Expectation (VoE) paradigm from cognitive science. In seminal infant studies, surprise elicited by temporally reversed videos indicates causal perception. This principle is transferred to VDMs: a causally-aware model should assign lower probability (higher denoising loss) to reversed counterfactual videos than to forward ones, treating its learned distribution as its "expectation."

The paper focuses on intuitively observable cause-and-effect mechanisms (event A leads to event B), rather than formal structural causal models.

Methodology

The YoCausal framework consists of dataset construction, a formal link between model "surprise" and denoising loss, and two-level evaluation metrics.

Dataset Construction

The benchmark uses real-world videos from four thematic subsets: General (daily-life events), Physics (mechanics, optics, thermodynamics), Human Action, and Animal Action. Any video can be temporally reversed at zero cost to produce a counterfactual pair ( $x_f$ , $x_r$ ), making the dataset arbitrarily extensible and bridging the sim-to-real gap.

Table 1: Comparison with existing physics-law evaluation benchmarks.

Benchmark Video type # Video # Video scene
PhyWorld [55] Synthetic (2D) 3M 70
LikePhys [128] Synthetic 120 12
Physion [12] Synthetic 10400 260
IntPhys2 [15] Synthetic 1416 344
Phys101 [118] Real-World (Controlled) 2500 101
Physics IQ [85] Real-World (Controlled) 396 132
Ours Real-World 1232 ↑ 1232 ↑

Benchmark	Video type	# Video	# Video scene
PhyWorld [55]	Synthetic (2D)	3M	70
LikePhys [128]	Synthetic	120	12
Physion [12]	Synthetic	10400	260
IntPhys2 [15]	Synthetic	1416	344
Phys101 [118]	Real-World (Controlled)	2500	101
Physics IQ [85]	Real-World (Controlled)	396	132
Ours	Real-World	1232 ↑	1232 ↑

Formulating Surprise via Denoising Loss

Under the VoE paradigm, "surprise" corresponds to low probability. For a diffusion model, the denoising loss serves as a proxy for negative log-likelihood (NLL). The denoising loss is formulated as:

\mathcal{L}_{\text{denoise}}(\theta; x_t) = \mathbb{E}_{t \sim \mathcal{U}(1, T), \epsilon \sim \mathcal{N}(0, \mathbf{I})} \left[ \| \epsilon - \epsilon_\theta(x_t, t) \|_2^2 \right] \gtrsim \mathbb{E}_{x_0} \left[ -\log p_\theta(x_0) \right].

(Equation 1)

A higher denoising loss indicates lower model-assigned probability (greater surprise).

Level 1: Measuring Arrow-of-Time Perception via RSI

A model that has internalized causal cognition should assign higher likelihood to a forward video $x_f$ than to its reversed counterpart $x_r$ . This is expressed as:

\mathcal{L}_{\text{denoise}}(\theta; x_r) > \mathcal{L}_{\text{denoise}}(\theta; x_f).

(Equation 2)

The Reverse Surprise Index (RSI) measures the proportion of videos for which the model correctly assigns a lower denoising loss to the forward sequence. For a dataset $\mathcal{D}$ composed of sub-datasets $\{\mathcal{D}_i\}$ :

\text{RSI}(\mathcal{D}) = \frac{1}{|\mathcal{D}|} \sum_{\mathcal{D}_i \in \mathcal{D}} \frac{1}{|\mathcal{D}_i|} \sum_{x_{i,j} \in \mathcal{D}_i} \mathbb{1}\left[ \mathcal{L}_{\text{denoise}}(\theta; x_{i,j}^r) > \mathcal{L}_{\text{denoise}}(\theta; x_{i,j}^f) \right],

(Equation 3)

where $\mathbb{1}[\cdot]$ is the indicator function and $\text{RSI} \in [0, 1]$ . Higher values indicate stronger perception of the arrow of time. Crucially, RSI alone cannot isolate causal cognition, as surprise may stem from reversed arrow of time or reversed causality.

Level 2: Disentangling Causality via CCI

To disentangle genuine causal understanding, the dataset is partitioned into a causal subset $\mathcal{D}_c$ and a non-causal subset $\mathcal{D}_{nc}$ based on whether obvious cause-effect interactions are present (automated using a Vision-Language Model with a carefully designed prompt). Reversing a causal video introduces two abnormality sources: reversed temporal direction and reversed causality, while a non-causal video introduces only the first.

The Causality Cognition Index (CCI) is defined as:

\text{CCI}(\mathcal{D}) = \text{RSI}(\mathcal{D}_c) - \text{RSI}(\mathcal{D}_{nc}).

(Equation 4)

A higher CCI indicates a model captures reversed causality cues beyond statistical temporal patterns. The VLM-based partitioning is validated by high agreement with human annotations (Kendall $\tau = 0.7613$ , F1-score 82.76%) and negligible difference in optical flow magnitude distributions between $\mathcal{D}_c$ and $\mathcal{D}_{nc}$ (Cohen's $d = 0.057 < 0.2$ ), confirming semantic reasoning.

Empirical Validation / Results

The benchmark evaluates 13 state-of-the-art open-source VDMs (e.g., AnimateDiff, CogVideoX, Wan, LTX-Video, HunyuanVideo, Mochi).

Level 1 RSI Results

Human annotators achieve the highest RSI across most subsets, serving as an upper bound. Several models surpass the 50% random-guess baseline with 90% confidence (bootstrap test), but a significant gap remains relative to human performance. Higher-fidelity models tend to score higher. Per-subset results reveal cross-domain variation due to differing cue strength and training data biases.

Figure 6: RSI scores show models lag behind human performance. Some models score below 50%, suggesting they capture local visual smoothness without internalizing the arrow of time.

Level 2 CCI Results

Humans achieve the highest CCI. Several models attain positive CCI with 90% confidence, demonstrating preliminary causal perception; top performers are in the Wan and CogVideo families. Crucially, models ranking high on RSI (e.g., LTX-Video, HunyuanVideo) score poorly on CCI, confirming the framework disentangles causal cognition from mere arrow-of-time perception. Some models show negative CCI, indicating they lack an internalized causality.

Figure 7: Left: RSI scores on $\mathcal{D}_c$ (dark blue) and $\mathcal{D}_{nc}$ (light blue). Right: The resulting CCI values.

Aggregate Ranking

An aggregate causality score combines RSI and CCI ranks (summing ranks, lower is better). Ties are broken by RSI rank. This provides a holistic view of each model's causal cognition capability.

Figure 8: Aggregate ranking (lower is better). Wan2.2-T2V-A14B and Wan2.1-T2V-14B rank highest, close to human performance.

Cross-Metric Analysis

Kendall's rank correlation $\tau$ between the aggregate rank and external metrics/model properties reveals:

Table 2: Cross-metric analysis.

Metric / Feature Kendall's $\tau$ p-value
Human Preference 0.3333 0.4694
LikePhys [128] (Intuitive Physics) 0.5111 0.0466
Aesthetic Quality 0.0000 1.0000
Subject Consistency 0.3333 0.1289
Background Consistency 0.0256 0.9524
Motion Smoothness 0.2821 0.2044
Temporal Flickering 0.2564 0.2519
Release Date 0.5958 0.0316
# of Parameters 0.6880 0.0093

Metric / Feature	Kendall's $\tau$	p-value
Human Preference	0.3333	0.4694
LikePhys [128] (Intuitive Physics)	0.5111	0.0466
Aesthetic Quality	0.0000	1.0000
Subject Consistency	0.3333	0.1289
Background Consistency	0.0256	0.9524
Motion Smoothness	0.2821	0.2044
Temporal Flickering	0.2564	0.2519
Release Date	0.5958	0.0316
# of Parameters	0.6880	0.0093

Moderate correlation with human preference validates the benchmark's ability to assess causal understanding.
Positive correlation with LikePhys implies causal cognition relates to but is not reducible to physical intuition.
Zero correlation with aesthetic quality confirms the benchmark is not confounded by visual appeal.
Strong correlation with release date and parameters supports that scaling laws and architectural evolution extend to causal cognition.

Entropy-Controlled Subset Analysis

To address concerns that models exploit low-level entropy dynamics, a subset of videos with symmetric optical-flow magnitude trajectories (low asymmetry score $a = \| M_f - \text{reverse}(M_f) \|_2 / \| M_f \|_2$ ) is analyzed. RSI scores on this entropy-symmetric subset closely track those on the full dataset, indicating RSI captures event-level temporal structure rather than low-level entropy cues.

Figure 9: RSI results on the full dataset vs. motion-symmetric subset show close agreement.

Theoretical and Practical Implications

Unique Evaluation Dimension: YoCausal provides the first causality benchmark for VDMs free from sim-to-real gaps, capturing a capability distinct from intuitive physics and visual quality.
Disentanglement of Cognition: The two-level framework (RSI & CCI) successfully disentangles arrow-of-time perception from genuine causal reasoning, revealing that high temporal perception does not guarantee causal understanding.
Guidance for Model Development: The findings indicate that scaling parameters and advancing architectures (e.g., UNet to DiT) improve causal cognition, suggesting scaling laws apply to higher-order reasoning. This motivates treating causal cognition as a distinct objective for future model improvement.
Cognitive Science Transfer: The successful adaptation of the VoE paradigm from infant studies to generative models demonstrates a valuable interdisciplinary approach for probing AI cognition.

Conclusion

YoCausal establishes a scalable, real-world benchmark for evaluating causal cognition in VDMs. Key takeaways are:

Perceiving the arrow of time is not equivalent to understanding causality. Models can exhibit temporal perception without causal reasoning.
A significant human-model gap persists. Even the best models lag behind human-level causal cognition.
Causal cognition is a distinct capability. It correlates with intuitive physics but not with aesthetic quality, and benefits from model scaling.
The benchmark is extensible and validated. The use of reversed real-world videos and the two-level metric framework provide a robust and scalable evaluation protocol.

Limitations: The method struggles with temporally symmetric events (e.g., Newton's cradle). Computing denoising losses requires access to model weights, limiting evaluation of closed-source models. Future work will address these limitations.

YoCausal motivates the community to treat causal cognition as a critical objective in the pursuit of world models.