YoCausal: How Far is Video Generation from World Model? A Causality Perspective
Summary (Overview)
- Key Contribution: Introduces YoCausal, a novel benchmark for evaluating causal cognition in Video Diffusion Models (VDMs) by leveraging the Violation of Expectation (VoE) paradigm from cognitive science.
- Scalable Methodology: Uses temporally reversed real-world videos as natural, zero-cost counterfactual samples, eliminating the sim-to-real gap and enabling arbitrarily extensible evaluation.
- Two-Level Framework: Proposes the Reverse Surprise Index (RSI) to measure arrow-of-time perception and the Causality Cognition Index (CCI) to disentangle genuine causal understanding from statistical temporal biases.
- Key Findings: Evaluation of 13 state-of-the-art VDMs reveals that perceiving the arrow of time does not imply understanding causality, and a significant gap persists relative to human-level causal cognition.
- Correlations: Shows causal cognition correlates with intuitive physics but not with aesthetic quality, and benefits from scaling laws (model size and architectural evolution).
Introduction and Theoretical Foundation
The long-standing aspiration of AI is to build machines that truly model the world. Video generation models, trained on vast real-world data, are often regarded as promising candidates for world modeling. However, a fundamental question remains: do these models actually understand causality beyond merely learning statistical temporal patterns?
Previous research on "world knowledge" has focused on adherence to physical laws, often using synthetic data, which creates a sim-to-real gap. YoCausal aims to bridge this gap by assessing broader causal comprehension. The benchmark is inspired by the Violation of Expectation (VoE) paradigm from cognitive science. In seminal infant studies, surprise elicited by temporally reversed videos indicates causal perception. This principle is transferred to VDMs: a causally-aware model should assign lower probability (higher denoising loss) to reversed counterfactual videos than to forward ones, treating its learned distribution as its "expectation."
The paper focuses on intuitively observable cause-and-effect mechanisms (event A leads to event B), rather than formal structural causal models.
Methodology
The YoCausal framework consists of dataset construction, a formal link between model "surprise" and denoising loss, and two-level evaluation metrics.
Dataset Construction
The benchmark uses real-world videos from four thematic subsets: General (daily-life events), Physics (mechanics, optics, thermodynamics), Human Action, and Animal Action. Any video can be temporally reversed at zero cost to produce a counterfactual pair (, ), making the dataset arbitrarily extensible and bridging the sim-to-real gap.
Table 1: Comparison with existing physics-law evaluation benchmarks.
Benchmark Video type # Video # Video scene PhyWorld [55] Synthetic (2D) 3M 70 LikePhys [128] Synthetic 120 12 Physion [12] Synthetic 10400 260 IntPhys2 [15] Synthetic 1416 344 Phys101 [118] Real-World (Controlled) 2500 101 Physics IQ [85] Real-World (Controlled) 396 132 Ours Real-World 1232 ↑ 1232 ↑
Formulating Surprise via Denoising Loss
Under the VoE paradigm, "surprise" corresponds to low probability. For a diffusion model, the denoising loss serves as a proxy for negative log-likelihood (NLL). The denoising loss is formulated as:
(Equation 1)
A higher denoising loss indicates lower model-assigned probability (greater surprise).
Level 1: Measuring Arrow-of-Time Perception via RSI
A model that has internalized causal cognition should assign higher likelihood to a forward video than to its reversed counterpart . This is expressed as:
(Equation 2)
The Reverse Surprise Index (RSI) measures the proportion of videos for which the model correctly assigns a lower denoising loss to the forward sequence. For a dataset composed of sub-datasets :
(Equation 3)
where is the indicator function and . Higher values indicate stronger perception of the arrow of time. Crucially, RSI alone cannot isolate causal cognition, as surprise may stem from reversed arrow of time or reversed causality.
Level 2: Disentangling Causality via CCI
To disentangle genuine causal understanding, the dataset is partitioned into a causal subset and a non-causal subset based on whether obvious cause-effect interactions are present (automated using a Vision-Language Model with a carefully designed prompt). Reversing a causal video introduces two abnormality sources: reversed temporal direction and reversed causality, while a non-causal video introduces only the first.
The Causality Cognition Index (CCI) is defined as:
(Equation 4)
A higher CCI indicates a model captures reversed causality cues beyond statistical temporal patterns. The VLM-based partitioning is validated by high agreement with human annotations (Kendall , F1-score 82.76%) and negligible difference in optical flow magnitude distributions between and (Cohen's ), confirming semantic reasoning.
Empirical Validation / Results
The benchmark evaluates 13 state-of-the-art open-source VDMs (e.g., AnimateDiff, CogVideoX, Wan, LTX-Video, HunyuanVideo, Mochi).
Level 1 RSI Results
Human annotators achieve the highest RSI across most subsets, serving as an upper bound. Several models surpass the 50% random-guess baseline with 90% confidence (bootstrap test), but a significant gap remains relative to human performance. Higher-fidelity models tend to score higher. Per-subset results reveal cross-domain variation due to differing cue strength and training data biases.
Figure 6: RSI scores show models lag behind human performance. Some models score below 50%, suggesting they capture local visual smoothness without internalizing the arrow of time.
Level 2 CCI Results
Humans achieve the highest CCI. Several models attain positive CCI with 90% confidence, demonstrating preliminary causal perception; top performers are in the Wan and CogVideo families. Crucially, models ranking high on RSI (e.g., LTX-Video, HunyuanVideo) score poorly on CCI, confirming the framework disentangles causal cognition from mere arrow-of-time perception. Some models show negative CCI, indicating they lack an internalized causality.
Figure 7: Left: RSI scores on (dark blue) and (light blue). Right: The resulting CCI values.
Aggregate Ranking
An aggregate causality score combines RSI and CCI ranks (summing ranks, lower is better). Ties are broken by RSI rank. This provides a holistic view of each model's causal cognition capability.
Figure 8: Aggregate ranking (lower is better). Wan2.2-T2V-A14B and Wan2.1-T2V-14B rank highest, close to human performance.
Cross-Metric Analysis
Kendall's rank correlation between the aggregate rank and external metrics/model properties reveals:
Table 2: Cross-metric analysis.
Metric / Feature Kendall's p-value Human Preference 0.3333 0.4694 LikePhys [128] (Intuitive Physics) 0.5111 0.0466 Aesthetic Quality 0.0000 1.0000 Subject Consistency 0.3333 0.1289 Background Consistency 0.0256 0.9524 Motion Smoothness 0.2821 0.2044 Temporal Flickering 0.2564 0.2519 Release Date 0.5958 0.0316 # of Parameters 0.6880 0.0093
- Moderate correlation with human preference validates the benchmark's ability to assess causal understanding.
- Positive correlation with LikePhys implies causal cognition relates to but is not reducible to physical intuition.
- Zero correlation with aesthetic quality confirms the benchmark is not confounded by visual appeal.
- Strong correlation with release date and parameters supports that scaling laws and architectural evolution extend to causal cognition.
Entropy-Controlled Subset Analysis
To address concerns that models exploit low-level entropy dynamics, a subset of videos with symmetric optical-flow magnitude trajectories (low asymmetry score ) is analyzed. RSI scores on this entropy-symmetric subset closely track those on the full dataset, indicating RSI captures event-level temporal structure rather than low-level entropy cues.
Figure 9: RSI results on the full dataset vs. motion-symmetric subset show close agreement.
Theoretical and Practical Implications
- Unique Evaluation Dimension: YoCausal provides the first causality benchmark for VDMs free from sim-to-real gaps, capturing a capability distinct from intuitive physics and visual quality.
- Disentanglement of Cognition: The two-level framework (RSI & CCI) successfully disentangles arrow-of-time perception from genuine causal reasoning, revealing that high temporal perception does not guarantee causal understanding.
- Guidance for Model Development: The findings indicate that scaling parameters and advancing architectures (e.g., UNet to DiT) improve causal cognition, suggesting scaling laws apply to higher-order reasoning. This motivates treating causal cognition as a distinct objective for future model improvement.
- Cognitive Science Transfer: The successful adaptation of the VoE paradigm from infant studies to generative models demonstrates a valuable interdisciplinary approach for probing AI cognition.
Conclusion
YoCausal establishes a scalable, real-world benchmark for evaluating causal cognition in VDMs. Key takeaways are:
- Perceiving the arrow of time is not equivalent to understanding causality. Models can exhibit temporal perception without causal reasoning.
- A significant human-model gap persists. Even the best models lag behind human-level causal cognition.
- Causal cognition is a distinct capability. It correlates with intuitive physics but not with aesthetic quality, and benefits from model scaling.
- The benchmark is extensible and validated. The use of reversed real-world videos and the two-level metric framework provide a robust and scalable evaluation protocol.
Limitations: The method struggles with temporally symmetric events (e.g., Newton's cradle). Computing denoising losses requires access to model weights, limiting evaluation of closed-source models. Future work will address these limitations.
YoCausal motivates the community to treat causal cognition as a critical objective in the pursuit of world models.