Summary (Overview)

  • TailOR introduces a benchmark to evaluate whether visual generative world models (image and video generation) truly internalize physical principles or rely on statistical regularities from common training data.
  • The benchmark defines three progressively challenging scenario modes – Regular, Unconventional, and Impossible – based on tool–task pairings and object attribute compatibility.
  • Two complementary evaluation settings – Predictive generation (infer outcome) and Descriptive generation (realize specified outcome) – disentangle implicit physical reasoning from instruction following.
  • Experimental results across eight state-of-the-art models (image and video) reveal a consistent long-tail performance gap: scores degrade from Regular to Unconventional to Impossible scenarios, with the largest drops in Interaction Accuracy and Physical Realism.
  • Failure analysis shows models rely on memorized interaction templates rather than compositional physical reasoning; video models additionally suffer from temporal inconsistencies and cascading errors.

Introduction and Theoretical Foundation

Physical interactions in the real world follow a long-tailed distribution: a small set of common interactions (e.g., cutting bread with a knife, hammering a nail with a hammer) dominates human experience and training data (“head scenarios”), while a vast space of rare but physically valid interactions (“long-tail scenarios”) remains underrepresented. Current world models (image/video generation) achieve impressive realism on head-distribution benchmarks (e.g., Sora-2 reaches 86% physical accuracy on MMGR), raising the central question:

Do world models truly internalize the underlying physical principles of object interactions, or do they primarily rely on statistical regularities observed in training data?

TailOR is designed to answer this question by focusing on tool-use tasks. A regular tool–task pair (head) is compared against two types of long-tail pairs:

  • Unconventional: a substitute tool that satisfies the required physical attributes (e.g., using a coin to loosen a screw)
  • Impossible: a tool that violates critical attributes (e.g., using dry spaghetti to loosen a screw)

Success on long-tail scenarios requires reasoning about object affordances, rigidity, geometry, and force transmission – capabilities that go beyond pattern matching.

Methodology

Problem Formulation

Each evaluation instance is defined as a tuple x=(g,r,A,U,I)x = (g, r, \mathcal{A}, \mathcal{U}, \mathcal{I}) where:

  • gg = task goal (e.g., “crack a walnut”),
  • rr = canonical tool,
  • A\mathcal{A} = required functional attributes,
  • U\mathcal{U} = set of attribute-compatible unconventional substitutes,
  • I\mathcal{I} = set of attribute-violating impossible tools.

Scenario Modes (progressive difficulty):

  • Regular: uses the canonical tool rr (head distribution).
  • Unconventional: uses uUu \in \mathcal{U} (attribute-compatible but atypical).
  • Impossible: uses iIi \in \mathcal{I} (attribute-violating, should fail).

Evaluation Settings:

  • Predictive Generation: outcome is withheld; model must infer the result from implicit physical knowledge.
  • Descriptive Generation: outcome is explicitly specified; model must faithfully realize the instructed target state.

Data Curation Pipeline (7 Steps)

  1. Action-driven task generation: from HICO-DET verb classes, 18 actions selected; LLMs generate tasks with goals, required attributes.
  2. Unconventional tool generation: LLMs propose attribute-compatible substitutes.
  3. Impossible tool generation: opposite affordances via LLMs + object–affordance graph (ConceptNet).
  4. Human verification & filtering: annotators select/revise tools; each task → 1 regular + 2 unconventional + 2 impossible tools.
  5. Prompt generation: predictive and descriptive prompts for each task–tool pair.
  6. Evaluation rubric generation: checklist-based questions for objective scoring.
  7. Human finalization: final quality control.

Dataset Statistics

Scenario CategoryCount
Regular (tasks)80
Generation prompts320
Unconventional (tasks)160
Generation prompts640
Impossible (tasks)160
Generation prompts640
Total tasks400
Total prompts1600
Table 1: Dataset statistics of TailOR, organized by scenario type. Each evaluation task yields four generation prompts (predictive/descriptive × image/video).

Evaluation Metrics

Four complementary dimensions, each scored 0–100% or 0–5:

  • Instruction Adherence (IA): Entity Completeness, Attribute Fidelity, Scene Validity.
  • Interaction Accuracy (IntAcc): State Change Correctness, Affordance Grounding, Motion Plausibility (video only).
  • Physical Realism (Phys): open-ended 0–5 rating of adherence to physics laws.
  • Perceptual Quality (Perc): 0–5 rating of visual fidelity and temporal consistency.

Automatic evaluation uses gemini-2.5-pro as a judge, with strong alignment to human annotations (9 trained annotators, 3 per sample).

Evaluated Models

  • Image generation: Z-Image, Qwen-Image, GPT-Image-1, Nano-Banana-2
  • Video generation: HunyuanVideo-1.5, Wan-2.2, Sora-2, Veo-3.1

Empirical Validation / Results

Main Quantitative Results (Table 2)

The table below shows automatic/human scores for Predictive setting (results for Descriptive setting follow the same trend; see full Table 2 in paper).

ModelRegular (A/H) IA / IntAcc / Phys / PercUnconventional (A/H) IA / IntAcc / Phys / PercImpossible (A/H) IA / IntAcc / Phys / Perc
Image (Predictive)
Z-Image52/55 44/50 2.5/2.7 2.4/2.933/39 29/27 1.7/2.8 2.2/2.526/24 21/28 1.5/1.4 1.9/2.1
Qwen-Img63/74 60/70 3.3/3.5 3.8/4.041/52 37/48 2.2/2.9 2.4/3.236/47 33/43 1.9/2.7 2.4/3.0
GPT-Img-167/79 63/75 3.4/4.2 3.7/4.444/58 41/52 2.7/3.3 2.8/3.640/52 36/48 2.5/3.3 2.7/3.4
Nano-Banana-269/81 65/78 3.6/4.4 3.8/4.446/60 43/55 2.8/3.6 3.0/3.742/52 38/51 2.6/3.4 2.8/3.6
Video (Predictive)
HunyuanVideo48/52 45/47 2.4/2.3 2.8/3.231/29 23/34 1.6/1.8 2.1/2.618/25 22/20 1.5/1.6 1.6/2.4
Wan-2.244/57 49/51 2.0/2.6 3.0/2.929/37 25/28 1.8/2.0 2.5/2.423/21 17/30 1.3/1.9 2.2/2.0
Sora-266/83 63/72 3.1/3.5 3.4/3.944/53 40/49 2.3/2.8 2.7/3.239/48 36/45 2.2/2.6 2.5/3.1
Veo-3.164/75 61/80 2.9/3.4 3.3/3.842/61 48/47 2.2/2.6 2.2/3.137/45 34/42 2.1/2.4 2.2/2.9
Table 2 (abbreviated for Predictive setting; descriptive results in paper). Values: Automatic (gemini-2.5-pro) / Human. Highest per column marked with color in original (here shown in bold).

Key Observations

  1. Consistent long-tail degradation: All models show declining performance from Regular → Unconventional → Impossible across all metrics. The drop is largest in Interaction Accuracy and Physical Realism, indicating that failures stem from weak affordance-level understanding rather than perceptual quality.

  2. Video models are more brittle: Video generation models consistently underperform image models, especially on Interaction Accuracy and Motion Plausibility. Video models exhibit strong bias toward familiar training patterns, and small errors in early frames propagate into cascading temporal failures.

  3. Strong automatic–human alignment: Model rankings under automatic evaluation (gemini-2.5-pro) closely match human rankings. Nano-Banana-2 is the top image model; Sora-2 is the top video model across scenarios.

Failure Mode Analysis

Image generation failures:

  • Regular: incorrect outcomes, inaccurate attributes, physical violations (e.g., hammer scaled incorrectly).
  • Unconventional: incorrect outcomes, affordance misgeneralization (model recognizes tool property but applies it wrongly), physical violations.
  • Impossible: instruction adherence failure (model ignores constraint and succeeds with impossible tool), incorrect outcomes, physical violations.

Video generation failures:

  • Regular: implausible dynamics (e.g., hammer contacts without visible swing), interaction misexecution, temporal inconsistency (jittery objects).
  • Unconventional: affordance misgeneralization (tool switches role mid-scene), implausible dynamics, interaction misexecution.
  • Impossible: temporal inconsistency (abrupt state changes without force buildup), physical violations, interaction misexecution.

Sensitivity Analysis: Predictive vs. Descriptive

  • Descriptive generation improves image models on entity completeness and scene validity, but video models benefit less – they often revert to familiar patterns even when the outcome is explicitly specified.
  • The gap between predictive and descriptive performance is largest in Unconventional scenarios, confirming that models struggle to reconcile explicit instructions with learned priors when interactions deviate from common distributions.

Theoretical and Practical Implications

The findings strongly suggest that current world models memorize holistic interaction templates rather than learning transferable physical primitives (rigidity, sharpness, force transmission, leverage). This has key implications:

  • For model development: Training pipelines that maximize perceptual realism and distributional fidelity are insufficient for generalization to long-tail scenarios. Models need stronger inductive biases for object attributes, affordances, and causal dynamics.
  • For evaluation: Benchmarks must include long-tail scenarios that probe attribute-level reasoning, not just perceptual quality. TailOR provides a structured protocol for such diagnostics.
  • For video generation: Additional mechanisms for long-horizon state tracking, force-consistent motion modeling, and constraint-aware temporal planning are necessary to avoid cascading errors.

The work highlights a fundamental limitation: even when provided with explicit outcome descriptions, models fall back on high-frequency visual patterns, indicating that they lack causal grounding and compositional physical reasoning.

Conclusion

TailOR systematically evaluates visual world models on long-tail physical interactions, revealing a pronounced performance gap that cannot be explained by perceptual quality alone. The benchmark’s three scenario modes (Regular, Unconventional, Impossible) and two generation settings (Predictive, Descriptive) provide a principled framework for diagnosing whether models truly understand physical principles or simply memorize statistical regularities. Key findings include:

  • All models degrade from Regular to Unconventional to Impossible scenarios.
  • Failures are dominated by weak affordance understanding and, for video, temporal inconsistency.
  • Even descriptive prompts cannot overcome the bias toward familiar interaction patterns.

Future directions include incorporating stronger inductive biases for object attributes and causal dynamics during training, improving video generation with force-consistent motion modeling, and extending benchmarks to multi-step manipulation and multi-object causal chains. TailOR serves as a testbed to drive progress toward world models that can reason about the long tail of physical interactions rather than merely reproducing common visual statistics.

Related papers