WB ENCH: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation

Summary (Overview)

Unified Benchmark: Introduces WB ENCH, a comprehensive benchmark for evaluating interactive video world models across five key dimensions: Video Quality, Setting Adherence, Interaction Adherence, Consistency, and Physics Compliance.
Multi-turn Dataset: Contains 289 test cases with 1,058 interaction turns, covering diverse scenes, styles, subjects, perspectives (first- and third-person), and four interaction types (Navigation, Subject Action, Event Editing, Perspective Switching).
Unified Navigation Control: Supports fair cross-paradigm comparison by representing navigation actions in three aligned forms: text, camera pose (6-DoF), and discrete keyboard actions.
Diagnostic Evaluation: Uses 22 fine-grained automatic sub-metrics combining specialist vision models and large multimodal models (LMMs), validated against human judgments. Evaluation of 20 state-of-the-art models reveals no single model excels across all dimensions.
Key Findings: Navigation capability is largely independent of other dimensions; camera control does not guarantee perspective consistency; physical correctness correlates more with rendering quality than control; and navigation performance degrades most severely over multiple turns.

Introduction and Theoretical Foundation

Recent advances in video generation have enabled interactive world models that simulate environment evolution in response to user actions. These models function as conditional generators predicting the next observation $o_{t+1}$ given history $o_{\le t}$ and actions $a_{\le t}$ :

o_{t+1} \sim f_\theta(o_{t+1} \mid o_{\le t}, a_{\le t})

A capable interactive world model must fulfill five complementary roles analogous to a game engine: Renderer (visual quality), Director (world initialization), Controller (interaction execution), Memory (state preservation), and Engine (physical compliance).

Existing benchmarks are fragmented, focusing on isolated aspects. Non-interactive suites (e.g., VBench) assess video quality without control. World model benchmarks (e.g., WorldMark, MIND, Omni-WorldBench) cover navigation and memory but lack semantic interactions or are restricted to specific domains (e.g., autonomous driving in WorldLens). No existing benchmark jointly covers diverse open-domain scenes, both perspectives, a comprehensive interaction taxonomy, and multi-turn closed-loop evaluation.

WB ENCH fills this gap by providing a unified framework that decomposes evaluation into explicit World Settings $W$ (defining the initial state $o_0$ ) and Interaction sequences $I = (a_0, a_1, ..., a_{T-1})$ (specifying user controls over $T$ turns). This separation makes failure modes easier to diagnose.

Methodology

3.1 Dataset Construction

Each test case is defined by a World Setting $W$ and a multi-turn Interaction sequence $I$ .

World Setting Attributes:
1. Scene: Environment type, layout, and dynamics (e.g., terrain, buildings).
2. Style: Rendering appearance (e.g., realistic, cartoon, oil painting).
3. Perspective: First-person (FPP) or third-person (TPP).
4. Subject: Primary entity (e.g., human, animal, vehicle). Applies to all TPP and relevant FPP cases.
Interaction Types:
1. Navigation: Camera/ego-agent motion via unified controls (W/S/A/D for translation, ←/→/↑/↓ for rotation).
2. Subject Action: Actions performed by the primary subject (manipulation, locomotion, tool use, combat, gesture).
3. Event Editing: Externally imposed environment changes (weather, time-of-day, object appearances).
4. Perspective Switching: Transitions between FPP and TPP views.
Construction: Follows a setting-first principle. Annotators design a coherent world setting and then derive physically executable, semantically coherent interaction sequences. Cases undergo manual review.

3.2 Dataset Statistics

WB ENCH comprises 289 cases spanning 1,058 interaction turns.

Perspective: 62% FPP, 38% TPP.
Interaction Distribution: Navigation (57%), Subject Action (20%), Event Editing (17%), Perspective Switching (6%).
Scene Diversity: Nature (31%), Urban (21%), Indoor (17%), Workspace (13%), Fantasy (10%), Sports (8%).
Subject Diversity: Human (64%), Animal (9%), Robot (9%), Vehicle (7%), Other (10%).
Style: Photorealistic (52%), Styled (48% e.g., anime, cartoon, oil painting).
Turn Depth: Average 3.7 turns per case (range 2-9).

4. WB ENCH Evaluation Suite

Evaluation is decomposed into five dimensions with 22 fine-grained sub-metrics. All scores are linearly rescaled to $[0, 100]$ .

Video Quality (6 metrics): Aesthetic Quality, Imaging Quality, Temporal Flickering, Dynamic Degree, Motion Smoothness (from VBench), and HPSv3-Norm (human-preference score).
Setting Adherence (2 metrics):
- S.1 Scene Adherence: VLM evaluates consistency of initially visible elements and appearance of described offscreen elements.
- S.2 Subject Adherence: VLM evaluates match of subject's visual attributes and movement style to description.
Interaction Adherence (4 metrics):
- I.1 Navigation Score: Compares MegaSaM-estimated camera poses against a synthetic ground-truth trajectory. Computes normalized Absolute Trajectory Error (nATE) and cross-turn consistency.
- I.2 Event Editing & I.3 Subject Action Adherence: Turn-level VLM protocol with five binary checks per turn (change detection, event occurrence, completion, detail accuracy, anomaly absence). Score is average of $[0,