WorldMark: A Unified Benchmark Suite for Interactive Video World Models

Summary (Overview)

Standardized Benchmark: Introduces WorldMark, the first benchmark providing standardized test conditions (identical scenes, identical action sequences, and a unified control interface) for fair, apples-to-apples comparison of interactive Image-to-Video (I2V) world models.
Key Infrastructure: Features a unified action-mapping layer that translates a shared WASD-style action vocabulary into the native control formats of six major models (YUME 1.5, HY-World 1.5, Genie 3, Matrix-Game 2.0, Open-Oasis, HY-GameCraft), enabling semantically identical inputs.
Comprehensive Test Suite: Provides a hierarchical test suite of 500 evaluation cases covering first-/third-person viewpoints, photorealistic/stylized scenes, and three difficulty tiers (Easy: 20s, Medium: 40s, Hard: 60s).
Modular Evaluation Toolkit: Offers a default, modular evaluation toolkit covering Visual Quality, Control Alignment, and World Consistency using both geometric (e.g., SLAM) and Vision-Language Model (VLM)-based metrics, allowing researchers to plug in their own metrics.
Key Findings: Reveals that visual quality and world consistency are largely uncorrelated across current models, that third-person generation remains a significant challenge, and that precise control alignment does not guarantee overall generation quality.

Introduction and Theoretical Foundation

The rapid advancement of interactive video generation models (e.g., Genie, YUME, HY-World, Matrix-Game) aims to create models that function not just as video synthesizers but as interactive environments that respond faithfully to user actions while preserving long-term scene memory. However, a critical problem hinders progress: each model is evaluated on its own private benchmark with bespoke scenes, trajectories, and metrics, making fair cross-model comparison impossible.

Existing public benchmarks (VBench, WorldScore, MIND, PhyGenBench) provide useful metrics but lack the standardized test conditions necessary for interactive models. They either omit action control, rely on ground-truth camera trajectories most interactive models cannot accept, or fail to provide a common set of scenes and action sequences. This creates a disconnect: no benchmark jointly evaluates action controllability, visual quality, and long-horizon world consistency under the interactive paradigm.

WorldMark addresses this fundamental gap by establishing a common playing field. Its core thesis is that the root cause of the evaluation problem is not a lack of metrics, but the absence of standardized test conditions. By providing identical scenes, identical action sequences, and a transparent translation layer to feed each model semantically equivalent inputs, any metric can then yield comparable numbers across models.

Methodology

WorldMark is a standardized benchmarking toolkit comprising five key components:

1. Evaluation Dimension Suite: Eight metrics organized into three categories (see Table 2):

Visual Quality: Assesses per-frame fidelity.
- Aesthetic Quality: Uses the LAION aesthetic predictor for human-perceived appeal.
- Imaging Quality: Uses MUSIQ to quantify low-level distortions (noise, blur, etc.).
Control Alignment: Measures faithfulness to input actions via reconstructed camera poses.
- Translation Error: Scale-invariant Euclidean distance between ground-truth and estimated camera positions. $e_t = \| \mathbf{t}_{gt} - s\mathbf{t} \|_2$ where $\mathbf{t}_{gt}, \mathbf{t} \in \mathbb{R}^3$ are positions and $s$ is the least-squares scale factor.
- Rotation Error: Geodesic angular deviation between rotation matrices. $e_r = \arccos\left( \frac{\text{tr}(\mathbf{R}_{gt} \mathbf{R}^T) - 1}{2} \right) \cdot \frac{180}{\pi}$ where $\mathbf{R}_{gt}, \mathbf{R} \in \text{SO}(3)$ .
World Consistency: Evaluates temporal coherence and 3D plausibility.
- Reprojection Error: Measures 3D geometric coherence via DROID-SLAM's Dense Bundle Adjustment. $e_{\text{reproj}} = \frac{1}{|\mathcal{V}|} \sum_{(i,j)\in\mathcal{V}} \| \mathbf{p}^*_{ij} - \Pi(\mathbf{P}_{ij}) \|_2$ where $\mathcal{V}$ is co-visible pixel pairs, $\mathbf{p}^*_{ij}$ is the observed 2D coordinate, $\mathbf{P}_{ij}$ is the 3D point, and $\Pi(\cdot)$ is the camera projection.
- VLM-based Metrics (State, Content, Style Consistency): Use a VLM (Gemini-3.1-Pro) to track object stability, detect hallucinations ("popping"), and assess stylistic uniformity.

2. Image Suite: A curated set of 50 reference images from the WorldScore dataset, expanded to 100 images by synthesizing corresponding third-person views for each. It spans categories (Nature, City, Indoor), styles (Real, Stylized), and viewpoints (first-/third-person).

3. Action Suite: 15 standardized action sequences (see Figure 3) composed from a shared vocabulary of six primitives: forward (W), backward (S), strafe-left (A), strafe-right (D), yaw-left (L), yaw-right (R). Sequences range from simple translations to complex patrols. A VLM filters actions for each image to ensure physical plausibility (e.g., no strafing into a wall). The suite is organized into three difficulty tiers (Easy/20s, Medium/40s, Hard/60s), creating ~500 evaluation cases.

4. Unified Action Interface: The core innovation that enables cross-model comparison. It translates the shared WASD+L/R vocabulary into each model's native, heterogeneous control format via per-model action-mapping adapters (see Table 3).

Model	Native Format	Mapping Strategy
YUME 1.5	Caption prompts	Directional keywords in text
HY-World 1.5	6-DoF pose params	Latent timescale matching
HY-GameCraft	6-DoF pose params	Pose → Plücker ray embeddings
Genie 3	Gamepad controls	Directional button presses
Matrix-Game	Action functions	Corresponding action API calls
Open-Oasis	25-dim action vectors	Set movement dimensions

5. Evaluation Workflow: A modular four-stage pipeline: (1) Image Selection, (2) Action Mapping, (3) Video Generation, (4) Metric Evaluation. Researchers can plug in custom metrics at the final stage.

Empirical Validation / Results

Six models were evaluated: YUME 1.5, Matrix-Game 2.0, HY-World 1.5, HY-GameCraft (HY-Game), Open-Oasis, and Genie 3.

First-Person View Evaluation

Key quantitative results for First-Person Real and Stylized scenarios are shown in Tables 4 and 5.

Metric	YUME 1.5	Matrix-Game 2.0	HY-World 1.5	HY-Game	Oasis	Genie 3
Visual Quality
Aesthetic Quality ↑	56.94	49.40	54.79	46.59	29.31	45.58
Imaging Quality ↑	74.36	68.11	69.37	49.31	28.08	64.14
Control Alignment
Translation Error ↓	0.199	0.222	0.191	0.159	0.376	0.498
Rotation Error ↓	2.107	1.324	2.079	6.019	4.892	4.247
World Consistency
Reprojection Error ↓	0.549	0.688	0.702	0.447	1.938	0.441
State Consistency ↑	5.344	4.151	5.913	4.073	2.585	6.416
Content Consistency ↑	3.820	7.415	6.352	5.814	3.748	6.914
Style Consistency ↑	7.119	3.181	5.142	3.726	1.797	8.158
Table 4: Quantitative comparison on the First-Person Real scenario. Best in bold, second best underlined.

Findings:

Visual Quality: YUME 1.5 and HY-World 1.5 lead in producing aesthetically pleasing frames.
Control Alignment: HY-Game shows the most precise translation control, while Matrix-Game 2.0 excels in rotation alignment in real scenes.
World Consistency: The closed-source Genie 3 dominates across almost all consistency metrics, demonstrating superior long-horizon coherence.

Third-Person View Evaluation

Key quantitative results are shown in Table 6. Only Matrix-Game 2.0, HY-World 1.5, and Genie 3 support third-person.

| Metric | Real | Stylized | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | | | Matrix-Game 2.0 | HY-World 1.5 | Genie 3 | Matrix-Game 2.0 | HY-World 1.5 | Genie 3 | | Visual Quality | | | | | | | | Aesthetic Quality ↑ | 52.78 | 57.69 | 51.04 | 51.60 | 60.57 | 53.76 | | Imaging Quality ↑ | 67.26 | 70.76 | 60.20 | 65.24 | 66.45 | 63.98 | | Control Alignment | | | | | | | | Translation Error ↓ | 0.284 | 0.206 | 0.212 | 0.230 | 0.220 | 0.129 | | Rotation Error ↓ | 27.606 | 2.137 | 14.905 | 9.211 | 5.285 | 8.823 | | World Consistency | | | | | | | | Reprojection Error ↓ | 0.814 | 0.640 | 0.584 | 0.744 | 0.713 | 1.148 | | State Consistency ↑ | 5.136 | 6.628 | 7.082 | 3.625 | 5.274 | 7.565 | | Content Consistency ↑ | 3.405 | 5.707 | 7.424 | 2.083 | 5.147 | 7.109 | | Style Consistency ↑ | 1.659 | 4.491 | 8.247 | 2.942 | 7.236 | 8.541 | Table110: Quantitative comparison on the Third-Person scenarios. Best in bold, second best underlined.

Findings:

HY-World 1.5 leads in visual quality and shows strong control alignment.
Genie 3 again dominates world consistency metrics.
Third-person is a pronounced failure mode: Matrix-Game 2.0's rotation error inflates dramatically (~20x in Real scenario) compared to first-person, highlighting the difficulty of maintaining camera control around a visible character.

Qualitative Evaluation & Human Alignment

Qualitative examples (Figure 5) illustrate successes and failures across the three evaluation dimensions. A human preference study with 20 volunteers showed strong correlation ( $\rho > 0.9$ ) between human rankings and automated WorldMark scores (Figure 6), validating the benchmark's metrics.

Theoretical and Practical Implications

Theoretical Implications:

Decouples Model Strengths: The benchmark clearly demonstrates that current models excel in different, often uncorrelated aspects. A model with high visual fidelity (YUME) may have poor world consistency, and vice-versa (Genie 3). This challenges the notion of a single "best" model and emphasizes the need for multi-dimensional evaluation.
Highlights Fundamental Challenges: The severe degradation in third-person control, especially rotation, points to a core unsolved problem in interactive world modeling related to character-scene interaction and viewpoint stability.

Practical Implications:

Enables Fair Comparison: For the first time, researchers and developers can directly and fairly compare interactive world models on equal footing, accelerating progress by identifying true strengths and weaknesses.
Provides Actionable Insights: The results provide clear guidance for model development: improving world consistency is a major frontier, third-person generation needs dedicated architectural attention, and domain-specific training (e.g., Open-Oasis on Minecraft) does not generalize well.
Offers Modular, Future-Proof Infrastructure: By separating standardized inputs from the evaluation metrics, WorldMark allows the community to reuse its test suite while integrating new, improved metrics as the field evolves.
Democratizes Evaluation: The release of all data, code, and the online World Model Arena (warena.ai) allows anyone to compare models interactively, fostering transparency and community engagement.

Conclusion

WorldMark establishes the first standardized benchmark for interactive I2V world models, resolving the critical fragmentation in their evaluation. By providing a unified action interface, a diverse and hierarchical test suite, and a modular evaluation toolkit, it enables rigorous, apples-to-apples comparison.

The benchmark reveals key insights about the current state of the field: visual quality and world consistency are largely independent, precise low-level control does not guarantee a coherent world, and third-person generation remains an open and significant challenge. The release of this benchmark, along with all associated data and tools, is intended to provide a common foundation for measuring and driving future progress in interactive world modeling.