# WorldMark: A Unified Benchmark Suite for Interactive Video World Models

> WorldMark introduces the first standardized benchmark for interactive video world models, enabling fair comparisons by providing identical scenes, action sequences, and a unified control interface across six major models.

- **Source:** [arXiv](https://arxiv.org/abs/2604.21686)
- **Published:** 2026-04-25
- **Permalink:** https://picx.dev/p/sopctd
- **Whiteboard:** https://picx.dev/p/sopctd/image

## Summary

# WorldMark: A Unified Benchmark Suite for Interactive Video World Models

## Summary (Overview)
*   **Standardized Benchmark:** Introduces **WorldMark**, the first benchmark providing standardized test conditions (identical scenes, identical action sequences, and a unified control interface) for fair, apples-to-apples comparison of interactive Image-to-Video (I2V) world models.
*   **Key Infrastructure:** Features a **unified action-mapping layer** that translates a shared WASD-style action vocabulary into the native control formats of six major models (YUME 1.5, HY-World 1.5, Genie 3, Matrix-Game 2.0, Open-Oasis, HY-GameCraft), enabling semantically identical inputs.
*   **Comprehensive Test Suite:** Provides a **hierarchical test suite of 500 evaluation cases** covering first-/third-person viewpoints, photorealistic/stylized scenes, and three difficulty tiers (Easy: 20s, Medium: 40s, Hard: 60s).
*   **Modular Evaluation Toolkit:** Offers a default, modular evaluation toolkit covering **Visual Quality, Control Alignment, and World Consistency** using both geometric (e.g., SLAM) and Vision-Language Model (VLM)-based metrics, allowing researchers to plug in their own metrics.
*   **Key Findings:** Reveals that **visual quality and world consistency are largely uncorrelated** across current models, that **third-person generation remains a significant challenge**, and that precise control alignment does not guarantee overall generation quality.

## Introduction and Theoretical Foundation
The rapid advancement of interactive video generation models (e.g., Genie, YUME, HY-World, Matrix-Game) aims to create models that function not just as video synthesizers but as **interactive environments** that respond faithfully to user actions while preserving long-term scene memory. However, a critical problem hinders progress: **each model is evaluated on its own private benchmark** with bespoke scenes, trajectories, and metrics, making fair cross-model comparison impossible.

Existing public benchmarks (VBench, WorldScore, MIND, PhyGenBench) provide useful metrics but lack the **standardized test conditions** necessary for interactive models. They either omit action control, rely on ground-truth camera trajectories most interactive models cannot accept, or fail to provide a common set of scenes and action sequences. This creates a disconnect: no benchmark jointly evaluates **action controllability, visual quality, and long-horizon world consistency** under the interactive paradigm.

WorldMark addresses this fundamental gap by establishing a **common playing field**. Its core thesis is that the root cause of the evaluation problem is not a lack of metrics, but the **absence of standardized test conditions**. By providing identical scenes, identical action sequences, and a transparent translation layer to feed each model semantically equivalent inputs, any metric can then yield comparable numbers across models.

## Methodology
WorldMark is a standardized benchmarking toolkit comprising five key components:

**1. Evaluation Dimension Suite:** Eight metrics organized into three categories (see Table 2):
*   **Visual Quality:** Assesses per-frame fidelity.
    *   *Aesthetic Quality:* Uses the LAION aesthetic predictor for human-perceived appeal.
    *   *Imaging Quality:* Uses MUSIQ to quantify low-level distortions (noise, blur, etc.).
*   **Control Alignment:** Measures faithfulness to input actions via reconstructed camera poses.
    *   *Translation Error:* Scale-invariant Euclidean distance between ground-truth and estimated camera positions.
        $$e_t = \| \mathbf{t}_{gt} - s\mathbf{t} \|_2$$
        where $\mathbf{t}_{gt}, \mathbf{t} \in \mathbb{R}^3$ are positions and $s$ is the least-squares scale factor.
    *   *Rotation Error:* Geodesic angular deviation between rotation matrices.
        $$e_r = \arccos\left( \frac{\text{tr}(\mathbf{R}_{gt} \mathbf{R}^T) - 1}{2} \right) \cdot \frac{180}{\pi}$$
        where $\mathbf{R}_{gt}, \mathbf{R} \in \text{SO}(3)$.
*   **World Consistency:** Evaluates temporal coherence and 3D plausibility.
    *   *Reprojection Error:* Measures 3D geometric coherence via DROID-SLAM's Dense Bundle Adjustment.
        $$e_{\text{reproj}} = \frac{1}{|\mathcal{V}|} \sum_{(i,j)\in\mathcal{V}} \| \mathbf{p}^*_{ij} - \Pi(\mathbf{P}_{ij}) \|_2$$
        where $\mathcal{V}$ is co-visible pixel pairs, $\mathbf{p}^*_{ij}$ is the observed 2D coordinate, $\mathbf{P}_{ij}$ is the 3D point, and $\Pi(\cdot)$ is the camera projection.
    *   *VLM-based Metrics (State, Content, Style Consistency):* Use a VLM (Gemini-3.1-Pro) to track object stability, detect hallucinations ("popping"), and assess stylistic uniformity.

**2. Image Suite:** A curated set of **50 reference images** from the WorldScore dataset, expanded to **100 images** by synthesizing corresponding third-person views for each. It spans categories (Nature, City, Indoor), styles (Real, Stylized), and viewpoints (first-/third-person).

**3. Action Suite:** **15 standardized action sequences** (see Figure 3) composed from a shared vocabulary of six primitives: forward (W), backward (S), strafe-left (A), strafe-right (D), yaw-left (L), yaw-right (R). Sequences range from simple translations to complex patrols. A **VLM filters actions** for each image to ensure physical plausibility (e.g., no strafing into a wall). The suite is organized into three difficulty tiers (Easy/20s, Medium/40s, Hard/60s), creating ~500 evaluation cases.

**4. Unified Action Interface:** The core innovation that enables cross-model comparison. It translates the shared WASD+L/R vocabulary into each model's native, heterogeneous control format via per-model **action-mapping adapters** (see Table 3).

| Model | Native Format | Mapping Strategy |
| :--- | :--- | :--- |
| YUME 1.5 | Caption prompts | Directional keywords in text |
| HY-World 1.5 | 6-DoF pose params | Latent timescale matching |
| HY-GameCraft | 6-DoF pose params | Pose → Plücker ray embeddings |
| Genie 3 | Gamepad controls | Directional button presses |
| Matrix-Game | Action functions | Corresponding action API calls |
| Open-Oasis | 25-dim action vectors | Set movement dimensions |

**5. Evaluation Workflow:** A modular four-stage pipeline: (1) **Image Selection**, (2) **Action Mapping**, (3) **Video Generation**, (4) **Metric Evaluation**. Researchers can plug in custom metrics at the final stage.

## Empirical Validation / Results
Six models were evaluated: YUME 1.5, Matrix-Game 2.0, HY-World 1.5, HY-GameCraft (HY-Game), Open-Oasis, and Genie 3.

### First-Person View Evaluation
**Key quantitative results for First-Person Real and Stylized scenarios are shown in Tables 4 and 5.**

| Metric | YUME 1.5 | Matrix-Game 2.0 | HY-World 1.5 | HY-Game | Oasis | Genie 3 |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: |
| **Visual Quality** | | | | | | |
| Aesthetic Quality ↑ | **56.94** | 49.40 | 54.79 | 46.59 | 29.31 | 45.58 |
| Imaging Quality ↑ | **74.36** | 68.11 | 69.37 | 49.31 | 28.08 | 64.14 |
| **Control Alignment** | | | | | | |
| Translation Error ↓ | 0.199 | 0.222 | 0.191 | **0.159** | 0.376 | 0.498 |
| Rotation Error ↓ | 2.107 | **1.324** | 2.079 | 6.019 | 4.892 | 4.247 |
| **World Consistency** | | | | | | |
| Reprojection Error ↓ | 0.549 | 0.688 | 0.702 | 0.447 | 1.938 | **0.441** |
| State Consistency ↑ | 5.344 | 4.151 | 5.913 | 4.073 | 2.585 | **6.416** |
| Content Consistency ↑ | 3.820 | 7.415 | 6.352 | 5.814 | 3.748 | **6.914** |
| Style Consistency ↑ | 7.119 | 3.181 | 5.142 | 3.726 | 1.797 | **8.158** |
*Table 4: Quantitative comparison on the First-Person Real scenario. Best in **bold**, second best underlined.*

**Findings:**
*   **Visual Quality:** YUME 1.5 and HY-World 1.5 lead in producing aesthetically pleasing frames.
*   **Control Alignment:** HY-Game shows the most precise translation control, while Matrix-Game 2.0 excels in rotation alignment in real scenes.
*   **World Consistency:** The closed-source **Genie 3 dominates** across almost all consistency metrics, demonstrating superior long-horizon coherence.

### Third-Person View Evaluation
**Key quantitative results are shown in Table 6.** Only Matrix-Game 2.0, HY-World 1.5, and Genie 3 support third-person.

| Metric | Real | Stylized |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: |
| | Matrix-Game 2.0 | HY-World 1.5 | Genie 3 | Matrix-Game 2.0 | HY-World 1.5 | Genie 3 |
| **Visual Quality** | | | | | | |
| Aesthetic Quality ↑ | 52.78 | **57.69** | 51.04 | 51.60 | **60.57** | 53.76 |
| Imaging Quality ↑ | 67.26 | **70.76** | 60.20 | 65.24 | **66.45** | 63.98 |
| **Control Alignment** | | | | | | |
| Translation Error ↓ | 0.284 | **0.206** | 0.212 | 0.230 | 0.220 | **0.129** |
| Rotation Error ↓ | 27.606 | **2.137** | 14.905 | 9.211 | **5.285** | 8.823 |
| **World Consistency** | | | | | | |
| Reprojection Error ↓ | 0.814 | 0.640 | **0.584** | 0.744 | **0.713** | 1.148 |
| State Consistency ↑ | 5.136 | 6.628 | **7.082** | 3.625 | 5.274 | **7.565** |
| Content Consistency ↑ | 3.405 | 5.707 | **7.424** | 2.083 | 5.147 | **7.109** |
| Style Consistency ↑ | 1.659 | 4.491 | **8.247** | 2.942 | 7.236 | **8.541** |
*Table110: Quantitative comparison on the Third-Person scenarios. Best in **bold**, second best underlined.*

**Findings:**
*   HY-World 1.5 leads in visual quality and shows strong control alignment.
*   Genie 3 again dominates world consistency metrics.
*   **Third-person is a pronounced failure mode:** Matrix-Game 2.0's rotation error inflates dramatically (~20x in Real scenario) compared to first-person, highlighting the difficulty of maintaining camera control around a visible character.

### Qualitative Evaluation & Human Alignment
Qualitative examples (Figure 5) illustrate successes and failures across the three evaluation dimensions. A human preference study with 20 volunteers showed **strong correlation ($\rho > 0.9$)** between human rankings and automated WorldMark scores (Figure 6), validating the benchmark's metrics.

## Theoretical and Practical Implications
**Theoretical Implications:**
*   **Decouples Model Strengths:** The benchmark clearly demonstrates that current models excel in different, often uncorrelated aspects. A model with high visual fidelity (YUME) may have poor world consistency, and vice-versa (Genie 3). This challenges the notion of a single "best" model and emphasizes the need for multi-dimensional evaluation.
*   **Highlights Fundamental Challenges:** The severe degradation in third-person control, especially rotation, points to a core unsolved problem in interactive world modeling related to character-scene interaction and viewpoint stability.

**Practical Implications:**
*   **Enables Fair Comparison:** For the first time, researchers and developers can directly and fairly compare interactive world models on equal footing, accelerating progress by identifying true strengths and weaknesses.
*   **Provides Actionable Insights:** The results provide clear guidance for model development: improving world consistency is a major frontier, third-person generation needs dedicated architectural attention, and domain-specific training (e.g., Open-Oasis on Minecraft) does not generalize well.
*   **Offers Modular, Future-Proof Infrastructure:** By separating standardized inputs from the evaluation metrics, WorldMark allows the community to reuse its test suite while integrating new, improved metrics as the field evolves.
*   **Democratizes Evaluation:** The release of all data, code, and the online **World Model Arena (warena.ai)** allows anyone to compare models interactively, fostering transparency and community engagement.

## Conclusion
WorldMark establishes the first standardized benchmark for interactive I2V world models, resolving the critical fragmentation in their evaluation. By providing a unified action interface, a diverse and hierarchical test suite, and a modular evaluation toolkit, it enables rigorous, apples-to-apples comparison.

The benchmark reveals key insights about the current state of the field: **visual quality and world consistency are largely independent**, **precise low-level control does not guarantee a coherent world**, and **third-person generation remains an open and significant challenge**. The release of this benchmark, along with all associated data and tools, is intended to provide a common foundation for measuring and driving future progress in interactive world modeling.

---

_Markdown view of https://picx.dev/p/sopctd, served by PicX — AI-generated visual whiteboard summaries of research papers._
