Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding - Summary
Summary (Overview)
- Introduces Video-MME-v2, a new benchmark designed to rigorously evaluate the robustness and faithfulness of video Multimodal Large Language Models (MLLMs), addressing the gap between inflated leaderboard scores and real-world capabilities.
- Proposes a progressive tri-level capability hierarchy (Level 1: Visual Information Aggregation, Level 2: Temporal Dynamics, Level 3: Complex Reasoning) and a group-based non-linear evaluation strategy that penalizes fragmented correctness and rewards consistent, coherent reasoning.
- Reveals a substantial performance gap: The best model (Gemini-3-Pro) achieves a Non-Lin Score of 49.4, far below the human expert baseline of 90.7. Results uncover a hierarchical bottleneck where errors in lower-level perception and temporal modeling propagate to limit high-level reasoning.
- Highlights key insights: Conventional per-question accuracy overestimates capability; "thinking" modes improve performance with subtitles but can regress without them, indicating over-reliance on language priors; and model scale can partially compensate for missing core capabilities.
- Constructs a high-quality dataset through a meticulously controlled human annotation pipeline involving 12 annotators and 50 reviewers over 3,300 human-hours, resulting in 800 videos and 3,200 questions with strong adversarial distractors.
Introduction and Theoretical Foundation
Recent advancements in video-based MLLMs have shifted focus from simple comprehension to deeper reasoning. However, existing benchmarks often lack a comprehensive evaluation hierarchy and rely on per-question accuracy, overlooking the need for consistent and trustworthy video comprehension. This makes holistic assessment difficult and hinders investigation into models' robust understanding and reliable reasoning capabilities.
To address these challenges, Video-MME-v2 is introduced with two core innovations:
- A Multi-level Evaluation Hierarchy: A progressive framework categorizing core video understanding skills into three levels, ensuring evaluation from foundational perception to sophisticated, human-like comprehension.
- A Group-based Evaluation Strategy: Assesses models from two perspectives: Capability Consistency (breadth of a fundamental skill) and Reasoning Coherence (depth of multi-step logical reasoning). This is coupled with a non-linear scoring method to enforce stepwise reasoning validity.
The benchmark aims to serve as a demanding new testbed that pushes frontier video MLLMs toward more robust and faithful understanding of dynamic visual content.
Methodology
1. Progressive Capability Hierarchy
The benchmark organizes questions into a three-level hierarchy comprising 12 sub-categories and over 30 task types.
- Level 1: Visual Information Aggregation: Foundational level assessing perception and aggregation of cross-frame and cross-modal information at specific timestamps.
- Sub-categories: Visual Recognition, Cross-Modal Consistency, Basic Counting & Calculation.
- Level 2: Temporal Dynamics: Emphasizes the temporal evolution of events, building upon Level 1.
- Sub-categories: Action & Motion Analysis, Sequential Ordering, Causal Reasoning.
- Level 3: Complex Reasoning: Advanced level simulating real-world cognitive tasks requiring professional knowledge and multi-hop inference.
- Sub-categories: Narrative Understanding, Social Dynamics, Physical World Reasoning.
2. Group Type Definition
To move beyond per-question accuracy, the benchmark employs two group types:
- Consistency-Based Group: Evaluates the breadth of a specific capability. Groups are constructed along two dimensions:
- Breadth: Diverse question types within a single domain (e.g., spatial understanding includes both static localization and dynamic motion reasoning).
- Granularity: Extends one question type across multiple spatio-temporal scales (e.g., global sequence vs. fine-grained sub-action ordering).
- Coherence-Based Group: Evaluates the depth of a model's reasoning ability. Question sets are constructed to mimic the logical progression a human would follow, creating an explicit reasoning chain (e.g., from clue localization → anomaly verification → purpose explanation → final conclusion). This allows for hierarchical verification of coherent, grounded reasoning.
3. Metrics
The evaluation incorporates both conventional average accuracy and the proposed group-level non-linear score.
-
Average Accuracy (Avg Acc): Standard per-question correctness.
where indexes questions, is the predicted answer, and is the ground truth.
-
Group-level Non-linear Score: Evaluates robustness against related questions within a group , which contains four questions .
- For consistency groups: Given correct answers out of 4, the group score is . This quadratic suppression penalizes isolated correct guesses.
- For coherence groups: A first-error truncation mechanism is applied. Only the longest consecutive sequence of correct answers from the first reasoning step counts; any correct answers after an error are ignored. This prevents credit for logically unsupported steps.
4. Dataset Construction and Annotation
A rigorous pipeline was established to ensure high data quality and mitigate pretraining leakage.
- Video Curation:
- Recency-Oriented Collection: Over 80% of videos were published in 2025 or later to minimize contamination risk.
- Diversity via Taxonomy: Videos sourced from four top-level domains (Sports & Competition, Lifestyle & Entertainment, Art & Literature, Knowledge & Education) split into 31 fine-grained subcategories.
- Quality Control: Used view-count thresholding (median 355k views) and manual decontamination to exclude classic works.
- Question and Option Design:
- Group-Based Construction: Questions are designed in groups of four, with average length progressively increasing from Q1 to Q4 to align with reasoning depth.
- 8-Option Multiple-Choice: Reduces random-guessing probability to 12.5%. Option word counts are strictly controlled to eliminate length bias.
- Strong Distractor Design: Includes at least one adversarial distractor—a plausible option that contradicts a key, fine-grained detail in the ground truth.
- Quality Assurance:
- Involved 50 independent reviewers for multi-round cross-validation and blind testing.
- Employed text-only baseline testing with Gemini-3-Pro to remove questions solvable without visual information.
- Established a closed-loop "Correction-Reverification" mechanism.
Empirical Validation / Results
Extensive evaluations were conducted across a broad spectrum of frontier video MLLMs under two settings: visual frames only (wo sub) and visual frames with subtitles/audio (w. sub).
Key Results Table
Table 1: Main Results on the Leaderboard (Sorted by w. sub Non-Lin Score)
| Model | Frames | Non-Lin Score | Avg Acc |
|---|---|---|---|
| w. sub | wo sub | ||
| Human Expert | - | 90.7 | - |
| Gemini-3-Pro [10] | 1fps | 49.4 | 38.2 |
| Doubao-Seed-2.0-Pro [3] | 1fps | 43.3 | 35.2 |
| Gemini-3-Flash [10] | 1fps | 42.5 | 32.9 |
| MiMo-v2-Omni [29] | 1fps | 38.6 | 29.9 |
| GPT-5 [21] | 50 | 37.0 | 26.4 |
| Qwen3.5-397B-A17B-Think [20] | 512 | 39.1 | 30.3 |
| Qwen3-VL-235B-A22B-Instruct [1] | 64 | 25.0 | 16.5 |
Table Note:
ConsistencyandCoherencescores are omitted from this summary table for brevity. The full table in the paper includes scores for all three levels and both group types. Models are ranked by the proposedNon-Lin Score.
Key Findings:
- Significant Gap with Human Performance: A massive 41.3-point gap exists between the best model (Gemini-3-Pro, 49.4) and human experts (90.7).
- Hierarchical Bottlenecks: Performance degrades monotonically from Level 1 to Level 3 across all models. Errors in lower-level competencies (information aggregation, temporal modeling) propagate to undermine high-level reasoning.
- Commercialization Dominance: Commercial models (Gemini, GPT) significantly outperform open-source models. They also show smaller performance drops when subtitles are removed (
wo sub), indicating greater robustness. - Benefit of Native Audio: Omni-models with native audio processing (e.g., Gemini-3-Pro, MiMo-v2-Omni) show clear gains (+11.2 and +8.7 points respectively) when audio is provided, suggesting complementary semantic information.
- Scale vs. Quality: Model scale is not the sole determinant. For example, Qwen3.5-27B outperforms much larger models, highlighting the impact of training recipe and alignment techniques.
Analysis Experiments
Table 2: Avg Acc vs. Non-Lin Score (Robustness Ratio)
| Model | Avg Acc (%) | Non-Lin Score | Non-Lin / Avg Acc |
|---|---|---|---|
| Gemini-3-Pro | 66.1 | 49.4 | 74.7% |
| Doubao-Seed-2.0-Pro | 60.5 | 43.3 | 71.6% |
| Qwen3.5-397B-Think (512) | 55.9 | 39.1 | 69.9% |
| LLaVA-Video-7B-Qwen2 | 24.0 | 9.7 | 40.4% |
- Advantage of Non-Linear Scoring: The
Non-Lin Scoreis substantially lower thanAvg Acc, revealing that even SOTA models lack consistency across correlated queries. The ratio (Non-Lin / Avg Acc) serves as a robustness indicator, with weaker models showing much lower ratios (~40%). - Effect of Thinking Mode: Enabling "thinking" or reasoning modes generally improves performance, but gains are larger with subtitles (
w. sub). Notably, for some models (e.g., KimiVL-16B), thinking causes regression in thewo subsetting, especially at Level 3, indicating imperfect reasoning mechanisms and over-reliance on language cues. - Capability Profiling: The paper abstracts three core capabilities: C1 (Omni-modal aggregation), C2 (Long-context temporal modeling), C3 (Complex reasoning).
Table 3: Capability Profiling
Model Non-Lin Score C1 C2 C3 Gemini-3-Pro 49.4 ✓ ✓ ✓ MiMo-v2-Omni 38.6 ✓ ✓ ✓ Qwen3.5-397B-Think (512) 39.1 ✓ ✓ Qwen3.5-397B-Think (64) 30.6 ✓ ✓ - Synergy: Models with a more complete capability profile (C1+C2+C3) generally perform better.
- Scale Compensation: Large parameter count (e.g., Qwen3.5-397B) can partly compensate for missing capabilities (e.g., lacking C1).
- Frame Count Impact: Increasing frame count (from 64 to 512) for the same model can yield significant improvements (+8.5 points), underscoring the importance of long-context processing (C2).
Theoretical and Practical Implications
- Rethinking Video MLLM Evaluation: Video-MME-v2 demonstrates that conventional per-question accuracy substantially overestimates model capability. The proposed group-based, non-linear evaluation provides a more faithful measure of robustness and reasoning coherence, which should inform future benchmark design.
- Highlighting Fundamental Limitations: The large gap to human performance and the exposed hierarchical bottleneck reveal that current models fundamentally lack the capability consistency and reasoning coherence required for dynamic, real-world scenarios. Improving complex reasoning requires a holistic enhancement of the entire capability stack, starting with stronger perception and temporal grounding.
- Uncovering Model Biases and Dependencies: The analysis shows that current reasoning mechanisms are highly dependent on textual cues, leading to an over-reliance on language priors. This indicates a need for more robust, visually-grounded reasoning techniques.
- Informing Model Development: The findings suggest that synergistic improvement across omni-modal aggregation, long-context modeling, and complex reasoning is key. While scale can help, focused architectural and training innovations to build a complete capability profile are crucial for advancing the field.
Conclusion
Video-MME-v2 establishes a new, rigorous benchmark for comprehensive video understanding. Through its progressive capability hierarchy, group-based evaluation framework, and non-linear scoring scheme, it effectively exposes the limitations of current video MLLMs in terms of robustness and faithfulness. The extensive experiments reveal a substantial performance gap, a clear hierarchical bottleneck, and dependencies on textual modalities. By providing these insights and a high-quality evaluation testbed, Video-MME-v2 aims to drive the development of next-generation video MLLMs toward more robust, coherent, and human-like video comprehension.