Summary (Overview)

Introduces ViGoR-Bench, a comprehensive benchmark designed to evaluate the reasoning capabilities of visual generative models (Image-to-Image and Video tasks), moving beyond traditional metrics that focus solely on visual fidelity.
Employs a dual-track evaluation protocol that assesses both the generative process (e.g., intermediate steps, Chain-of-Thought) and the final result, using an automated "Evidence-Grounded" judge (Gemini-2.5-Pro) with high human alignment.
Reveals significant reasoning deficits in state-of-the-art models. Key findings include: a large performance gap between proprietary and open-source models; that explicit CoT improves interpretability but not necessarily final accuracy; and that video models exhibit an "Illusion of Reasoning" with good visual quality but poor logical success.
Demonstrates the benchmark's utility for model improvement by showing that training on challenging, out-of-distribution data with Reinforcement Learning (RL) can elicit strong reasoning capabilities, even surpassing top proprietary models on specific tasks.

Introduction and Theoretical Foundation

The paper addresses a critical gap in the evaluation of modern Artificial Intelligence-Generated Content (AIGC) models. Despite achieving stunning visual fidelity, these models often fail at tasks requiring physical, causal, or complex spatial reasoning—a problem termed the "logical desert." Current evaluation metrics like CLIP-Score and FID create a "performance mirage" by prioritizing statistical similarity over true structural and logical integrity.

Existing reasoning-centric benchmarks are fragmented, typically focusing on either Image-to-Image (I2I) or Video (I2V) tasks in isolation and often failing to evaluate the generative process itself. The "VLM-as-a-Judge" paradigm, while scalable, struggles with robust human alignment.

To address these limitations, the authors introduce ViGoR-Bench (Vision-Generative Reasoning-centric Benchmark), defined by four key innovations:

Holistic Cross-Modal Coverage: Unifies evaluation across I2I, Sequential I2I (I2Is), and I2V tasks.
Dual-Track Process 加 Outcome Evaluation: Assesses both intermediate reasoning steps and the final output.
Evidence-Grounded Automated Alignment: Uses a multi-agent judge system with ground-truth references to achieve high agreement with human experts.
Granular Diagnostic Analysis: Decomposes performance into fine-grained cognitive dimensions to pinpoint specific reasoning failures.

Methodology

1. Benchmark Construction (Data Engine)

ViGoR-Bench comprises 918 samples across three primary reasoning domains, each with several sub-tasks:

Physical Reasoning: Embodied intelligence tasks (Sorting, Spatial Reasoning, Object Assembly, etc.). Data is generated synthetically via LLMs and image models (e.g., NanoBanana-Pro), with human-verified textual ground truth.
Knowledge Reasoning: World knowledge tasks (Biology, Physics, History, etc.). Data is curated from authoritative sources, with human-verified textual answers and, where applicable, paired ground-truth images.
Symbolic Reasoning: Logical and mathematical tasks (Sudoku, Maze, Algebraic Calculation, etc.). Data is constructed via rule-based algorithms or real-world photography, with algorithmically generated ground-truth images.

The construction pipeline integrates Generative Synthesis, Real-world Acquisition, and Algorithmic Construction, followed by rigorous human and solver-based verification.

2. Evaluation Protocol

A dual-track system uses Gemini-2.5-Pro as the automated judge.

Process Metric: Evaluates dynamic outputs (videos or intermediate reasoning frames).

S_{\text{Process}} = \text{VLM}(I, P, O_{\text{seq}}, R_i, R_t, T_{\text{Process}}) \tag{1}

The score vector is:

S_{\text{Process}} = [S_{BC}, S_{RO}, S_{VQ}, S_{RA}] \tag{2}

with the average calculated as:

S_{\text{Process}}^{\text{Avg}} = \frac{1}{4}(S_{BC} + S_{RO} + S_{VQ} + S_{RA}) \tag{3}

$S_{BC}$ : Background Consistency (preservation of input structure).
$S_{RO}$ : Rule Obey (adherence to instruction constraints).
$S_{VQ}$ : Visual Quality (fidelity, absence of artifacts).
$S_{RA}$ : Reasoning Accuracy (progress toward correct solution).

Result Metric: Evaluates the final static output or final frame.

S_{\text{Result}} = \text{VLM}(I, P, O_{\text{final}}, R_i, R_t, T_{\text{Result}}) \tag{4}

The score vector is:

S_{\text{Result}} = [S_{BC}, S_{RO}, S_{VQ}, S_{RS}] \tag{5}

with the average:

S_{\text{Result}}^{\text{Avg}} = \frac{1}{4}(S_{BC} + S_{RO} + S_{VQ} + S_{RS}) \tag{6}

$S_{RS}$ : Reasoning Success (binary pass/fail if final state matches reference).

3. Reliability Analysis

A meta-evaluation confirmed the automated judge's reliability compared to human experts.

Table 2: Reliability Analysis of VLM-as-a-Judge

Type	Evaluator	GT Ref.	MAE ↓	Acc ↑	Var ↓
Process	Human	✓	0.000	1.000	0.051
Process	Gemini-2.5-Pro	✗	0.319	0.680	0.039
Process	Gemini-2.5-Pro	✓	0.267	0.733	0.034
Result	Human	✓	0.000	1.000	0.011
Result	Gemini-2.5-Pro	✗	0.294	0.705	0.034
Result	Gemini-2.5-Pro	✓	0.213	0.786	0.029

Key conclusions: 1) High human alignment when using GT references, 2) Ground truth is critical for stabilizing judgments, 3) The VLM's consistency is competitive with human consensus.

Empirical Validation / Results

Main Results (Table 3)

The paper evaluates over 20 leading models across four categories: Image Editing, Unified Models without CoT, Unified Models with CoT, and Video Generation Models.

Table 3 (Abridged Key Results)

Type	Process Model	Result Metric Avg (%)
Edit w/o CoT	FLUX.2-dev	29.9
Edit w/o CoT	Nano Banana Pro	68.4
Unified w/ CoT	Bagel-Think	9.5
Unified w/ CoT	Nano Banana Pro †	61.2
Video Gen	Sora 2 Pro	17.8

Key Findings:

Proprietary Lead: Proprietary models (Nano Banana Pro) significantly outperform open-source counterparts.
CoT Interpretability vs. Accuracy: Explicit Chain-of-Thought prompting enhances process interpretability but does not guarantee improved final result accuracy, and can introduce error accumulation.
Illusion of Reasoning in Video Models: Video models (e.g., Kling 1.6, Sora 2 Pro) achieve high Process Visual Quality scores (~77-85%) but have extremely low Result Reasoning Success (~1.6%), indicating they simulate fluid motion well but fail at underlying logical constraints.

Analysis

Impact of Problem Complexity: Performance on tasks like Maze Navigation and Jigsaw Puzzle degrades monotonically with increasing grid size. Sudoku shows an inverted-U pattern, peaking at intermediate complexities, suggesting training data bias.
Eliciting Reasoning via Post-Training: Training on the benchmark's data can dramatically improve reasoning. On the Maze Navigation task, fine-tuning Qwen-Image-Edit-2511 with RL on 8x8 data achieved a Reasoning Success of 97.0%, surpassing all proprietary models.
- Finding: Training on more challenging, Out-of-Distribution (OOD) data (8x8 mazes) enhances generalization to simpler, in-distribution tasks.
- Finding: Reward-driven Reinforcement Learning (RL) demonstrates superior potential for advancing reasoning where Supervised Fine-Tuning (SFT) exhibits saturation.

Theoretical and Practical Implications

Theoretical Implications:

Challenges the assumption that high visual fidelity correlates with strong reasoning capabilities, exposing a "logical desert."
Provides a unified framework for evaluating cross-modal reasoning, bridging a gap in the current fragmented benchmark landscape.
Demonstrates that evaluating the generative process is crucial and distinct from evaluating the final output.

Practical Implications:

For Model Development: Serves as a critical "stress test" to identify specific reasoning deficits. The granular diagnostic analysis can guide architectural improvements and training strategies.
For Training Strategies: Provides evidence that RL and training on challenging OOD data are promising directions for eliciting reasoning capabilities, moving beyond SFT saturation.
For Evaluation Methodology: Establishes a reliable, automated evaluation pipeline (VLM-as-a-Judge with GT references) that scales while maintaining high human alignment, offering a model for future benchmark design.
For AI Safety and Reliability: By pushing the community to develop models with better physical, causal, and logical understanding, it contributes to creating more reliable and trustworthy AI systems suitable for critical applications.

Conclusion

ViGoR-Bench is introduced as a comprehensive benchmark and evaluation framework designed to move beyond the "performance mirage" and rigorously assess the reasoning capabilities of visual generative models. Through a dual-track process-outcome evaluation and an evidence-grounded automated judge, the benchmark reveals significant reasoning deficits in even state-of-the-art models, particularly in complex symbolic and physical tasks.

The work demonstrates that proprietary models currently hold a substantial lead, that CoT improves interpretability but not final accuracy, and that video models suffer from an "illusion of reasoning." Furthermore, it shows the benchmark's utility for model improvement, where RL and training on hard OOD data can elicit strong reasoning performance.

ViGoR-Bench establishes a new standard for evaluating generative visual intelligence, aiming to catalyze the development of models that are not just visually impressive but also logically sound and truly intelligent.