CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models

Summary (Overview)

Core Contribution: Introduces CollabVR, a closed-loop framework that couples a Vision-Language Model (VLM) as a planner/verifier with a Video Generation Model (VGM) at a step-by-step granularity for video reasoning tasks.
Key Mechanism: The VLM progressively plans the immediate next action and verifies each short clip generated by the VGM, diagnosing failures and folding corrective feedback into the next prompt. This addresses two key VGM failure modes: long-horizon drift and mid-clip execution errors.
Empirical Results: CollabVR consistently improves performance over single-inference, Pass@k sampling, and prior test-time scaling methods (like VideoTPO) on two benchmarks (Gen-ViRe and VBVR-Bench) for both open-source (VBVR-Wan2.2) and closed-source (Veo 3.1) VGMs, at matched computational cost.
Orthogonality: The framework's test-time reasoning supervision is orthogonal and stackable with reasoning-oriented fine-tuning of VGMs, yielding further improvements on a fine-tuned model.
Validation: A human-annotated benchmark confirms the reliability of the VLM's supervisory decisions (plan depth, verification, and prompt evolution) aligned with expert judgment.

Introduction and Theoretical Foundation

Recent "Thinking with Video" paradigms use Video Generation Models (VGMs) to produce temporally coherent Chain-of-Frames as reasoning artifacts. However, VGMs exhibit complementary strengths and weaknesses compared to VLMs:

VLMs excel at logical reasoning, planning, and abstract inference but are weak at direct visual simulation.
VGMs excel at short-horizon visual simulation, detail, and physical coherence but are weak at abstract reasoning and long-range consistency.

VGMs trained for perceptual quality exhibit two recurring failure modes in goal-directed tasks:

Overloaded-prompt failure (Long-horizon drift): When a single prompt specifies a multi-step task, the VGM collapses it into one short rollout, deviating from the intended trajectory due to lack of planning capacity.
Execution failure (Mid-clip errors): Localized errors within a single clip (e.g., agent crossing a wall, object identity loss) propagate and contaminate the entire trajectory.

The root cause is the absence of an explicit, corrigible reasoning process on top of the VGM's strong short-horizon visual prior. A VLM can naturally serve as a reasoning supervisor. However, upfront planning commits before generation, and post-hoc critiques over whole videos intervene too late. CollabVR proposes step-level coupling to catch failures immediately and repair them.

Methodology

CollabVR is a closed-loop, training-free framework that treats video reasoning as a construction problem. The correct trajectory is assembled stepwise through VLM planning and VGM execution.

Problem Formulation

A task is specified by an input image $I_0$ and a task prompt $q$ . The goal is to produce a video $V$ that realizes the reasoning. The framework uses:

A VLM-based planner/verifier $\pi$ (queried as $\pi_{\text{plan}}$ and $\pi_{\text{verify}}$ ).
An image-to-video generator $g$ that maps a conditioning frame $f$ and an action prompt $a_t$ to a short clip $c_t$ .

The system maintains:

$f$ : the latest conditioning frame (initially $I_0$ ).
$H$ : the history of accepted clips.

The output is the concatenation of accepted clips: $V = c_1 \oplus \cdots \oplus c_N$ .

The core algorithm is detailed in Algorithm 1 (provided in the paper). The loop iterates up to a maximum planning step $N_{\text{max}}$ with a per-step attempt budget $M$ .

Core Modules

1. VLM-Driven Progressive Planning (Module 1): Addresses overloaded-prompt failure. Instead of pre-planning all steps upfront, the VLM adaptively decides the step count and plans only the immediate next action $a_t$ conditioned on previously generated frames and the task prompt.

a_t \gets \pi_{\text{plan}}(I_0, q, H)

This allows the plan to adapt to the VGM's actual output, mitigating long-horizon drift.

2. VLM-VGM Collaborative Reasoning (Module 2): Addresses execution failure. For each generated clip $c_t$ , the VLM verifier $\pi_{\text{verify}}$ produces a structured judgment $(v, d)$ where:

$v \in \{\text{accept}, \text{reject}\}$
$d$ is a textual reason and actionable suggestion for repair.

If $v = \text{reject}$ , the next action prompt is evolved using the diagnosis: $a_t \gets \text{evolve}(a_t, d)$ . The VGM is re-sampled with this revised prompt (up to $M$ retries). This localizes and repairs errors before they compound.

Empirical Validation / Results

Experimental Setup

Benchmarks: Gen-ViRe (72 samples, 6 categories, VLM-judged) and VBVR-Bench (500 samples, 5 categories, rule-based evaluation).
VGMs: VBVR-Wan2.2 (open-source), Veo 3.1 (closed-source), Cosmos-Predict-2.5.
VLM: Gemini 2.5 Pro as default planner/verifier.
Baselines: Single Inference (Pass@1), Pass@k ( $k=2,4$ ), VideoTPO.
Cost Metric: Total seconds of video generated by the VGM per sample (VLM compute is negligible).
CollabVR Configuration: $N_{\text{max}}=3$ , per-step attempt budget $M=3$ .

Main Results

Table 1: Benchmarking results on Gen-ViRe.

Category	Method	VGM Cost (s)	Avg.	Abst.	Algo.	Analog.	Perc.	Plan.	Spat.
Open-source Video Models
	VBVR-Wan2.2	6.0	0.391	0.479	0.415	0.250	0.261	0.554	0.387
	VBVR-Wan2.2 + Pass@2	12.0	0.398	0.576	0.437	0.278	0.257	0.481	0.357
	VBVR-Wan2.2 + Pass@4	24.0	0.438	0.622	0.418	0.250	0.275	0.604	0.462
	VBVR-Wan2.2 + VideoTPO [4]	30.0	0.488	0.535	0.443	0.417	0.313	0.671	0.552
	VBVR-Wan2.2 + CollabVR	17.8	0.531	0.569	0.606	0.333	0.367	0.821	0.488
Closed-source Video Models
	Veo 3.1	8.0	0.481	0.420	0.512	0.361	0.274	0.744	0.573
	Veo 3.1 + Pass@2	16.0	0.491	0.458	0.587	0.389	0.242	0.721	0.571
	Veo 3.1 + Pass@4	32.0	0.509	0.425	0.573	0.417	0.296	0.726	0.646
	Veo 3.1 + CollabVR	21.4	0.550	0.434	0.641	0.472	0.325	0.768	0.657

Table 2: Benchmarking results on VBVR-Bench.

Models	VGM Cost (s)	Overall	In-Domain by Category					Out-of-Domain by Category
			Avg.	Abst.	Know.	Perc.	Spat.	Trans.	Avg.	Abst.	Know.	Perc.
VBVR-Wan2.2	3.70	0.671	0.762	0.701	0.746	0.802	0.793	0.803	0.577	0.674	0.674	0.503
VBVR-Wan2.2 + Pass@2	7.40	0.694	0.783	0.791	0.742	0.795	0.774	0.812	0.602	0.728	0.617	0.494
VBVR-Wan2.2 + Pass@4	14.80	0.707	0.789	0.751	0.734	0.826	0.805	0.841	0.622	0.785	0.660	0.535
VBVR-Wan2.2 + VideoTPO [4]	11.10	0.650	0.717	0.723	0.698	0.641	0.744	0.816	0.582	0.767	0.619	0.513
VBVR-Wan2.2 + CollabVR	10.91	0.757	0.819	0.828	0.784	0.805	0.828	0.852	0.696	0.884	0.634	0.641
Cosmos-Predict2.5	3.70	0.308	0.312	0.272	0.327	0.355	0.227	0.390	0.304	0.368	0.169	0.309
Cosmos-Predict2.5 + CollabVR	10.91	0.403	0.406	0.404	0.431	0.411	0.301	0.482	0.400	0.481	0.286	0.400

Key Findings:

CollabVR achieves the highest scores on both benchmarks for the tested VGMs at comparable or lower VGM cost than Pass@4.
Gains are most pronounced on categories requiring multi-step reasoning (Planning, Algorithmic, Spatial, Transformation).
CollabVR improves performance even on a reasoning-fine-tuned VGM (VBVR-Wan2.2), demonstrating orthogonality.
A blind user study showed human annotators preferred CollabVR outputs (73.8%) over Pass@4 (19.7%) and Pass@1 (6.5%).

Ablation Study

Table 3: Per-module ablation on Gen-ViRe and VBVR-Bench.

M1	M2	Cost (s)	Overall	$\Delta$
✗	✗	6.0	0.391	–
✓	✗	10.9	0.511	+0.120
✗	✓	9.9	0.436	+0.045
✓	✓	17.8	0.531	+0.140
✗	✗	3.70	0.671	–
✓	✗	6.19	0.706	+0.035
✗	✓	6.03	0.734	+0.063
✓	✓	10.91	0.757	+0.086

Module Dominance: On Gen-ViRe (dominated by multi-step tasks), Progressive Planning (M1) is the larger contributor. On VBVR-Bench (dominated by single-step tasks), Collaborative Reasoning (M2) is the larger contributor. The framework adapts to task character.
Effect of $N_{\text{max}}$ : Performance on Gen-ViRe rises as $N_{\text{max}}$ increases to the required level (up to 3), then plateaus or degrades with further splitting (Figure 7).
VLM Choice: Performance degrades gracefully with weaker VLMs (Qwen3.5-27B, Qwen3.5-9B), but even Qwen3.5-9B with CollabVR surpasses all baselines using Gemini 2.5 Pro (Table 4).

Analysis

Category-wise Module Effectiveness (Figure 8):

Planning tasks benefit most from progressive planning (M1) (+0.165).
Analogy tasks (atomic transformations) benefit from verifier-driven re-sampling (M2) alone (+0.139).
The full pipeline (M1+M2) yields positive gains in every category, and is crucial for improving long-horizon Spatial tasks.
Smallest gains are on symbolic/abstract categories (Analogy, Abstract), where the VGM lacks the underlying capability.

Human-Annotated Benchmark for VLM Supervision Reliability: A benchmark was constructed to evaluate the VLM's decisions along three axes:

Plan-depth match: VLM's predicted step count $N$ vs. human annotation.
Clip-level verification agreement: VLM's accept/reject judgment vs. human annotation.
Evolution quality: Human rating (1-3 scale) of the verifier's suggested repair.

Figure 9 shows results:

Gemini 2.5 Pro aligns most closely with human annotators on all axes.
Plan-depth match: Exact-match accuracy 68.0%, MAE 0.366.
Verification agreement: F1 score 0.750.
Evolution quality: Mean rating 2.61 (scale 1-3).

Theoretical and Practical Implications

Paradigm Shift: Redirects test-time compute from sampling more videos (Pass@k) to refining the one being constructed through step-level diagnosis and repair.
Complementary Strengths: Provides a principled framework to leverage the complementary strengths of VLMs (reasoning, planning, verification) and VGMs (visual simulation) in a synergistic closed loop.
Practical Effectiveness: Demonstrates consistent improvements over existing test-time scaling methods at matched compute, making it a cost-effective approach for enhancing video reasoning.
Orthogonality to Training: The test-time reasoning supervision is orthogonal to reasoning-oriented VGM fine-tuning, suggesting both approaches can be combined for further gains.
Generalizability: The framework is training-free and works with any off-the-shelf VGM and VLM, showing graceful degradation with weaker VLMs.

Conclusion

CollabVR is a closed-loop framework that couples a VLM and VGM at step-level granularity for collaborative video reasoning. It addresses key VGM failure modes through progressive planning and failure-aware collaborative reasoning. The framework consistently improves performance on benchmark tasks for both open- and closed-source VGMs over existing baselines. Human annotation confirms the reliability of the VLM as a supervisor. Limitations include inability to overcome fundamental VGM capability gaps (e.g., symbolic transformations) and imperfect verifier accuracy. Future directions include integrating reasoning-oriented VGM training and finer-grained failure localization into the test-time loop.