# CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models

> CollabVR improves video reasoning by having a vision-language model plan and verify each step of a video generation model's output, catching errors immediately and correcting them.

- **Source:** [arXiv](https://arxiv.org/abs/2605.08735)
- **Published:** 2026-05-13
- **Permalink:** https://picx.dev/p/4jW4rO
- **Whiteboard:** https://picx.dev/p/4jW4rO/image

## Summary

# CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models

## Summary (Overview)
* **Core Contribution:** Introduces **CollabVR**, a closed-loop framework that couples a Vision-Language Model (VLM) as a planner/verifier with a Video Generation Model (VGM) at a step-by-step granularity for video reasoning tasks.
* **Key Mechanism:** The VLM **progressively plans** the immediate next action and **verifies** each short clip generated by the VGM, diagnosing failures and folding corrective feedback into the next prompt. This addresses two key VGM failure modes: **long-horizon drift** and **mid-clip execution errors**.
* **Empirical Results:** CollabVR consistently improves performance over single-inference, Pass@k sampling, and prior test-time scaling methods (like VideoTPO) on two benchmarks (Gen-ViRe and VBVR-Bench) for both open-source (VBVR-Wan2.2) and closed-source (Veo 3.1) VGMs, at matched computational cost.
* **Orthogonality:** The framework's test-time reasoning supervision is **orthogonal and stackable** with reasoning-oriented fine-tuning of VGMs, yielding further improvements on a fine-tuned model.
* **Validation:** A human-annotated benchmark confirms the reliability of the VLM's supervisory decisions (plan depth, verification, and prompt evolution) aligned with expert judgment.

## Introduction and Theoretical Foundation
Recent "Thinking with Video" paradigms use Video Generation Models (VGMs) to produce temporally coherent **Chain-of-Frames** as reasoning artifacts. However, VGMs exhibit complementary strengths and weaknesses compared to VLMs:
* **VLMs** excel at logical reasoning, planning, and abstract inference but are weak at direct visual simulation.
* **VGMs** excel at short-horizon visual simulation, detail, and physical coherence but are weak at abstract reasoning and long-range consistency.

VGMs trained for perceptual quality exhibit two recurring failure modes in goal-directed tasks:
1.  **Overloaded-prompt failure (Long-horizon drift):** When a single prompt specifies a multi-step task, the VGM collapses it into one short rollout, deviating from the intended trajectory due to lack of planning capacity.
2.  **Execution failure (Mid-clip errors):** Localized errors within a single clip (e.g., agent crossing a wall, object identity loss) propagate and contaminate the entire trajectory.

The root cause is the **absence of an explicit, corrigible reasoning process** on top of the VGM's strong short-horizon visual prior. A VLM can naturally serve as a reasoning supervisor. However, upfront planning commits before generation, and post-hoc critiques over whole videos intervene too late. CollabVR proposes **step-level coupling** to catch failures immediately and repair them.

## Methodology
CollabVR is a closed-loop, training-free framework that treats video reasoning as a **construction problem**. The correct trajectory is assembled stepwise through VLM planning and VGM execution.

### Problem Formulation
A task is specified by an input image $I_0$ and a task prompt $q$. The goal is to produce a video $V$ that realizes the reasoning. The framework uses:
* A **VLM-based planner/verifier** $\pi$ (queried as $\pi_{\text{plan}}$ and $\pi_{\text{verify}}$).
* An **image-to-video generator** $g$ that maps a conditioning frame $f$ and an action prompt $a_t$ to a short clip $c_t$.

The system maintains:
* $f$: the latest conditioning frame (initially $I_0$).
* $H$: the history of accepted clips.

The output is the concatenation of accepted clips: $V = c_1 \oplus \cdots \oplus c_N$.

The core algorithm is detailed in **Algorithm 1** (provided in the paper). The loop iterates up to a maximum planning step $N_{\text{max}}$ with a per-step attempt budget $M$.

### Core Modules
**1. VLM-Driven Progressive Planning (Module 1):**
Addresses overloaded-prompt failure. Instead of pre-planning all steps upfront, the VLM **adaptively decides** the step count and plans **only the immediate next action** $a_t$ conditioned on previously generated frames and the task prompt.
$$a_t \gets \pi_{\text{plan}}(I_0, q, H)$$
This allows the plan to adapt to the VGM's actual output, mitigating long-horizon drift.

**2. VLM-VGM Collaborative Reasoning (Module 2):**
Addresses execution failure. For each generated clip $c_t$, the VLM verifier $\pi_{\text{verify}}$ produces a structured judgment $(v, d)$ where:
* $v \in \{\text{accept}, \text{reject}\}$
* $d$ is a textual reason and actionable suggestion for repair.

If $v = \text{reject}$, the next action prompt is evolved using the diagnosis: $a_t \gets \text{evolve}(a_t, d)$. The VGM is re-sampled with this revised prompt (up to $M$ retries). This localizes and repairs errors before they compound.

## Empirical Validation / Results

### Experimental Setup
* **Benchmarks:** Gen-ViRe (72 samples, 6 categories, VLM-judged) and VBVR-Bench (500 samples, 5 categories, rule-based evaluation).
* **VGMs:** VBVR-Wan2.2 (open-source), Veo 3.1 (closed-source), Cosmos-Predict-2.5.
* **VLM:** Gemini 2.5 Pro as default planner/verifier.
* **Baselines:** Single Inference (Pass@1), Pass@k ($k=2,4$), VideoTPO.
* **Cost Metric:** Total seconds of video generated by the VGM per sample (VLM compute is negligible).
* **CollabVR Configuration:** $N_{\text{max}}=3$, per-step attempt budget $M=3$.

### Main Results
**Table 1: Benchmarking results on Gen-ViRe.**

| Category | Method | VGM Cost (s) | Avg. | Abst. | Algo. | Analog. | Perc. | Plan. | Spat. |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| **Open-source Video Models** | | | | | | | | | |
| | VBVR-Wan2.2 | 6.0 | 0.391 | 0.479 | 0.415 | 0.250 | 0.261 | 0.554 | 0.387 |
| | VBVR-Wan2.2 + Pass@2 | 12.0 | 0.398 | 0.576 | 0.437 | 0.278 | 0.257 | 0.481 | 0.357 |
| | VBVR-Wan2.2 + Pass@4 | 24.0 | 0.438 | 0.622 | 0.418 | 0.250 | 0.275 | 0.604 | 0.462 |
| | VBVR-Wan2.2 + VideoTPO [4] | 30.0 | 0.488 | 0.535 | 0.443 | 0.417 | 0.313 | 0.671 | 0.552 |
| | **VBVR-Wan2.2 + CollabVR** | **17.8** | **0.531** | **0.569** | **0.606** | **0.333** | **0.367** | **0.821** | **0.488** |
| **Closed-source Video Models** | | | | | | | | | |
| | Veo 3.1 | 8.0 | 0.481 | 0.420 | 0.512 | 0.361 | 0.274 | 0.744 | 0.573 |
| | Veo 3.1 + Pass@2 | 16.0 | 0.491 | 0.458 | 0.587 | 0.389 | 0.242 | 0.721 | 0.571 |
| | Veo 3.1 + Pass@4 | 32.0 | 0.509 | 0.425 | 0.573 | 0.417 | 0.296 | 0.726 | 0.646 |
| | **Veo 3.1 + CollabVR** | **21.4** | **0.550** | **0.434** | **0.641** | **0.472** | **0.325** | **0.768** | **0.657** |

**Table 2: Benchmarking results on VBVR-Bench.**

| Models | VGM Cost (s) | Overall | **In-Domain by Category** | | | | | **Out-of-Domain by Category** | | | | |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| | | | Avg. | Abst. | Know. | Perc. | Spat. | Trans. | Avg. | Abst. | Know. | Perc. | Spat. | Trans. |
| VBVR-Wan2.2 | 3.70 | 0.671 | 0.762 | 0.701 | 0.746 | 0.802 | 0.793 | 0.803 | 0.577 | 0.674 | 0.674 | 0.503 | 0.528 | 0.633 |
| VBVR-Wan2.2 + Pass@2 | 7.40 | 0.694 | 0.783 | 0.791 | 0.742 | 0.795 | 0.774 | 0.812 | 0.602 | 0.728 | 0.617 | 0.494 | 0.532 | 0.701 |
| VBVR-Wan2.2 + Pass@4 | 14.80 | 0.707 | 0.789 | 0.751 | 0.734 | 0.826 | 0.805 | 0.841 | 0.622 | 0.785 | 0.660 | 0.535 | 0.577 | 0.683 |
| VBVR-Wan2.2 + VideoTPO [4] | 11.10 | 0.650 | 0.717 | 0.723 | 0.698 | 0.641 | 0.744 | 0.816 | 0.582 | 0.767 | 0.619 | 0.513 | 0.540 | 0.572 |
| **VBVR-Wan2.2 + CollabVR** | **10.91** | **0.757** | **0.819** | **0.828** | **0.784** | **0.805** | **0.828** | **0.852** | **0.696** | **0.884** | **0.634** | **0.641** | **0.608** | **0.720** |
| Cosmos-Predict2.5 | 3.70 | 0.308 | 0.312 | 0.272 | 0.327 | 0.355 | 0.227 | 0.390 | 0.304 | 0.368 | 0.169 | 0.309 | 0.377 | 0.274 |
| Cosmos-Predict2.5 + CollabVR | 10.91 | 0.403 | 0.406 | 0.404 | 0.431 | 0.411 | 0.301 | 0.482 | 0.400 | 0.481 | 0.286 | 0.400 | 0.471 | 0.346 |

**Key Findings:**
* CollabVR achieves the highest scores on both benchmarks for the tested VGMs at comparable or lower VGM cost than Pass@4.
* Gains are most pronounced on categories requiring **multi-step reasoning** (Planning, Algorithmic, Spatial, Transformation).
* CollabVR improves performance even on a reasoning-fine-tuned VGM (VBVR-Wan2.2), demonstrating orthogonality.
* A blind user study showed human annotators preferred CollabVR outputs (73.8%) over Pass@4 (19.7%) and Pass@1 (6.5%).

### Ablation Study
**Table 3: Per-module ablation on Gen-ViRe and VBVR-Bench.**

| M1 | M2 | Cost (s) | Overall | $\Delta$ |
| :--- | :--- | :--- | :--- | :--- |
| ✗ | ✗ | 6.0 | 0.391 | – |
| ✓ | ✗ | 10.9 | 0.511 | +0.120 |
| ✗ | ✓ | 9.9 | 0.436 | +0.045 |
| ✓ | ✓ | 17.8 | 0.531 | +0.140 |
| ✗ | ✗ | 3.70 | 0.671 | – |
| ✓ | ✗ | 6.19 | 0.706 | +0.035 |
| ✗ | ✓ | 6.03 | 0.734 | +0.063 |
| ✓ | ✓ | 10.91 | 0.757 | +0.086 |

* **Module Dominance:** On **Gen-ViRe** (dominated by multi-step tasks), **Progressive Planning (M1)** is the larger contributor. On **VBVR-Bench** (dominated by single-step tasks), **Collaborative Reasoning (M2)** is the larger contributor. The framework adapts to task character.
* **Effect of $N_{\text{max}}$:** Performance on Gen-ViRe rises as $N_{\text{max}}$ increases to the required level (up to 3), then plateaus or degrades with further splitting (Figure 7).
* **VLM Choice:** Performance degrades gracefully with weaker VLMs (Qwen3.5-27B, Qwen3.5-9B), but even Qwen3.5-9B with CollabVR surpasses all baselines using Gemini 2.5 Pro (Table 4).

### Analysis
**Category-wise Module Effectiveness (Figure 8):**
* **Planning** tasks benefit most from **progressive planning (M1)** (+0.165).
* **Analogy** tasks (atomic transformations) benefit from **verifier-driven re-sampling (M2)** alone (+0.139).
* The **full pipeline (M1+M2)** yields positive gains in every category, and is crucial for improving **long-horizon Spatial tasks**.
* Smallest gains are on **symbolic/abstract categories** (Analogy, Abstract), where the VGM lacks the underlying capability.

**Human-Annotated Benchmark for VLM Supervision Reliability:**
A benchmark was constructed to evaluate the VLM's decisions along three axes:
1.  **Plan-depth match:** VLM's predicted step count $N$ vs. human annotation.
2.  **Clip-level verification agreement:** VLM's accept/reject judgment vs. human annotation.
3.  **Evolution quality:** Human rating (1-3 scale) of the verifier's suggested repair.

**Figure 9** shows results:
* Gemini 2.5 Pro aligns most closely with human annotators on all axes.
* Plan-depth match: Exact-match accuracy 68.0%, MAE 0.366.
* Verification agreement: F1 score 0.750.
* Evolution quality: Mean rating 2.61 (scale 1-3).

## Theoretical and Practical Implications
* **Paradigm Shift:** Redirects test-time compute from **sampling more videos** (Pass@k) to **refining the one being constructed** through step-level diagnosis and repair.
* **Complementary Strengths:** Provides a principled framework to leverage the complementary strengths of VLMs (reasoning, planning, verification) and VGMs (visual simulation) in a synergistic closed loop.
* **Practical Effectiveness:** Demonstrates consistent improvements over existing test-time scaling methods at matched compute, making it a cost-effective approach for enhancing video reasoning.
* **Orthogonality to Training:** The test-time reasoning supervision is orthogonal to reasoning-oriented VGM fine-tuning, suggesting both approaches can be combined for further gains.
* **Generalizability:** The framework is training-free and works with any off-the-shelf VGM and VLM, showing graceful degradation with weaker VLMs.

## Conclusion
CollabVR is a closed-loop framework that couples a VLM and VGM at step-level granularity for collaborative video reasoning. It addresses key VGM failure modes through **progressive planning** and **failure-aware collaborative reasoning**. The framework consistently improves performance on benchmark tasks for both open- and closed-source VGMs over existing baselines. Human annotation confirms the reliability of the VLM as a supervisor. Limitations include inability to overcome fundamental VGM capability gaps (e.g., symbolic transformations) and imperfect verifier accuracy. Future directions include integrating reasoning-oriented VGM training and finer-grained failure localization into the test-time loop.

---

_Markdown view of https://picx.dev/p/4jW4rO, served by PicX — AI-generated visual whiteboard summaries of research papers._
