VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding

Summary (Overview)

Problem Addressed: Long video understanding is challenging for Multimodal Large Language Models (MLLMs) due to limited context windows. Existing methods for localizing query-relevant segments rely solely on query-to-video matching, ignoring the intrinsic temporal and semantic structure of the video itself.
Core Solution: Proposes VideoDetective, a plug-and-play inference framework that integrates extrinsic query relevance and intrinsic video correlations (modeled as a Visual-Temporal Affinity Graph) to hunt for critical clues with sparse observations.
Key Mechanism: Introduces an iterative Hypothesis-Verification-Refinement loop. It uses graph diffusion to propagate relevance scores from sparsely observed "anchor" segments to unvisited ones, building a global "belief field" that guides efficient clue localization.
Main Results: Demonstrates consistent and substantial performance gains across a wide range of MLLM backbones (from 7B to 72B parameters) on four long-video benchmarks. Achieves accuracy improvements of up to +7.5% on VideoMME-long and enables a 20B model to outperform proprietary models like GPT-4o on LongVideoBench.
Efficiency: Achieves state-of-the-art accuracy with moderate computational cost ( ~10k tokens per video), positioning itself optimally on the efficiency-accuracy Pareto frontier compared to other methods and large proprietary models.

Introduction and Theoretical Foundation

Long video understanding is a central challenge in multimodal AI. While specialized MLLMs are emerging, processing massive video information within limited context windows remains critical. The dominant strategy is to localize only query-relevant segments to reduce context length. However, existing paradigms—key-frame sampling, retrieval-augmented, and agent-based methods—primarily follow a unidirectional query-to-video search, matching content based purely on query information. They largely overlook the video's intrinsic structure: its coherent temporal dynamics, causal continuity, and semantic correlations between segments.

This paper is motivated by the insight that a video's internal structure can be exploited to "see the whole from a part," maintaining global understanding from sparse observations. The authors argue against assuming a single step can pinpoint informative regions or that the process must restart if early guesses fail. Instead, they propose to jointly leverage the query and the video's intrinsic inter-segment correlations to model a query-relevance distribution over the entire video from sparse observations, maximizing the information gain per observation under a limited budget.

The theoretical foundation combines concepts from:

Graph-based Semi-Supervised Learning: Modeling the video as a graph where nodes are segments and edges encode affinity, allowing label (relevance) propagation from a few labeled nodes (observed segments) to many unlabeled ones.
Active Learning / Iterative Refinement: Employing a hypothesis-test loop to dynamically select the most informative points (segments) to observe next, based on current state estimates.
Manifold Regularization: The framework's "Refinement" step is formalized as minimizing an objective function that enforces consistency with observed data and smoothness over the graph manifold, ensuring relevance diffuses along semantically and temporally valid paths.

Methodology

VideoDetective formulates long-video QA as iterative relevance state estimation on a graph. The framework has three core stages: Graph Construction, the Hypothesis-Verification-Refinement Loop, and Final Segment Selection.

1. Visual-Temporal Affinity Graph Construction

The video is modeled as a graph $G = (V, E)$ to define how relevance propagates.

Nodes ( $V$ ): The video is divided into $K$ semantic segments $\{c_i\}_{i=1}^K$ based on visual similarity boundaries. Each node $i$ is represented by the normalized mean of its frame features: $h_i = \text{norm}\left( |c_i|^{-1} \sum_{t \in c_i} f_t \right)$ .
Affinity Matrix ( $W$ ): Edges combine visual affinity (cosine similarity) and temporal affinity (exponential decay). $(W_{\text{sim}})_{ij} = \max\{0, \langle h_i, h_j \rangle\}$ $(W_{\text{time}})_{ij} = \exp\left(-\frac{|t_i - t_j|}{\tau}\right)$ The final affinity is a weighted combination: $W = \alpha W_{\text{sim}} + (1 - \alpha) W_{\text{time}}$ . The matrix is then sparsified (top- $k$ connections), symmetrized, and normalized using the symmetric normalized Laplacian for stable diffusion: $W_{\text{norm}} \triangleq D^{-1/2} \tilde{W} D^{-1/2}$ where $D$ is the degree matrix.

2. Hypothesis-Verification-Refinement Loop

The core iterative process maintains two state vectors:

Injection Vector $Y^{(t)} \in \mathbb{R}^K$ : A sparse vector recording verified relevance scores $s_i$ for visited segments.
Belief Field $F^{(t)} \in \mathbb{R}^K$ : A dense global relevance distribution inferred from $Y^{(t)}$ via graph propagation.

The loop for each iteration $t$ is:

Hypothesis (Anchor Selection): The next segment to observe is selected dynamically.
1. Query Decomposition: An LLM rewrites the query $q$ into $R$ semantic facets $\{f_r\}$ , each with a keyword set $K_r$ and a semantic description set $P_r$ .
2. Selection Policies:
  - Initialization ( $t=0$ ): For each facet $r$ , a hybrid prior score fuses visual and semantic matching: $(Y^{\text{prior}}_r)_i = \alpha \cdot \max_{w \in K_r} \langle \phi_T(w), h_i \rangle + (1-\alpha) \cdot \max_{p \in P_r} \langle \psi(p), \psi(e_i) \rangle$ . The segment with the max score is chosen.
  - Iterative Sampling ( $t \geq 1$ ): If evidence for a facet is insufficient ("Case A"), select an unvisited neighbor with strong graph connection and high belief: $i^{\star(t)} \leftarrow \arg \max_{j \in \mathcal{U}, \tilde{W}_{i^{\star}j} > 0} \left( \tilde{W}_{i^{\star}j} \cdot F^{(t-1)}_j \right)$ . If all facets are resolved ("Case B"), perform global gap filling: $i^{\star(t)} = \arg \max_i \left( F^{(t-1)}_i \cdot (1 - v^{(t-1)}_i) \right)$ , where $v$ is a visited mask.
Verification (Observation & Scoring): The selected anchor segment $i$ $i$ is observed.
1. Multimodal Evidence Extraction: Extract a multi-source set $\mathcal{E}_i = \{e_i^{\text{cap}}, e_i^{\text{ocr}}, e_i^{\text{asr}}\}$ (VLM caption, OCR text, ASR transcript).
2. Relevance Scoring: For each evidence item $e$ $e$ , compute a source-aware score combining lexical (sparse) and semantic (dense) similarity:
  - Lexical: $s_{\text{lex}}(e, f_r) = \min \left( 1, \frac{\sum_{t \in e \cap K_r} \text{IDF}(t)}{Z_{\text{lex}}} \right)$
  - Semantic: $s_{\text{sem}}(e, f_r) = \max_{p \in P_r} \frac{\langle \psi(e), \psi(p) \rangle}{\|\psi(e)\|_2 \|\psi(p)\|_2 + \epsilon}$
  - Fused: $s(e, f_r) = \lambda_{\text{src}(e)} s_{\text{lex}}(e, f_r) + (1 - \lambda_{\text{src}(e)}) s_{\text{sem}}(e, f_r)$ Weights vary by source (OCR: $\lambda=0.7$ , ASR: $\lambda=0.5$ , Caption: $\lambda=0.3$ ).
3. The node's score is the maximum across sources and facets: $s_i = \max_{e \in \mathcal{E}_i, r \in \{1,...,R\}} s(e, f_r)$ . This score is injected into $Y$ .
Refinement (Belief Propagation): The injection scores are propagated via graph diffusion to update the global belief field $F$ . This minimizes a cost function: $J(F) = \underbrace{\|F - Y\|^2_2}_{\text{Consistency}} + \mu \underbrace{F^\top L F}_{\text{Smoothness on manifold}}$ where $L = I - D^{-1/2} \tilde{W} D^{-1/2}$ is the normalized Laplacian. The efficient iterative update is: $F^{(t+1)} = \beta W_{\text{norm}} F^{(t)} + (1 - \beta) Y^{(t+1)}$ where $\beta = \mu/(1+\mu)$ .

3. Segment Selection via Graph-NMS

After the loop concludes, the converged belief field $F$ is used to select a final, diverse set of key segments for the MLLM. Graph-NMS is applied: it selects high-confidence nodes while suppressing their neighbors on the affinity graph to avoid redundancy, explicitly ensuring at least one high-belief node is retained per query facet.

Empirical Validation / Results

Experiments were conducted on four benchmarks: VideoMME-long (w/o subtitles), LVBench, MLVU (Test), and LongVideoBench (Val).

Main Results

Table 1: Effectiveness Analysis across Different Backbones

Backbone (LLM + VLM)	Method	Accuracy (%)
Qwen3-8B + Qwen3VL-8B	LVNet	40.4
	DVD	42.6
	VideoAgent	42.0
	VideoRAG	50.3
	VideoDetective	55.6
Qwen3-30B + SeedVL-1.5	LVNet	51.7
	DVD	45.4
	VideoAgent	51.7
	VideoRAG	62.0
	VideoDetective	65.6

VideoDetective consistently outperforms other representative long-video frameworks (LVNet, DVD, VideoAgent, VideoRAG) under fair comparisons with the same backbones and frame budget (32 frames).

Table 2: Comparison with State-of-the-Art Models

Model	Param	Frames	VideoMME	LVBench	MLVU	LongVideoBench
Proprietary Models
GPT-4o	-	384	65.3	48.9	54.9	66.7
Gemini-1.5-Pro	-	256	67.4	33.1	53.8	64.0
SeedVL-1.5	20B(A)	32	63.1	46.1	54.9	63.8
Open-Source (<30B)
Qwen3-VL	8B	32	50.2	41.1	50.1	58.9
InternVL-2.5	8B	32	50.8	39.9	52.8	59.2
VideoDetective (Qwen3-VL)	8B	32	55.6	43.2	56.3	60.2
Open-Source (≥30B)
LLaVA-Video	72B	64	70.3	46.1	-	63.9
VideoDetective (SeedVL-1.5)	20B(A)	32	65.6	51.3	63.8	67.9

Key Findings:

Plug-and-play effectiveness: VideoDetective brings substantial gains to various backbones (e.g., +7.5% for InternVL-2.5, +7.0% for Oryx-1.5).
State-of-the-art performance: When integrated with SeedVL-1.5 (20B), it achieves 67.9% on LongVideoBench, outperforming the much larger LLaVA-Video-72B (63.9%) and the proprietary GPT-4o (66.7%) and Gemini-1.5-Pro (64.0%).

Ablation Studies

Table 3: Ablation Study on VideoMME-long w/o subtitle

Configuration	Accuracy (%)	$\Delta$
VideoDetective (Full)	55.6	-
w/o Graph Propagation	51.4	-4.2
w/o Facet Decomposition & Iterative Refinement	47.8	-7.8
w/o Iterative Refinement	51.0	-4.6
w/o Textual Evidence	49.9	-5.7
w/o Optimized Sampling	50.7	-4.9
Baseline (Direct Inference)	50.2	-5.4

Each core component is essential:

Graph Propagation (-4.2%): Manifold smoothness is crucial for inferring unvisited regions.
Semantic Decomposition (-7.8%): Prevents noise from blind similarity propagation; acts as a "compass."
Iterative Loop (-4.6%): Enables evidence-driven correction of initial biases.
Multimodal Evidence (-5.7%): Visual and textual evidence are complementary.

Table 4: Modality Scaling Analysis

LLM	VLM	Acc. (%)	Gain
Qwen3-8B	Qwen3-VL-8B	55.6	-
Qwen3-30B	Qwen3-VL-8B	55.8	+0.2
Qwen3-8B	SeedVL-1.5	65.1	+9.5
Qwen3-30B	SeedVL-1.5	65.6	+10.0

The performance bottleneck lies in the visual model (VLM), not the language planner (LLM). Upgrading the VLM causes a qualitative leap (+9.5%), while upgrading only the LLM yields marginal gains.

Efficiency Analysis

Token Efficiency: VideoDetective achieves competitive accuracy with moderate token consumption (~10k per video).
Pareto Frontier: It occupies the optimal position on the efficiency-accuracy curve, significantly outperforming method baselines (VideoAgent, DVD) in accuracy while using far fewer tokens than proprietary models (GPT-4o, Gemini-1.5-Pro) which require ~ $10^5$ tokens for similar performance.

Theoretical and Practical Implications

Theoretical Contribution: Provides a principled framework combining graph-based semi-supervised learning with active inference for long-video understanding. The formalization of the belief field via manifold regularization offers a solid theoretical grounding for relevance propagation in video data.
Paradigm Shift: Moves beyond pure query-to-video retrieval by emphasizing the exploitation of intrinsic video structure. This "see less but know more" philosophy enables global understanding from sparse, strategically chosen observations.
Practical Impact:
- Plug-and-play Enhancement: VideoDetective is a training-free framework that can consistently boost the performance of existing MLLMs on long-video tasks, making it highly practical for deployment.
- Efficiency-Accuracy Trade-off: Demonstrates that strategic active inference can compensate for model scale, enabling smaller open-source models to rival large proprietary ones, which has significant implications for cost-effective and scalable video analysis systems.
- Component Design: Highlights the importance of query facet decomposition, multi-source evidence fusion, and iterative refinement with feedback, providing a blueprint for future long-context understanding systems.

Conclusion

VideoDetective presents an effective inference framework for long-video question answering that successfully integrates extrinsic query guidance with intrinsic video correlations. By modeling the video as a visual-temporal affinity graph and performing an iterative hypothesis-verification-refinement loop with graph diffusion, it can localize critical clues from sparse observations. Extensive experiments validate its effectiveness, generality across backbones, and superior efficiency-accuracy balance.

Limitation & Future Work: The method relies on the VLM's self-reflection capability (e.g., outputting "missing keywords") for feedback. Future work could explore more robust and sophisticated relevance assessment mechanisms. Additionally, extending the framework to other long-context modalities (e.g., long documents or audio) is a promising direction.