# Rethinking RAG in Long Videos: What to Retrieve and How to Use It?

> CARVE outperforms VideoRAG baselines by selecting per-chunk modality and granularity via chunk-adaptive reranking, achieving 0.603 Recall@5 on V-RAGBench.

- **Source:** [arXiv](https://arxiv.org/abs/2606.13141)
- **Published:** 2026-06-16
- **Permalink:** https://picx.dev/p/lsdUbq
- **Whiteboard:** https://picx.dev/p/lsdUbq/image

## Summary

## Summary (Overview)

- Introduces **V‑RAGBench**, a benchmark of 2,100 ⟨query, evidence chunk, answer⟩ triplets over hour‑scale egocentric videos that enforces non‑recurring evidence, visual grounding, and evidence localisation, enabling faithful *stage‑wise* evaluation of retrieval and generation in VideoRAG.
- Proposes **CARVE**, a method that runs four parallel retrievers (by crossing modality {visual, text} and granularity {frame, clip}) and then uses a chunk‑adaptive reranker to assign each chunk its winning configuration, propagating that choice into generation.
- On V‑RAGBench, CARVE outperforms eight recent VideoRAG baselines on both retrieval (Recall@5: 0.603 vs. best baseline 0.510) and generation (pass rate 0.357–0.367 across three generator backbones).
- The chunk‑level configuration decisions distribute fairly evenly across all four configurations (no single dominant choice), and CARVE even surpasses trained query‑level routers without any additional training.

## Introduction and Theoretical Foundation

VideoRAG extends retrieval‑augmented generation from text to long, egocentric video, where the system must retrieve query‑relevant evidence from hour‑scale first‑person streams. Two fundamental gaps hinder progress:

1. **Benchmark gap** – Existing datasets (e.g., Ego4D, EgoLife) allow queries to be answered *without* the video, so high generation accuracy can mask retrieval failures. This makes it impossible to diagnose retrieval quality or study the interaction between retrieval and generation.
2. **Methodological gap** – Prior methods apply a single modality–granularity configuration (e.g., visual‑clip) uniformly to all chunks for a given query. However, the best representation for a chunk is a property of the chunk’s content, not of the query: a visually salient moment is best retrieved via visual embeddings, while a semantically rich moment is better surfaced through its textual summary.

The paper formalises VideoRAG along two explicit design axes: **modality** \(m \in \{\text{vis}, \text{text}\}\) and **granularity** \(g \in \{\text{frame}, \text{clip}\}\). A chunk \(v\) admits a representation \(\phi_{m,g}(v)\). The goal is to retrieve a set \(V_q\) of \(k\) chunks and generate an answer \(\hat{a}\):

\[
\hat{a} = G\!\left(q, \{\phi_{m_g,g_g}(v)\}_{v \in V_q}\right) \quad \text{where} \quad V_q = R(q, \{\phi_{m_r,g_r}(v)\}_{v \in V}).
\tag{1}
\]

The key insight is that \((m_r, g_r)\) and \((m_g, g_g)\) need not coincide, and the choice should be made *per chunk* rather than per query.

## Methodology

### V‑RAGBench Construction

V‑RAGBench is built from 216 hour‑scale egocentric videos (42 from EgoLife, 174 from Ego4D). The construction pipeline has four stages:

1. **Event extraction and deduplication** – Kernel temporal segmentation on CLIP embeddings produces semantically coherent segments; k‑means++ clustering keeps only one representative per cluster, removing recurring routines.
2. **Query generation** – Gemini‑3‑flash‑prompt generates up to three queries per event, each embedding anchor information from its source event.
3. **Post‑hoc filtering** – Five filters enforce: (i) semantic uniqueness, (ii) answerability from the source clip, (iii) no shortcut bias (unanswerable without the clip), (iv) empirical answerability (LLM correctly answers from source), (v) evidence uniqueness (answer not recoverable from other chunks). After filtering, 2,100 queries remain (1,800 train, 300 test), balanced across three categories.

### CARVE: Chunk‑Aware Reranking for Video Evidence

CARVE operates in two stages:

**Stage 1 – Parallel Candidate Pooling.** Four parallel retrievers, one per configuration \((m,g)\), each retrieve their top‑\(k\) chunks:

\[
C_{(m,g)}(q) = \text{top-}k \text{ chunks } v \text{ ranked by } \langle q, \phi_{m,g}(v) \rangle.
\tag{2}
\]

The four lists are merged into a configuration‑tagged candidate pool:

\[
P(q) = \bigcup_{(m,g)} \{ (v, (m,g)) : v \in C_{(m,g)}(q) \}.
\tag{3}
\]

**Stage 2 – Chunk‑Adaptive Reranking.** A multimodal cross‑encoder \(CE\) re‑scores each candidate *only under its retrieving configuration*:

\[
\tilde{s}(q; (v, (m,g))) = CE(q, \phi_{m,g}(v)).
\tag{4}
\]

For each chunk, the winning configuration is the one that gives the highest score:

\[
(m^*_v, g^*_v) = \arg\max_{(m,g): (v,(m,g)) \in P(q)} \tilde{s}(q; (v, (m,g))).
\tag{5}
\]

The final top‑\(k\) list carries each chunk with its winning configuration:

\[
V^*_q = \{ (v, (m^*_v, g^*_v)) : v \in \text{top-}k \text{ chunks ranked by } \tilde{s}(q; (v, (m^*_v, g^*_v))) \}.
\tag{6}
\]

In generation, each chunk is rendered under its winning representation only, producing an interleaved evidence form:

\[
\hat{a} = G\!\left(q, \{ \phi_{m^*_v, g^*_v}(v) : (v, (m^*_v, g^*_v)) \in V^*_q \} \right).
\tag{7}
\]

## Empirical Validation / Results

Experiments are conducted on the V‑RAGBench test set. Retrieval metrics are Recall@5 and nDCG@5; generation pass rate is judged by an LLM‑as‑a‑judge (Qwen3.6‑35B‑A3B) with three generator backbones (Qwen3‑VL‑8B, Qwen3‑VL‑32B, Gemma‑4‑26B). All methods use \(k = 5\).

### Main Results

**Table 1: Retrieval and generation performance on V-RAGBench.** Best values are in bold.

| Category      | Method             | Recall@5 | nDCG@5 | Qwen3-VL-8B | Qwen3-VL-32B | Gemma-4-26B |
|---------------|--------------------|----------|--------|-------------|--------------|-------------|
| System‑Level  | VideoRAG‑A [28]    | 0.510    | 0.332  | 0.250       | 0.317        | 0.307       |
|               | GQR [59]           | 0.503    | 0.340  | 0.167       | 0.110        | 0.153       |
|               | Freeret [89]       | 0.263    | 0.187  | 0.225       | 0.200        | 0.160       |
|               | GME [85]           | 0.413    | 0.253  | 0.260       | 0.287        | 0.215       |
| Query‑Level   | RRF [50]           | 0.463    | 0.298  | 0.180       | 0.130        | 0.187       |
|               | DAT [22]           | 0.460    | 0.312  | 0.223       | 0.147        | 0.200       |
|               | VideoRAG‑B [42]    | 0.487    | 0.325  | 0.315       | 0.317        | 0.240       |
|               | UniversalRAG [76]  | 0.447    | 0.298  | 0.293       | 0.247        | 0.257       |
|               | UniversalRAG‑LoRA  | 0.470    | 0.311  | 0.237       | 0.270        | 0.220       |
| **Chunk‑Level**| **CARVE (Ours)**  | **0.603**| **0.433**| **0.357**  | **0.367**    | **0.320**   |

CARVE achieves the best retrieval and generation across all generator backbones, demonstrating that chunk‑level decisions benefit both stages.

### Ablation and Analysis

**Table 2: Ablation over modality–granularity configurations.** Full setting (all four) gives the best performance.

| Configuration                       | R@5   | nDCG@5 | Pass Rate | Latency |
|-------------------------------------|-------|--------|-----------|---------|
| \(m=\{\text{text}\}, g=\{\text{frame}\}\) | 0.430 | 0.301  | 0.207     | 2.5s    |
| \(m=\{\text{text}\}, g=\{\text{clip}\}\)  | 0.433 | 0.293  | 0.157     | 1.5s    |
| \(m=\{\text{vis}\}, g=\{\text{frame}\}\)  | 0.477 | 0.336  | 0.323     | 5.2s    |
| \(m=\{\text{vis}\}, g=\{\text{clip}\}\)   | 0.507 | 0.369  | 0.293     | 7.3s    |
| \(m=\{\text{text, vis}\}, g=\{\text{frame}\}\) | 0.513 | 0.352 | 0.303     | 5.0s    |
| \(m=\{\text{text, vis}\}, g=\{\text{clip}\}\)  | 0.543 | 0.398 | 0.300     | 5.5s    |
| \(m=\{\text{text}\}, g=\{\text{frame, clip}\}\) | 0.567 | 0.413 | 0.217     | 3.3s    |
| \(m=\{\text{vis}\}, g=\{\text{frame, clip}\}\) | 0.497 | 0.355 | 0.327     | 6.1s    |
| **\(m=\{\text{text, vis}\}, g=\{\text{frame, clip}\}\)** | **0.603** | **0.433** | **0.357** | **4.6s** |

Expanding across modalities is a more dependable source of synergy than expanding granularity within a single modality.

**Table 3: Distribution of winning configuration across ranks.** All four configurations are selected with non‑trivial frequency.

| Configuration | Rank 1 | Rank 2 | Rank 3 | Rank 4 | Rank 5 |
|---------------|--------|--------|--------|--------|--------|
| {text, frame} | 0.37   | 0.27   | 0.25   | 0.27   | 0.25   |
| {text, clip}  | 0.15   | 0.13   | 0.11   | 0.11   | 0.13   |
| {vis, frame}  | 0.24   | 0.26   | 0.28   | 0.27   | 0.28   |
| {vis, clip}   | 0.24   | 0.34   | 0.36   | 0.35   | 0.34   |

**Table 4: Reranking strategy comparison.** CARVE (rescoring under retrieving tag) outperforms alternatives.

| Method         | R@5   | nDCG@5 | Latency |
|----------------|-------|--------|---------|
| Single fixed  | 0.497–0.560 | 0.343–0.396 | 0.5–8.0s |
| Random         | 0.513 | 0.312  | 3.1s    |
| Concatenation  | 0.513 | 0.339  | 8.2s    |
| **CARVE**      | **0.603** | **0.433** | **3.4s** |

**Table 5: Generation vs. query‑level routing.** CARVE surpasses trained query‑level routers without any training.

| Method               | Pass Rate | Latency |
|----------------------|-----------|---------|
| Single fixed (best)  | 0.323     | 4.4s    |
| Non‑LLM Router (trained) | 0.329 | 5.7s    |
| LLM Router (trained)    | 0.310 | 5.3s    |
| **CARVE**            | **0.357** | **4.6s** |

## Theoretical and Practical Implications

- **For benchmarking VideoRAG:** V‑RAGBench establishes a rigorous evaluation protocol where retrieval and generation can be studied independently and jointly. The three enforced properties (non‑recurring evidence, visual grounding, evidence localisation) are essential for any future VideoRAG benchmark; they prevent accuracy inflation from shortcut answerability.
- **For system design:** CARVE shows that *chunk‑level* decisions about modality and granularity are more effective than *query‑level* decisions. This suggests that VideoRAG systems should treat representation selection as a fine‑grained per‑chunk problem, not a holistic per‑query one.
- **Practical latency:** CARVE achieves strong performance at a latency (4.6 s per query) that is modestly above text‑only retrievers but substantially faster than visual‑only configurations, because visual features are only processed for chunks that benefit from them.
- **Generality:** The gains hold across three different generator backbones and two video domains (Ego4D, EgoLife), indicating that chunk‑adaptive interleaving is a robust design principle.

## Conclusion

The paper introduces V‑RAGBench, a benchmark with 2,100 triplets that enables faithful stage‑wise evaluation of VideoRAG, and CARVE, a method that reframes modality and temporal granularity as chunk‑level decisions. By running parallel retrievers and applying chunk‑adaptive reranking, CARVE propagates each chunk’s winning configuration into generation, producing an interleaved evidence form that outperforms both system‑level and query‑level baselines on both retrieval and generation. The chunk‑level decisions distribute evenly across configurations, confirming that per‑chunk diversity is genuine and beneficial. Future work can extend the approach to additional modalities (e.g., audio, depth) and finer‑grained temporal scales, and explore learned routers that capture the chunk‑level dynamics identified in this work.

---

_Markdown view of https://picx.dev/p/lsdUbq, served by PicX — AI-generated visual whiteboard summaries of research papers._
