EVA: Efficient Reinforcement Learning for End-to-End Video Agent
Summary (Overview)
- Planning-Before-Perception Paradigm: Proposes EVA, a novel reinforcement learning framework where a Multimodal Large Language Model (MLLM) agent first reasons from a textual query to plan what, when, and how to watch a video, before any visual perception. This enables iterative summary–plan–action–reflection cycles for efficient, query-driven understanding.
- Three-Stage Training Pipeline: Introduces a scalable training method combining Supervised Fine-Tuning (SFT) for cold-start capabilities, Kahneman–Tversky Optimization (KTO) to learn from failure patterns, and Generalized Reward Policy Optimization (GRPO) for online reinforcement learning, bridging imitation and policy learning.
- High-Quality Datasets: Constructs and releases three datasets—EVA-SFT (10k samples), EVA-KTO (11k success/failure trajectories), and EVA-RL (~10.7k QA pairs)—to support stable and reproducible agent training.
- State-of-the-Art Performance: Demonstrates substantial improvements over baselines, achieving +6–12% over general MLLMs and a further +1–3% over prior adaptive agents across six video understanding benchmarks, while using significantly fewer visual tokens.
Introduction and Theoretical Foundation
Video understanding is a cornerstone of multimodal intelligence, with applications in question answering, retrieval, and embodied perception. While MLLMs are adept at integrating vision and language, most existing approaches treat them as passive recognizers, processing entire videos or uniformly sampled frames without adaptive reasoning. Recent agent-based methods introduce external tools (e.g., for frame selection) but remain limited by manually designed workflows, perception-first strategies (seeing frames before planning), and rigid tool parameters (e.g., fixed sampling rates).
This work identifies a fundamental gap: the need for a planning-before-perception paradigm where an MLLM-based agent can autonomously decide its visual exploration strategy before engaging with the video. The authors formulate this as an iterative process of summary, planning, action, and reflection, transforming the MLLM from a passive processor into an active, adaptive agentic watcher.
The core challenge is training such an agent to operate effectively within this reasoning loop: learning to generate initial tool calls based solely on the query, to continue reasoning when visual evidence is insufficient, and to avoid over-exploration. The proposed solution, EVA, addresses this through a novel three-stage training strategy supported by high-quality, curated datasets.
Methodology
3.1. Problem Setup
The active video understanding problem is formulated as a Markov Decision Process (MDP). At each timestep , the agent observes a belief state:
where is the user query, is the interleaved text–frame history, and is the visual evidence (frames) from tool calls. The agent's policy is .
Critically, at the initial step , the model is provided only with the query , without any visual information. This enforces planning-before-perception.
A flexible frame-selection tool is designed, parameterized by:
| Parameter | Description |
|---|---|
start_time | Start of the temporal window |
end_time | End of the temporal window |
nframes | Number of frames to sample |
resize | Spatial downsampling ratio |
This tool allows the agent to control both temporal granularity (via time range and nframes) and spatial resolution (via resize), providing a broad exploration space for efficient token allocation.
3.2. Data Construction
The training pipeline relies on three high-quality datasets, constructed using Qwen2.5-VL-72B as a teacher model.
- EVA-SFT (Supervised Fine-Tuning): A "cold-start" dataset of 10k samples instilling core agent capabilities. Each instance follows the Summary + Planning + Action + Reflection format, guiding the model to attend to visual evidence, propose actions, generate tool calls, and evaluate sufficiency before answering.
- EVA-KTO (Kahneman-Tversky Optimization): A dataset of 11k trajectories (63% successful, 37% failed) labeled as "chosen" or "rejected". KTO is used to correct common failure patterns (e.g., guessing without evidence, poor frame sampling) before online RL, improving stability.
- EVA-RL (Reinforcement Learning): A mixed-format dataset (~9.6k open-ended QA, ~1.1k multiple-choice) for online policy optimization. A Data-Enhanced GRPO pipeline is introduced: failure cases from the KTO model are collected and used as in-context examples for the teacher MLLM to generate new QA pairs for unseen videos, enhancing training diversity and preventing overfitting to a static dataset.
3.3. Reinforcement Learning
The model is optimized via Generalized Reward Policy Optimization (GRPO), a KL-regularized policy optimization method. The objective is:
where is the reference model from SFT and KTO.
A composite reward function is designed:
- Accuracy Reward ():
- For multiple-choice (MCQ): Uses Completeness Self-Verification (CSV). only if both EVA and a judge model (given EVA's retrieved images) produce the correct answer.
- For open-ended (OE): Uses averaged ROUGE scores. .
- Format Reward (): A small compensatory reward (0.05) is given if the model generates a proper tool call but yields an incorrect answer. This discourages reward hacking via random guessing after a formatted call.
Empirical Validation / Results
EVA is built upon Qwen2.5-VL-7B-Instruct. Evaluations are conducted on six benchmarks: LSDBench, LongVideoBench, MLVU, VideoMME, LVBench, and Video-Holmes.
4.2. Main Results
Sampling Efficiency (LSDBench): EVA achieves 51.8% accuracy using only 6.2K visual tokens, surpassing the Qwen2.5-VL baseline (49.2%) which uses 21.0K tokens, and outperforming other adaptive agents. This demonstrates effective mitigation of the sampling dilemma. Table 1: Performance on Sampling Dilemma Bench (LSDBench)
| Method | Frames | Visual Token | Acc (%) |
|---|---|---|---|
| Close Source Model | |||
| Gemini-2.0-Flash | 2700 | 696.6k | 56.2 |
| Open Source Model | |||
| Qwen2.5-VL* (Baseline) | 32 | 21.0k | 49.2 |
| Ours | |||
| EVA | 76.9 | 10.3k | 51.0 |
Long-Form Video Understanding: EVA achieves strong, consistent results across four long-video benchmarks, outperforming most open-source and adaptive agents while processing only ~20-30 estimated frames per video. Table 2: Main Performance on Multiple Video Understanding Benchmarks
| Model | LongVideoBench Acc | MLVU Acc | VideoMME Acc | LVBench Acc |
|---|---|---|---|---|
| Static Frame Sampling | ||||
| Qwen2.5-VL | 43.2 | 48.4 | 53.6 | 31.6 |
| Adaptive Agent | ||||
| FrameThinker | 52.9 | 59.1 | - | 36.6 |
| Ours | ||||
| EVA-SFT | 49.9 | 52.3 | 56.0 | 26.5 |
| EVA-KTO | 53.2 | 57.4 | 56.5 | 36.0 |
| EVA-GRPO | 55.0 | 68.3 | 60.2 | 43.3 |
Zero-Shot Reasoning (Video-Holmes): EVA shows strong transferability, achieving competitive performance (up to 37.2% overall) in a zero-shot setting, comparable to models trained with uniform sampling.
4.3. Ablation Study
- Training Schema: The SFT → KTO → GRPO sequence provides a clear evolutionary path. SFT learns formatting but is inefficient. KTO reduces frame consumption and improves performance. GRPO learns the most strategic exploration, using more deliberate rounds with precise token allocation, achieving the highest scores.
- GRPO Data Composition: Training with a mixed dataset (OE + MCQ) leads to more stable learning and better performance than using only one type, as it prevents reward hacking and ensures visually grounded reasoning.
4.4. Computation Efficiency
Despite multi-round reasoning, EVA's total visual token count is comparable to or lower than static uniform-sampling baselines. Inference runtime is dominated by the compact set of adaptively selected visual tokens, not the number of reasoning steps, making it highly efficient.
4.5. Case Study
EVA generates diverse, query-adaptive workflows:
- For queries needing specific segments, it allocates tokens precisely (e.g., brief low-res overview first, then high-res zoom-in).
- For queries needing extensive evidence, it can behave like a traditional agent (uniform sampling for grounding).
- It autonomously controls all tool parameters (
start_time,end_time,nframes,resize), enabling more efficient token usage than prior agents with limited action spaces.
Theoretical and Practical Implications
- Paradigm Shift: Establishes a planning-before-perception paradigm as a superior framework for agentic video understanding, moving beyond perception-first and rigidly tooled approaches.
- Training Methodology: Provides a scalable three-stage pipeline (SFT, KTO, GRPO) that effectively bridges supervised imitation learning and online reinforcement learning for complex agent tasks.
- Data Contribution: Releases high-quality, curated datasets that address specific training stages, facilitating reproducible research in agentic video understanding.
- Efficiency & Performance: Demonstrates that adaptive, reasoning-driven visual token allocation can simultaneously improve accuracy and reduce computational cost compared to brute-force or semi-adaptive methods, making long-video understanding more practical.
Conclusion
EVA presents a step towards building truly autonomous video-understanding agents through a query-driven, iterative reasoning framework. The three-stage training paradigm enables the model to evolve from a passive recognizer into an adaptive, self-directed agent that balances perceptual efficiency and reasoning depth.
Limitations & Future Work: The current approach relies on pre-defined tool interfaces and may struggle with unseen query distributions. Future directions include exploring more flexible tool ecosystems, self-evolving reasoning strategies, and cross-modal memory mechanisms for enhanced autonomy in long-horizon video understanding.