EasyVideoR1: Easier RL for Video Understanding
Summary (Overview)
- Complete Video RL Pipeline: A full RL training pipeline with offline preprocessing and tensor caching eliminates redundant video decoding, achieving a 1.47× throughput improvement.
- Task-Aware Reward System: A comprehensive, modular reward system covers 11 distinct video and image problem types with unified routing.
- Mixed Offline-Online Training: A hybrid training paradigm combines curated offline trajectories with on-policy rollout data to improve learning on challenging tasks.
- Joint Image-Video Training: Supports mixed-modality batches with independently configurable pixel budgets for images and videos, allowing modalities to mutually reinforce each other.
- Asynchronous Multi-Benchmark Evaluation: An efficient evaluation framework supports concurrent inference across 22 mainstream video benchmarks with reproducible accuracy.
Introduction and Theoretical Foundation
Reinforcement Learning from Verifiable Rewards (RLVR), exemplified by GRPO, has proven highly effective for improving reasoning in large language models. As models evolve into natively multimodal architectures, extending RLVR to video understanding is crucial but remains largely unexplored due to unique challenges:
- Diversity of Tasks: Video understanding spans multiple-choice QA, OCR, temporal localization, spatial grounding, tracking, and dense segmentation.
- Computational Overhead: Repeated decoding and preprocessing of high-dimensional visual inputs creates bottlenecks.
- Reproducible Evaluation: Evaluation is sensitive to numerous hyperparameters (frame sampling, token budget, fps, resolution, prompt template).
Existing frameworks like EasyR1, R1-V, and OneThinker either focus on image-text scenarios or lack systematic optimizations for video, such as redundant video decoding and lack of reproducible evaluation code. EasyVideoR1 is built on EasyR1 to address these gaps, providing a complete, efficient RL framework specifically designed for video understanding.
Methodology
The system design extends EasyR1 and veRL with systematic support for video RL training and evaluation, organized around three dimensions: video-friendly optimization, research-friendly interfaces, and high-throughput evaluation.
Video-Friendly Optimization
1. Efficient RL with Video Caching
To address CPU-bound I/O bottlenecks from repeated video decoding, EasyVideoR1 introduces offline preprocessing. Videos are decoded, resampled, and resized into cache files keyed by (video_path, fps, max_frames, max_pixels) to invalidate stale entries. During training, only cache file paths are stored, and tensors are loaded locally on each worker, reducing inter-node data transfer. VideoMetadata (frame rate, sampling indices, spatial dimensions) is propagated through the pipeline to ensure consistent video_grid_thw values and skip redundant operations.
2. Mixed-Modality Pipeline Adaptation
- Mixed-Modality Forward Pass: To handle micro-batches containing only one modality (image or video), dummy tensors for the missing modality are generated. Their encoder outputs are connected via zero-weighted addition to ensure all parameters participate in every forward pass without spurious gradients.
- Independent Resolution Budgets: Parameters
image_max_pixels,video_max_pixels, andvideo_max_framesare decoupled for independent tuning.
3. Task-Aware Reward System
A modular reward library uses a central dispatcher that routes samples based on problem_type to corresponding reward modules. Supported categories are summarized in Table 1. Prompt formatting uses Jinja2 templates.
Table 1: Supported Task Types and Accuracy Scoring Methods
| Category | Task Type | Accuracy Scoring |
|---|---|---|
| Multiple Choice | multiple choice | Exact match |
| Numerical | numerical, regression | Numeric comparison |
| Temporal Grounding | temporal grounding | 1D IoU |
| ST Grounding | spatial-temporal grounding | |
| Spatial Grounding | spatial grounding | Bounding-box IoU |
| Open-ended | open-ended, video QA | ROUGE score |
| Math | math | Symbolic verification |
| OCR | OCR | WER / exact match |
| Boolean | boolean | Exact match |
| Code | SVG, HTML | Execution / match |
| Preference | LLaVA, critic | LLM-as-Judge |
Research-Friendly Interfaces
1. Hybrid Online-Offline Training
Implemented via a lightweight mix-policy interface. Each training sample can carry a pre-collected offline trajectory. During rollout, the framework generates on-policy responses and substitutes the final slot with the offline trajectory, assembling a group of responses for reward computation and GRPO update. Controlled by flag enable_mix_policy.
2. Joint Image-Video Training
Each sample carries a data_type field routing it to the appropriate preprocessor and decoupled resolution budget. A strict-failure policy raises exceptions if placeholder token counts mismatch visual feature counts, enforcing semantic consistency.
3. Broad Model and Algorithm Coverage
- Models: Natively supports Qwen2-VL, Qwen2.5-VL, Qwen3-VL, and Qwen3.5 series.
- Algorithms: Inherits GRPO, DAPO, GSPO, CISPO, Reinforce++, ReMax, RLOO from EasyR1 and contributes GDPO and LUFFY.
Fast & Comprehensive Evaluation Framework
1. Asynchronous Inference Design
- Precomputed Frame Caching: Video preprocessing (decoding, sampling, resizing) is precomputed and cached on disk, keyed by preprocessing parameters. Evaluation reads from cache, reducing per-video latency to milliseconds.
- Asynchronous Pipeline with AsyncLLMEngine: A three-stage (IO, Prefill, Decode) asynchronous pipeline built on vLLM's AsyncLLMEngine operates concurrently, eliminating batch-boundary stalls. Chunked prefill prevents long sequences from monopolizing GPU.
2. Supported Benchmarks The framework integrates 22 video understanding benchmarks across six categories (see Table 2). Each benchmark is registered via a lightweight configuration adapter.
Table 2: Video Understanding Benchmarks Supported by Evaluation Framework
| Benchmark | Task Type | Number | Metric |
|---|---|---|---|
| General Video Understanding | |||
| Video-MME [11] | Multiple Choice | 2,700 | Accuracy |
| Video-MME-v2 [12] | Multiple Choice | 3,200 | Accuracy |
| MVBench [19] | Multiple Choice | 3,586 | Accuracy |
| TempCompass [23] | Multiple Choice | 7,540 | Accuracy |
| MotionBench [14] | Multiple Choice | 3,715 | Accuracy |
| Long Video Understanding | |||
| LVBench [37] | Multiple Choice | 1,492 | Accuracy |
| LongVideoBench [39] | Multiple Choice | 1,337 | Accuracy |
| MLVU [58] | Multiple Choice | 502 | Accuracy |
| Video Reasoning | |||
| Video-Holmes [6] | Multiple Choice | 1,837 | Accuracy |
| MINERVA [26] | Multiple Choice | 1,431 | Accuracy |
| VCR-Bench [27] | Multiple Choice + Open-ended | 1,034 | Accuracy / LLM-as-a-judge |
| VideoReasonBench [24] | Open-ended | 1,440 | LLM-as-a-judge |
| LongVideo-Reason [5] | Multiple Choice | 851 | Accuracy |
| STEM Knowledge | |||
| MMVU [54] | Multiple Choice + Open-ended | 1,000 | Accuracy |
| Video-MMMU [16] | Multiple Choice | 900 | Accuracy |
| VideoMathQA [29] | Multiple Choice | 2,100 | Accuracy |
| Spatial Understanding | |||
| VSI-Bench [46] | Multiple Choice + Regression | 5,130 | Accuracy |
| (Spatio-)Temporal Grounding | |||
| Charades-STA [13] | Regression | 3,720 | tIoU |
| STVG [53] | Regression | 2,000 | tIoU + mIoU |
| Streaming | |||
| OVOBench [21] | Multiple Choice + Counting | 3,035 | Accuracy |
| ODVBench [48] | Multiple Choice | 7,896 | Accuracy |
| LiveSports-QA [3] | Multiple Choice | 1,174 | Accuracy |
Empirical Validation / Results
Experimental Setup
- Base Model: Qwen3-VL-8B-Instruct.
- Training Data: ~100K video samples from OneThinker, Video-R1, VideoChat-R1, filtered by partial success () using rollouts.
- Training Configuration: GRPO with DAPO clipping variant (, ), no KL penalty. Rollout group size , global batch size 256. Learning rate , AdamW (, , weight decay 0.01). Video: 2 FPS, max 128 frames, 262,144 pixels/frame. Image: 1,048,576 pixels. Max response length 4,096 tokens. 32 GPUs with FSDP.
- Evaluation: 10 benchmarks from Table 2 using asynchronous framework with greedy decoding.
Results: Unlocking Potential of Instruct Models After 200 GRPO training steps, the RL-trained model (Qwen3-VL-8B-Instruct + EasyVideoR1) achieves:
- Average accuracy improvement from 62.1 to 64.4 (+2.3).
- Largest gains on reasoning and mathematical tasks: Video-Holmes (+6.6) and VideoMathQA (+6.7).
- Consistent improvements on general video understanding: Video-MME (+2.1), MVBench (+3.5), LVBench (+0.7).
- Competitive with the thinking variant: Achieves comparable or superior results to Qwen3-VL-8B-Think on most benchmarks, without additional reasoning overhead.
Results: Efficiency of Offline Preprocessing and Caching Cache-based loading vs. on-the-fly decoding comparison (Qwen3-VL-8B, 32 GPUs, batch size 32, max 256 frames):
- Overall speedup: 1.47×.
- Step time reduced from 194.5s to 131.9s.
- Token throughput increased from 797 to 1,175 tokens/s.
Phase-level breakdown:
- Rollout generation: Reduced from 82.1s to 53.9s (1.52×).
- Reference model forward pass: Reduced from 53.6s to 18.8s (2.85×).
- Actor parameter update: Constant (~54s) in both modes.
- Total tokens processed per step (~4.93M) identical, confirming semantic preservation.
Theoretical and Practical Implications
Theoretical Implications:
- Demonstrates that RLVR principles can be effectively extended to complex multimodal (video) domains, strengthening deliberative reasoning capabilities.
- Provides a systematic framework for studying hybrid training paradigms (offline-online) and joint modality training (image-video) in RL contexts.
- Highlights the importance of modality-specific optimizations (caching, independent budgets) for efficient training of large vision-language models.
Practical Implications:
- Lowered Barrier for Research: Provides a complete, open-source framework with research-friendly interfaces, enabling community exploration of RL-driven video understanding.
- Improved Training Efficiency: The caching mechanism offers a significant throughput boost (1.47×), making RL training on video data more feasible.
- Reproducible Evaluation: The asynchronous multi-benchmark evaluation framework ensures accuracy aligns with official reports and facilitates fair comparisons.
- Effective Model Enhancement: Shows that RL post-training can elevate a standard instruct model to surpass its dedicated "thinking" variant on multiple benchmarks.
Conclusion
EasyVideoR1 addresses the lack of suitable RL frameworks for video understanding by implementing systematic optimizations. It is, to the best of the authors' knowledge, the most suitable code repository for RL post-training research for video understanding at the time of release. Key contributions include:
- Support for a wide range of video understanding tasks.
- Research-friendly interfaces for mixed offline-online and joint image-video training.
- Enhanced training efficiency through offline preprocessing and caching.
- An efficient, comprehensive, and accuracy-aligned evaluation framework.
The framework successfully improved Qwen3-VL-8B-Instruct's performance across multiple benchmarks with efficient training. The authors hope to inspire enthusiasm within the multimodal community and call for collaborative maintenance to create the most comprehensive and research-friendly repository for video understanding.