EasyVideoR1: Easier RL for Video Understanding

Summary (Overview)

Complete Video RL Pipeline: A full RL training pipeline with offline preprocessing and tensor caching eliminates redundant video decoding, achieving a 1.47× throughput improvement.
Task-Aware Reward System: A comprehensive, modular reward system covers 11 distinct video and image problem types with unified routing.
Mixed Offline-Online Training: A hybrid training paradigm combines curated offline trajectories with on-policy rollout data to improve learning on challenging tasks.
Joint Image-Video Training: Supports mixed-modality batches with independently configurable pixel budgets for images and videos, allowing modalities to mutually reinforce each other.
Asynchronous Multi-Benchmark Evaluation: An efficient evaluation framework supports concurrent inference across 22 mainstream video benchmarks with reproducible accuracy.

Introduction and Theoretical Foundation

Reinforcement Learning from Verifiable Rewards (RLVR), exemplified by GRPO, has proven highly effective for improving reasoning in large language models. As models evolve into natively multimodal architectures, extending RLVR to video understanding is crucial but remains largely unexplored due to unique challenges:

Diversity of Tasks: Video understanding spans multiple-choice QA, OCR, temporal localization, spatial grounding, tracking, and dense segmentation.
Computational Overhead: Repeated decoding and preprocessing of high-dimensional visual inputs creates bottlenecks.
Reproducible Evaluation: Evaluation is sensitive to numerous hyperparameters (frame sampling, token budget, fps, resolution, prompt template).

Existing frameworks like EasyR1, R1-V, and OneThinker either focus on image-text scenarios or lack systematic optimizations for video, such as redundant video decoding and lack of reproducible evaluation code. EasyVideoR1 is built on EasyR1 to address these gaps, providing a complete, efficient RL framework specifically designed for video understanding.

Methodology

The system design extends EasyR1 and veRL with systematic support for video RL training and evaluation, organized around three dimensions: video-friendly optimization, research-friendly interfaces, and high-throughput evaluation.

Video-Friendly Optimization

1. Efficient RL with Video Caching To address CPU-bound I/O bottlenecks from repeated video decoding, EasyVideoR1 introduces offline preprocessing. Videos are decoded, resampled, and resized into cache files keyed by (video_path, fps, max_frames, max_pixels) to invalidate stale entries. During training, only cache file paths are stored, and tensors are loaded locally on each worker, reducing inter-node data transfer. VideoMetadata (frame rate, sampling indices, spatial dimensions) is propagated through the pipeline to ensure consistent video_grid_thw values and skip redundant operations.

2. Mixed-Modality Pipeline Adaptation

Mixed-Modality Forward Pass: To handle micro-batches containing only one modality (image or video), dummy tensors for the missing modality are generated. Their encoder outputs are connected via zero-weighted addition to ensure all parameters participate in every forward pass without spurious gradients.
Independent Resolution Budgets: Parameters image_max_pixels, video_max_pixels, and video_max_frames are decoupled for independent tuning.

3. Task-Aware Reward System A modular reward library uses a central dispatcher that routes samples based on problem_type to corresponding reward modules. Supported categories are summarized in Table 1. Prompt formatting uses Jinja2 templates.

Table 1: Supported Task Types and Accuracy Scoring Methods

Category	Task Type	Accuracy Scoring
Multiple Choice	multiple choice	Exact match
Numerical	numerical, regression	Numeric comparison
Temporal Grounding	temporal grounding	1D IoU
ST Grounding	spatial-temporal grounding	$0.5 \times \text{tIoU} + 0.5 \times \text{mIoU}$
Spatial Grounding	spatial grounding	Bounding-box IoU
Open-ended	open-ended, video QA	ROUGE score
Math	math	Symbolic verification
OCR	OCR	WER / exact match
Boolean	boolean	Exact match
Code	SVG, HTML	Execution / match
Preference	LLaVA, critic	LLM-as-Judge

Research-Friendly Interfaces

1. Hybrid Online-Offline Training Implemented via a lightweight mix-policy interface. Each training sample can carry a pre-collected offline trajectory. During rollout, the framework generates $n-1$ on-policy responses and substitutes the final slot with the offline trajectory, assembling a group of $n$ responses for reward computation and GRPO update. Controlled by flag enable_mix_policy.

2. Joint Image-Video Training Each sample carries a data_type field routing it to the appropriate preprocessor and decoupled resolution budget. A strict-failure policy raises exceptions if placeholder token counts mismatch visual feature counts, enforcing semantic consistency.

3. Broad Model and Algorithm Coverage

Models: Natively supports Qwen2-VL, Qwen2.5-VL, Qwen3-VL, and Qwen3.5 series.
Algorithms: Inherits GRPO, DAPO, GSPO, CISPO, Reinforce++, ReMax, RLOO from EasyR1 and contributes GDPO and LUFFY.

Fast & Comprehensive Evaluation Framework

1. Asynchronous Inference Design

Precomputed Frame Caching: Video preprocessing (decoding, sampling, resizing) is precomputed and cached on disk, keyed by preprocessing parameters. Evaluation reads from cache, reducing per-video latency to milliseconds.
Asynchronous Pipeline with AsyncLLMEngine: A three-stage (IO, Prefill, Decode) asynchronous pipeline built on vLLM's AsyncLLMEngine operates concurrently, eliminating batch-boundary stalls. Chunked prefill prevents long sequences from monopolizing GPU.

2. Supported Benchmarks The framework integrates 22 video understanding benchmarks across six categories (see Table 2). Each benchmark is registered via a lightweight configuration adapter.

Table 2: Video Understanding Benchmarks Supported by Evaluation Framework

Benchmark	Task Type	Number	Metric
General Video Understanding
Video-MME [11]	Multiple Choice	2,700	Accuracy
Video-MME-v2 [12]	Multiple Choice	3,200	Accuracy
MVBench [19]	Multiple Choice	3,586	Accuracy
TempCompass [23]	Multiple Choice	7,540	Accuracy
MotionBench [14]	Multiple Choice	3,715	Accuracy
Long Video Understanding
LVBench [37]	Multiple Choice	1,492	Accuracy
LongVideoBench [39]	Multiple Choice	1,337	Accuracy
MLVU [58]	Multiple Choice	502	Accuracy
Video Reasoning
Video-Holmes [6]	Multiple Choice	1,837	Accuracy
MINERVA [26]	Multiple Choice	1,431	Accuracy
VCR-Bench [27]	Multiple Choice + Open-ended	1,034	Accuracy / LLM-as-a-judge
VideoReasonBench [24]	Open-ended	1,440	LLM-as-a-judge
LongVideo-Reason [5]	Multiple Choice	851	Accuracy
STEM Knowledge
MMVU [54]	Multiple Choice + Open-ended	1,000	Accuracy
Video-MMMU [16]	Multiple Choice	900	Accuracy
VideoMathQA [29]	Multiple Choice	2,100	Accuracy
Spatial Understanding
VSI-Bench [46]	Multiple Choice + Regression	5,130	Accuracy
(Spatio-)Temporal Grounding
Charades-STA [13]	Regression	3,720	tIoU
STVG [53]	Regression	2,000	tIoU + mIoU
Streaming
OVOBench [21]	Multiple Choice + Counting	3,035	Accuracy
ODVBench [48]	Multiple Choice	7,896	Accuracy
LiveSports-QA [3]	Multiple Choice	1,174	Accuracy

Empirical Validation / Results

Experimental Setup

Base Model: Qwen3-VL-8B-Instruct.
Training Data: ~100K video samples from OneThinker, Video-R1, VideoChat-R1, filtered by partial success ( $0 < \text{pass rate} < 1$ ) using $k=8$ rollouts.
Training Configuration: GRPO with DAPO clipping variant ( $\epsilon_{\text{low}} = 0.2$ , $\epsilon_{\text{high}} = 0.28$ ), no KL penalty. Rollout group size $n=8$ , global batch size 256. Learning rate $1 \times 10^{-6}$ , AdamW ( $\beta_1=0.9$ , $\beta_2=0.999$ , weight decay 0.01). Video: 2 FPS, max 128 frames, 262,144 pixels/frame. Image: 1,048,576 pixels. Max response length 4,096 tokens. 32 GPUs with FSDP.
Evaluation: 10 benchmarks from Table 2 using asynchronous framework with greedy decoding.

Results: Unlocking Potential of Instruct Models After 200 GRPO training steps, the RL-trained model (Qwen3-VL-8B-Instruct + EasyVideoR1) achieves:

Average accuracy improvement from 62.1 to 64.4 (+2.3).
Largest gains on reasoning and mathematical tasks: Video-Holmes (+6.6) and VideoMathQA (+6.7).
Consistent improvements on general video understanding: Video-MME (+2.1), MVBench (+3.5), LVBench (+0.7).
Competitive with the thinking variant: Achieves comparable or superior results to Qwen3-VL-8B-Think on most benchmarks, without additional reasoning overhead.

Results: Efficiency of Offline Preprocessing and Caching Cache-based loading vs. on-the-fly decoding comparison (Qwen3-VL-8B, 32 GPUs, batch size 32, max 256 frames):

Overall speedup: 1.47×.
Step time reduced from 194.5s to 131.9s.
Token throughput increased from 797 to 1,175 tokens/s.

Phase-level breakdown:

Rollout generation: Reduced from 82.1s to 53.9s (1.52×).
Reference model forward pass: Reduced from 53.6s to 18.8s (2.85×).
Actor parameter update: Constant (~54s) in both modes.
Total tokens processed per step (~4.93M) identical, confirming semantic preservation.

Theoretical and Practical Implications

Theoretical Implications:

Demonstrates that RLVR principles can be effectively extended to complex multimodal (video) domains, strengthening deliberative reasoning capabilities.
Provides a systematic framework for studying hybrid training paradigms (offline-online) and joint modality training (image-video) in RL contexts.
Highlights the importance of modality-specific optimizations (caching, independent budgets) for efficient training of large vision-language models.

Practical Implications:

Lowered Barrier for Research: Provides a complete, open-source framework with research-friendly interfaces, enabling community exploration of RL-driven video understanding.
Improved Training Efficiency: The caching mechanism offers a significant throughput boost (1.47×), making RL training on video data more feasible.
Reproducible Evaluation: The asynchronous multi-benchmark evaluation framework ensures accuracy aligns with official reports and facilitates fair comparisons.
Effective Model Enhancement: Shows that RL post-training can elevate a standard instruct model to surpass its dedicated "thinking" variant on multiple benchmarks.

Conclusion

EasyVideoR1 addresses the lack of suitable RL frameworks for video understanding by implementing systematic optimizations. It is, to the best of the authors' knowledge, the most suitable code repository for RL post-training research for video understanding at the time of release. Key contributions include:

Support for a wide range of video understanding tasks.
Research-friendly interfaces for mixed offline-online and joint image-video training.
Enhanced training efficiency through offline preprocessing and caching.
An efficient, comprehensive, and accuracy-aligned evaluation framework.

The framework successfully improved Qwen3-VL-8B-Instruct's performance across multiple benchmarks with efficient training. The authors hope to inspire enthusiasm within the multimodal community and call for collaborative maintenance to create the most comprehensive and research-friendly repository for video understanding.