EasyVideoR1: Easier RL for Video Understanding

Summary (Overview)

  • Complete Video RL Pipeline: A full RL training pipeline with offline preprocessing and tensor caching eliminates redundant video decoding, achieving a 1.47× throughput improvement.
  • Task-Aware Reward System: A comprehensive, modular reward system covers 11 distinct video and image problem types with unified routing.
  • Mixed Offline-Online Training: A hybrid training paradigm combines curated offline trajectories with on-policy rollout data to improve learning on challenging tasks.
  • Joint Image-Video Training: Supports mixed-modality batches with independently configurable pixel budgets for images and videos, allowing modalities to mutually reinforce each other.
  • Asynchronous Multi-Benchmark Evaluation: An efficient evaluation framework supports concurrent inference across 22 mainstream video benchmarks with reproducible accuracy.

Introduction and Theoretical Foundation

Reinforcement Learning from Verifiable Rewards (RLVR), exemplified by GRPO, has proven highly effective for improving reasoning in large language models. As models evolve into natively multimodal architectures, extending RLVR to video understanding is crucial but remains largely unexplored due to unique challenges:

  1. Diversity of Tasks: Video understanding spans multiple-choice QA, OCR, temporal localization, spatial grounding, tracking, and dense segmentation.
  2. Computational Overhead: Repeated decoding and preprocessing of high-dimensional visual inputs creates bottlenecks.
  3. Reproducible Evaluation: Evaluation is sensitive to numerous hyperparameters (frame sampling, token budget, fps, resolution, prompt template).

Existing frameworks like EasyR1, R1-V, and OneThinker either focus on image-text scenarios or lack systematic optimizations for video, such as redundant video decoding and lack of reproducible evaluation code. EasyVideoR1 is built on EasyR1 to address these gaps, providing a complete, efficient RL framework specifically designed for video understanding.

Methodology

The system design extends EasyR1 and veRL with systematic support for video RL training and evaluation, organized around three dimensions: video-friendly optimization, research-friendly interfaces, and high-throughput evaluation.

Video-Friendly Optimization

1. Efficient RL with Video Caching To address CPU-bound I/O bottlenecks from repeated video decoding, EasyVideoR1 introduces offline preprocessing. Videos are decoded, resampled, and resized into cache files keyed by (video_path, fps, max_frames, max_pixels) to invalidate stale entries. During training, only cache file paths are stored, and tensors are loaded locally on each worker, reducing inter-node data transfer. VideoMetadata (frame rate, sampling indices, spatial dimensions) is propagated through the pipeline to ensure consistent video_grid_thw values and skip redundant operations.

2. Mixed-Modality Pipeline Adaptation

  • Mixed-Modality Forward Pass: To handle micro-batches containing only one modality (image or video), dummy tensors for the missing modality are generated. Their encoder outputs are connected via zero-weighted addition to ensure all parameters participate in every forward pass without spurious gradients.
  • Independent Resolution Budgets: Parameters image_max_pixels, video_max_pixels, and video_max_frames are decoupled for independent tuning.

3. Task-Aware Reward System A modular reward library uses a central dispatcher that routes samples based on problem_type to corresponding reward modules. Supported categories are summarized in Table 1. Prompt formatting uses Jinja2 templates.

Table 1: Supported Task Types and Accuracy Scoring Methods

CategoryTask TypeAccuracy Scoring
Multiple Choicemultiple choiceExact match
Numericalnumerical, regressionNumeric comparison
Temporal Groundingtemporal grounding1D IoU
ST Groundingspatial-temporal grounding0.5×tIoU+0.5×mIoU0.5 \times \text{tIoU} + 0.5 \times \text{mIoU}
Spatial Groundingspatial groundingBounding-box IoU
Open-endedopen-ended, video QAROUGE score
MathmathSymbolic verification
OCROCRWER / exact match
BooleanbooleanExact match
CodeSVG, HTMLExecution / match
PreferenceLLaVA, criticLLM-as-Judge

Research-Friendly Interfaces

1. Hybrid Online-Offline Training Implemented via a lightweight mix-policy interface. Each training sample can carry a pre-collected offline trajectory. During rollout, the framework generates n1n-1 on-policy responses and substitutes the final slot with the offline trajectory, assembling a group of nn responses for reward computation and GRPO update. Controlled by flag enable_mix_policy.

2. Joint Image-Video Training Each sample carries a data_type field routing it to the appropriate preprocessor and decoupled resolution budget. A strict-failure policy raises exceptions if placeholder token counts mismatch visual feature counts, enforcing semantic consistency.

3. Broad Model and Algorithm Coverage

  • Models: Natively supports Qwen2-VL, Qwen2.5-VL, Qwen3-VL, and Qwen3.5 series.
  • Algorithms: Inherits GRPO, DAPO, GSPO, CISPO, Reinforce++, ReMax, RLOO from EasyR1 and contributes GDPO and LUFFY.

Fast & Comprehensive Evaluation Framework

1. Asynchronous Inference Design

  • Precomputed Frame Caching: Video preprocessing (decoding, sampling, resizing) is precomputed and cached on disk, keyed by preprocessing parameters. Evaluation reads from cache, reducing per-video latency to milliseconds.
  • Asynchronous Pipeline with AsyncLLMEngine: A three-stage (IO, Prefill, Decode) asynchronous pipeline built on vLLM's AsyncLLMEngine operates concurrently, eliminating batch-boundary stalls. Chunked prefill prevents long sequences from monopolizing GPU.

2. Supported Benchmarks The framework integrates 22 video understanding benchmarks across six categories (see Table 2). Each benchmark is registered via a lightweight configuration adapter.

Table 2: Video Understanding Benchmarks Supported by Evaluation Framework

BenchmarkTask TypeNumberMetric
General Video Understanding
Video-MME [11]Multiple Choice2,700Accuracy
Video-MME-v2 [12]Multiple Choice3,200Accuracy
MVBench [19]Multiple Choice3,586Accuracy
TempCompass [23]Multiple Choice7,540Accuracy
MotionBench [14]Multiple Choice3,715Accuracy
Long Video Understanding
LVBench [37]Multiple Choice1,492Accuracy
LongVideoBench [39]Multiple Choice1,337Accuracy
MLVU [58]Multiple Choice502Accuracy
Video Reasoning
Video-Holmes [6]Multiple Choice1,837Accuracy
MINERVA [26]Multiple Choice1,431Accuracy
VCR-Bench [27]Multiple Choice + Open-ended1,034Accuracy / LLM-as-a-judge
VideoReasonBench [24]Open-ended1,440LLM-as-a-judge
LongVideo-Reason [5]Multiple Choice851Accuracy
STEM Knowledge
MMVU [54]Multiple Choice + Open-ended1,000Accuracy
Video-MMMU [16]Multiple Choice900Accuracy
VideoMathQA [29]Multiple Choice2,100Accuracy
Spatial Understanding
VSI-Bench [46]Multiple Choice + Regression5,130Accuracy
(Spatio-)Temporal Grounding
Charades-STA [13]Regression3,720tIoU
STVG [53]Regression2,000tIoU + mIoU
Streaming
OVOBench [21]Multiple Choice + Counting3,035Accuracy
ODVBench [48]Multiple Choice7,896Accuracy
LiveSports-QA [3]Multiple Choice1,174Accuracy

Empirical Validation / Results

Experimental Setup

  • Base Model: Qwen3-VL-8B-Instruct.
  • Training Data: ~100K video samples from OneThinker, Video-R1, VideoChat-R1, filtered by partial success (0<pass rate<10 < \text{pass rate} < 1) using k=8k=8 rollouts.
  • Training Configuration: GRPO with DAPO clipping variant (ϵlow=0.2\epsilon_{\text{low}} = 0.2, ϵhigh=0.28\epsilon_{\text{high}} = 0.28), no KL penalty. Rollout group size n=8n=8, global batch size 256. Learning rate 1×1061 \times 10^{-6}, AdamW (β1=0.9\beta_1=0.9, β2=0.999\beta_2=0.999, weight decay 0.01). Video: 2 FPS, max 128 frames, 262,144 pixels/frame. Image: 1,048,576 pixels. Max response length 4,096 tokens. 32 GPUs with FSDP.
  • Evaluation: 10 benchmarks from Table 2 using asynchronous framework with greedy decoding.

Results: Unlocking Potential of Instruct Models After 200 GRPO training steps, the RL-trained model (Qwen3-VL-8B-Instruct + EasyVideoR1) achieves:

  • Average accuracy improvement from 62.1 to 64.4 (+2.3).
  • Largest gains on reasoning and mathematical tasks: Video-Holmes (+6.6) and VideoMathQA (+6.7).
  • Consistent improvements on general video understanding: Video-MME (+2.1), MVBench (+3.5), LVBench (+0.7).
  • Competitive with the thinking variant: Achieves comparable or superior results to Qwen3-VL-8B-Think on most benchmarks, without additional reasoning overhead.

Results: Efficiency of Offline Preprocessing and Caching Cache-based loading vs. on-the-fly decoding comparison (Qwen3-VL-8B, 32 GPUs, batch size 32, max 256 frames):

  • Overall speedup: 1.47×.
  • Step time reduced from 194.5s to 131.9s.
  • Token throughput increased from 797 to 1,175 tokens/s.

Phase-level breakdown:

  • Rollout generation: Reduced from 82.1s to 53.9s (1.52×).
  • Reference model forward pass: Reduced from 53.6s to 18.8s (2.85×).
  • Actor parameter update: Constant (~54s) in both modes.
  • Total tokens processed per step (~4.93M) identical, confirming semantic preservation.

Theoretical and Practical Implications

Theoretical Implications:

  • Demonstrates that RLVR principles can be effectively extended to complex multimodal (video) domains, strengthening deliberative reasoning capabilities.
  • Provides a systematic framework for studying hybrid training paradigms (offline-online) and joint modality training (image-video) in RL contexts.
  • Highlights the importance of modality-specific optimizations (caching, independent budgets) for efficient training of large vision-language models.

Practical Implications:

  • Lowered Barrier for Research: Provides a complete, open-source framework with research-friendly interfaces, enabling community exploration of RL-driven video understanding.
  • Improved Training Efficiency: The caching mechanism offers a significant throughput boost (1.47×), making RL training on video data more feasible.
  • Reproducible Evaluation: The asynchronous multi-benchmark evaluation framework ensures accuracy aligns with official reports and facilitates fair comparisons.
  • Effective Model Enhancement: Shows that RL post-training can elevate a standard instruct model to surpass its dedicated "thinking" variant on multiple benchmarks.

Conclusion

EasyVideoR1 addresses the lack of suitable RL frameworks for video understanding by implementing systematic optimizations. It is, to the best of the authors' knowledge, the most suitable code repository for RL post-training research for video understanding at the time of release. Key contributions include:

  • Support for a wide range of video understanding tasks.
  • Research-friendly interfaces for mixed offline-online and joint image-video training.
  • Enhanced training efficiency through offline preprocessing and caching.
  • An efficient, comprehensive, and accuracy-aligned evaluation framework.

The framework successfully improved Qwen3-VL-8B-Instruct's performance across multiple benchmarks with efficient training. The authors hope to inspire enthusiasm within the multimodal community and call for collaborative maintenance to create the most comprehensive and research-friendly repository for video understanding.