Visual Summary | VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding

Summary (Overview)

VideoKR is the first large-scale training corpus specifically designed for knowledge- and reasoning-intensive video understanding, comprising 315K video reasoning examples over 145K newly collected, CC-licensed, expert-domain videos.
A human-in-the-loop, skill-oriented example generation pipeline is developed that targets three complementary capabilities: basic video reasoning, knowledge-enhanced video perception, and knowledge-intensive video reasoning.
A new evaluation benchmark, VideoKR-Eval, is curated with expert-annotated examples that require genuine video understanding and knowledge-intensive reasoning, mitigating textual or single-frame shortcuts found in prior benchmarks.
Under a standard SFT → GRPO post-training pipeline, models post-trained on VideoKR outperform prior post-training approaches on knowledge-intensive video reasoning while remaining competitive on general video reasoning.
Comprehensive ablations isolate the contributions of CoT supervision, skill-based data composition, and training data difficulty, providing actionable insights for future work.

Introduction and Theoretical Foundation

The paper identifies a key bottleneck in current video understanding models: while progress has been rapid on surface-level perception tasks (e.g., action recognition, event localization), models struggle with tasks requiring domain knowledge and multi-step inference (e.g., scientific experiments, medical procedures). Existing large-scale video training corpora are heavily skewed toward everyday activities with limited coverage of specialized domains.

To bridge this gap, the authors introduce VideoKR – a corpus that tightly integrates domain knowledge, visual grounding, and structured reasoning. The theoretical foundation decomposes knowledge- and reasoning-intensive video understanding into three core skills:

Basic Video Reasoning (VIDR) – direct comprehension of observable events without external knowledge.
Knowledge-enhanced Video Perception (KNOWVID) – aligning visual cues with explicit domain concepts (e.g., recognizing lab apparatus).
Knowledge-Intensive Video Reasoning (KNOWVIDR) – multi-hop inference integrating visual evidence and domain knowledge (e.g., calculating chemical yields from observed reactions).

The work adopts a corpus-centric perspective, arguing that data design is a primary limiting factor for advanced video reasoning, and deliberately uses a standard SFT → GRPO pipeline to attribute performance gains to the training data.

Methodology

Data Construction Pipeline

Domain Knowledge Bank Construction (Section 3.1): Built a hierarchical knowledge base (Subject → Course → Lecture → Knowledge Point) covering 82 subjects across four disciplines (Natural Sciences, Engineering, Healthcare, Humanities & Social Sciences), totaling 63,745 knowledge points.
Knowledge-Driven Video Collection (Section 3.2): Generated 1–3 realistic scenarios per knowledge point to retrieve authentic videos (e.g., “rocket launching” for Newton’s Second Law). Searched YouTube for CC-licensed videos, performed multi-stage relevance filtering (text metadata, visual content by MLLMs, safety moderation), collecting 145K videos (average duration 344 seconds).
Skill-Oriented Example Generation (Section 3.3):
- Seed examples: 1,800 expert-curated examples (150 per skill per discipline) with detailed CoT rationales.
- Scalable generation: For each video, 2 examples per skill are generated using frontier MLLMs (pool of 7 models), guided by seed examples and knowledge points.
- Validation: Three-stage filtering – self-consistency check, video dependency filter (removing examples solvable from text + single frame), and CoT rationale validation by an independent verifier.
Quality Control (Section 3.4):
- Human-validated model selection: For each pipeline step, models are only eligible if error rates < 3% based on expert evaluation of 100 samples.
- Contamination mitigation: YouTube-ID filtering and near-duplicate video filtering via perceptual hashing.
Final Corpus: Randomly partitioned into VideoKR-SFT-201K (with CoT rationales) and VideoKR-RL-114K (questions and verifiable answers only).

Post-Training Setup

Base models: Qwen2.5-VL-7B-Instruct and Qwen3-VL-8B-Instruct.
Pipeline: SFT (1 epoch on VideoKR-SFT-201K) → GRPO (1 epoch on VideoKR-RL-114K) with batch size 32.
Reward: ROUGE for open-ended QA, Exact Match for multiple-choice; format reward included.
Evaluation: 7 benchmarks grouped into General Video Reasoning (Video-MME, MVBench, LongVideoBench) and Knowledge-Intensive Video Reasoning (VideoMMMU, MMVU, SciVideoBench, VideoKR-Eval). Standardized evaluation using LMMs-Eval with fixed prompts and temperatures.

Empirical Validation / Results

Main Results (Table 3)

Post-training on VideoKR consistently improves both base models, with largest gains on knowledge-intensive benchmarks:

Base Model	Setting	General Avg	Knowledge-Intensive Avg	Gain
Qwen2.5-VL-7B-Instruct	Base (128 frames)	64.1	41.9	–
	+ VideoKR (SFT+RL)	65.5	46.6	+4.7
Qwen3-VL-8B-Instruct	Base (128 frames)	65.9	48.5	–
	+ VideoKR (SFT+RL)	65.4	51.5	+3.0

Notably, VideoKR post-trained Qwen3-VL-8B achieves the best knowledge-intensive average among 7/8B-scale models (51.5 vs. 50.0 for Qwen3-VL-8B-Thinking).

Ablation Studies (Table 4)

Skill-Oriented Data Composition: Incorporating all three skills (VIDR + KNOWVID + KNOWVIDR) yields the best knowledge-intensive performance (42.4 vs. 41.4 with VIDR only).
CoT Supervision: CoT-trained model improves knowledge-intensive average by 3.0 points over direct-output baseline (42.4 vs. 39.4).
Comparison with Prior Corpora: Under SFT, VideoKR-SFT is the only corpus that surpasses the base model (+0.5), while prior corpora cause performance drops. Under zero-RL, VideoKR-RL achieves the strongest gain (+1.1 over base).

Training-Data Difficulty (Table 5)

Accuracy of base models on 3,000 randomly sampled QA examples from various corpora:

Model	Video-R1	VideoRFT	OneThinker	VideoAuto-R1	VideoKR
Qwen2.5-VL-7B-Inst.	55.3	47.8	45.8	57.1	39.2
Qwen3-VL-8B-Inst.	57.1	51.1	49.1	54.5	42.3

VideoKR exhibits substantially lower accuracy, confirming its examples provide a more challenging and informative learning signal.

Single-Frame Answerability (Table 2)

Model	VideoMMMU	MMVU	SciVideoBench	VideoKR-Eval
Claude-4.5-Sonnet	35.3	41.3	21.8	9.5
Qwen3-VL-235B-A22B	39.3	45.2	13.2	10.1
GPT-5.2	38.3	49.7	23.0	10.7

VideoKR-Eval dramatically reduces single-frame answerability compared to existing benchmarks.

Theoretical and Practical Implications

Data quality is the primary driver: By using a standard post-training pipeline, the paper isolates data design as the key factor – models trained on VideoKR outperform those trained on prior corpora with the same algorithmic setup.
Integration of domain knowledge with visual grounding: The skill-oriented decomposition and inclusion of structured domain knowledge enable models to perform multi-hop inference that combines observation with learned principles.
Training-data difficulty matters: VideoKR's lower base-model accuracy (Table 5) indicates it provides a stronger learning signal, avoiding saturation of current models.
Practical recommendations: For future video reasoning corpus construction, the paper recommends: (i) using diverse, CC-licensed expert-domain videos; (ii) applying strict video dependency filtering to avoid textual shortcuts; (iii) incorporating human-validated model selection to avoid single-model biases; (iv) including CoT rationales and multiple skill levels.

Conclusion

This work presents VideoKR, a large-scale, high-quality training corpus for knowledge- and reasoning-intensive video understanding, together with VideoKR-Eval, a robust evaluation benchmark. Through extensive experiments, the paper demonstrates that with a corpus-centric approach – integrating structured domain concepts, visually grounded examples, and rigorous quality control – models can achieve significant gains in knowledge-intensive video reasoning without relying on sophisticated RL reward engineering. The results highlight that better data, rather than better algorithms, is the primary bottleneck for advancing video reasoning. Future directions include extending to longer videos, incorporating more diverse knowledge sources, and further improving the reasoning quality through advanced training techniques.