Summary (Overview)
- Kwai Keye-VL-2.0-30B-A3B is an open-source Mixture-of-Experts (MoE) multimodal foundation model with 30B total parameters and only 3B active parameters, designed for long-video understanding and agentic intelligence.
- It is the first model to adapt DeepSeek Sparse Attention (DSA) to a GQA-based multimodal architecture, enabling lossless 256K context processing while capturing critical frames and long-range temporal dependencies.
- Introduces Cross-Modal Multi-Teacher On-Policy Distillation (MOPD) paired with Context-RL and Video-RL to overcome catastrophic forgetting during multi-task alignment, allowing the model to natively support agent collaboration across Code, Tool, and Search scenarios.
- Achieves state-of-the-art performance among models of similar scale, particularly excelling in fine-grained temporal localization on TimeLens benchmarks (ActivityNet, QVHighlights, Charades) and long-video comprehension on Video-MME-v2 and LongVideoBench.
- The model, checkpoints, and infrastructure are fully open-sourced via Hugging Face and GitHub to accelerate community progress.
Introduction and Theoretical Foundation
Background and Motivation
Recent large language models (e.g., GPT-5.5, Claude Opus 4.8, Gemini 3.5 Flash, Qwen3.7) have made substantial progress in multimodal reasoning and long-context understanding. However, scaling to hour-level video contexts incurs prohibitive computational and memory costs (the infrastructural bottleneck), and integrating complex agent tasks often induces catastrophic forgetting of foundational reasoning capabilities (the algorithmic dilemma).
Theoretical Basis
Keye-VL-2.0 addresses these challenges through two paradigm-shifting innovations:
- Extreme Context Scaling via Multimodal DSA: Standard dense attention leads to catastrophic KV cache expansion. DSA compresses and sparsifies video feature aggregation, constraining linear KV cache growth and enabling 256K lossless context processing.
- Resolving Modality Conflict via Cross-Modal MOPD: Direct injection of multiple capabilities often degrades reasoning. MOPD uses specialized teacher models to provide dense token-level feedback on student-generated trajectories, isolating task-specific expertise and preserving general-purpose reasoning baselines.
Methodology
Model Architecture
The model comprises four core components:
- Vision Encoder (ViT): Inherited from Keye-VL-1.5, based on SigLIP-400M-384-14, with native-resolution encoding, 2D RoPE, and adaptive position encoding.
- Language Decoder (LLM): Based on Qwen3-30B-A3B-Thinking-2507.
- MLP Projector: Randomly initialized and trained to align visual features.
- Sparse Attention Module: A GQA-compatible DSA design combining MQA-based indexing with grouped GQA aggregation.
DSA for Long-Context Multimodal Modeling
The DSA contains a Lightning Indexer and a fine-grained token selection mechanism. For throughput, the indexer follows an MQA key-sharing design:
where is the number of indexer heads, and are derived from the current query token , and is the shared key from the preceding token . The Top- tokens form the sparse index set:
In the GQA backbone, the sparse attention output for the -th group is:
With , core attention complexity reduces from to where .
Two-Stage DSA Training
-
Dense Warm-up: Initializes the indexer using a KL divergence loss:
-
Sparse Adaptation: All parameters unfrozen, training switches to sparse mode:
The total loss is:
Pre-Training Pipeline (Four-Stage Curriculum)
| Stage | Description | Sequence Length | Data Scale | Key Data Types |
|---|---|---|---|---|
| 0 | Projector Initialization | 8K | 4B | Caption, Interleaved Image |
| 1 | General Multimodal Pre-training | 32K | 1T | Caption, OCR, Video, Pure-Text QA |
| 2 | Multi-Task Capability Injection | 64K | 2T | STEM, GUI, Grounding, Coding, Tool-use/Search |
| 3 | Long-Context Extension | 256K | 500B | Long Video, Long Docs, Multi-doc, Long Code, Long Agent Traj. |
Post-Training
Reinforcement Learning
General RL uses Group Sequence Policy Optimization (GSPO):
where is an importance sampling ratio and is the normalized advantage.
Specialized RL covers Grounding, Spatial, Math, Counting, and OCR experts using domain-specific rewards (e.g., IoU for grounding, symbolic equivalence for math).
Video RL uses temporal IoU (tIoU) rewards and LLM-as-Judge for dense captioning.
Agentic RL covers Coding RL (Online Judge, SWE-bench), Tool Use RL (150+ API domains), and Search RL with environment-grounded rewards.
Cross-Modal Multi-Teacher On-Policy Distillation (MOPD)
MOPD maintains 13 RL-trained domain teachers. For a student on-policy response , the token-level advantage is computed over a top- overlap set :
The student is optimized with:
with token-category-aware advantage scaling and localized repetition penalties.
Efficient Infrastructure
- ViT–LM Heterogeneous Parallelism: Co-located on same GPUs with separate sharding strategies; recompute-or-offload reduces ViT activation memory to near zero.
- DSA Optimization: FlashInfer and TileLang achieve >2× speedup; top- memory optimization and short-sequence optimization for variable-length inputs.
- Inference Efficiency: Chunk ViT, sparse attention deduplication, and decode optimization reduce prefill cost by >3× and decode cost by >5× at 128K context.
Empirical Validation / Results
Video Understanding
Table 2: Video Understanding Evaluation
| Category | Benchmark | Keye-VL-2.0 30B-A3B | Qwen3.5 35B-A3B | InternVL3.5 241B-A28B | GPT-5-mini | Qwen3-VL 235B-A22B Thinking |
|---|---|---|---|---|---|---|
| Long-Video Comprehensive | LongVideoBench | 74.1 | 61.6 | 67.1 | – | 70.5 |
| Video-MME-v2 ACC (512 frames) | 42.4 | 28.5 | – | – | 36.8 | |
| Video-MME-v2 Non-Lin (512 frames) | 24.2 | 12.2 | – | – | 28.1 | |
| MLVU | 82.8 | 85.6 | 78.2 | 83.3 | 83.8 | |
| Video-MME (w/o sub.) | 78.3 | 82.5 | 72.9 | 78.9 | 79.0 | |
| Temporal Grounding (TimeLens) | ActivityNet-TimeLens | 58.5 | 53.2 | – | – | 52.1 |
| QVHighlights-TimeLens | 70.1 | 65.7 | – | – | 64.6 | |
| Charades-TimeLens | 58.4 | 49.1 | – | – | 47.8 | |
| Video Knowledge Acquisition | Video-MMMU | 80.0 | 80.4 | – | 82.5 | 80.0 |
Keye-VL-2.0 achieves best on LongVideoBench (74.1), Video-MME-v2 ACC (42.4 at 512 frames), and all three TimeLens subsets (ActivityNet, QVHighlights, Charades), demonstrating superior long-video comprehension and fine-grained temporal localization.
Agentic Capability Evaluation
Table 3: Code Agent Evaluation
| Benchmark | Keye-VL-2.0 30B-A3B | Qwen3.5 35B-A3B | GPT-5-mini | Qwen3-VL 235B-A22B Thinking |
|---|---|---|---|---|
| LiveCodeBench v6 | 64.2 | 62.8 | 51.5 | – |
| OJBench | 71.5 | 70.2 | 58.7 | – |
| SWE-bench Verified | 62.0 | 63.5 | 55.5 | – |
Table 4: Tool-Use Evaluation
| Benchmark | Keye-VL-2.0 30B-A3B | Qwen3.5 35B-A3B | GPT-5-mini | Qwen3-VL 235B-A22B Thinking |
|---|---|---|---|---|
| BFCL-V4 | 65.7 | 67.3 | 55.5 | – |
| τ²-Bench | 82.6 | 81.2 | 69.8 | – |
| VitaBench | 33.1 | 31.9 | 13.9 | – |
Keye-VL-2.0 shows strong coding (64.2 on LiveCodeBench v6) and tool-use abilities (best on τ²-Bench and VitaBench), with only 3B active parameters.
Theoretical and Practical Implications
- Sparse Long-Context Modeling: Demonstrates that DSA can be effectively adapted to GQA-based multimodal architectures, enabling 256K context processing at controllable cost. This provides a practical path for scaling video understanding beyond frame-limited perception.
- Capability Consolidation via MOPD: MOPD resolves the multimodal alignment dilemma by isolating task-specific expertise through on-policy distillation, preventing catastrophic forgetting while integrating heterogeneous agent capabilities.
- Deployable Efficiency: With only 3B active parameters and optimized DSA kernels, the model achieves high inference efficiency, making hour-level video understanding feasible for real applications.
- Open-Source Impact: Full release of model checkpoints and infrastructure (ViT-LM parallelism, DSA kernels, MOPD system) accelerates community progress toward scalable multimodal agentic applications.
Conclusion
Kwai Keye-VL-2.0-30B-A3B is a 30B MoE multimodal foundation model (3B active) that pioneers DSA in GQA-based architectures for 256K context processing. Its post-training pipeline, featuring Cross-Modal Multi-Teacher On-Policy Distillation and diverse RL stages, enables leading performance in long-video understanding, temporal grounding, and agentic tasks without sacrificing general reasoning. Future work will focus on deeper integration into real business pipelines (generative recommendation, content governance), developing Video × Agent workflows for automated orchestration, and strengthening the underlying infrastructure toward native multimodal modeling. The open-source release aims to turn long-context multimodal intelligence into reliable, scalable infrastructure for real-world applications.
Related papers
- ABot-Earth 0.5: Generative 3D Earth Model
ABot-Earth 0.5 generates seamless real-world 3D environments from satellite imagery at under 10 min/km² with FID 16.1.
- EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments
- Toward Generalist Autonomous Research via Hypothesis-Tree Refinement
Arbor's hypothesis tree framework achieves best held-out results on all six real research tasks, with over 2.5x the average gain of Codex and Claude Code.