Visual Summary | Kwai Keye-VL-2.0 Technical Report

Summary (Overview)

Kwai Keye-VL-2.0-30B-A3B is an open-source Mixture-of-Experts (MoE) multimodal foundation model with 30B total parameters and only 3B active parameters, designed for long-video understanding and agentic intelligence.
It is the first model to adapt DeepSeek Sparse Attention (DSA) to a GQA-based multimodal architecture, enabling lossless 256K context processing while capturing critical frames and long-range temporal dependencies.
Introduces Cross-Modal Multi-Teacher On-Policy Distillation (MOPD) paired with Context-RL and Video-RL to overcome catastrophic forgetting during multi-task alignment, allowing the model to natively support agent collaboration across Code, Tool, and Search scenarios.
Achieves state-of-the-art performance among models of similar scale, particularly excelling in fine-grained temporal localization on TimeLens benchmarks (ActivityNet, QVHighlights, Charades) and long-video comprehension on Video-MME-v2 and LongVideoBench.
The model, checkpoints, and infrastructure are fully open-sourced via Hugging Face and GitHub to accelerate community progress.

Introduction and Theoretical Foundation

Background and Motivation

Recent large language models (e.g., GPT-5.5, Claude Opus 4.8, Gemini 3.5 Flash, Qwen3.7) have made substantial progress in multimodal reasoning and long-context understanding. However, scaling to hour-level video contexts incurs prohibitive computational and memory costs (the infrastructural bottleneck), and integrating complex agent tasks often induces catastrophic forgetting of foundational reasoning capabilities (the algorithmic dilemma).

Theoretical Basis

Keye-VL-2.0 addresses these challenges through two paradigm-shifting innovations:

Extreme Context Scaling via Multimodal DSA: Standard dense attention leads to catastrophic KV cache expansion. DSA compresses and sparsifies video feature aggregation, constraining linear KV cache growth and enabling 256K lossless context processing.
Resolving Modality Conflict via Cross-Modal MOPD: Direct injection of multiple capabilities often degrades reasoning. MOPD uses specialized teacher models to provide dense token-level feedback on student-generated trajectories, isolating task-specific expertise and preserving general-purpose reasoning baselines.

Methodology

Model Architecture

The model comprises four core components:

Vision Encoder (ViT): Inherited from Keye-VL-1.5, based on SigLIP-400M-384-14, with native-resolution encoding, 2D RoPE, and adaptive position encoding.
Language Decoder (LLM): Based on Qwen3-30B-A3B-Thinking-2507.
MLP Projector: Randomly initialized and trained to align visual features.
Sparse Attention Module: A GQA-compatible DSA design combining MQA-based indexing with grouped GQA aggregation.

DSA for Long-Context Multimodal Modeling

The DSA contains a Lightning Indexer and a fine-grained token selection mechanism. For throughput, the indexer follows an MQA key-sharing design:

I_{t,s} = \sum_{j=1}^{H_I} w_{t,j}^I \cdot \text{ReLU}(q_{t,j}^I \cdot k_s^I)

where $H_I$ is the number of indexer heads, $q_{t,j}^I$ and $w_{t,j}^I$ are derived from the current query token $h_t$ , and $k_s^I$ is the shared key from the preceding token $h_s$ . The Top- $k$ tokens form the sparse index set:

\Omega_t = \{ s \mid I_{t,s} \in \text{Top-}k(I_{t,:}) \}

In the GQA backbone, the sparse attention output for the $g$ -th group is:

u_{t,g} = \text{Attn}(h_{t,g}, \{c_{s,g} \mid s \in \Omega_t\})

With $k = 2048$ , core attention complexity reduces from $O(L^2)$ to $O(Lk)$ where $k \ll L$ .

Two-Stage DSA Training

Dense Warm-up: Initializes the indexer using a KL divergence loss:
$\mathcal{L}_{\text{warmup}}^I = \sum_t \sum_{g=1}^G D_{\text{KL}}(p_{t,:,g} \parallel \text{Softmax}(I_{t,:}))$
Sparse Adaptation: All parameters unfrozen, training switches to sparse mode:
$\mathcal{L}_{\text{sparse}}^I = \sum_t \sum_{g=1}^G D_{\text{KL}}(p_{t,S_t,g} \parallel \text{Softmax}(I_{t,S_t}))$
The total loss is:
$\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{NTP}} + \lambda \mathcal{L}_{\text{sparse}}^I$

Pre-Training Pipeline (Four-Stage Curriculum)

Stage	Description	Sequence Length	Data Scale	Key Data Types
0	Projector Initialization	8K	4B	Caption, Interleaved Image
1	General Multimodal Pre-training	32K	1T	Caption, OCR, Video, Pure-Text QA
2	Multi-Task Capability Injection	64K	2T	STEM, GUI, Grounding, Coding, Tool-use/Search
3	Long-Context Extension	256K	500B	Long Video, Long Docs, Multi-doc, Long Code, Long Agent Traj.

Post-Training

Reinforcement Learning

General RL uses Group Sequence Policy Optimization (GSPO):

J_{\text{GSPO}}(\theta) = \mathbb{E}_{x \sim \mathcal{D}, \{y_i\}_{i=1}^G \sim \pi_{\theta_{\text{old}}}(\cdot|x)} \left[ \frac{1}{G} \sum_{i=1}^G \min(s_i(\theta) \hat{A}_i, \text{clip}(s_i(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_i) \right]

where $s_i(\theta)$ is an importance sampling ratio and $\hat{A}_i$ is the normalized advantage.

Specialized RL covers Grounding, Spatial, Math, Counting, and OCR experts using domain-specific rewards (e.g., IoU for grounding, symbolic equivalence for math).

Video RL uses temporal IoU (tIoU) rewards and LLM-as-Judge for dense captioning.

Agentic RL covers Coding RL (Online Judge, SWE-bench), Tool Use RL (150+ API domains), and Search RL with environment-grounded rewards.

Cross-Modal Multi-Teacher On-Policy Distillation (MOPD)

MOPD maintains 13 RL-trained domain teachers. For a student on-policy response $y_i = (y_{i,1}, \ldots, y_{i,T}) \sim \pi_\theta(\cdot|x_i)$ , the token-level advantage is computed over a top- $k$ overlap set $\Omega_{i,t}$ :

A_{i,t} = \sum_{v \in \Omega_{i,t}} \bar{\pi}_\theta(v|s_{i,t}) \left[ \log \pi_T^{(r(i))}(v|s_{i,t}) - \log \pi_\theta(v|s_{i,t}) \right]

The student is optimized with:

\mathcal{L}_{\text{MOPD}} = -\mathbb{E} \left[ \frac{1}{|\mathcal{M}_i|} \sum_{t \in \mathcal{M}_i} b\hat{A}_{i,t} \log \pi_\theta(y_{i,t}|x_i, y_{i,<t}) \right]

with token-category-aware advantage scaling and localized repetition penalties.

Efficient Infrastructure

ViT–LM Heterogeneous Parallelism: Co-located on same GPUs with separate sharding strategies; recompute-or-offload reduces ViT activation memory to near zero.
DSA Optimization: FlashInfer and TileLang achieve >2× speedup; top- $k$ memory optimization and short-sequence optimization for variable-length inputs.
Inference Efficiency: Chunk ViT, sparse attention deduplication, and decode optimization reduce prefill cost by >3× and decode cost by >5× at 128K context.

Empirical Validation / Results

Video Understanding

Table 2: Video Understanding Evaluation

Category	Benchmark	Keye-VL-2.0 30B-A3B	Qwen3.5 35B-A3B	InternVL3.5 241B-A28B	GPT-5-mini	Qwen3-VL 235B-A22B Thinking
Long-Video Comprehensive	LongVideoBench	74.1	61.6	67.1	–	70.5
	Video-MME-v2 ACC (512 frames)	42.4	28.5	–	–	36.8
	Video-MME-v2 Non-Lin (512 frames)	24.2	12.2	–	–	28.1
	MLVU	82.8	85.6	78.2	83.3	83.8
	Video-MME (w/o sub.)	78.3	82.5	72.9	78.9	79.0
Temporal Grounding (TimeLens)	ActivityNet-TimeLens	58.5	53.2	–	–	52.1
	QVHighlights-TimeLens	70.1	65.7	–	–	64.6
	Charades-TimeLens	58.4	49.1	–	–	47.8
Video Knowledge Acquisition	Video-MMMU	80.0	80.4	–	82.5	80.0

Keye-VL-2.0 achieves best on LongVideoBench (74.1), Video-MME-v2 ACC (42.4 at 512 frames), and all three TimeLens subsets (ActivityNet, QVHighlights, Charades), demonstrating superior long-video comprehension and fine-grained temporal localization.

Agentic Capability Evaluation

Table 3: Code Agent Evaluation

Benchmark	Keye-VL-2.0 30B-A3B	Qwen3.5 35B-A3B	GPT-5-mini	Qwen3-VL 235B-A22B Thinking
LiveCodeBench v6	64.2	62.8	51.5	–
OJBench	71.5	70.2	58.7	–
SWE-bench Verified	62.0	63.5	55.5	–

Table 4: Tool-Use Evaluation

Benchmark	Keye-VL-2.0 30B-A3B	Qwen3.5 35B-A3B	GPT-5-mini	Qwen3-VL 235B-A22B Thinking
BFCL-V4	65.7	67.3	55.5	–
τ²-Bench	82.6	81.2	69.8	–
VitaBench	33.1	31.9	13.9	–

Keye-VL-2.0 shows strong coding (64.2 on LiveCodeBench v6) and tool-use abilities (best on τ²-Bench and VitaBench), with only 3B active parameters.

Theoretical and Practical Implications

Sparse Long-Context Modeling: Demonstrates that DSA can be effectively adapted to GQA-based multimodal architectures, enabling 256K context processing at controllable cost. This provides a practical path for scaling video understanding beyond frame-limited perception.
Capability Consolidation via MOPD: MOPD resolves the multimodal alignment dilemma by isolating task-specific expertise through on-policy distillation, preventing catastrophic forgetting while integrating heterogeneous agent capabilities.
Deployable Efficiency: With only 3B active parameters and optimized DSA kernels, the model achieves high inference efficiency, making hour-level video understanding feasible for real applications.
Open-Source Impact: Full release of model checkpoints and infrastructure (ViT-LM parallelism, DSA kernels, MOPD system) accelerates community progress toward scalable multimodal agentic applications.

Conclusion

Kwai Keye-VL-2.0-30B-A3B is a 30B MoE multimodal foundation model (3B active) that pioneers DSA in GQA-based architectures for 256K context processing. Its post-training pipeline, featuring Cross-Modal Multi-Teacher On-Policy Distillation and diverse RL stages, enables leading performance in long-video understanding, temporal grounding, and agentic tasks without sacrificing general reasoning. Future work will focus on deeper integration into real business pipelines (generative recommendation, content governance), developing Video × Agent workflows for automated orchestration, and strengthening the underlying infrastructure toward native multimodal modeling. The open-source release aims to turn long-context multimodal intelligence into reliable, scalable infrastructure for real-world applications.