# Kwai Keye-VL-2.0 Technical Report

> First multimodal MoE achieves SOTA long-video understanding and agentic tasks with 3B active parameters via sparse attention and multi-teacher distillation.

- **Source:** [arXiv](https://arxiv.org/abs/2606.10651)
- **Published:** 2026-06-11
- **Permalink:** https://picx.dev/p/r5JNLR
- **Whiteboard:** https://picx.dev/p/r5JNLR/image

## Summary

## Summary (Overview)

- **Kwai Keye-VL-2.0-30B-A3B** is an open-source Mixture-of-Experts (MoE) multimodal foundation model with 30B total parameters and only **3B active parameters**, designed for long-video understanding and agentic intelligence.
- It is the **first model to adapt DeepSeek Sparse Attention (DSA)** to a GQA-based multimodal architecture, enabling **lossless 256K context processing** while capturing critical frames and long-range temporal dependencies.
- Introduces **Cross-Modal Multi-Teacher On-Policy Distillation (MOPD)** paired with Context-RL and Video-RL to overcome catastrophic forgetting during multi-task alignment, allowing the model to natively support agent collaboration across Code, Tool, and Search scenarios.
- Achieves **state-of-the-art performance** among models of similar scale, particularly excelling in fine-grained temporal localization on TimeLens benchmarks (ActivityNet, QVHighlights, Charades) and long-video comprehension on Video-MME-v2 and LongVideoBench.
- The model, checkpoints, and infrastructure are fully open-sourced via Hugging Face and GitHub to accelerate community progress.

## Introduction and Theoretical Foundation

### Background and Motivation
Recent large language models (e.g., GPT-5.5, Claude Opus 4.8, Gemini 3.5 Flash, Qwen3.7) have made substantial progress in multimodal reasoning and long-context understanding. However, scaling to hour-level video contexts incurs prohibitive computational and memory costs (the **infrastructural bottleneck**), and integrating complex agent tasks often induces **catastrophic forgetting** of foundational reasoning capabilities (the **algorithmic dilemma**).

### Theoretical Basis
Keye-VL-2.0 addresses these challenges through two paradigm-shifting innovations:
1. **Extreme Context Scaling via Multimodal DSA**: Standard dense attention leads to catastrophic KV cache expansion. DSA compresses and sparsifies video feature aggregation, constraining linear KV cache growth and enabling 256K lossless context processing.
2. **Resolving Modality Conflict via Cross-Modal MOPD**: Direct injection of multiple capabilities often degrades reasoning. MOPD uses specialized teacher models to provide dense token-level feedback on student-generated trajectories, isolating task-specific expertise and preserving general-purpose reasoning baselines.

## Methodology

### Model Architecture
The model comprises four core components:
- **Vision Encoder (ViT)**: Inherited from Keye-VL-1.5, based on SigLIP-400M-384-14, with native-resolution encoding, 2D RoPE, and adaptive position encoding.
- **Language Decoder (LLM)**: Based on Qwen3-30B-A3B-Thinking-2507.
- **MLP Projector**: Randomly initialized and trained to align visual features.
- **Sparse Attention Module**: A GQA-compatible DSA design combining MQA-based indexing with grouped GQA aggregation.

#### DSA for Long-Context Multimodal Modeling
The DSA contains a Lightning Indexer and a fine-grained token selection mechanism. For throughput, the indexer follows an MQA key-sharing design:

$$I_{t,s} = \sum_{j=1}^{H_I} w_{t,j}^I \cdot \text{ReLU}(q_{t,j}^I \cdot k_s^I)$$

where $H_I$ is the number of indexer heads, $q_{t,j}^I$ and $w_{t,j}^I$ are derived from the current query token $h_t$, and $k_s^I$ is the shared key from the preceding token $h_s$. The Top-$k$ tokens form the sparse index set:

$$\Omega_t = \{ s \mid I_{t,s} \in \text{Top-}k(I_{t,:}) \}$$

In the GQA backbone, the sparse attention output for the $g$-th group is:

$$u_{t,g} = \text{Attn}(h_{t,g}, \{c_{s,g} \mid s \in \Omega_t\})$$

With $k = 2048$, core attention complexity reduces from $O(L^2)$ to $O(Lk)$ where $k \ll L$.

#### Two-Stage DSA Training
1. **Dense Warm-up**: Initializes the indexer using a KL divergence loss:

   $$\mathcal{L}_{\text{warmup}}^I = \sum_t \sum_{g=1}^G D_{\text{KL}}(p_{t,:,g} \parallel \text{Softmax}(I_{t,:}))$$

2. **Sparse Adaptation**: All parameters unfrozen, training switches to sparse mode:

   $$\mathcal{L}_{\text{sparse}}^I = \sum_t \sum_{g=1}^G D_{\text{KL}}(p_{t,S_t,g} \parallel \text{Softmax}(I_{t,S_t}))$$

   The total loss is:

   $$\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{NTP}} + \lambda \mathcal{L}_{\text{sparse}}^I$$

### Pre-Training Pipeline (Four-Stage Curriculum)

| Stage | Description | Sequence Length | Data Scale | Key Data Types |
|-------|-------------|----------------|------------|----------------|
| 0 | Projector Initialization | 8K | 4B | Caption, Interleaved Image |
| 1 | General Multimodal Pre-training | 32K | 1T | Caption, OCR, Video, Pure-Text QA |
| 2 | Multi-Task Capability Injection | 64K | 2T | STEM, GUI, Grounding, Coding, Tool-use/Search |
| 3 | Long-Context Extension | 256K | 500B | Long Video, Long Docs, Multi-doc, Long Code, Long Agent Traj. |

### Post-Training

#### Reinforcement Learning
**General RL** uses Group Sequence Policy Optimization (GSPO):

$$J_{\text{GSPO}}(\theta) = \mathbb{E}_{x \sim \mathcal{D}, \{y_i\}_{i=1}^G \sim \pi_{\theta_{\text{old}}}(\cdot|x)} \left[ \frac{1}{G} \sum_{i=1}^G \min(s_i(\theta) \hat{A}_i, \text{clip}(s_i(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_i) \right]$$

where $s_i(\theta)$ is an importance sampling ratio and $\hat{A}_i$ is the normalized advantage.

**Specialized RL** covers Grounding, Spatial, Math, Counting, and OCR experts using domain-specific rewards (e.g., IoU for grounding, symbolic equivalence for math).

**Video RL** uses temporal IoU (tIoU) rewards and LLM-as-Judge for dense captioning.

**Agentic RL** covers Coding RL (Online Judge, SWE-bench), Tool Use RL (150+ API domains), and Search RL with environment-grounded rewards.

#### Cross-Modal Multi-Teacher On-Policy Distillation (MOPD)
MOPD maintains 13 RL-trained domain teachers. For a student on-policy response $y_i = (y_{i,1}, \ldots, y_{i,T}) \sim \pi_\theta(\cdot|x_i)$, the token-level advantage is computed over a top-$k$ overlap set $\Omega_{i,t}$:

$$A_{i,t} = \sum_{v \in \Omega_{i,t}} \bar{\pi}_\theta(v|s_{i,t}) \left[ \log \pi_T^{(r(i))}(v|s_{i,t}) - \log \pi_\theta(v|s_{i,t}) \right]$$

The student is optimized with:

$$\mathcal{L}_{\text{MOPD}} = -\mathbb{E} \left[ \frac{1}{|\mathcal{M}_i|} \sum_{t \in \mathcal{M}_i} b\hat{A}_{i,t} \log \pi_\theta(y_{i,t}|x_i, y_{i,<t}) \right]$$

with token-category-aware advantage scaling and localized repetition penalties.

### Efficient Infrastructure
- **ViT–LM Heterogeneous Parallelism**: Co-located on same GPUs with separate sharding strategies; recompute-or-offload reduces ViT activation memory to near zero.
- **DSA Optimization**: FlashInfer and TileLang achieve >2× speedup; top-$k$ memory optimization and short-sequence optimization for variable-length inputs.
- **Inference Efficiency**: Chunk ViT, sparse attention deduplication, and decode optimization reduce prefill cost by >3× and decode cost by >5× at 128K context.

## Empirical Validation / Results

### Video Understanding

**Table 2: Video Understanding Evaluation**

| Category | Benchmark | Keye-VL-2.0 30B-A3B | Qwen3.5 35B-A3B | InternVL3.5 241B-A28B | GPT-5-mini | Qwen3-VL 235B-A22B Thinking |
|----------|-----------|----------------------|-----------------|----------------------|-------------|------------------------------|
| Long-Video Comprehensive | LongVideoBench | **74.1** | 61.6 | 67.1 | – | 70.5 |
| | Video-MME-v2 ACC (512 frames) | **42.4** | 28.5 | – | – | 36.8 |
| | Video-MME-v2 Non-Lin (512 frames) | 24.2 | 12.2 | – | – | 28.1 |
| | MLVU | 82.8 | 85.6 | 78.2 | 83.3 | 83.8 |
| | Video-MME (w/o sub.) | 78.3 | 82.5 | 72.9 | 78.9 | 79.0 |
| Temporal Grounding (TimeLens) | ActivityNet-TimeLens | **58.5** | 53.2 | – | – | 52.1 |
| | QVHighlights-TimeLens | **70.1** | 65.7 | – | – | 64.6 |
| | Charades-TimeLens | **58.4** | 49.1 | – | – | 47.8 |
| Video Knowledge Acquisition | Video-MMMU | 80.0 | 80.4 | – | 82.5 | 80.0 |

Keye-VL-2.0 achieves **best on LongVideoBench** (74.1), **Video-MME-v2 ACC** (42.4 at 512 frames), and **all three TimeLens subsets** (ActivityNet, QVHighlights, Charades), demonstrating superior long-video comprehension and fine-grained temporal localization.

### Agentic Capability Evaluation

**Table 3: Code Agent Evaluation**

| Benchmark | Keye-VL-2.0 30B-A3B | Qwen3.5 35B-A3B | GPT-5-mini | Qwen3-VL 235B-A22B Thinking |
|-----------|----------------------|-----------------|-------------|------------------------------|
| LiveCodeBench v6 | **64.2** | 62.8 | 51.5 | – |
| OJBench | **71.5** | 70.2 | 58.7 | – |
| SWE-bench Verified | 62.0 | **63.5** | 55.5 | – |

**Table 4: Tool-Use Evaluation**

| Benchmark | Keye-VL-2.0 30B-A3B | Qwen3.5 35B-A3B | GPT-5-mini | Qwen3-VL 235B-A22B Thinking |
|-----------|----------------------|-----------------|-------------|------------------------------|
| BFCL-V4 | 65.7 | **67.3** | 55.5 | – |
| τ²-Bench | **82.6** | 81.2 | 69.8 | – |
| VitaBench | **33.1** | 31.9 | 13.9 | – |

Keye-VL-2.0 shows strong coding (64.2 on LiveCodeBench v6) and tool-use abilities (best on τ²-Bench and VitaBench), with only 3B active parameters.

## Theoretical and Practical Implications

- **Sparse Long-Context Modeling**: Demonstrates that DSA can be effectively adapted to GQA-based multimodal architectures, enabling 256K context processing at controllable cost. This provides a practical path for scaling video understanding beyond frame-limited perception.
- **Capability Consolidation via MOPD**: MOPD resolves the multimodal alignment dilemma by isolating task-specific expertise through on-policy distillation, preventing catastrophic forgetting while integrating heterogeneous agent capabilities.
- **Deployable Efficiency**: With only 3B active parameters and optimized DSA kernels, the model achieves high inference efficiency, making hour-level video understanding feasible for real applications.
- **Open-Source Impact**: Full release of model checkpoints and infrastructure (ViT-LM parallelism, DSA kernels, MOPD system) accelerates community progress toward scalable multimodal agentic applications.

## Conclusion

Kwai Keye-VL-2.0-30B-A3B is a 30B MoE multimodal foundation model (3B active) that pioneers DSA in GQA-based architectures for 256K context processing. Its post-training pipeline, featuring Cross-Modal Multi-Teacher On-Policy Distillation and diverse RL stages, enables leading performance in long-video understanding, temporal grounding, and agentic tasks without sacrificing general reasoning. Future work will focus on deeper integration into real business pipelines (generative recommendation, content governance), developing Video × Agent workflows for automated orchestration, and strengthening the underlying infrastructure toward native multimodal modeling. The open-source release aims to turn long-context multimodal intelligence into reliable, scalable infrastructure for real-world applications.

---

_Markdown view of https://picx.dev/p/r5JNLR, served by PicX — AI-generated visual whiteboard summaries of research papers._
