# LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model

> LLaDA2.0-Uni is a unified discrete diffusion model that uses semantic tokens to perform both multimodal understanding and generation within a single framework.

- **Source:** [arXiv](https://arxiv.org/abs/2604.20796)
- **Published:** 2026-04-24
- **Permalink:** https://picx.dev/p/D5leQk
- **Whiteboard:** https://picx.dev/p/D5leQk/image

## Summary

# LLaDA2.0-Uni: A Comprehensive Summary

## Summary (Overview)
*   **Unified Architecture for Multimodal Tasks:** LLaDA2.0-Uni is a novel unified discrete diffusion large language model (dLLM) that natively supports both multimodal understanding (e.g., VQA, document reasoning) and generation (e.g., image generation, editing) within a single framework.
*   **Core Technical Innovations:** The architecture integrates three key components: 1) a **SigLIP-VQ semantic tokenizer** for discretizing images, 2) a **16B MoE-based dLLM backbone** trained with a block-level masked diffusion objective for unified sequence modeling, and 3) a **distillation-optimized diffusion decoder** for high-fidelity image reconstruction.
*   **Strong Benchmark Performance:** The model achieves performance on par with specialized vision-language models (VLMs) in understanding tasks and competitive results with state-of-the-art image generation models, while also excelling in editing and complex interleaved generation/reasoning tasks.
*   **Efficient Inference:** The model employs training-free inference acceleration via **SPRINT** (Sparse Prefix Retention with Inference-time Non-uniform Token Unmasking) and a distilled diffusion decoder, achieving up to 1.6× speedup and 8-step CFG-free image generation.
*   **Scalable Data and Training Pipeline:** The model is supported by a large-scale, carefully curated multimodal dataset and a tailored three-stage training pipeline (vision-language alignment, multi-task pre-training, supervised fine-tuning) that progressively builds its capabilities.

## Introduction and Theoretical Foundation
Large language models (LLMs) have expanded beyond text to handle diverse multimodal tasks, primarily categorized into **understanding** (e.g., visual question answering) and **generation** (e.g., text-to-image). Traditionally, these are handled by separate specialized models. A unified model offers key benefits: mutual enhancement between understanding and generation, improved deployment efficiency, and unlocking advanced capabilities like **interleaved generation and reasoning**, moving closer to Artificial General Intelligence (AGI).

Current unified models are predominantly based on **autoregressive (AR)** architectures, which tokenize images into discrete sequences for next-token prediction. An alternative paradigm is offered by **masked diffusion models**, which have inherent advantages in parallel decoding and bidirectional context modeling. However, existing unified masked diffusion models (e.g., MMaDA, Lumina-DiMOO) lag behind AR-based models due to architectural limitations: 1) reconstructive VQ tokenizers lack semantic information, harming understanding; 2) excessive image compression degrades generation quality; 3) fully bidirectional modeling is unreliable for text; 4) they assume fixed output lengths.

**LLaDA2.0-Uni** addresses these limitations by proposing a unified dLLM-based MoE model. Its core innovation is using **fully discrete semantic tokens** for both understanding and generation. This is achieved via a **SigLIP-VQ tokenizer**, which converts images into semantic tokens, preserving crucial details for complex reasoning. This unified representation allows both text and images to be optimized under a shared **block-level masked diffusion objective** within the dLLM backbone, while a dedicated diffusion decoder reconstructs images from the generated tokens.

## Methodology

### 2.1 Architecture Overview
LLaDA2.0-Uni consists of three core components:
1.  **Semantic Discrete Tokenizer (SigLIP-VQ):** Converts continuous images into discrete semantic tokens. It uses a pre-trained SigLIP2-g ViT as a feature extractor and a vector quantizer with a codebook (vocab size 16,384, dimensionality 2,048). Unlike reconstruction-based VQ-VAEs, it is trained on understanding tasks, preserving rich semantics.
2.  **Diffusion Large Language Model (16B MoE Backbone):** Built upon LLaDA-2.0-mini, it processes interleaved sequences of text and visual tokens. Key design choices:
    *   **Block-wise Attention:** Adopted instead of full bidirectional attention to balance parallel decoding speed with training stability, especially for semantically aligned SigLIP-VQ tokens.
    *   **1D RoPE with Size Tokens:** Uses standard 1D Rotary Position Embedding (RoPE). Special `<height>` and `<width>` tokens (e.g., `<imgsize 512>`) are prepended to the flattened visual sequence to represent 2D spatial information and enable arbitrary resolution handling.
3.  **Diffusion Decoder:** A model built upon Z-Image-Base (6B) that reconstructs high-fidelity images from the semantic tokens generated by the dLLM backbone. It performs 2× super-resolution and is optimized via **model distillation** for efficient 8-step CFG-free inference.

### 2.2 Training-free Inference Acceleration: SPRINT
To accelerate inference beyond parallel decoding, the paper proposes **SPRINT**, which reduces cost along two axes:
1.  **Sparse Prefix Retention:** Prunes the prefix Key-Value (KV) cache in a modality-aware manner to lower per-step attention cost. A composite importance score $s_i$ for each prefix position $i$ is calculated:
    $$ s_i = \alpha \cdot \bar{I}_i + (1 - \alpha) \cdot c_i $$
    where $\bar{I}_i = \| k_i \|_2 / \left( \frac{1}{L} \sum_{j=1}^{L} \| k_j \|_2 \right)$ is the mean-normalized key norm, $c_i = \max_v p_\theta(v | x_t)$ is the top-1 softmax confidence, and $\alpha=0.5$. Separate keep ratios are used for text ($r_{text}$) and image ($r_{img}$) tokens.
2.  **Non-uniform Token Unmasking:** Replaces the fixed denoising schedule with a confidence-adaptive strategy. At each step, all masked positions with confidence exceeding a threshold $\tau$ are accepted:
    $$ A = \{ n \in [m] : c_n > \tau \}. $$
    A minimum number of acceptances is enforced to guarantee termination.

### 3. Data Preparation
A large-scale, meticulously curated dataset is constructed:
*   **Multimodal Understanding:** Includes image-caption data, OCR (via a coarse-to-fine pipeline), grounding/counting data, world knowledge/reasoning data, and high-quality text data.
*   **Image Generation:** Over 200M web images filtered for resolution, aesthetics (ArtiMuse score >60), and quality (DeQA-Score >4.0). Captions are generated/enhanced by Qwen3-VL.
*   **Image Editing:** Combines open-source datasets (X2Edit, OmniEdit, etc.) and synthesized pairs. Instructions are refined by Qwen3-VL for accuracy.
*   **Interleaved Data:** Constructed from the Koala36M video corpus with strict filtering for duration, quality, and motion. Frame sequences (2-6 frames) are captioned to provide instruction-following data.
*   **Reasoning-Augmented Data:** Sourced from Flux-6M, Zebra-CoT, and Weave (~8M samples) to enable chain-of-thought reasoning.

### 4. Model Training
A three-stage training pipeline is employed:

**Stage 0: Vision-Language Alignment:** Aligns visual and linguistic representations using image-caption pairs. A random masking strategy is applied (image tokens for generation, text tokens for understanding). Progressive resolution from 256×256 to 512×512.

**Stage 1: Multi-task Pre-training:** Trains on diverse multimodal data (understanding: interleaved, OCR, grounding; generation: editing, controllable generation, style transfer) to develop comprehensive capabilities.

**Stage 2: Supervised Fine-Tuning (SFT):** Conducted in two phases (8k then 16k context) on high-quality instruction-following data for complex reasoning and generation.

#### 4.2 Pre-Training Optimization
*   **BDLM Loss:** The training uses the Block Diffusion Language Model (BDLM) objective, which operates on block-level masked regions:
    $$ \mathcal{L}_{\text{BDLM}}(\theta) = -\mathbb{E}_{t, x_0, x_t} \left[ \frac{\alpha'_t}{1 - \alpha_t} \sum_{k=1}^{K} \sum_{i=1}^{L_B} \mathbb{1}[x^i_{t,k} = \text{[MASK]}] \log p_\theta(x^i_{0,k} | x_{0,<k}, x_{t,k}) \right] $$
    where $K$ is the number of blocks, $L_B$ is the block size, $x_{0,<k}$ are preceding clean blocks, and $x_{t,k}$ is the noisy version of the current block.
*   **Load Balancing for MoE:** An auxiliary-loss-free mechanism is used to promote uniform expert utilization, with bias updates normalized for stability:
    $$ b_i = b_i + u \times \frac{(F_i - Q_i)}{\sqrt{\frac{1}{n} \sum_{j=1}^{n} (F_j - Q_j)^2}} $$
    where $F$ is the current expert load distribution and $Q$ is the ideal uniform distribution.

#### 4.3 Supervised Fine-Tuning Optimization
*   **Mask Token Reweighting Loss (MTRS):** Adapts the BDLM loss to be conditional on a prompt $c$. A re-weighting mechanism $\beta_j$ balances gradient contributions across samples with vastly different lengths:
    $$ \mathcal{L}_{\text{MTRS}} = \frac{\sum_j \beta_j \mathcal{L}^{(j)}_{\text{SFT}}}{\sum_j \beta_j}, \quad \text{where } \beta_j = \frac{1}{\sqrt{\sum_{k=1}^{K} \sum_{i=1}^{L_B} \mathbb{1}[x^{i,(j)}_{t,k} = \text{[MASK]}]}}. $$
*   **Complementary Masking:** Enhances data efficiency by constructing two antithetical training instances (primary and inverse mask) from a single sequence.

#### 4.4 Diffusion Decoder Training
Optimized via a **flow matching objective**:
$$ \mathcal{L}_{\text{FM}}(\theta) = \mathbb{E}_{x_0, x_1, z, t} \left[ \| v_{\theta, t}(x_t, z) - v_t \|_2^2 \right] $$
where $z$ represents the conditioned semantic visual tokens. Training is decoupled into three stages: Warm-up, Multi-domain Generalization, and High-fidelity Refinement.

**Few-step Generation:** Achieved via a lightweight **consistency-based distillation** framework, combining flow matching with a consistency term:
$$ \mathcal{L}_{\text{Distill}}(\theta) = \mathbb{E}_{x_0, z, t} \left[ \| v_{\theta, t} - v_t \|_2^2 + \| u_{\theta, t} - v_t + t \cdot \frac{d u_{\theta^-, t}}{d t} \|_2^2 \right] $$
where $u_{\theta^-, t} = \text{stop\_grad}(u_{\theta, t})$. This enables 8-step CFG-free inference.

## Empirical Validation / Results

### 5.1 Multimodal Understanding Performance
LLaDA2.0-Uni was evaluated on 21 benchmarks across general VQA, reasoning, and OCR/document understanding.

**Table 2: Overall Comparison on Multimodal Understanding Benchmarks**
| **Category** | **Benchmark** | **Qwen2.5-VL-7B** | **LLaDA-V** | **BAGEL** | **InternVL-U** | **Lumina-DiMOO** | **LLaDA-o** | **LLaDA2.0-Uni** |
| :--- | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| **General Tasks** | MMStar | 63.9 | 60.1 | 67.0 | 54.7 | 61.0 | 58.0 | **64.1** |
| | MMBench EN | 83.5 | 82.9 | 85.0 | 75.3 | 84.5 | 71.1 | 81.5 |
| | MME-C | 62.4 | 49.1 | 66.7 | 27.9 | 35.2 | 52.7 | 58.7 |
| **Reasoning Tasks** | MMMU val | 51.3 | 48.6 | 55.3 | 54.7 | 58.6 | 44.9 | 50.1 |
| | MathVista mini | 68.2 | 59.7 | 73.1 | 55.8 | 10.3 | 66.1 | **68.1** |
| **OCR & Chart** | CharXiv(DQ) | 73.9 | 47.0 | 70.6 | 53.3 | 27.8 | 69.8 | 68.4 |
| | ChartQA | 84.1 | 78.3 | 74.3 | 76.6 | 8.3 | 87.9 | 80.1 |
| | OCRBench | 84.2 | 63.2 | 73.3 | 83.9 | 7.6 | 74.6 | 75.7 |
| **Other Tasks** | CountBench | 84.9 | 75.1 | 93.2 | 62.2 | 48.4 | 91.7 | **86.0** |

**Key Findings:** LLaDA2.0-Uni demonstrates strong and comprehensive understanding capabilities. It significantly outperforms existing diffusion-based unified models (Lumina-DiMOO, LLaDA-o) across all categories and performs on par with state-of-the-art specialized VLMs like Qwen2.5-VL-7B, even outperforming it on specific metrics (e.g., MMStar, CountBench).

### 5.2 Text-to-Image Generation Performance

**Table 3: Performance on GenEval Benchmark (Compositional Prompt Following)**
| **Type** | **Model** | **Arch.** | **Single Object** | **Two Object** | **Position** | **Attribute Binding** | **Overall** |
| :--- | :--- | :---: | :---: | :---: | :---: | :---: | :---: |
| Gen. Only | Qwen-Image | Diff. | 0.99 | 0.92 | 0.76 | 0.77 | 0.87 |
| Unified | BAGEL | AR+Diff. | 0.99 | 0.94 | 0.64 | 0.63 | 0.82 |
| | Lumina-DiMOO | D-Diff. | 1.00 | 0.94 | 0.85 | 0.76 | 0.88 |
| | **LLaDA2.0-Uni** | **D-Diff.+Diff.** | **1.00** | **0.98** | **0.90** | **0.84** | **0.89** |

**Key Findings on Generation:**
*   **GenEval & DPG-Bench:** LLaDA2.0-Uni achieves highly competitive overall scores (0.89 on GenEval, 87.76 on DPG), outperforming all other unified models and bridging the gap with top-tier generation-only models. It shows a particular advantage in spatial arrangement (**Position: 0.90**).
*   **Text Rendering (CVTG-2K):** Leads unified models with a score of **0.765** and demonstrates exceptional stability in multi-region text generation.
*   **Reasoning-Informed Generation (WISE-Bench):** Achieves a strong score of **0.68**, ranking first among unified models. Incorporating a reasoning mode yields a further 10% improvement (to **0.78**).

### 5.3 Image Editing Performance
LLaDA2.0-Uni achieves the best Overall score (**3.92**) on the ImgEdit benchmark among unified models, excelling in Adjust and Hybrid tasks. On the challenging multi-reference editing benchmark (MICo-Bench), it sets a new state-of-the-art with a score of **47.1**, significantly outperforming strong baselines.

### 5.4 Interleaved Generation and Reasoning
*   **Interleaved Generation:** On the proposed **InterGen benchmark**, LLaDA2.0-Uni generally outperforms Emu3.5, particularly in Story Telling and Time Series Forecasting tasks.
*   **Interleaved Reasoning:** The model demonstrates promising capabilities in step-by-step logical reasoning for tasks like chess strategy and physics problem-solving, as shown qualitatively.

### 5.5 Ablation Studies
**Table 13: Analysis of SPRINT Acceleration**
| **Method** | **Metric** | **AI2D** | **OCRBench** | **MMMU** | **GenEval** | **DPG** | **Avg. Score** | **Avg. TPS** |
| :--- | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| LLaDA2.0-Uni | Score | 82.0 | 75.7 | 50.1 | 89.0 | 87.76 | 76.3 | 24.3 |
| + SPRINT | Score | 80.9 | 73.4 | 52.5 | 87.8 | 86.27 | 75.7 | **39.8** |
| | **∆** | **-1.1** | **-2.3** | **+2.4** | **-1.2** | **-1.5** | **-0.6** | **×1.6** |

**Key Finding:** SPRINT achieves a **1.6× speedup** (24.3 → 39.8 TPS) with a negligible average performance drop (**-0.6**). The speedup is largest on benchmarks with longer outputs (e.g., DocVQA: **3.5×**).

**Table 14: Analysis of Diffusion Decoder Distillation**

---

_Markdown view of https://picx.dev/p/D5leQk, served by PicX — AI-generated visual whiteboard summaries of research papers._