# Qwen-Image-2.0 Technical Report

> Qwen-Image-2.0 is a unified image generation model that excels in ultra-long text rendering, multilingual typography, and high-resolution photorealism within a single efficient framework.

- **Source:** [arXiv](https://arxiv.org/abs/2605.10730)
- **Published:** 2026-05-13
- **Permalink:** https://picx.dev/p/6WrV4D
- **Whiteboard:** https://picx.dev/p/6WrV4D/image

## Summary

# Qwen-Image-2.0 Technical Report Summary

## Summary (Overview)
* **Unified Model:** Qwen-Image-2.0 is an omni-capable image generation foundation model that integrates high-fidelity text-to-image (T2I) generation and precise text-guided image-to-image (TI2I) editing within a single framework.
* **Key Capabilities:** The model excels in **ultra-long text rendering** (up to 1K tokens), **multilingual typography**, **high-resolution photorealism**, **robust instruction following**, and **improved inference efficiency**.
* **Architecture:** It couples a **Qwen3-VL** multimodal encoder with a **Multimodal Diffusion Transformer (MMDiT)** backbone and employs a **high-compression (16×) VAE** for efficient high-resolution synthesis.
* **Training Strategy:** The model is trained using a **multi-stage, multi-resolution pipeline** (256p → 2048p) and refined with **Reinforcement Learning from Human Feedback (RLHF)** and **few-step distillation**.
* **Performance:** Extensive evaluations show substantial improvements over previous models, ranking **#9 globally** and **#1 among Chinese models** on the LMArena benchmark, with superior performance in text rendering, portrait generation, and editing tasks.

## Introduction and Theoretical Foundation
The field of image generation has advanced significantly through diffusion models, Transformer-based architectures, and the integration of vision-language foundation models as conditional encoders. Despite progress, several bottlenecks persist in real-world creative workflows:
* **Ultra-long text rendering** becomes fragile with increasing character counts.
* **Multilingual typography** is underdeveloped for non-English/Chinese scripts.
* **High-resolution photorealism** deteriorates, introducing repeated textures and incoherent lighting.
* **Complex instruction following** leads to concept omission or hallucination.
* **Computational cost** constrains deployment in resource-limited settings.
Furthermore, existing systems typically excel in one area (e.g., photorealism or text rendering) but rarely deliver all capabilities simultaneously for both generation and editing within a single, efficient architecture.

**Qwen-Image-2.0** aims to address these challenges by unifying T2I generation and TI2I editing. Its design is grounded in comprehensive data curation and a customized multi-stage training pipeline, leveraging strong multimodal understanding from Qwen3-VL while preserving generative flexibility via the MMDiT backbone.

## Methodology

### 1. Data Infrastructure
A large-scale, diverse data pipeline supports unified training for both T2I and TI2I tasks. Data construction follows principles of **broad domain coverage**, **strong instruction quality**, and **source-target consistency**.

**Data Annotation:** A fine-grained captioning framework is designed for different task types:
* **General captions:** Comprehensive descriptions of visual content, including text.
* **Text captions:** Emphasis on accurately extracting dense textual content and layout structure.
* **Knowledge captions:** Inject image-related background information or contextual cues.
* **Structured captions:** Explicitly model entities, attributes, and relations for complex visual structures.

**Multi-Stage Training Data Strategy:** A six-stage filtering pipeline progressively refines data:
1. **Stage 1 (256p T2I pre-training):** Apply eight sequential filters (Broken Files, Resolution, Deduplication, NSFW, Rotation, Entropy, CLIP, Token Length).
2. **Stage 2 (256p T2I & TI2I pre-training):** Introduce Edit Data alongside filtered T2I data.
3. **Stage 3 (512p T2I & TI2I pre-training):** Scale resolution to 512p and introduce Synthetic Data.
4. **Stage 4 (512p/1024p T2I & TI2I pre-training):** Extend to mixed 512p/1024p resolution with additional high-resolution filters (Resolution, Image Quality, Image Aesthetic, Compression Quality).
5. **Stage 5 (Multi-Resolution T2I & TI2I pre-training):** Expand to 512p, 1024p, and 2048p resolutions with a dedicated 2048p Resolution Filter.
6. **Stage 6 (Supervised Fine-tuning):** Apply a Distribution Filter with stricter thresholds for final SFT.

**Closed-loop Data Flywheel System:** An automated system for continuous model optimization:
* **Stage 1:** Multi-source signal collection (model evaluation, bad-case mining, user feedback).
* **Stage 2:** Case routing & targeted optimization based on error attribution:
    * **RL track:** For alignment/policy issues.
    * **Pre-training track:** For missing knowledge; uses a vector retrieval engine for data augmentation.
    * **Prompt engineering track:** For inaccurate instruction understanding.
* **Stage 3:** Model update & closed loop.

### 2. Architecture
The architecture comprises three tightly coupled components (see Figure 8):
1. **Multimodal Large Language Model (MLLM):** Frozen **Qwen3-VL** encoder extracts semantic features from user inputs.
2. **Variational Autoencoder (VAE):** Encodes images into latent representations with a **high-compression ratio (16×)**.
3. **Multimodal Diffusion Transformer (MMDiT):** Performs the core denoising process in latent space conditioned on multimodal representations.

**Variational Autoencoder:** To balance compression efficiency, reconstruction fidelity, and latent diffusability:
* Uses a **residual autoencoder architecture** to preserve fine-grained spatial details.
* Increases latent dimensionality to **64 channels** ($f16c64$ configuration).
* Trained on a large-scale internal corpus of text-rich images.
* Introduces a **semantic alignment loss** alongside reconstruction and perceptual losses.
* Adopts **dynamic semantic alignment** (strong early, relaxed later) and removes adversarial loss for stability.

**VAE reconstruction performance** is evaluated quantitatively:

**Table 1: Quantitative evaluation results of VAEs under different settings.**

| Model | Setting | # Params (M) Enc Dec | Imagenet_256x256 PSNR SSIM | Text_256x256 PSNR SSIM |
|---|---|---|---|---|
| SD-3.5 (Esser et al., 2024) | f8c16 | 34 50 | 31.22 0.8839 | 29.93 0.9658 |
| Cosmos-CI8x8 (Agarwal et al., 2025) | f8c16 | 31 46 | 32.23 0.9010 | 30.62 0.9664 |
| Wan2.1 (Wan et al., 2025) | f8c16 | 54 73 | 31.29 0.8870 | 26.77 0.9386 |
| HunyuanVideo (Kong et al., 2024) | f8c16 | 100 146 | 33.21 0.9143 | 32.83 0.9773 |
| FLUX.1-dev (BlackForest, 2024) | f8c16 | 34 50 | 32.84 0.9155 | 32.65 0.9792 |
| Qwen-Image (Wu et al., 2025) | f8c16 | 54 73 | 33.42 0.9159 | 36.63 0.9839 |
| HunyuanImage-3.0 (Cao et al., 2025) | f16c32 | 389 871 | 31.08 0.8655 | 29.23 0.9521 |
| Wan2.2 (Wan et al., 2025) | f16c48 | 150 555 | 31.30 0.8784 | 28.19 0.9508 |
| Stepvideo-T2V (Ma et al., 2025) | f16c64 | 110 389 | 31.54 0.8973 | 29.62 0.9641 |
| **Qwen-Image-2.0** | **f16c64** | **79 259** | **33.42 0.9225** | **32.81 0.9795** |

**Multi-modal Diffusion Transformer:** The MMDiT jointly models text and image tokens within a shared transformer backbone.
* Given visual inputs $x$ and textual inputs $y$, Qwen3-VL encodes them into $h_x$ and $h_y$. $h_x$ is replaced by the VAE latent $E_x$.
* The multimodal sequence is constructed by concatenation:
$$h = \text{Concat}(E_x, h_y)$$
* Uses **MSRoPE** for unified cross-modal positional encoding.
* For modulation, uses a purely multiplicative formulation (removes bias):
$$h' = \alpha h$$
* Introduces **SwiGLU** in MLP layers to alleviate activation magnitude issues during joint text-image training:
$$h = \Phi_1(x) \otimes \sigma(\Phi_2(x))$$
where $\Phi_1(\cdot)$ and $\Phi_2(\cdot)$ are linear projections, $\sigma(\cdot)$ is the SiLU activation, and $\otimes$ is element-wise multiplication.

**Prompt Enhancer (PE):** A rewriting module that converts user queries into structured, detail-rich prompts.
* **Data Construction:** A reverse-engineering pipeline atomically degrades fine-grained annotations $P_{\text{fine}}$ into diverse, colloquial user prompts $P_{\text{short}}$, while recording inverse reasoning traces as training supervision. Strategies include stylistic simplification, colloquialization, and removal/underspecification of visual details.
* **PE Training:** Initialized from Qwen3.5-9B, trained in two stages:
    1. **SFT:** Standard next-token prediction on constructed dataset.
    2. **RL:** Based on **Group Relative Policy Optimization (GRPO)**; optimized with rewards combining MLLM-based visual consistency, MLLM-based aesthetic quality, and rule-based textual constraints.

### 3. Training Strategy

**Multistage Training:** Comprises three phases with progressive adjustments:

**Table 2: Training configurations, data distribution, and hyperparameters.**

| Configuration | Pre-training | Continual Pre-training | Supervised Fine-tuning |
|---|---|---|---|
| **Training Process** | | | |
| Steps (K) | 700 | 250 | 10 |
| Resolution | 256/512 | 512/1024/2048 | 512/1024/2048 |
| Batch Size (K) | 32/16 | 16/8/4 | 16/8/4 |
| **Data Distribution** | | | |
| Type | T2I/TI2I | T2I/TI2I | T2I/TI2I |
| Ratio | 0.9/0.1 | 0.7/0.3 | 0.7/0.3 |
| **Hyperparameters** | | | |
| Optimizer | Adam | Adam | Adam |
| Weight Decay | 0.001 | 0.001 | 0.001 |
| Grad. Norm Clip | 1.0 | 1.0 | 1.0 |
| Uncond. Dropout | 0.1 | 0.1 | 0.1 |
| Learning Rate | $1 \times 10^{-4}$ | $2 \times 10^{-5}$ | $1 \times 10^{-5}$ |

* **Pre-training:** 700K steps at low resolutions (256/512); learns basic semantic representations.
* **Continual pre-training:** 250K steps; gradually increases resolution to 512–2048; adjusts data ratio to 7:3 T2I/TI2I.
* **Supervised fine-tuning:** ~10K steps; focuses on aesthetic quality with strict filtering and manual curation.

**Reinforcement Learning with Human Feedback (RLHF):** Refines the base diffusion model via multi-dimensional reward signals.
* **Reward Modeling:** Task-specific composite reward models:
    * **Aesthetic reward (T2I):** Visual quality (composition, lighting, texture, coherence).
    * **Image-text alignment reward (T2I):** Semantic correspondence with prompt.
    * **Portrait reward (T2I):** Anatomical plausibility, facial accuracy, texture realism.
    * **Instruction-following reward (TI2I):** Accuracy of specified modifications.
    * **Visual consistency reward (TI2I):** Preservation of unmodified regions' identity and structure.
* **Training:** Optimizes using an adapted **GRPO framework**. A hybrid strategy: **Classifier-free Guidance (CFG)** is used during rollout sampling but excluded from policy optimization. The RL-aligned model is denoted **Qwen-Image-2.0-RL**.

**Few-step Distillation:** Distills the multi-step model into a few-step variant for efficiency using **Distribution Matching Distillation (DMD)**.
* Given a conditional few-step student generator $G_\theta$, noise $\epsilon \sim N(0, I)$, condition $c \sim p(c)$, the clean-state prediction is $x_\theta = G_\theta(\epsilon, c)$.
* The gradient of the DMD objective $\ell_{\text{DMD}}(\theta)$ is:
$$\nabla_\theta \ell_{\text{DMD}}(\theta) = E_{c \sim p(c), \epsilon \sim N(0, I), \xi \sim N(0, I), t \sim p(t)} \left[ \left( s_{\text{fake}}(x_t, t, c) - s_{\text{real}}(x_t, t, c) \right) \nabla_\theta x_\theta \right]$$
where $\xi$ is independent Gaussian noise, $t \in [0,1]$ is diffusion time, and $x_t$ is linear interpolation:
$$x_t = (1 - t) x_\theta + t \xi$$
* $s_{\text{fake}}(x_t, t, c) = \nabla_{x_t} \log p_{\text{fake}, t}(x_t|c)$ is the conditional score from the student-induced distribution (estimated by an auxiliary fake score model).
* $s_{\text{real}}(x_t, t, c) = \nabla_{x_t} \log p_{\text{real}, t}(x_t|c)$ is the conditional target score from the pretrained teacher.
* The distilled model is **Qwen-Image-2.0-Distillation** (4 NFE student vs. 40-step teacher).

## Empirical Validation / Results

### 1. Benchmark Evaluation (LMArena)
On the LMArena T2I leaderboard (blind, ELO-based), Qwen-Image-2.0 achieves strong performance:
* **Global Rank:** #9
* **Chinese Models Rank:** #1
* **ELO Score:** 1168, outperforming Nano Banana.

**Figure 1: Qwen-Image-2.0 shows significant improvements across core dimensions, including photorealism and portrait generation, in LMArena (accessed April 22, 2026).**
The figure shows ELO score comparisons between Qwen-Image and Qwen-Image-2512 across categories (Product, 3D Modeling, Cartoon, Photorealism, Art, Portraits, Text Rendering, Overall). Qwen-Image-2.0 consistently scores higher (1135-1155 range) than its predecessors (1046-1076 range).

### 2. Qualitative Results on Text-to-Image Generation
**Text Rendering (Figure 13):** Qwen-Image-2.0 uniquely achieves high-fidelity text rendering with negligible errors and harmonious typographic integration, outperforming competitors (GPT-Image-2, NanoBanana Pro, Qwen-Image-2512, Wan2.7 Pro, Seedream 5.0 Lite) which exhibit character-level errors, omission, incorrect scaling, or lack of spatial binding.

**Portrait Generation (Figures 14 & 15):** Qwen-Image-2.0 simultaneously achieves high-fidelity text rendering, photorealistic material textures, and natural lighting consistency. Competitors show failures like artificial textures, misinterpreted occlusion instructions, hallucinated text, incorrect blur application, or altered subject identity.

**Multilingual Text Rendering (Figure 18):** The model handles a wide range of languages (English, Chinese, Japanese, Korean, Arabic, Hindi, Thai, Bengali, Tamil, Gujarati, etc.) with higher character accuracy and support for complex typography.

**Slide Generation (Figure 19):** Demonstrates capability to directly generate professional text-rich visual content like slides.

### 3. Qualitative Results on Image Editing
**Complex Text Rendering (Figure 16):** In TI2I tasks adding classical Chinese poetry to images, Qwen-Image-2.0 is the only model that preserves character-level accuracy, canonical line order, and coherent vertical composition (ti-hua-shi aesthetic). Baselines (Qwen-Image-Edit-2511, NanoBanana Pro, Wan2.7 Pro, Seedream 5.0 Lite) exhibit failures like small-scale rendering, duplication, character errors, or disjointed columns.

**Identity Preservation (Figure 17):** In single-image and multi-image editing tasks (e.g., adding objects, transferring hats, scene composition), Qwen-Image-2.0 uniquely preserves subject identity (facial features, posture, appearance) while accurately satisfying editing instructions. Baselines change fur color/posture, misplace objects, alter ethnicity, or render objects insufficiently realistic.

### 4. RLHF and Distillation Results
**RLHF (Figure 10):** Qualitative comparisons between **Qwen-Image-2.0-Base** and **Qwen-Image-2.0-RL** show RL further improves visual quality in diverse scenarios (portraits, landscapes, posters, natural scenes), enhancing texture fidelity, realism, and instruction following.

**Distillation (Figure 11):** The **4-NFE student (Qwen-Image-2.0-Distillation)** produces results visually comparable to the **40-step teacher (Qwen-Image-2.0-RL)** across diverse prompts, preserving detail, coherence, and semantic alignment while reducing inference cost.

## Theoretical and Practical Implications
* **Unified Architecture:** Demonstrates that a single model can effectively handle both generation and editing tasks, reducing pipeline complexity and improving usability.
* **High-Compression VAE:** Shows that a 16× compression ratio can achieve state-of-the-art reconstruction fidelity (PSNR/SSIM) when combined with residual autoencoding, increased channels, and semantic alignment loss, enabling efficient high-resolution synthesis.
* **Multi-Stage Training & Data Flywheel:** Provides a blueprint for scalable, iterative model development, emphasizing progressive resolution scaling, targeted data filtering, and automated

---

_Markdown view of https://picx.dev/p/6WrV4D, served by PicX — AI-generated visual whiteboard summaries of research papers._