CHEERS: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation

Summary (Overview)

Core Innovation: Introduces CHEERS, a unified multimodal model that decouples patch-level details from semantic representations to harmonize visual comprehension and generation within a single framework.
Key Architecture: Features a unified vision tokenizer for efficient encoding, an LLM-based Transformer for hybrid (autoregressive & diffusion) decoding, and a cascaded flow matching (CFM) head that first generates semantics and then refines details via gated high-frequency injection.
Main Results: CHEERS achieves competitive or superior performance to state-of-the-art unified models on both understanding (e.g., MMBench, OCRBench) and generation (GenEval, DPG-Bench) benchmarks, using only 83M training samples.
Efficiency: Achieves a 4× token compression rate for efficient high-resolution modeling and outperforms the Tar-1.5B model on key benchmarks while requiring only 20% of its training cost.

Introduction and Theoretical Foundation

A cutting-edge goal in AI is to unify visual comprehension (like MLLMs) and high-fidelity image generation (like diffusion models) within a single model, moving towards more human-like multimodal intelligence. However, this unification is challenging due to fundamentally mismatched requirements:

Decoding Mechanisms: Comprehension favors autoregressive (AR) decoding of discrete tokens (seamless with LLMs), while generation benefits from parallel diffusion/flow-based decoding of continuous latents for global context.
Visual Representations: Understanding relies on semantic-rich features from vision encoders (e.g., SigLIP), whereas generation needs detail-preserving latents from reconstruction-oriented tokenizers (e.g., VAEs).

Prior Unified Multimodal Models (UMMs) either use separated feature spaces for each task (losing synergy) or attempt to fuse heterogeneous features into a shared token interface, often leading to optimization conflicts and subpar performance in one or both tasks.

CHEERS' Theoretical Insight: The paper posits that explicitly decoupling high-level semantics from low-level patch details can resolve this conflict. Semantics provide stable grounding for understanding, while high-frequency details can be injected in a controlled manner to refine generation fidelity. This mirrors a human-like, coarse-to-fine creative process.

Methodology

CHEERS comprises three core components, as illustrated in Figure 3 of the paper.

1. Unified Vision Tokenizer

This module encodes an input image $X \in \mathbb{R}^{H \times W \times 3}$ into compressed semantic tokens for the LLM.

A VAE encoder first produces latent states $z_1 \in \mathbb{R}^{h \times w \times d}$ ( $h=H/16, w=W/16$ ).
A task-dependent latent $z_t$ is formulated: $z_t = t z_1 + (1 - t) z_0$ , where $z_0 \sim \mathcal{N}(0,1)$ . For understanding, $t=1$ ; for generation, $t \in (0,1)$ ; for text-only tasks, $t=0$ .
Critically, $z_t$ is passed through a VAE decoder $D(\cdot)$ to reconstruct a pixel image, which is then encoded by a SigLIP2-ViT semantic encoder $S(\cdot)$ to extract high-level semantic tokens $z_s^{(t)} \in \mathbb{R}^{h \times w \times d'}$ . This preserves fine-grained details lost by direct latent processing.
A Pixel-Unshuffle module compresses these tokens spatially by 2× and projects channels, yielding $Z_s^{(t)} \in \mathbb{R}^{h/2 \times w/2 \times c}$ for efficient LLM conditioning (4× token compression).

2. Unified LLM-based Transformer

Uses Qwen2.5-1.5B-Instruct as the backbone.
Concatenates semantic visual tokens $Z_s^{(t)}$ and text embeddings $Z_{text}$ into a unified sequence.
Employs a bidirectional attention mask on $Z_s^{(t)}$ for global visual context and a causal mask on $Z_{text}$ for AR decoding.
Outputs are routed: to a standard LM head for text/understanding tasks, or to the Cascaded Flow Matching Head for image generation.

3. Cascaded Flow Matching (CFM) Head

This two-stage head explicitly decouples and then integrates semantics and details.

Stage 1 (Semantic Generation): Takes the LLM's contextualized hidden states $Z_s^{(t)}$ and uses DiT blocks to perform low-resolution semantic generation. A PixelShuffle module then up-samples to $Z_s^{'(t)} \in \mathbb{R}^{h \times w \times d'}$ .
Stage 2 (Detail Refinement): Injects high-frequency patch details $S(D(z_t))$ from the vision tokenizer using a gating network $G(\cdot)$ : $Z_s^{'(t)} \leftarrow G(Z_s^{'(t)}) \odot S(D(z_t)) + Z_s^{'(t)}$ where $G(Z_s^{'(t)}) \in \mathbb{R}^{h \times w \times 1}$ is a scalar gating map and $\odot$ is element-wise multiplication. The gating intensity is dynamically coupled with the denoising timestep $t$ .
The refined features are passed through more DiT layers to predict the velocity field $V_t$ .

Training Pipeline & Objectives

A four-stage progressive training strategy is employed (details in Table 1).

Overall Training Loss: A weighted sum of the AR text loss and the Flow Matching image loss.

L_{total} = L_{AR} + \lambda L_{FM}

where $\lambda = 1$ , $L_{AR} = -\log P_\theta(y|C)$ , and $L_{FM} = \| v_\theta(Z_s^{'(t)}) - (z_1 - z_0) \|_2^2$ .

Inference for Generation: Uses continuous-time flow-based sampling. Starting from noise $z_0$ , the latent is iteratively updated via numerical integration of the predicted velocity field:

z_{t+\Delta t} = z_t + \int_t^{t+\Delta t} V_\tau d\tau

until reaching the terminal latent $z_1$ , which is decoded by the VAE decoder into the final image. Classifier-free guidance (CFG) is applied.

Empirical Validation / Results

Multimodal Understanding

CHEERS achieves strong and balanced understanding performance across diverse benchmarks, often matching or exceeding larger specialized models.

Table 2: Evaluation on Multimodal Understanding Benchmarks

Model	#Params.	SEEDBench	MMBench	ChartQA	POPE	AI2D	MMMU
Understanding Only
Qwen2-VL	2B	-	72.2	73.5	-	74.7	41.1
Understanding & Generation
Show-o2	1.5B	65.6	67.4	40.0	-	69.0	37.1
Janus-Pro	1.5B	68.3	75.5	23.4	86.2	64.5	36.3
Tar	1.5B	70.4	65.6	-	88.4	-	36.0
CHEERS (Ours)	1.5B	71.7	70.4	75.7	87.9	74.4	36.0

Notably, CHEERS excels on OCR (ChartQA: 75.7) and diagram understanding (AI2D: 74.4).

Visual Generation

CHEERS demonstrates highly competitive generation quality with superior data efficiency (83M samples vs. 100M+ for peers).

Table 3: Performances on GenEval (Compositional Generation)

Model	#Params.	#Data	Overall
SD3-Medium	2B	-	0.74
Janus-Pro	1.5B	162M	0.73
Show-o2	1.5B	177M	0.73
Tar	1.5B	403M	0.76
CHEERS (Ours)	1.5B	83M	0.78

Table 4: Performances on DPG-Bench (Dense Prompt Following)

Model	#Params.	#Data	Overall
SD3-Medium	2B	-	84.08
Janus-Pro	1.5B	162M	82.63
Show-o2	1.5B	177M	85.02
Tar	1.5B	403M	82.96
CHEERS (Ours)	1.5B	83M	83.48

Analysis of High-Frequency Injection (HFI)

Temporal Dynamics: Visualization (Fig. 5) shows HFI follows a coarse-to-fine pattern. Injection is low initially (focus on contours), moderate in mid-stages (object composition), and intensifies sharply at final stages (texture refinement).
Ablation Study (Table 5): Removing HFI causes a drastic drop in generation quality (GenEval: 0.30 → 0.17; DPG: 51.63 → 39.11) while having minimal impact on understanding performance, confirming its crucial role for fidelity.
Synergy: Joint training with generation objectives does not harm understanding performance and can even slightly improve it compared to understanding-only fine-tuning.

Theoretical and Practical Implications

Theoretical: Provides a principled framework (semantic-detail decoupling) for resolving the intrinsic optimization conflict in UMMs. It validates that a hierarchical, human-drawing-like process is an effective paradigm for unified modeling.
Practical Efficiency: Demonstrates that high-performance unification is achievable without massive scale, via architectural ingenuity and efficient token compression (4×). CHEERS offers a cost-effective path (20% training cost of Tar) to capable UMMs.
Model Design: Shows the viability of leveraging frozen, pre-trained native ViT weights (SigLIP2) within a unified tokenizer, avoiding the computational cost of training a unified encoder from scratch while allowing full joint fine-tuning.

Conclusion

CHEERS presents a novel and effective architecture for unifying multimodal comprehension and generation by decoupling patch details from semantic representations. Its core components—a unified vision tokenizer, an LLM-based hybrid decoder, and a cascaded flow matching head with gated detail injection—enable it to achieve strong, balanced performance across both task types with high data and token efficiency. The work validates the coarse-to-fine generation paradigm and opens avenues for more efficient and capable general-purpose multimodal AI.

Future Directions & Limitations:

Future Work: Scaling the LLM backbone and training data; extending the framework to video understanding/generation; exploring more complex multimodal data.
Limitations: Model scale (1.5B) may limit capture of intricate details; not initialized from large-scale VLMs; trained primarily on single-image datasets.