Summary of "Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation"

Summary (Overview)

Encoder-Free Architecture: Tuna-2 is a native unified multimodal model (UMM) that discards pretrained vision encoders (e.g., VAEs, CLIP) entirely, using simple patch embedding layers to process raw pixels directly.
Pixel-Space Unification: It performs both visual understanding (e.g., VQA) and generation (e.g., text-to-image, editing) directly in high-dimensional pixel space using a single transformer backbone and a flow matching head, enabling fully end-to-end optimization.
State-of-the-Art Performance: Despite its simplicity, Tuna-2 achieves SOTA results on a wide range of multimodal understanding benchmarks, particularly excelling on fine-grained, pixel-centric tasks, while remaining highly competitive on generation benchmarks.
Key Insight: While encoder-based variants (Tuna-R) converge faster initially, the encoder-free Tuna-2 surpasses them in understanding performance after sufficient pretraining, suggesting end-to-end pixel-space learning yields stronger visual representations at scale.
Enhanced Training: A proposed masking-based visual feature learning scheme acts as a regularizer, encouraging more robust representation learning for both understanding and generation tasks.

Introduction and Theoretical Foundation

Unified Multimodal Models (UMMs) aim to integrate visual understanding and generation within a single framework. A central challenge has been designing visual representations that effectively support both tasks. Early approaches used decoupled encoders (e.g., CLIP for understanding, VQ-VAE for generation), leading to representation mismatch. More recent UMMs moved towards unified visual representations via shared vision encoders but still heavily rely on pretrained vision encoders.

Parallel trends show movement away from modular designs:

Understanding: Newer native vision-language models (e.g., NEO) align images and language within a unified, encoder-free architecture.
Generation: Pixel-space diffusion/flow matching models (e.g., JiT) show that pretrained VAEs may not be essential for high-fidelity synthesis.

This paper asks: Can we build UMMs through end-to-end native modelling directly from raw pixels, without any pretrained vision encoders? The authors answer affirmatively by introducing Tuna-2, which progressively simplifies architecture by first removing the VAE (creating Tuna-R) and then removing the representation encoder entirely.

Methodology

2.1 Towards Encoder-Free Unified Models

The design evolves from the previous UMM Tuna (Liu et al., 2025):

Tuna-R (Representation encoder-based): Removes the VAE but keeps a pretrained representation encoder (SigLIP 2). It follows the standard LLaVA paradigm: encoder tokens + text tokens → LLM decoder.
Tuna-2 (Encoder-free): Removes the representation encoder entirely. Replaces it with a simple patch embedding layer that converts raw image pixels into visual tokens. These are processed jointly with text tokens by a single transformer decoder (Qwen2.5-7B-Instruct). This creates a monolithic, unified transformer.

Pixel-Space Image Generation: Without a VAE, the model adopts pixel-space flow matching based on the x-prediction paradigm from JiT (Li and He, 2025).

Given a source image $x_1$ and noise $x_0 \sim \mathcal{N}(0, I)$ , a noisy sample is constructed via rectified flow with a linear schedule:

x_t = t x_1 + (1 - t) x_0, \quad t \in [0, 1] \tag{1}

The model $\pi_{\theta}$ directly predicts the clean image from the noisy image conditioned on signals $c$ (text or text+image):

\hat{x}_{\theta} = \pi_{\theta}(x_t, c, t) \tag{2}

This prediction is transformed into a velocity term for the training objective:

v_{\theta} = \frac{\hat{x}_{\theta} - x_t}{1 - t} \tag{3}

\mathcal{L}_{\text{flow}} = \mathbb{E}_{t,c,x_1,x_0} \| v_{\theta} - v \|_2^2, \quad \text{where } v = x_1 - x_0 \tag{4}

During inference, an Euler solver is used: $x_{t'} = x_t + (t' - t) v_{\theta}$ .

2.2 Learning Robust Visual Representations via Masking

Learning in pixel space is more challenging due to high dimensionality and redundancy. To encourage robust representation learning, a masking-based feature learning scheme is introduced.

During training, a random subset of image patches is replaced with a learnable mask token before feeding into the LLM decoder.

For Generation Examples: The model must predict clean image patches for both masked and unmasked regions, creating a harder denoising problem.
For Understanding Examples: The model predicts text responses based on masked visual input, forcing multimodal reasoning under partial observation, acting as a regularizer.

This scheme resembles methods like MAE (for understanding) and MaskGIT (for generation).

2.3 Training Pipeline

The encoder-free design enables fully end-to-end training without separate connector alignment stages.

Stage 1: Full Model Pretraining. Joint training on image captioning (70%) and text-to-image generation (30%) data, plus text-only data (20%), for 300k steps.
Stage 2: Supervised Finetuning (SFT). Lower learning rate training on curated data for image instruction-following (FineVision), image editing (OmniEdit), and high-quality generation for 50k steps.

For Tuna-R, an extra connector-alignment stage is required before Stage 1.

Empirical Validation / Results

3.2 Main Results

Image Understanding (Table 1): Tuna-2 (7B) is evaluated on 9 VQA benchmarks and 3 pixel-centric benchmarks (V*, CountBench, VisuLogic).

Models	Size	GQA	RealWorldQA	MMVet	MMMU	MMVP	SEED-Bench2+	AI2D	ChartQA	OCRBench	V*	CountBench	VisuLogic
Understanding-only Models (LMMs)
LLaVA-OV	7B	-	69.9	51.9	48.8	77.3	62.2	81.4	80.9	62.2	72.7	76.2	24.8
Qwen2.5-VL	7B	60.7	69.9	61.7	58.6	78.0	70.5	82.7	83.0	83.7	71.2	74.1	20.0
Native UMMs
BAGEL	14B	66.4	72.8	67.2	55.3	85.0	71.9	89.2	78.5	73.3	70.2	82.5	41.7
Tuna	7B	63.9	66.1	42.9	49.8	70.7	52.7	79.3	85.8	74.3	52.4	73.5	22.4
Tuna-R	7B	63.5	67.9	46.7	51.1	74.7	58.4	79.4	85.6	78.3	57.6	77.8	26.2
Tuna-2	7B	65.0	67.7	51.7	50.7	77.3	61.1	79.6	85.6	79.7	59.2	81.7	28.8

Key Findings:
- Both Tuna-R and Tuna-2 outperform the latent-space Tuna.
- Tuna-2 outperforms Tuna-R on most understanding benchmarks, despite being encoder-free.
- Both models excel on pixel-centric benchmarks, highlighting the advantage of pixel-space representations for fine-grained visual reasoning.

Image Generation (Table 2): Evaluated on GenEval (text-image alignment) and DPG-Bench.

Models	Size	GenEval Overall	DPG-Bench Overall
Generation-only Models
LongCat-Image	6B	0.87	86.80
Qwen-Image	20B	0.87	88.32
Native UMMs
BAGEL †	14B	0.88	85.07
Tuna	7B	0.90	86.76
Tuna-R	7B	0.88	86.35
Tuna-2	7B	0.87	86.54

Key Findings: Tuna-2 remains highly competitive with SOTA UMMs and specialized generation models, demonstrating the effectiveness of pixel-space generation. Tuna-R performs slightly better, suggesting semantic priors from the encoder help generation.

Additional Evaluations:

LLM-Judge Quality/Diversity (Table 3): Tuna-2 achieves competitive quality and is significantly preferred for diversity by GPT-5.4 and Claude Opus 4.7.
Image Editing (Table 4): Tuna-2 achieves strong performance on ImgEdit, competitive with unified models.
Image Reconstruction (Table 5): Both Tuna-R and Tuna-2 achieve reconstruction quality (PSNR, SSIM) approaching specialized VAEs like FLUX.1[dev]-VAE.

3.3 - 3.6 Ablations and Analyses

Training Dynamics (Fig. 5): A generation-to-understanding data ratio of 7:3 (7g3u) achieves the best trade-off between the two tasks' losses.
Masking Effectiveness (Table 6): The masking-based feature learning strategy consistently improves performance for both Tuna-R and Tuna-2, with Tuna-2 benefiting more.
Tuna-R vs. Tuna-2 Scaling (Fig. 6):
- Understanding: Tuna-R leads early (due to encoder priors), but Tuna-2 catches up and surpasses it with more data, showing the scalability of the monolithic design.
- Generation: Tuna-R consistently leads, but the gap narrows with scale.
Attention Map Visualization (Fig. 7): Tuna-2 exhibits more accurate and robust cross-modal alignment, correctly attending to relevant regions even under misleading linguistic contexts or salient visual distractors.

Theoretical and Practical Implications

Challenges Encoder-Based Paradigm: The work demonstrates that pretrained vision encoders are not necessary for state-of-the-art multimodal modelling. Their inductive biases (fixed resolution, limited low-level detail access) may even be limiting for fine-grained perception.
Advantages of Pixel-Space Learning: End-to-end optimization from raw pixels offers a scalable path to stronger visual representations that benefit both understanding and generation. It naturally handles fine-grained details.
Unified Architecture Benefits: The monolithic, encoder-free design simplifies the model, removes representation mismatch, and shows stronger scaling laws for understanding performance.
Practical Design Guidance: The paper provides insights on data mixing ratios (7g3u), the utility of masking as a regularizer, and the trade-offs between encoder-based and encoder-free designs at different training stages.

Conclusion

Tuna-2 successfully demonstrates that native unified multimodal modelling directly from raw pixels is not only feasible but highly competitive. By removing all pretrained vision encoders and using a simple patch embedding layer, Tuna-2 achieves SOTA understanding performance, particularly on fine-grained tasks, and remains strong on generation. The controlled comparison with Tuna-R reveals that while encoder priors help early convergence, the encoder-free design scales better for understanding. These results highlight pixel-space unified modelling as a promising and simplified direction for future UMMs.