# Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

> Tuna-2 demonstrates that a unified multimodal model using simple patch embeddings on raw pixels outperforms vision-encoder-based models on understanding tasks and remains competitive on generation after sufficient pretraining.

- **Source:** [arXiv](https://arxiv.org/abs/2604.24763)
- **Published:** 2026-04-29
- **Permalink:** https://picx.dev/p/Yu3zRv
- **Whiteboard:** https://picx.dev/p/Yu3zRv/image

## Summary

# Summary of "Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation"

## Summary (Overview)
*   **Encoder-Free Architecture:** Tuna-2 is a native unified multimodal model (UMM) that discards pretrained vision encoders (e.g., VAEs, CLIP) entirely, using simple patch embedding layers to process raw pixels directly.
*   **Pixel-Space Unification:** It performs both visual understanding (e.g., VQA) and generation (e.g., text-to-image, editing) directly in high-dimensional pixel space using a single transformer backbone and a flow matching head, enabling fully end-to-end optimization.
*   **State-of-the-Art Performance:** Despite its simplicity, Tuna-2 achieves SOTA results on a wide range of multimodal understanding benchmarks, particularly excelling on fine-grained, pixel-centric tasks, while remaining highly competitive on generation benchmarks.
*   **Key Insight:** While encoder-based variants (Tuna-R) converge faster initially, the encoder-free Tuna-2 surpasses them in understanding performance after sufficient pretraining, suggesting end-to-end pixel-space learning yields stronger visual representations at scale.
*   **Enhanced Training:** A proposed masking-based visual feature learning scheme acts as a regularizer, encouraging more robust representation learning for both understanding and generation tasks.

## Introduction and Theoretical Foundation
Unified Multimodal Models (UMMs) aim to integrate visual understanding and generation within a single framework. A central challenge has been designing visual representations that effectively support both tasks. Early approaches used decoupled encoders (e.g., CLIP for understanding, VQ-VAE for generation), leading to representation mismatch. More recent UMMs moved towards unified visual representations via shared vision encoders but still heavily rely on **pretrained vision encoders**.

Parallel trends show movement away from modular designs:
*   **Understanding:** Newer native vision-language models (e.g., NEO) align images and language within a unified, encoder-free architecture.
*   **Generation:** Pixel-space diffusion/flow matching models (e.g., JiT) show that pretrained VAEs may not be essential for high-fidelity synthesis.

This paper asks: **Can we build UMMs through end-to-end native modelling directly from raw pixels, without any pretrained vision encoders?** The authors answer affirmatively by introducing **Tuna-2**, which progressively simplifies architecture by first removing the VAE (creating **Tuna-R**) and then removing the representation encoder entirely.

## Methodology

### 2.1 Towards Encoder-Free Unified Models
The design evolves from the previous UMM **Tuna** (Liu et al., 2025):
1.  **Tuna-R (Representation encoder-based):** Removes the VAE but keeps a pretrained representation encoder (SigLIP 2). It follows the standard LLaVA paradigm: encoder tokens + text tokens → LLM decoder.
2.  **Tuna-2 (Encoder-free):** Removes the representation encoder entirely. Replaces it with a simple **patch embedding layer** that converts raw image pixels into visual tokens. These are processed jointly with text tokens by a single transformer decoder (Qwen2.5-7B-Instruct). This creates a monolithic, unified transformer.

**Pixel-Space Image Generation:** Without a VAE, the model adopts **pixel-space flow matching** based on the **x-prediction** paradigm from JiT (Li and He, 2025).

Given a source image $x_1$ and noise $x_0 \sim \mathcal{N}(0, I)$, a noisy sample is constructed via rectified flow with a linear schedule:
$$ x_t = t x_1 + (1 - t) x_0, \quad t \in [0, 1] \tag{1} $$

The model $\pi_{\theta}$ directly predicts the clean image from the noisy image conditioned on signals $c$ (text or text+image):
$$ \hat{x}_{\theta} = \pi_{\theta}(x_t, c, t) \tag{2} $$

This prediction is transformed into a velocity term for the training objective:
$$ v_{\theta} = \frac{\hat{x}_{\theta} - x_t}{1 - t} \tag{3} $$
$$ \mathcal{L}_{\text{flow}} = \mathbb{E}_{t,c,x_1,x_0} \| v_{\theta} - v \|_2^2, \quad \text{where } v = x_1 - x_0 \tag{4} $$

During inference, an Euler solver is used: $x_{t'} = x_t + (t' - t) v_{\theta}$.

### 2.2 Learning Robust Visual Representations via Masking
Learning in pixel space is more challenging due to high dimensionality and redundancy. To encourage robust representation learning, a **masking-based feature learning scheme** is introduced.

During training, a random subset of image patches is replaced with a **learnable mask token** before feeding into the LLM decoder.
*   **For Generation Examples:** The model must predict clean image patches for **both masked and unmasked regions**, creating a harder denoising problem.
*   **For Understanding Examples:** The model predicts text responses based on **masked visual input**, forcing multimodal reasoning under partial observation, acting as a regularizer.

This scheme resembles methods like MAE (for understanding) and MaskGIT (for generation).

### 2.3 Training Pipeline
The encoder-free design enables fully end-to-end training without separate connector alignment stages.

1.  **Stage 1: Full Model Pretraining.** Joint training on image captioning (70%) and text-to-image generation (30%) data, plus text-only data (20%), for 300k steps.
2.  **Stage 2: Supervised Finetuning (SFT).** Lower learning rate training on curated data for image instruction-following (FineVision), image editing (OmniEdit), and high-quality generation for 50k steps.

**For Tuna-R**, an extra connector-alignment stage is required before Stage 1.

## Empirical Validation / Results

### 3.2 Main Results

**Image Understanding (Table 1):** Tuna-2 (7B) is evaluated on 9 VQA benchmarks and 3 pixel-centric benchmarks (V*, CountBench, VisuLogic).

| Models | Size | GQA | RealWorldQA | MMVet | MMMU | MMVP | SEED-Bench2+ | AI2D | ChartQA | OCRBench | V* | CountBench | VisuLogic |
| :--- | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| **Understanding-only Models (LMMs)** |
| LLaVA-OV | 7B | - | 69.9 | 51.9 | 48.8 | 77.3 | 62.2 | 81.4 | 80.9 | 62.2 | 72.7 | 76.2 | 24.8 |
| Qwen2.5-VL | 7B | 60.7 | 69.9 | 61.7 | 58.6 | 78.0 | 70.5 | 82.7 | 83.0 | 83.7 | 71.2 | 74.1 | 20.0 |
| **Native UMMs** |
| BAGEL | 14B | 66.4 | 72.8 | 67.2 | 55.3 | 85.0 | 71.9 | 89.2 | 78.5 | 73.3 | 70.2 | **82.5** | 41.7 |
| Tuna | 7B | 63.9 | 66.1 | 42.9 | 49.8 | 70.7 | 52.7 | 79.3 | 85.8 | 74.3 | 52.4 | 73.5 | 22.4 |
| **Tuna-R** | 7B | 63.5 | **67.9** | 46.7 | **51.1** | 74.7 | 58.4 | 79.4 | 85.6 | 78.3 | 57.6 | 77.8 | 26.2 |
| **Tuna-2** | 7B | **65.0** | 67.7 | **51.7** | 50.7 | **77.3** | **61.1** | **79.6** | **85.6** | **79.7** | **59.2** | 81.7 | **28.8** |

*   **Key Findings:**
    *   Both Tuna-R and Tuna-2 outperform the latent-space Tuna.
    *   **Tuna-2 outperforms Tuna-R** on most understanding benchmarks, despite being encoder-free.
    *   Both models excel on **pixel-centric benchmarks**, highlighting the advantage of pixel-space representations for fine-grained visual reasoning.

**Image Generation (Table 2):** Evaluated on GenEval (text-image alignment) and DPG-Bench.

| Models | Size | GenEval Overall | DPG-Bench Overall |
| :--- | :--- | :---: | :---: |
| **Generation-only Models** |
| LongCat-Image | 6B | 0.87 | 86.80 |
| Qwen-Image | 20B | 0.87 | **88.32** |
| **Native UMMs** |
| BAGEL † | 14B | **0.88** | 85.07 |
| Tuna | 7B | **0.90** | **86.76** |
| **Tuna-R** | 7B | 0.88 | 86.35 |
| **Tuna-2** | 7B | 0.87 | 86.54 |

*   **Key Findings:** Tuna-2 remains highly competitive with SOTA UMMs and specialized generation models, demonstrating the effectiveness of pixel-space generation. Tuna-R performs slightly better, suggesting semantic priors from the encoder help generation.

**Additional Evaluations:**
*   **LLM-Judge Quality/Diversity (Table 3):** Tuna-2 achieves competitive quality and is **significantly preferred for diversity** by GPT-5.4 and Claude Opus 4.7.
*   **Image Editing (Table 4):** Tuna-2 achieves strong performance on ImgEdit, competitive with unified models.
*   **Image Reconstruction (Table 5):** Both Tuna-R and Tuna-2 achieve reconstruction quality (PSNR, SSIM) approaching specialized VAEs like FLUX.1[dev]-VAE.

### 3.3 - 3.6 Ablations and Analyses
*   **Training Dynamics (Fig. 5):** A generation-to-understanding data ratio of **7:3 (7g3u)** achieves the best trade-off between the two tasks' losses.
*   **Masking Effectiveness (Table 6):** The masking-based feature learning strategy consistently improves performance for both Tuna-R and Tuna-2, with Tuna-2 benefiting more.
*   **Tuna-R vs. Tuna-2 Scaling (Fig. 6):**
    *   **Understanding:** Tuna-R leads early (due to encoder priors), but **Tuna-2 catches up and surpasses it with more data**, showing the scalability of the monolithic design.
    *   **Generation:** Tuna-R consistently leads, but the gap narrows with scale.
*   **Attention Map Visualization (Fig. 7):** Tuna-2 exhibits more accurate and robust cross-modal alignment, correctly attending to relevant regions even under misleading linguistic contexts or salient visual distractors.

## Theoretical and Practical Implications
*   **Challenges Encoder-Based Paradigm:** The work demonstrates that **pretrained vision encoders are not necessary** for state-of-the-art multimodal modelling. Their inductive biases (fixed resolution, limited low-level detail access) may even be limiting for fine-grained perception.
*   **Advantages of Pixel-Space Learning:** End-to-end optimization from raw pixels offers a **scalable path** to stronger visual representations that benefit both understanding and generation. It naturally handles fine-grained details.
*   **Unified Architecture Benefits:** The monolithic, encoder-free design simplifies the model, removes representation mismatch, and shows stronger scaling laws for understanding performance.
*   **Practical Design Guidance:** The paper provides insights on data mixing ratios (7g3u), the utility of masking as a regularizer, and the trade-offs between encoder-based and encoder-free designs at different training stages.

## Conclusion
Tuna-2 successfully demonstrates that **native unified multimodal modelling directly from raw pixels** is not only feasible but highly competitive. By removing all pretrained vision encoders and using a simple patch embedding layer, Tuna-2 achieves SOTA understanding performance, particularly on fine-grained tasks, and remains strong on generation. The controlled comparison with Tuna-R reveals that while encoder priors help early convergence, the **encoder-free design scales better for understanding**. These results highlight pixel-space unified modelling as a promising and simplified direction for future UMMs.

---

_Markdown view of https://picx.dev/p/Yu3zRv, served by PicX — AI-generated visual whiteboard summaries of research papers._
