Representation Forcing for Bottleneck-Free Unified Multimodal Models

Summary (Overview)

Proposes Representation Forcing (RF), a technique to close the quality gap for pixel-space image generation in Unified Multimodal Models (UMMs) by eliminating the need for a separately pretrained, frozen VAE (Variational Autoencoder).
Core Idea: The decoder is trained to autoregressively predict visual representations (as discrete tokens) extracted from the model's own understanding encoder. These predicted tokens then serve as in-context structural guidance for subsequent pixel-space diffusion within the same shared transformer backbone.
Key Result: A pixel-space UMM with RF matches state-of-the-art VAE-based UMMs on text-to-image generation benchmarks (e.g., GenEval score of 0.84/0.88) and generally outperforms them on image understanding tasks.
Demonstrates Benefits for Both Directions: RF improves performance for both generation and understanding, showing that pixel-space generation is more compatible with unified modeling than VAE-based generation.
Advances End-to-End Learning: The work advocates for UMMs where perception and generation share a single, end-to-end-learned representation space, moving towards fully bottleneck-free models.

Introduction and Theoretical Foundation

Unified Multimodal Models (UMMs) aim to perform both understanding (e.g., text output from images) and generation (e.g., image output from text) within a single model, representing a step towards general-purpose multimodal intelligence. Prevailing UMMs unify language and image generation in a shared transformer backbone but still depend on a separately pretrained, frozen VAE for the image generation pathway. This creates a structural bottleneck because:

The VAE's latent space is optimized for reconstruction, not the UMM's objectives.
Its lossy compression imposes a hard upper bound on generation quality.

While generating directly in pixel space is a natural alternative to remove this bottleneck, naive application in UMMs fails to match VAE-based quality. The authors attribute this to the UMM's broader image distribution and richer text conditioning, which forces the model to learn both high-level semantic structure and fine-grained details from the same raw pixel signal.

The key insight is that UMMs already have an internal source of high-level structural representation: the understanding encoder (e.g., DINOv3), whose features capture object identity, layout, and scene composition. The challenge in generation is that the model must predict these representations from text alone. Representation Forcing (RF) addresses this by making representation prediction a native capability of the decoder, grounding understanding and generation in a single representation space.

Methodology

The design principle is simple: in understanding, the encoder maps images to high-level representations; in generation, the decoder predicts these representations from text before rendering pixels.

1. Representations from Understanding:

Source: Features from the last layer of a jointly trained understanding encoder (DINOv3 ViT-H+/16).
Discretization: Patch-level features are discretized into a sequence of visual representation tokens via online vector quantization using a learnable codebook of $K$ prototypes (default $K=16,384$ ).
Process: Features are extracted from an Exponential Moving Average (EMA) copy of the encoder for stable targets. For each feature, cosine similarity to all prototypes is computed, and it's assigned to the nearest one, producing a discrete token index. The codebook is updated online via momentum update following SwAV, with Sinkhorn–Knopp balancing to prevent collapse.

2. Generating Pixels via Predicted Representations (Representation Forcing):

Training: The EMA encoder provides ground-truth representation token sequences. The decoder learns to predict them autoregressively under a cross-entropy loss ( $\mathcal{L}_{Rep}$ ), within the same next-token prediction stream as text.
Inference: The decoder produces the representation token sequence autoregressively from the text prompt alone. These predicted tokens remain in the sequence as in-context conditioning for pixel-space generation.
Pixel Generation: Pixel patches are generated via flow matching within the same backbone. Following JiT, $x$ -prediction with velocity loss is used. Given clean patches $x$ and noise $\epsilon \sim \mathcal{N}(0, I)$ , noisy patches at time $t \in [0, 1]$ are: $z_t = t x + (1 - t) \epsilon$ The decoder predicts $x_\theta$ , and the flow-matching loss is: $\mathcal{L}_{FM} = \mathbb{E} \| v_\theta - v \|^2$ where $v = x - \epsilon$ and $v_\theta = (x_\theta - z_t) / (1 - t)$ .

3. Architecture and Training Objective:

Architecture: Based on the Mixture-of-Transformers (MoT) design from BAGEL. All tokens (text, representation, pixel) share self-attention layers but are routed to modality-specific feed-forward experts (understanding, representation prediction, pixel generation).
Sequence: [text tokens, representation tokens, pixel patches]. Text and rep tokens use causal attention; pixel patches attend bidirectionally to each other and causally to all preceding tokens.
Total Loss: The model is trained end-to-end with the combined objective: $\mathcal{L} = \mathcal{L}_{LM} + \mathcal{L}_{FM} + \mathcal{L}_{Rep}$ where $\mathcal{L}_{LM}$ is the cross-entropy loss for text.
Training Strategy: Three-stage training (alignment, joint pre-training up to 256px, continued training up to 1024px). Classifier-free guidance is supported by independently dropping text and representation token conditions during training with probability 0.1.

Empirical Validation / Results

Experiments compare four variants under controlled settings: Pixel, Pixel+RF, VAE, VAE+RF.

Image Generation Evaluation

Evaluated on GenEval (compositional generation) and DPG-Bench (dense prompt following).

Table 1: Evaluation of text-to-image generation.

Model	GenEval Overall ↑	DPG Overall ↑
Generation-Only Models
SDXL	0.55	74.65
DALL-E 3	0.67	83.50
FLUX.1-dev †	0.82	84.00
Unified Models (VAE-based)
Emu3 †	0.66	81.60
BAGEL	0.82	85.07
BAGEL †	0.88	–
Our Model (Pixel-space)
RF-Pixel	0.84	84.15
RF-Pixel †	0.88	-

Key Findings:

RF-Pixel without LLM rewriter achieves a GenEval score of 0.84, matching or outperforming strong VAE-based UMMs like BAGEL (0.82) and BLIP3-o (0.84).
RF-Pixel with LLM rewriter scores 0.88, matching the state-of-the-art among unified models.
This demonstrates that RF effectively closes the quality gap, enabling a pixel-space UMM to perform on par with VAE-based counterparts.

Image Understanding Evaluation

Evaluated on 8 benchmarks spanning general visual understanding and document/diagram tasks.

Table 2: Impact of RF on understanding.

Model	MMMU	HalluBench	MME*	BLINK	RealWorldQA	AI2D	DocVQA	ChartQA
VLM-only	56.2	65.0	79.7	56.2	65.8	90.3	89.3	86.0
VAE	51.0	55.7	71.3	52.2	65.2	90.7	90.0	78.8
VAE+RF	49.6	61.3 (+5.6)	79.3 (+8.0)	52.9 (+0.7)	66.6 (+1.4)	87.8	88.3	80.5
Pixel	49.9	63.7	76.6	49.4	63.1	85.8	90.0	81.7
Pixel+RF	54.2 (+4.3)	64.8 (+1.1)	80.2 (+3.6)	53.0 (+3.6)	65.8 (+2.7)	90.3 (+4.5)	88.0	81.3

Key Findings:

RF improves understanding for both generation pathways (Pixel+RF improves 6/8 benchmarks, VAE+RF improves 5/8).
Improvements are concentrated on general visual understanding tasks (MMMU, HalluBench, MME), aligning with the semantic, structural nature of the representation tokens.
Pixel+RF outperforms VAE+RF on 6 out of 8 benchmarks, suggesting pixel-space generation is more compatible with unified modeling, likely due to the removal of the external VAE bottleneck.

Ablation Studies

Table 3: Ablation studies (Pixel-space, 256px).

Ablation	GenEval ↑	Notes
(a) Effect of RF
Pixel w/o RF	0.25	Severe quality gap
Pixel + RF	0.76	Closes the gap
VAE w/o RF	0.52
VAE + RF	0.77	RF also helps VAE
(b) Prediction vs. Alignment
REPA (alignment)	0.43	Auxiliary loss
RF (prediction)	0.76	Our method
(c) Token Formulation
Continuous Regression	0.26	Error accumulation
Discrete Prediction	0.76	Our method
(d) Codebook Size $K$
$K=16384$	0.76	Default
$K=32768$	0.77	Comparable

Key Insights from Ablations:

RF is critical for pixel-space generation, bridging a large gap (0.25 → 0.76).
Decoder prediction (RF) substantially outperforms auxiliary feature alignment (REPA) (0.76 vs. 0.43), highlighting the effectiveness of direct in-context conditioning.
Discrete token prediction outperforms continuous regression (0.76 vs. 0.26), as discretization is more robust for autoregressive prediction and encourages the desired high-level/low-level factorization.
DINOv3 encoder outperforms SigLIP2 for understanding in this setting, providing richer spatial/structural features beneficial for RF.

Theoretical and Practical Implications

Theoretical Implications:

Unified Representation Space: RF demonstrates that a single representation space, derived from and used by the model itself, can effectively serve both perception and generation. This challenges the prevailing paradigm of coordinating separately pretrained components (VAEs, encoders).
Factorization of Learning: By forcing the prediction of high-level structural representations before pixel rendering, RF provides an explicit scaffold that factors the difficult problem of joint structure/detail learning in pixel space.

Practical Implications:

Bottleneck-Free UMMs: RF offers a concrete path towards fully end-to-end UMMs that do not rely on frozen, external generative components, potentially leading to more cohesive and jointly optimizable models.
Improved Compatibility: The finding that pixel-space generation with RF leads to better understanding performance than VAE-based generation suggests that removing architectural bottlenecks can improve overall model synergy.
Simplicity and Effectiveness: The method is relatively simple, integrating seamlessly into existing UMM architectures (MoT) and training procedures, while providing significant gains.

Conclusion

Representation Forcing (RF) is an effective method for enabling high-quality pixel-space image generation in Unified Multimodal Models, eliminating the need for a separately pretrained VAE. By training the decoder to autoregressively predict the model's own understanding representations as intermediate tokens, RF provides explicit structural guidance that closes the quality gap with VAE-based generation. The technique benefits both generation and understanding, with pixel-space RF models outperforming their VAE-based counterparts on comprehension tasks. This work represents a step towards fully end-to-end, bottleneck-free UMMs where all multimodal capabilities are acquired and shared within a single model's representation space.

Limitations & Future Work: The model is initialized from a pretrained LLM rather than trained from scratch on multimodal data. Extending RF to video or other temporal modalities remains for future exploration.