Visual Summary | ViQ: Text-Aligned Visual Quantized Representations at Any Resolution

Summary (Overview)

ViQ (Visual Quantized Representations) is a novel framework that produces discrete visual representations balancing high-level semantics and low-level details, while supporting native-resolution inputs.
A two-stage training pipeline is proposed: (1) text-aligned pre-training with any-resolution adaptation and self-distillation, and (2) progressive feature discretization using a proximal representation strategy and position-aware Finite Scalar Quantization (FSQ).
ViQ achieves competitive performance on multimodal understanding tasks compared to state-of-the-art continuous encoders (e.g., InternViT-2.5-6B) while using only 1.3B parameters (vs. 6B), and significantly outperforms prior quantized encoders (QLIP, UniTok).
The quantized representations yield substantial training efficiency gains: 20%–70% speedup across different LLM sizes and sequence lengths.
ViQ preserves strong reconstruction quality (rFID 0.62, PSNR 22.73 on ImageNet-1K 256×256), ranking among the best discrete tokenizers.

Introduction and Theoretical Foundation

Multimodal large language models (MLLMs) require visual representations that are both semantically rich and aligned with discrete text tokens. Most current MLLMs use continuous visual encoders (CLIP, SigLIP, InternViT), which suffer from a representational mismatch with discrete language modeling and impose high computational costs.

Core challenge: Discrete visual representations (e.g., VQ-VAE, FSQ) naturally unify modalities but face a fundamental trade-off:

Reconstruction-oriented tokenizers preserve low-level details but lack semantic structure.
Semantically strong features often lose fine-grained visual information during quantization.

ViQ's goal: design a discrete representation that balances both aspects and supports any input resolution, enabling efficient multimodal training without sacrificing performance.

Methodology

ViQ's training is structured into two stages, as illustrated in Figure 2 of the paper.

Stage 1: Text-Aligned Pre-Training (Continuous)

Any-resolution adaptation: Replace fixed positional embeddings with dynamic resizable ones (NaViT-style), enabling variable input sizes while maintaining efficiency (OryxViT packing).
Text-guided multimodal pre-training: Given a triplet (image (I), text query (T), answer (A)), supervise via: $L_{\text{text}} = \text{Cross Entropy}[\text{LLM}(\text{ViQ}(I), T), A] \tag{1}$
Self-distillation: Prevent overfitting by enforcing cosine similarity between the semantic token (class token) of the student (any-resolution) and teacher (fixed-resolution) model: $L_{\text{distill}} = 1 - \cos(z_{\text{student}}^s, z_{\text{teacher}}^s) \tag{2}$
Progressive training from low resolution to native resolution, gradually increasing sequence length.

Stage 2: Visual Quantized Representation Learning

Proximal representation learning: To reduce information loss during quantization, the high-dimensional feature (f \in \mathbb{R}^C) is first compressed via a bottleneck to (f_1 \in \mathbb{R}^D) and constrained with (\mathcal{L}_\infty) normalization:

f_1 = L_\infty(\text{BN}(f)), \quad \hat{f} = \text{BN}'(f_1) \tag{3}

where (|\cdot|_\infty = 1) projects features onto a hypercube surface.

Multi-Head Finite Scalar Quantization (FSQ): The constrained feature is down-projected to (d) dimensions ((d \ll D \ll C)) and quantized using FSQ:

z = \text{round}(Q(f_2)) \tag{4}

A multi-head attention mechanism expands each visual patch into (2 \times 2) codes (going from (N) to (4N) tokens), then restores the original sequence length after quantization.

Rotary Position Embedding (2D RoPE): Position encoding is injected before quantization to handle arbitrary resolutions:

\tilde{f}_m = f_m \odot e^{i(h\theta_h + w\theta_w)} \tag{5}

Multi-stage training with low-level supervision: A VAE-style latent reconstruction loss is added:

L_{\text{recon}} = \text{NLL}(\hat{f}, \text{Encoder}(x)) = \frac{1}{2}\|\hat{f} - \text{Encoder}(x)\|_2^2 + \text{const} \tag{6}

Overall objective:

L_{\text{total}} = \lambda_{\text{text}} L_{\text{text}} + \lambda_{\text{distill}} L_{\text{distill}} + \lambda_{\text{recon}} L_{\text{recon}} \tag{7}

Empirical Validation / Results

Multimodal Understanding (Table 2)

ViQ is evaluated on 9 benchmarks (MMStar, MMMU, SimpleVQA, InfoVQA, TextVQA, DocVQA, OCRBench, AI2D, ChartQA) using Qwen2.5-1.5B and Qwen2.5-7B as backbone LLMs. All methods are trained on the same 2M sample subset of LLaVA-OneVision.

Base LLM	Visual Encoder	Size	AnyRes	Discrete	Avg.
Qwen2.5-1.5B	InternViT-2.5	0.3B	✗	✗	56.5
Qwen2.5-1.5B	InternViT-2.5-6B	6.0B	✗	✗	57.0
Qwen2.5-1.5B	ViQ	1.3B	✓	✓	57.2
Qwen2.5-7B	InternViT-2.5-6B	6.0B	✗	✗	63.8
Qwen2.5-7B	ViQ	1.3B	✓	✓	63.9

ViQ outperforms all continuous encoders (including InternViT-2.5-6B with 6B params) and dramatically surpasses prior quantized encoders (QLIP: 29.7, UniTok: 33.0). Gains are largest on OCR, document, and chart tasks.

Training Efficiency (Figure 3)

Offline ViQ code extraction allows skipping image encoding during LLM training.
Forward pass speedups: 70%–78% for small LLMs (0.5B), 46%–65% for larger LLMs (7B).
Full iteration step speedups: >20% (4k setting) and >40% (16k setting).

Image Reconstruction (Table 3)

On ImageNet-1K 256×256 validation set:

Method	#Token	PSNR↑	SSIM↑	rFID↓
QLIP-B	16×16	23.16	0.63	3.21
UniTok	16×16	25.32	0.77	0.37
ViQ	16×16	22.73	0.66	0.62

ViQ achieves the second-best rFID among discrete tokenizers while maintaining strong semantics for understanding tasks.

Ablation Studies (Table 4)

Key findings:

Proximal representation (bottleneck + (L_\infty)) is crucial: Continuous→FSQ directly drops to 60.9 avg; adding bottleneck+(L_\infty) yields 68.7.
FSQ > SimVQ (non-learnable codebook better).
2D RoPE significantly outperforms no position encoding (65.3→68.7).
VAE latent loss balances training efficiency and performance (68.7 avg. at 1.3× time cost).

Theoretical and Practical Implications

Theoretical: ViQ demonstrates that discrete visual representations can approach the performance of continuous ones in multimodal understanding, addressing the long-standing trade-off between semantics and details. The proximal representation learning strategy provides a principled way to regularize the latent space before quantization.
Practical:
- Enables unified discrete processing of vision and language, simplifying multimodal architectures.
- Substantial training acceleration (up to 70%) reduces hardware requirements.
- High compression ratio for image storage: ViQ codes require only (1/96) of raw image size, with better reconstruction than JPEG at equivalent bitrate.
- Native-resolution support eliminates the need for image resizing or tiling.

Conclusion

ViQ introduces a two-stage framework for learning text-aligned visual quantized representations at any resolution. By combining text-aligned pre-training with progressive feature discretization, proximal representation learning, and position-aware FSQ, ViQ achieves:

Competitive or superior performance to continuous encoders on multimodal benchmarks.
Significant training efficiency improvements.
High-quality image reconstruction.

Future directions: Integration with larger LLMs (70B+), scaling to more diverse data, and further narrowing the gap on detail-intensive tasks through multi-scale or residual quantization. ViQ offers a viable path toward a fully unified discrete representation for vision and language.