Summary (Overview)
- ViQ (Visual Quantized Representations) is a novel framework that produces discrete visual representations balancing high-level semantics and low-level details, while supporting native-resolution inputs.
- A two-stage training pipeline is proposed: (1) text-aligned pre-training with any-resolution adaptation and self-distillation, and (2) progressive feature discretization using a proximal representation strategy and position-aware Finite Scalar Quantization (FSQ).
- ViQ achieves competitive performance on multimodal understanding tasks compared to state-of-the-art continuous encoders (e.g., InternViT-2.5-6B) while using only 1.3B parameters (vs. 6B), and significantly outperforms prior quantized encoders (QLIP, UniTok).
- The quantized representations yield substantial training efficiency gains: 20%–70% speedup across different LLM sizes and sequence lengths.
- ViQ preserves strong reconstruction quality (rFID 0.62, PSNR 22.73 on ImageNet-1K 256×256), ranking among the best discrete tokenizers.
Introduction and Theoretical Foundation
Multimodal large language models (MLLMs) require visual representations that are both semantically rich and aligned with discrete text tokens. Most current MLLMs use continuous visual encoders (CLIP, SigLIP, InternViT), which suffer from a representational mismatch with discrete language modeling and impose high computational costs.
Core challenge: Discrete visual representations (e.g., VQ-VAE, FSQ) naturally unify modalities but face a fundamental trade-off:
- Reconstruction-oriented tokenizers preserve low-level details but lack semantic structure.
- Semantically strong features often lose fine-grained visual information during quantization.
ViQ's goal: design a discrete representation that balances both aspects and supports any input resolution, enabling efficient multimodal training without sacrificing performance.
Methodology
ViQ's training is structured into two stages, as illustrated in Figure 2 of the paper.
Stage 1: Text-Aligned Pre-Training (Continuous)
- Any-resolution adaptation: Replace fixed positional embeddings with dynamic resizable ones (NaViT-style), enabling variable input sizes while maintaining efficiency (OryxViT packing).
- Text-guided multimodal pre-training: Given a triplet (image (I), text query (T), answer (A)), supervise via:
- Self-distillation: Prevent overfitting by enforcing cosine similarity between the semantic token (class token) of the student (any-resolution) and teacher (fixed-resolution) model:
- Progressive training from low resolution to native resolution, gradually increasing sequence length.
Stage 2: Visual Quantized Representation Learning
Proximal representation learning: To reduce information loss during quantization, the high-dimensional feature (f \in \mathbb{R}^C) is first compressed via a bottleneck to (f_1 \in \mathbb{R}^D) and constrained with (\mathcal{L}_\infty) normalization:
where (|\cdot|_\infty = 1) projects features onto a hypercube surface.
Multi-Head Finite Scalar Quantization (FSQ): The constrained feature is down-projected to (d) dimensions ((d \ll D \ll C)) and quantized using FSQ:
A multi-head attention mechanism expands each visual patch into (2 \times 2) codes (going from (N) to (4N) tokens), then restores the original sequence length after quantization.
Rotary Position Embedding (2D RoPE): Position encoding is injected before quantization to handle arbitrary resolutions:
Multi-stage training with low-level supervision: A VAE-style latent reconstruction loss is added:
Overall objective:
Empirical Validation / Results
Multimodal Understanding (Table 2)
ViQ is evaluated on 9 benchmarks (MMStar, MMMU, SimpleVQA, InfoVQA, TextVQA, DocVQA, OCRBench, AI2D, ChartQA) using Qwen2.5-1.5B and Qwen2.5-7B as backbone LLMs. All methods are trained on the same 2M sample subset of LLaVA-OneVision.
| Base LLM | Visual Encoder | Size | AnyRes | Discrete | Avg. |
|---|---|---|---|---|---|
| Qwen2.5-1.5B | InternViT-2.5 | 0.3B | ✗ | ✗ | 56.5 |
| Qwen2.5-1.5B | InternViT-2.5-6B | 6.0B | ✗ | ✗ | 57.0 |
| Qwen2.5-1.5B | ViQ | 1.3B | ✓ | ✓ | 57.2 |
| Qwen2.5-7B | InternViT-2.5-6B | 6.0B | ✗ | ✗ | 63.8 |
| Qwen2.5-7B | ViQ | 1.3B | ✓ | ✓ | 63.9 |
ViQ outperforms all continuous encoders (including InternViT-2.5-6B with 6B params) and dramatically surpasses prior quantized encoders (QLIP: 29.7, UniTok: 33.0). Gains are largest on OCR, document, and chart tasks.
Training Efficiency (Figure 3)
- Offline ViQ code extraction allows skipping image encoding during LLM training.
- Forward pass speedups: 70%–78% for small LLMs (0.5B), 46%–65% for larger LLMs (7B).
- Full iteration step speedups: >20% (4k setting) and >40% (16k setting).
Image Reconstruction (Table 3)
On ImageNet-1K 256×256 validation set:
| Method | #Token | PSNR↑ | SSIM↑ | rFID↓ |
|---|---|---|---|---|
| QLIP-B | 16×16 | 23.16 | 0.63 | 3.21 |
| UniTok | 16×16 | 25.32 | 0.77 | 0.37 |
| ViQ | 16×16 | 22.73 | 0.66 | 0.62 |
ViQ achieves the second-best rFID among discrete tokenizers while maintaining strong semantics for understanding tasks.
Ablation Studies (Table 4)
Key findings:
- Proximal representation (bottleneck + (L_\infty)) is crucial: Continuous→FSQ directly drops to 60.9 avg; adding bottleneck+(L_\infty) yields 68.7.
- FSQ > SimVQ (non-learnable codebook better).
- 2D RoPE significantly outperforms no position encoding (65.3→68.7).
- VAE latent loss balances training efficiency and performance (68.7 avg. at 1.3× time cost).
Theoretical and Practical Implications
- Theoretical: ViQ demonstrates that discrete visual representations can approach the performance of continuous ones in multimodal understanding, addressing the long-standing trade-off between semantics and details. The proximal representation learning strategy provides a principled way to regularize the latent space before quantization.
- Practical:
- Enables unified discrete processing of vision and language, simplifying multimodal architectures.
- Substantial training acceleration (up to 70%) reduces hardware requirements.
- High compression ratio for image storage: ViQ codes require only (1/96) of raw image size, with better reconstruction than JPEG at equivalent bitrate.
- Native-resolution support eliminates the need for image resizing or tiling.
Conclusion
ViQ introduces a two-stage framework for learning text-aligned visual quantized representations at any resolution. By combining text-aligned pre-training with progressive feature discretization, proximal representation learning, and position-aware FSQ, ViQ achieves:
- Competitive or superior performance to continuous encoders on multimodal benchmarks.
- Significant training efficiency improvements.
- High-quality image reconstruction.
Future directions: Integration with larger LLMs (70B+), scaling to more diverse data, and further narrowing the gap on detail-intensive tasks through multi-scale or residual quantization. ViQ offers a viable path toward a fully unified discrete representation for vision and language.
Related papers
- DataClaw0: Agentic Tailoring Multimodal Data from Raw Streams
DataClaw 0's Agentic Data Tailoring transforms raw multimodal streams into structured data via a learnable agent, rivaling GPT-4o and Gemini.
- Moebius: 0.2B Lightweight Image Inpainting Framework with 10B-Level Performance
Moebius matches 10B-level inpainting quality with 0.22B parameters and 15× speedup, using novel linear-complexity Local-λ attention blocks.
- MemGUI-Agent: An End-to-End Long-Horizon Mobile GUI Agent with Proactive Context Management
Context-as-Action treats context management as first-class policy actions, achieving 62.5% Pass@3 on MemGUI-Bench and 41% fewer failures.