InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation

Summary (Overview)

  • Core Problem: Standard discrete visual tokenizers, when trained with generic reconstruction losses (e.g., LPIPS), struggle to preserve fine-grained details crucial for text legibility and facial identity fidelity, as these perceptually critical regions often occupy small image areas.
  • Proposed Solution: InsightTok, a novel tokenizer training framework that augments standard objectives with localized, content-aware perceptual losses (L_text and L_face) computed on detected text and face regions using domain-specific recognition models.
  • Key Innovation: Introduces an area-based weighting scheme for the specialized losses (w = Area(region)/Area(image)) to balance improvements on critical content with maintaining general reconstruction quality, preventing small, difficult regions from dominating optimization.
  • Superior Performance: With a compact 16k codebook and 16× downsampling, InsightTok achieves state-of-the-art text and face reconstruction on the TokBench benchmark, significantly outperforming prior tokenizers including those with much larger codebooks (e.g., 262k entries).
  • Effective Transfer: The improved tokenizer directly benefits downstream autoregressive image generation (InsightAR), producing images with clearer text and more faithful facial details without compromising general text-to-image capability.

Introduction and Theoretical Foundation

Discrete tokenization is foundational for autoregressive image generation and unified multimodal modeling. However, the aggressive spatial compression (e.g., 16× downsampling) and quantization inherent in visual tokenizers often discard the fine-grained structures necessary for readable text and distinctive facial features. This is increasingly problematic as generative models are used in text- and face-centric applications like graphic design and portrait synthesis.

The authors identify the root cause as insufficiently targeted supervision in standard tokenizer training. Objectives like pixel reconstruction (LrecL_{rec}) and general perceptual loss (LpercL_{perc}, e.g., LPIPS) treat all image content uniformly and are poorly aligned with domain-specific metrics like text readability (OCR accuracy) or identity preservation (face similarity). Consequently, the training signal for small but critical text/face regions is diluted by the surrounding scene.

Previous approaches often address this by reducing compression—increasing codebook size or token count—which incurs computational overhead without explicitly prioritizing fidelity-critical structures. This paper proposes a different direction: enhancing tokenizer training with specialized, content-aware supervision to learn a discrete representation that inherently preserves these details.

Methodology

Preliminary: Standard Discrete Tokenizer

A tokenizer consists of an encoder EE, a quantizer QQ with a codebook C={ekRd}k=1K\mathcal{C} = \{e_k \in \mathbb{R}^d\}_{k=1}^K, and a decoder DD. Given an image xx, it produces a latent z=E(x)z = E(x), quantized tokens z^=Q(z)\hat{z} = Q(z), and a reconstruction x^=D(z^)\hat{x} = D(\hat{z}).

The standard training objective combines several losses:

Limage=Lrec+βLcodebook+γLperc+ηLGANL_{image} = L_{rec} + \beta \cdot L_{codebook} + \gamma \cdot L_{perc} + \eta \cdot L_{GAN}

where LrecL_{rec} is an 1\ell_1 or 2\ell_2 reconstruction loss, LcodebookL_{codebook} is the commitment loss (zsg(z^)22\|z - \text{sg}(\hat{z})\|_2^2), LpercL_{perc} is a general perceptual loss (e.g., LPIPS), and LGANL_{GAN} is an adversarial loss.

InsightTok Framework

InsightTok augments the standard objective with two new, localized perceptual losses:

LInsightTok=Limage+α1Ltext+α2LfaceL_{\text{InsightTok}} = L_{image} + \alpha_1 \cdot L_{text} + \alpha_2 \cdot L_{face}

where α1\alpha_1 and α2\alpha_2 are scalar weights.

1. Text Perceptual Loss (LtextL_{text})

  1. Detection: For each training image xx, a text detector identifies NN bounding boxes {bntext}n=1N\{b_n^{text}\}_{n=1}^N.
  2. Extraction: Corresponding patches {rntext}\{r_n^{text}\} and {r^ntext}\{\hat{r}_n^{text}\} are cropped from the original and reconstructed images.
  3. Supervision: Each patch pair is compared in the feature space of a pretrained text recognition network Ftext()F_{text}(\cdot). The region-level loss is: Lntext=1Ll=1L1HlWlFtext(l)(rntext)Ftext(l)(r^ntext)2L_n^{text} = \frac{1}{L} \sum_{l=1}^L \frac{1}{H_l W_l} \| F_{text}^{(l)}(r_n^{text}) - F_{text}^{(l)}(\hat{r}_n^{text}) \|^2
  4. Aggregation: Region losses are aggregated with area-based weighting to prevent small text from dominating: Ltext=n=1NwntextLntext,where wntext=Area(bntext)Area(x)L_{text} = \sum_{n=1}^N w_n^{text} \cdot L_n^{text}, \quad \text{where } w_n^{text} = \frac{\text{Area}(b_n^{text})}{\text{Area}(x)}

2. Face Perceptual Loss (LfaceL_{face})

  1. Detection & Alignment: A face detector provides bounding boxes {bmface}\{b_m^{face}\} and five facial landmarks {pkm}\{p_k^m\} for MM faces. Each face is aligned to a canonical template via a similarity transform T(u)=sRu+tT(u) = sRu + t minimizing landmark error: mins,R,tk=15sRpk+tpk2\min_{s, R, t} \sum_{k=1}^5 \| sRp_k + t - p_k^* \|^2
  2. Extraction: Aligned patches rfacer_{face} and r^face\hat{r}_{face} are extracted via inverse warping: rface[c]=x[T1(c)]r_{face}[c] = x[T^{-1}(c)].
  3. Supervision: Similarity is measured in the feature space of a pretrained face recognition network Fface()F_{face}(\cdot): Lmface=1Ll=1L1HlWlFface(l)(rmface)Fface(l)(r^mface)2L_m^{face} = \frac{1}{L} \sum_{l=1}^L \frac{1}{H_l W_l} \| F_{face}^{(l)}(r_m^{face}) - F_{face}^{(l)}(\hat{r}_m^{face}) \|^2
  4. Aggregation: Again, using area-based weighting: Lface=m=1MwmfaceLmface,wmface=Area(bmface)Area(x)L_{face} = \sum_{m=1}^M w_m^{face} \cdot L_m^{face}, \quad w_m^{face} = \frac{\text{Area}(b_m^{face})}{\text{Area}(x)}

InsightAR: Autoregressive Image Generator

The improved tokens from InsightTok are used to train a standard autoregressive Transformer, InsightAR, for text-to-image generation. It models the token sequence conditioned on a text prompt TT:

p(tT)=i=1np(tit<i,T)p(t | T) = \prod_{i=1}^n p(t_i | t_{<i}, T)

The architecture follows Janus-Pro, connecting the visual tokenizer to a 7B-parameter multimodal LLM via an MLP adapter.

Empirical Validation / Results

1. Image Reconstruction (Tokenizer Quality)

InsightTok is evaluated against state-of-the-art tokenizers on the TokBench benchmark for text/face reconstruction and on ImageNet for general reconstruction. All models use ~1024 tokens (16× downsampling) on 512×512 images.

Table 1: Reconstruction performance of InsightTok and existing discrete visual tokenizers.

MethodCodebookBPPText (%) ↑Face ↑General
T-ACC<sub>m</sub>T-NED<sub>m</sub>F-Sim<sub>m</sub>
VQGAN16,3840.05476.1217.320.19
LlamaGen16,3840.054715.0130.440.25
O-MAGVIT2-16k16,3840.054720.6239.960.26
IBQ-16k16,3840.054724.1643.660.27
InsightTok16,3840.054753.0571.400.36
O-MAGVIT2-262k262,1440.070327.3347.280.31
Emu3.5-IBQ131,0720.066441.5265.390.30
  • Key Findings: InsightTok doubles the text accuracy (T-ACC) and significantly improves face similarity (F-Sim) over the best comparable baseline (IBQ-16k), while maintaining competitive general reconstruction metrics (PSNR, rFID). It even outperforms models with 8-16x larger codebooks (O-MAGVIT2-262k, Emu3.5-IBQ) on its target domains.

2. Autoregressive Text-to-Image Generation (InsightAR)

The tokenizer improvements consistently transfer to the generative model InsightAR.

Table 2: Image generation performance of InsightAR and existing autoregressive models.

Model#Params#TokensFaceTextGeneral
MagFace-Score ↑NED (%) ↑GenEval ↑
LlamaGen0.8B1,02422.3718.780.32
Janus-Pro7B57622.0932.290.80
LlamaGenTok-AR7B1,02422.2979.860.81
InsightAR7B1,02423.3395.830.82
  • Key Findings: InsightAR achieves the highest face quality score and lowest text error rate (NED) among models with the same token count, demonstrating clear benefits from the improved tokenizer. It maintains strong performance on general text-to-image benchmarks (GenEval, DPG-Bench).

3. Analytical Experiments & Ablations

Table 3: Effect of specialized perceptual losses and area-based loss weighting.

LtextL_{text} & LfaceL_{face}Area-based WeightingT-ACC<sub>m</sub>Face-Sim<sub>m</sub>rFID ↓IN-PSNR ↑
-30.890.290.6023.65
55.180.421.1122.41
53.050.360.6923.64
  • Finding: Adding specialized losses without weighting improves text/face metrics but degrades general reconstruction (worse rFID, PSNR). The proposed area-based weighting scheme preserves the gains while maintaining overall quality.

Table an: Additional key ablation results (from Appendix).

AblationKey Result
Decoder-only fine-tuning (Table 4)Applying losses only to the decoder yields minimal gains, confirming that improvements come from a refined latent representation, not just a stronger decoder.
Comparison with OCR-VQGAN (Table 5)InsightTok's localized loss (L_text) is more effective than OCR-VQGAN's global OCR loss.
Scaling codebook size (Table 6)InsightTok's benefits hold for both 16k and 65k codebooks.
Isolating losses (Table 8)L_text and L_face specifically improve their respective targets; combining them works well.
Detector coverage (Table 9)Better region identification (higher recall) leads to stronger reconstruction quality.

Theoretical and Practical Implications

  • Tokenizer Training Paradigm: The work shifts the focus from increasing bottleneck capacity (larger codebooks) to improving supervision alignment. It demonstrates that incorporating richer, content-aware objectives is a highly effective and computationally lightweight way to advance discrete representation learning.
  • Specialization vs. Generality: The paper successfully navigates the trade-off between specializing for critical content (text/faces) and maintaining general-purpose utility. The area-based weighting is a simple yet crucial mechanism for achieving this balance.
  • Downstream Impact: Improvements in the tokenizer directly and consistently transfer to autoregressive generation, enabling higher-quality text and face synthesis without changes to the core AR modeling architecture. This validates the tokenizer as a critical component for improving specific failure modes in generative models.
  • Practical Deployability: The method adds only ~2% overhead to training (FLOPs and wall-clock time) as detection is performed offline and recognition models are applied only to small crops. This makes it a practical enhancement to existing tokenizer training pipelines.

Conclusion

The paper identifies lack of targeted supervision as a key bottleneck for text and face fidelity in discrete visual tokenizers. The proposed InsightTok framework addresses this by augmenting standard training with localized, domain-specific perceptual losses for text and face regions, balanced via an area-based weighting scheme.

With a compact 16k codebook, InsightTok achieves state-of-the-art text and face reconstruction, significantly outperforming prior tokenizers including those with much larger capacities. These gains effectively transfer to the InsightAR autoregressive generator, producing images with clearer text and more faithful faces. The findings highlight that aligning tokenizer supervision with perceptually critical content is a promising and practical direction for advancing discrete image generation.