Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models - Summary

Summary (Overview)

Compact & Efficient Model: Introduces Lens, a 3.8B-parameter foundational text-to-image (T2I) model designed for high training-time efficiency. It achieves performance competitive with or superior to larger state-of-the-art models (6B+ parameters) while using significantly less compute (e.g., ~19.3% of Z-Image's training compute).
Core Efficiency Strategies: Training efficiency is achieved through three pillars: 1) Reduced model size, 2) Maximized data information density per batch via dense captions (Lens-800M dataset) and multi-resolution/aspect-ratio training, and 3) Accelerated convergence via a semantic VAE and a strong language encoder (GPT-OSS).
Systematic Post-Training & Optimization: Employs RL-based post-training on a diverse, taxonomy-driven prompt set (Lens-RL-8K) to suppress artifacts, a reasoner module for prompt refinement, training-free system prompt search, and few-step distillation to create Lens-Turbo for fast inference.
Strong Generalization: The model generalizes to unseen aspect ratios (1:2 to 2:1) and resolutions up to 1440² from training on a limited set of buckets, and supports multilingual prompt following from English-only training data.
Fast Inference: Due to its compact size, Lens generates a 1024² image in 3.15 seconds on a single H100 GPU (20 steps). Its distilled variant, Lens-Turbo, performs 4-step generation in 0.84 seconds.

Introduction and Theoretical Foundation

Recent foundational T2I models require massive computational resources, creating scalability challenges. This paper argues that training-time efficiency is determined by three key factors:

Model Size: Directly affects per-step computational cost.
Data Information Density per Training Batch: Determines the useful supervision extracted per update.
Convergence Speed: Determines the total number of training iterations needed.

The goal is to improve efficiency not just by reducing model scale, but by increasing the learning value of each batch and accelerating convergence. Lens is introduced as a case study implementing these principles.

Methodology

2.1 Pre-training Data: Lens 800M

Dataset: Lens-800M contains 800M high-quality image-text pairs from four sources (public real, public synthetic, private, text synthetic).
Data Cleaning: A multi-stage pipeline filters for resolution, NSFW content, aesthetics, watermarks, clarity, entropy, luminance, and near-duplicates.
Dense Captioning: Each image is captioned by GPT-4.1 to generate detailed, long-form English descriptions (~109 words avg.). This increases text information density, providing richer semantic supervision than short captions.
Ablation Study: Training with dense captions outperforms brief or mixed captions on the GenEval benchmark, confirming improved data utilization.

2.2 Architecture

The model consists of:

VAE: After ablation studies, the FLUX.2 semantic VAE is adopted. It provides a more compact and semantically meaningful latent space, accelerating convergence and improving generation quality.
Latent Diffusion Transformer: An MMDiT-style architecture with 48 blocks. It uses the flow-matching objective. Image latents are from the FLUX.2 VAE.
Language Encoder: GPT-OSS (20B MoE, 3B activated) is selected. Features are extracted from layers 4, 12, 18, and 24, concatenated, and projected via a linear adapter.
- Ablation Study: Stronger language encoders (GPT-OSS) lead to better prompt-following, faster convergence, and enable multilingual generalization (e.g., to Chinese, French) from English-only training data.
Reasoner: An independent LLM module (default GPT-5.5) that refines user inputs into detailed prompts aligned with the T2I model's training distribution.

2.3 Pre-training

Low-resolution Pre-training: Train at fixed 512x512 resolution for 400K iterations.
Mixed-resolution Continual Training: Train for another 400K iterations using bucket sampling over mixed resolutions and aspect ratios. The bucket set is constructed from three base areas (512², 768², 1024²) and nine aspect ratios (1:2 to 2:1), resulting in 27 concrete resolution buckets.
Training Details:
- Optimizer: AdamW with $\beta_1 = 0.9$ , $\beta_2 = 0.999$ .
- Learning rate: $2 \times 10^{-4}$ (low-res), $1 \times 10^{-4}$ (mixed-res).
- Effective global batch size: 3072 images.
- Logit-normal timestep sampling with $\mu$ adapted based on image token length $n$ : $\mu(n)$ interpolated from $\mu=1.0$ at $n=256$ to $\mu=1.3$ at $n=4096$ .

2.4 Post-training

Lens-RL-8K Dataset: A taxonomy-driven prompt set of 8,406 prompts covering diverse generation scenarios (Human, Object, Animal, etc.). RL data diversity is crucial for broad improvement.
Rubric Generation: For each prompt, GPT-4.1 generates 10 sample-aware evaluation rubrics (e.g., for object count, placement, attributes), plus a global coherence rubric.
Reinforcement Learning: Adopts DiffusionNFT using GPT-4.1-mini as the reward function, guided by the rubrics. The policy is trained for 180 steps.
Few-step Distillation: Lens-Turbo is a 4-step generator distilled from Lens-RL using techniques from DMD2, decoupled-DMD, and SenseFlow, combined with R1 regularization for stability.

2.5 Inference

Default: Reasoner + Lens with 20-step generation, CFG=5.0.
Fast: Lens-Turbo with 4-step generation, no CFG.
Training-free System-prompt Search: An iterative method using GPT-5.5 to optimize the reasoner's system prompt, improving its ability to convert user requests into effective T2I prompts.

Empirical Validation / Results

Lens is evaluated against state-of-the-art commercial and open-source models on four benchmarks.

Table 2: Main Benchmark Results Comparison

Model	Size	OneIG (EN)	GenEval	LongText (EN)	CVTG (Avg. NED)	CVTG (CLIP)
Commercial Models
Seedream 4.0	–	0.573	0.840	0.921	0.892	0.785
GPT Image 1 [High]	–	0.533	0.840	0.956	0.857	0.798
Nano Banana 2.0	–	0.578	–	0.981	–	–
Open-source Models
Z-Image	6B	0.546	0.840	0.935	0.867	0.797
Qwen-Image	20B	0.539	0.868	0.943	0.829	0.806
LongCat-Image	6B	–	0.870	–	0.866	0.786
Lens-Turbo (4-step)	3.8B	0.554	0.914	0.927	0.889	0.815
Lens (20-step)	3.8B	0.557	0.930	0.937	0.869	0.814

Key Findings: The 3.8B Lens outperforms larger models (6B-20B) on GenEval (object-centric composition) and is highly competitive on OneIG, LongText, and CVTG (text rendering).
Inference Speed: Figure 2 shows Lens and Lens-Turbo achieve a favorable trade-off between benchmark score and inference time compared to other models.
Ablation Studies: Confirm the importance of dense captions (Fig. 4), the FLUX.2 VAE (Fig. 5), strong language encoders for convergence and multilingual generalization (Fig. 7, 8), and diverse RL data (Table 1).

Table 1: RL Dataset Diversity Ablation (Right)

RL Training Set	CVTG	OneIG (EN)
	Avg. NED	CLIP Text
Full set w/o text	0.832	0.928
Full set	0.869	0.951

Theoretical and Practical Implications

Efficiency Blueprint: Provides a systematic framework for improving T2I training efficiency beyond just model scaling, emphasizing data density and convergence speed.
Data-Centric Insights: Demonstrates the high value of dense, high-quality captions and multi-resolution/aspect-ratio training for improving data utilization and enabling resolution generalization.
Architectural Guidance: Shows that semantic VAEs and strong language encoders are critical for faster convergence and emergent multilingual capabilities, reducing data requirements.
Post-Training Strategy: Highlights that RL with diverse, taxonomy-driven prompts and structured rubrics is effective for broad quality improvement without overfitting.
Accessibility: The compact model size and efficient training reduce the computational and financial barrier to developing high-performance T2I models, promoting wider research and application.

Conclusion

Lens demonstrates that through careful design focused on data information density, convergence acceleration, and systematic post-training, it is possible to build a highly competitive foundational T2I model with substantially fewer parameters and lower training cost. The strategies explored—dense captioning, mixed-resolution training, semantic VAE, strong language encoder, and diverse RL—offer actionable insights for the community. Future work may focus on expanding multilingual data coverage and further improving artifact suppression.