Penguin-VL Technical Report Summary

Summary (Overview)

Core Innovation: Introduces Penguin-VL, a compact Vision Language Model (VLM) that challenges the prevailing paradigm of using contrastive-pretrained vision encoders (e.g., CLIP/SigLIP). Instead, its vision encoder (Penguin-Encoder) is initialized directly from a text-only Large Language Model (LLM).
Key Finding: Identifies an objective mismatch where contrastive learning, optimized for discrimination, enforces coarse invariances that suppress fine-grained visual cues needed for dense captioning and complex reasoning. LLM-based initialization provides superior visual fidelity and data efficiency.
Methodological Contributions: Presents a holistic framework including: 1) A novel LLM-based vision encoder and pretraining strategy, 2) A tailored VLM training recipe with a Temporal Redundancy-Aware (TRA) token compression mechanism for videos, and 3) Comprehensive data curation pipelines for images and videos.
Performance: Achieves performance comparable to or surpassing leading VLMs (e.g., Qwen3-VL) across various image and video benchmarks, particularly excelling in document understanding, visual knowledge, and multi-perspective video reasoning, while maintaining a lightweight (2B/8B) architecture.
Primary Driver: Demonstrates that improved visual representation—rather than model scaling—is the primary driver of performance for efficient VLMs.

Introduction and Theoretical Foundation

The development of Vision Language Models (VLMs) has largely relied on scaling model size, hindering deployment on compute-constrained devices like smartphones and robots. This work explores the performance limits of compact VLMs (2B and 8B).

The paper challenges the standard practice of initializing vision encoders via massive contrastive pretraining (CLIP/SigLIP). It argues there is an objective mismatch: contrastive learning is optimized for discrimination, enforcing coarse, category-level invariances that suppress the fine-grained visual cues essential for tasks like dense captioning and complex VLM reasoning.

The proposed solution is Penguin-VL, which initializes its vision encoder from a text-only LLM. This approach leverages the rich semantic priors and reasoning capabilities already embedded in LLMs, and adopts their efficient transformer architecture, which is natively optimized for scalable, dense sequence modeling. This perspective is inspired by parallel successes in speech modeling where text-only LLMs are fine-tuned to process continuous speech signals.

The core motivation is to build a compact vision-centric multimodal foundation model with consistently strong capability across both images and videos, targeting practical deployment under strict latency constraints.

Methodology

2.1 Model Architecture

Penguin-VL adopts a unified three-module design:

Penguin-Encoder: A LLM-based vision encoder.
MLP-based Vision-Language Projector: A lightweight merger.
LLM: The language model backbone.

The series is built upon Qwen3 LLM backbones and available in 2B and 8B parameter variants.

2.1.1 Penguin-Encoder: From a Text-only LLM to Vision Encoder

The vision encoder is initialized directly from a text-only LLM (Qwen3-0.6B). Key modifications:

Causal → Bidirectional Attention: The LLM's causal self-attention is transformed into bidirectional full attention for symmetric token interactions required for visual representation.
2D Rotary Positional Embeddings (2D-RoPE): Added to support variable-resolution inputs.
Training Paradigm: Uses a mixed supervision strategy with LLM cross-entropy supervision and reconstruction-based objectives during initial training.

The reconstruction loss contains three parts:

Amplitude Loss: Supervises the absolute value of features. $L_A = \frac{1}{N} \sum_N (|F_s - F_t|)$ where $F_s$ are encoder features and $F_t$ are teacher supervision features.
Direction Loss: Aligns feature distributions via cosine similarity. $L_D = \frac{1}{N} \sum_N \text{tr}\left( \frac{F_s F_t^\top}{||F_s||_2 ||F_t||_2} \right)$
Relation Loss: Supervises inter-patch relationships using self-correlation similarity. $L_R = \frac{1}{N} \sum_N \left\| \frac{F_s F_s^\top}{||F_s||_2^2} - \frac{F_t F_t^\top}{||F_t||_2^2} \right\|_2$

2.1.2 Video Encoding and Projector

For video inputs, a Temporal Redundancy-Aware (TRA) token compression strategy is employed. It dynamically allocates token budgets by classifying frames into key frames (capturing rapid changes) and intermediate frames (providing stable context). The compression proceeds in a cascade:

Compression Stage 1 (Resolution Preservation): If full-resolution tokenization satisfies the budget constraint: $\sum_{k \in K} T_k + \sum_{i \in I} T_i \leq T_{\text{max}}$ all frames are tokenized at native resolution. $T_k$ and $T_i$ are per-frame token counts for key and intermediate frames, respectively. $K$ and $I$ are the sets of indices.
Compression Stage 2 (Synchronous Downscaling): If the budget is exceeded, both frame types are downscaled by a continuous factor $\alpha \in (0, 1]$ : $T_k \leftarrow \alpha T_k, \quad T_i \leftarrow \alpha T_i$ until the constraint is met. Intermediate frames maintain a $4 \times$ spatial down-sampling relative to key frames ( $T_k \approx 16 T_i$ ).
Compression Stage 3 (Saturation-Aware Scaling): When intermediate frames reach a physical lower bound $T_i = T_{\text{min}}$ , compression pressure transfers exclusively to key frames until the global budget is satisfied.

The vision-language projector is a simple MLP-based feed-forward transformation that projects visual features to match the LLM hidden size.

2.2 Construction of High-Quality Multi-modal Datasets

The data construction emphasizes quality and diversity through a three-stage pipeline: source aggregation, multi-stage filtering/deduplication, and scalable automatic annotation.

Image Data (Penguin-Recap-I): 57.2M image-text pairs curated from COYO-700M and DataComp-1B. Uses hierarchical semantic clustering (k-means on CLIP embeddings) for diversity balancing. Each image receives a detailed long caption synthesized from structured annotations covering: global semantics, subjects, actions, spatial relationships, scene attributes, objects, OCR content, image quality, mood, and knowledge-intensive descriptions.
Video Data (Penguin-Recap-V & Penguin-QA): 3.7M video-text pairs from 29 public datasets. Uses clustering and motion scoring for deduplication and filtering static videos. Videos receive multi-granularity annotations at three levels:
1. Event-level atomic descriptions: Fine-grained, timestamped captions.
2. Chapter-level narrative syntheses: Groups events into narrative chapters.
3. Holistic video summaries: Comprehensive interpretation of the entire video. A specialized Temporal Reasoning QA dataset is also constructed from the dense descriptions for tasks like temporal ordering and grounding.

3 Training

The training pipeline consists of three main stages.

3.1 Data Format

Visual inputs are converted into a single token sequence for the LLM using modality-specific blocks, separators (\n, ,), and absolute timestamp tags (Time: xx s).

3.2 Stage 1: Penguin-Encoder Training

A two-stage coarse-to-fine strategy is used. The language decoder remains frozen.

Low-Resolution Pre-training: On ~100M samples, resolution capped at 2048 visual tokens (~600×600 pixels). Uses mixed supervision with original captions and the reconstruction loss (amplitude, direction, relation) via a teacher encoder (VL3-SigLIP-NaViT). Includes substantial unlabeled chart/diagram data.
High-Resolution Fine-tuning: On ~47M filtered, re-captioned samples. Resolution increased to 10240 visual tokens. Reconstruction branch removed; focus on fine-grained alignment.

3.3 Stage 2: Pre-training

Goal: endow the VLM with diverse multimodal knowledge. All parameters (LLM, encoder, projector) are trainable. Data mixture contains ~121M samples.

Key Data Categories in Pre-training:

General Caption (64%): Core for image-text alignment and broadening visual distribution.
Document (14.45%): Critical for OCR, fine-grained recognition, and high-resolution processing.
Fine-grained (Region Caption 1.2%, Grounding 6.31%): For localization-aware perception.
- Image Grounding: 7.7M samples merged from various datasets. Bounding boxes converted to a unified integer coordinate system [0, 1000].
- Region Caption: 1.5M self-constructed QA pairs where questions specify a region in [0, 1000] coordinates for description/reasoning.
Others: OCR (2.42%), Interleaved (0.49%), Science (0.27%), Text (4.45%), Math (2.88%), Code (1.54%).

3.4 Stage 3: Supervised Fine-Tuning (SFT)

Aligns multimodal capabilities with user intent. High-quality SFT datasets cover broad capabilities.

Image SFT Mixture (39M samples):

General & Caption, Text (32.6%)
Document, Chart & Table (20.9%)
OCR, Text QA (16.6%)
Grounding & Counting (10.1%)
Mathematics (8.9%)
Multi-image, Science (3.71%)

Video SFT Mixture (3.7M samples):

General Video Understanding (77.6%)
Action Recognition and Reasoning (12.7%)
Temporal Grounding and Reasoning (6.9%)
Ego Video Understanding (2.8%)

Empirical Validation / Results

Implementation Details & Baselines

Training: Cosine LR decay, 3% warm-up. Max sequence length: 16,384 tokens (10,240 for visual inputs).
Vision Encoder: Initialized from Qwen3-0.6B. LLM backbone: Qwen3-1.7B/8B for corresponding scales.
Learning Rates: Encoder training: Stage1: $1.0 \times 10^{-3}$ , Stage2: $5.0 \times 10^{-4}$ . Pre-training: $1.0 \times 10^{-4}$ . SFT: $1.0 \times 10^{-5}$ for all components.
Video Processing: Visual tokens downsampled by factor 2 after encoder. Max frames (max_frames) capped at 180.
Baselines: For 2B: Gemma3n-E2B-it, SmolVLM2, InternVL3.5-2B, Qwen3-VL-2B. For 8B: corresponding variants and GPT-5-nano.

4.3 Image Benchmarks

Evaluated across three dimensions: Document/Chart/OCR, Mathematical/Logical Reasoning, and Multi-image/General Knowledge.

Key Results Tables

Table 1: Results comparison for 2B model variants.

Benchmark Category	Specific Benchmark	Penguin-VL 2B	Qwen3-VL 2B	InternVL3.5 2B	Gemma3n E2B-it	SmolVLM2 2.2B
Chart/OCR/Doc	InfoVQA	77.8	72.4	70.8	51.9	43.0
	ChartQA	86.6	76.9	80.7	65.8	68.7
	DocVQA	94.1	93.3	89.4	78.4	80.0
	CharXiv DQ/RQ	66.4/35.8	62.3/26.8	65.0/31.6	60.1/27.0	36.9/15.5
	OCRBench	810	858	836	700	729
General Knowledge/Multi-Image	AI2D	80.7	76.9	78.8	74.6	70.0
	RealWorldQA	70.2	63.9	62.0	59.9	58.3
	V-star	83.8	74.9	69.1	46.0	51.8
	MMMU-Pro	31.4	36.5	31.6	28.0	20.1
	BLINK	51.7	53.8	36.6	44.1	44.0
Math	MathVista	67.3	61.3	60.8	50.4	51.5
	MathVerse	35.9	52.1	39.6	22.5	21.5
	LogicVista	41.3	35.8	47.7	33.9	24.8

Best in bold, second-best underlined.

Table 2: Results comparison for 8B model variants.

Benchmark Category	Specific Benchmark	Penguin-VL 8B	Qwen3-VL 8B	InternVL-3.5 8B	OpenAI GPT-5 nano
Chart/OCR/Doc	InfoVQA	86.8	83.1	79.1	49.2
	ChartQA	90.5	89.6	86.7	48.6
	DocVQA	96.2	96.1	92.3	78.3
	CharXiv DQ/RQ	75.7/40.0	83.0/46.4	72.2/44.4	64.4/31.7
	OCRBench	852	896	840	701
General Knowledge/Multi-image	AI2D	86.1	85.7	84.0	65.7
	RealWorldQA	75.8	71.5	67.5	60.7
	V-star	90.2	90.1	70.7	63.4
	MMMU-Pro	40.2	55.9	39.7	36.5
	BLINK	58.2	69.1	59.5	42.2
Math	MathVista	77.4	77.2	74.2	40.9
	MathVerse	50.8	62.1	55.8	27.0
	LogicVista	53.8	55.3	57.3	40.5