Penguin-VL Technical Report Summary

Summary (Overview)

  • Core Innovation: Introduces Penguin-VL, a compact Vision Language Model (VLM) that challenges the prevailing paradigm of using contrastive-pretrained vision encoders (e.g., CLIP/SigLIP). Instead, its vision encoder (Penguin-Encoder) is initialized directly from a text-only Large Language Model (LLM).
  • Key Finding: Identifies an objective mismatch where contrastive learning, optimized for discrimination, enforces coarse invariances that suppress fine-grained visual cues needed for dense captioning and complex reasoning. LLM-based initialization provides superior visual fidelity and data efficiency.
  • Methodological Contributions: Presents a holistic framework including: 1) A novel LLM-based vision encoder and pretraining strategy, 2) A tailored VLM training recipe with a Temporal Redundancy-Aware (TRA) token compression mechanism for videos, and 3) Comprehensive data curation pipelines for images and videos.
  • Performance: Achieves performance comparable to or surpassing leading VLMs (e.g., Qwen3-VL) across various image and video benchmarks, particularly excelling in document understanding, visual knowledge, and multi-perspective video reasoning, while maintaining a lightweight (2B/8B) architecture.
  • Primary Driver: Demonstrates that improved visual representation—rather than model scaling—is the primary driver of performance for efficient VLMs.

Introduction and Theoretical Foundation

The development of Vision Language Models (VLMs) has largely relied on scaling model size, hindering deployment on compute-constrained devices like smartphones and robots. This work explores the performance limits of compact VLMs (2B and 8B).

The paper challenges the standard practice of initializing vision encoders via massive contrastive pretraining (CLIP/SigLIP). It argues there is an objective mismatch: contrastive learning is optimized for discrimination, enforcing coarse, category-level invariances that suppress the fine-grained visual cues essential for tasks like dense captioning and complex VLM reasoning.

The proposed solution is Penguin-VL, which initializes its vision encoder from a text-only LLM. This approach leverages the rich semantic priors and reasoning capabilities already embedded in LLMs, and adopts their efficient transformer architecture, which is natively optimized for scalable, dense sequence modeling. This perspective is inspired by parallel successes in speech modeling where text-only LLMs are fine-tuned to process continuous speech signals.

The core motivation is to build a compact vision-centric multimodal foundation model with consistently strong capability across both images and videos, targeting practical deployment under strict latency constraints.

Methodology

2.1 Model Architecture

Penguin-VL adopts a unified three-module design:

  1. Penguin-Encoder: A LLM-based vision encoder.
  2. MLP-based Vision-Language Projector: A lightweight merger.
  3. LLM: The language model backbone.

The series is built upon Qwen3 LLM backbones and available in 2B and 8B parameter variants.

2.1.1 Penguin-Encoder: From a Text-only LLM to Vision Encoder

The vision encoder is initialized directly from a text-only LLM (Qwen3-0.6B). Key modifications:

  • Causal → Bidirectional Attention: The LLM's causal self-attention is transformed into bidirectional full attention for symmetric token interactions required for visual representation.
  • 2D Rotary Positional Embeddings (2D-RoPE): Added to support variable-resolution inputs.
  • Training Paradigm: Uses a mixed supervision strategy with LLM cross-entropy supervision and reconstruction-based objectives during initial training.

The reconstruction loss contains three parts:

  1. Amplitude Loss: Supervises the absolute value of features. LA=1NN(FsFt)L_A = \frac{1}{N} \sum_N (|F_s - F_t|) where FsF_s are encoder features and FtF_t are teacher supervision features.
  2. Direction Loss: Aligns feature distributions via cosine similarity. LD=1NNtr(FsFtFs2Ft2)L_D = \frac{1}{N} \sum_N \text{tr}\left( \frac{F_s F_t^\top}{||F_s||_2 ||F_t||_2} \right)
  3. Relation Loss: Supervises inter-patch relationships using self-correlation similarity. LR=1NNFsFsFs22FtFtFt222L_R = \frac{1}{N} \sum_N \left\| \frac{F_s F_s^\top}{||F_s||_2^2} - \frac{F_t F_t^\top}{||F_t||_2^2} \right\|_2

2.1.2 Video Encoding and Projector

For video inputs, a Temporal Redundancy-Aware (TRA) token compression strategy is employed. It dynamically allocates token budgets by classifying frames into key frames (capturing rapid changes) and intermediate frames (providing stable context). The compression proceeds in a cascade:

  • Compression Stage 1 (Resolution Preservation): If full-resolution tokenization satisfies the budget constraint: kKTk+iITiTmax\sum_{k \in K} T_k + \sum_{i \in I} T_i \leq T_{\text{max}} all frames are tokenized at native resolution. TkT_k and TiT_i are per-frame token counts for key and intermediate frames, respectively. KK and II are the sets of indices.
  • Compression Stage 2 (Synchronous Downscaling): If the budget is exceeded, both frame types are downscaled by a continuous factor α(0,1]\alpha \in (0, 1]: TkαTk,TiαTiT_k \leftarrow \alpha T_k, \quad T_i \leftarrow \alpha T_i until the constraint is met. Intermediate frames maintain a 4×4 \times spatial down-sampling relative to key frames (Tk16TiT_k \approx 16 T_i).
  • Compression Stage 3 (Saturation-Aware Scaling): When intermediate frames reach a physical lower bound Ti=TminT_i = T_{\text{min}}, compression pressure transfers exclusively to key frames until the global budget is satisfied.

The vision-language projector is a simple MLP-based feed-forward transformation that projects visual features to match the LLM hidden size.

2.2 Construction of High-Quality Multi-modal Datasets

The data construction emphasizes quality and diversity through a three-stage pipeline: source aggregation, multi-stage filtering/deduplication, and scalable automatic annotation.

  • Image Data (Penguin-Recap-I): 57.2M image-text pairs curated from COYO-700M and DataComp-1B. Uses hierarchical semantic clustering (k-means on CLIP embeddings) for diversity balancing. Each image receives a detailed long caption synthesized from structured annotations covering: global semantics, subjects, actions, spatial relationships, scene attributes, objects, OCR content, image quality, mood, and knowledge-intensive descriptions.
  • Video Data (Penguin-Recap-V & Penguin-QA): 3.7M video-text pairs from 29 public datasets. Uses clustering and motion scoring for deduplication and filtering static videos. Videos receive multi-granularity annotations at three levels:
    1. Event-level atomic descriptions: Fine-grained, timestamped captions.
    2. Chapter-level narrative syntheses: Groups events into narrative chapters.
    3. Holistic video summaries: Comprehensive interpretation of the entire video. A specialized Temporal Reasoning QA dataset is also constructed from the dense descriptions for tasks like temporal ordering and grounding.

3 Training

The training pipeline consists of three main stages.

3.1 Data Format

Visual inputs are converted into a single token sequence for the LLM using modality-specific blocks, separators (\n, ,), and absolute timestamp tags (Time: xx s).

3.2 Stage 1: Penguin-Encoder Training

A two-stage coarse-to-fine strategy is used. The language decoder remains frozen.

  • Low-Resolution Pre-training: On ~100M samples, resolution capped at 2048 visual tokens (~600×600 pixels). Uses mixed supervision with original captions and the reconstruction loss (amplitude, direction, relation) via a teacher encoder (VL3-SigLIP-NaViT). Includes substantial unlabeled chart/diagram data.
  • High-Resolution Fine-tuning: On ~47M filtered, re-captioned samples. Resolution increased to 10240 visual tokens. Reconstruction branch removed; focus on fine-grained alignment.

3.3 Stage 2: Pre-training

Goal: endow the VLM with diverse multimodal knowledge. All parameters (LLM, encoder, projector) are trainable. Data mixture contains ~121M samples.

Key Data Categories in Pre-training:

  • General Caption (64%): Core for image-text alignment and broadening visual distribution.
  • Document (14.45%): Critical for OCR, fine-grained recognition, and high-resolution processing.
  • Fine-grained (Region Caption 1.2%, Grounding 6.31%): For localization-aware perception.
    • Image Grounding: 7.7M samples merged from various datasets. Bounding boxes converted to a unified integer coordinate system [0, 1000].
    • Region Caption: 1.5M self-constructed QA pairs where questions specify a region in [0, 1000] coordinates for description/reasoning.
  • Others: OCR (2.42%), Interleaved (0.49%), Science (0.27%), Text (4.45%), Math (2.88%), Code (1.54%).

3.4 Stage 3: Supervised Fine-Tuning (SFT)

Aligns multimodal capabilities with user intent. High-quality SFT datasets cover broad capabilities.

Image SFT Mixture (39M samples):

  • General & Caption, Text (32.6%)
  • Document, Chart & Table (20.9%)
  • OCR, Text QA (16.6%)
  • Grounding & Counting (10.1%)
  • Mathematics (8.9%)
  • Multi-image, Science (3.71%)

Video SFT Mixture (3.7M samples):

  • General Video Understanding (77.6%)
  • Action Recognition and Reasoning (12.7%)
  • Temporal Grounding and Reasoning (6.9%)
  • Ego Video Understanding (2.8%)

Empirical Validation / Results

Implementation Details & Baselines

  • Training: Cosine LR decay, 3% warm-up. Max sequence length: 16,384 tokens (10,240 for visual inputs).
  • Vision Encoder: Initialized from Qwen3-0.6B. LLM backbone: Qwen3-1.7B/8B for corresponding scales.
  • Learning Rates: Encoder training: Stage1: 1.0×1031.0 \times 10^{-3}, Stage2: 5.0×1045.0 \times 10^{-4}. Pre-training: 1.0×1041.0 \times 10^{-4}. SFT: 1.0×1051.0 \times 10^{-5} for all components.
  • Video Processing: Visual tokens downsampled by factor 2 after encoder. Max frames (max_frames) capped at 180.
  • Baselines: For 2B: Gemma3n-E2B-it, SmolVLM2, InternVL3.5-2B, Qwen3-VL-2B. For 8B: corresponding variants and GPT-5-nano.

4.3 Image Benchmarks

Evaluated across three dimensions: Document/Chart/OCR, Mathematical/Logical Reasoning, and Multi-image/General Knowledge.

Key Results Tables

Table 1: Results comparison for 2B model variants.

Benchmark CategorySpecific BenchmarkPenguin-VL 2BQwen3-VL 2BInternVL3.5 2BGemma3n E2B-itSmolVLM2 2.2B
Chart/OCR/DocInfoVQA77.872.470.851.943.0
ChartQA86.676.980.765.868.7
DocVQA94.193.389.478.480.0
CharXiv DQ/RQ66.4/35.862.3/26.865.0/31.660.1/27.036.9/15.5
OCRBench810858836700729
General Knowledge/Multi-ImageAI2D80.776.978.874.670.0
RealWorldQA70.263.962.059.958.3
V-star83.874.969.146.051.8
MMMU-Pro31.436.531.628.020.1
BLINK51.753.836.644.144.0
MathMathVista67.361.360.850.451.5
MathVerse35.952.139.622.521.5
LogicVista41.335.847.733.924.8

Best in bold, second-best underlined.

Table 2: Results comparison for 8B model variants.

Benchmark CategorySpecific BenchmarkPenguin-VL 8BQwen3-VL 8BInternVL-3.5 8BOpenAI GPT-5 nano
Chart/OCR/DocInfoVQA86.883.179.149.2
ChartQA90.589.686.748.6
DocVQA96.296.192.378.3
CharXiv DQ/RQ75.7/40.083.0/46.472.2/44.464.4/31.7
OCRBench852896840701
General Knowledge/Multi-imageAI2D86.185.784.065.7
RealWorldQA75.871.567.560.7
V-star90.290.170.763.4
MMMU-Pro40.255.939.736.5
BLINK58.269.159.542.2
MathMathVista77.477.274.240.9
MathVerse50.862.155.827.0
LogicVista53.855.357.340.5

Best in bold, second-best underlined.

Summary of Image Results:

  • 2B Model: Penguin-VL achieves leading performance on most Chart/Document and General Knowledge benchmarks. It is best on MathVista but faces competition on other math/logic tasks, suggesting a possible limitation from comparatively limited math-focused SFT data.
  • 8B Model: Penguin-VL-8B is a frontrunner in its weight class, demonstrating remarkable command over dense visual information (DocVQA: 96.2, ChartQA: 90.5) and leading in general knowledge/scientific reasoning (AI2D, V-star). It maintains a lead in foundational math (MathVista: 77.4) but is outperformed on abstract logic/multi-step deduction (MathVerse, LogicVista).

4.4 Video Benchmarks

Evaluated on General Video Understanding, Long-form Comprehension, and Temporal Reasoning.

Video Results (from Tables 1 & 2):

  • 2B Model: Penguin-VL demonstrates robust reasoning, achieving top performance on EgoSchema (57.6), ActivityNetQA (61.5), Perception Test (70.4), and ties on MMVU (42.7). It excels in long-form comprehension (LongVideoBench: 59.5) and temporal reasoning (NextQA: 79.9, Charades-STA: 56.2).
  • 8B Model: Penguin-VL-8B claims the top spot in most metrics. It excels in long-context tasks (LongVideoBench: 67.0, NextQA: 85.4) and general understanding (ActivityNetQA: 65.2, Perception Test: 78.0). It shows precise temporal grounding (Charades-STA: 61.4).

4.5 Ablation Study

Table 3: Ablation of Penguin-Encoder and comparison for LMM integration.

| Training Stages and Configuration | Evaluation Results (Avg of 5 benchmarks) | | :