EXAONE 4.5 Technical Report: Summary

Summary (Overview)

First Open-Weight VLM from LG AI Research: EXAONE 4.5 is LG's inaugural open-weight Vision-Language Model (VLM), designed for industrial intelligence by integrating a 1.2B-parameter vision encoder into the existing 32B EXAONE 4.0 language model backbone.
Targeted Training for Document & Korean Excellence: The model is trained on a large-scale, carefully curated dataset emphasizing document-centric corpora and specialized Korean multimodal content, leading to superior performance in document understanding and Korean contextual reasoning tasks.
Extended Multimodal Context & Multilingual Support: The model supports a context length of up to 256K tokens and processes six languages (Korean, English, Spanish, German, Japanese, Vietnamese), facilitating long-context reasoning and enterprise-scale applications.
Competitive Performance Across Benchmarks: Evaluations show EXAONE 4.5 achieves competitive results on general benchmarks while outperforming state-of-the-art models of similar scale in key areas like document understanding (e.g., AI2D, CharXiv), mathematical reasoning (e.g., MATH-VISION, WE-MATH), and Korean-specific tasks.
Architectural Innovations for Efficiency: Key design choices include using Grouped Query Attention (GQA) in the vision encoder, 2D Rotary Positional Embedding (2D RoPE) for spatial understanding, and the Multi-Token Prediction (MTP) module to enhance decoding throughput and computational efficiency.

Introduction and Theoretical Foundation

The EXAONE foundation model series has been engineered to address complex challenges in real-world industrial environments. Prior iterations focused on language (EXAONE 3.0, 3.5) and specialized reasoning (EXAONE Deep). EXAONE 4.0 introduced a hybrid LLM with dual NON-REASONING and REASONING modes.

EXAONE 4.5 advances this paradigm by introducing native visual comprehension, marking LG's first open-weight VLM. The core motivation is to bridge advanced language processing with visual perception to enhance AI's practical problem-solving capabilities in industrial settings (e.g., quality control via visual feed analysis, cross-referencing technical manuals and blueprints). This multimodal proficiency is positioned as a critical stepping stone towards future Vision-Language-Action (VLA) models capable of autonomous interaction in physical environments.

Methodology

Model Configurations

The architecture integrates a custom-built, from-scratch 1.2B-parameter vision encoder with the EXAONE 4.0 32B language model. To handle the high volume of visual tokens from high-resolution images without aggressive truncation, a large-scale vision encoder is used instead of smaller alternatives.

Efficiency Mechanisms: Grouped Query Attention (GQA) is employed in both the vision encoder and language decoder to reduce attention complexity and improve hardware utilization.
Positional Encoding: The vision encoder uses 2D Rotary Positional Embedding (2D RoPE) to capture image spatial structure, while the language model retains standard 1D RoPE.
Throughput Enhancement: The Multi-Token Prediction (MTP) module from K-EXAONE is incorporated to improve decoding throughput.
Tokenizer: The enhanced multilingual tokenizer from K-EXAONE is reused.

Pre-training

The pre-training pipeline is structured into two sequential stages:

Stage 1: Foundational Modality Alignment

Objective: End-to-end joint training of vision encoder, merger, and LLM.
Data Mix: General image-text pairs, interleaved image-text documents, document understanding datasets, OCR-centric samples, and text-only data to preserve language capabilities.

Stage 2: Perceptual and Knowledge Refinement

Objective: Refine the model's understanding of structured, high-density information.
Data Mix: Increased proportion of grounding, document, OCR-centric data, plus knowledge, mathematics, and STEM domain datasets.

The training configuration is summarized below:

Stage	Training Modules	Image Tokens	Text Tokens	Sequence Length	Amount of computation (FLOPs)
Stage 1	All	420B	400B	8K	$1.57 \times 10^{23}$
Stage 2	All	225B	-	8K	$6.43 \times 10^{22}$

Pre-Training Data Curation: The data mixture is meticulously crafted across several domains:

Image Caption Data: Korean-English bilingual pairs enhanced via a synthetic pipeline for richness, including task-oriented images (math, charts, documents).
Interleaved Image-Text Data: Filtered web content upsampled for high information density and STEM relevance.
OCR and Documents: Synthetic and real datasets at character/word/document level, with documents parsed into structured formats (HTML, Markdown, JSON).
Grounding and Counting: Data for spatial intelligence, with object locations as normalized bounding boxes $[x_1, y_1, x_2, y_2]$ scaled to $[0, 1000]$ .
STEM and Reasoning: Search-based synthesis pipeline for complex academic content (math graphs, engineering diagrams) coupled with Long Chain-of-Thought (CoT) data.
Korean Specific: Specialized corpus from Korean sources (Korea Tourism Organization, IT/Game Donga) for cultural and linguistic nuances, with text-to-vision augmentation for academic problems.

Context Length Extension

A maximum context length of 256K tokens is achieved by integrating context extension directly into the Supervised Fine-Tuning (SFT) stage, leveraging the stable 128K-capable base LLM as a prior. Context Parallelism is used to manage computational complexity.

Post-training

1. Supervised Fine-Tuning (SFT): A high-quality, domain-organized dataset covering multimodal and text-only tasks, supporting all six languages and both NON-REASONING and REASONING modes. A multi-stage curriculum is used for progressive capability strengthening.

2. Offline Preference Optimization: Applied in a multi-stage framework with tailored objectives (OCR, chart understanding, safety, etc.). Different loss functions are used for vision and text tasks:

For vision tasks, L_DPO is used for stable optimization with a reference model: $L_{\text{DPO}}(\theta) = -\mathbb{E}_{(x, y^+, y^-) \sim D} \left[ \log \sigma \left( \frac{1}{\beta} \left( \log \frac{\pi_\theta(y^+ \mid x)}{\pi_{\text{ref}}(y^+ \mid x)} - \log \frac{\pi_\theta(y^- \mid x)}{\pi_{\text{ref}}(y^- \mid x)} \right) \right) \right].$
For text tasks, L_GROUPER ( $G=4$ ) is used to leverage datasets with multiple rejected responses: $L_{\text{GROUPER}}(\theta) = -\mathbb{E}_{(x, y_i, ..., y_G) \sim D} \left[ \frac{1}{G} \sum_{i=1}^{G} \left( A_i \cdot \exp\left( \frac{1}{|y_i|} \log \pi_\theta(y_i \mid x) \right) \right) \right],$ where $z_i = \frac{r_i - \text{mean}(\{r_j\}_{j=1}^G)}{\text{std}(\{r_j\}_{j=1}^G)}$ and $A_i = 2 \cdot \frac{z_i - \min(\{z_j\}_{j=1}^G)}{\max(\{z_j\}_{j=1}^G) - \min(\{z_j\}_{j=1}^G)} - 1$ .

3. Reinforcement Learning: Joint multimodal RL is conducted on text (math, coding, knowledge) and vision (STEM, charts, OCR) tasks. GRPO with the IcePop setting is used for policy optimization.

Empirical Validation / Results

The model is evaluated on a comprehensive suite of vision and language benchmarks.

Vision Benchmarks Results

EXAONE 4.5 demonstrates competitive and balanced performance across four vision categories (STEM/Puzzle, Document Understanding, General, Korean). Key comparative results are shown in Table 2.

Table 2: Main evaluation results of EXAONE 4.5 REASONING mode on vision benchmarks.

Model	EXAONE 4.5 33B	GPT-5 mini	Qwen3-VL 32B	Qwen3-VL 235B	Qwen3.5 27B
Architecture	Dense	-	Dense	MoE	Dense
# Total Params	33B	-	33B	236B	27B
# Activated Params	33B	-	33B	23B	27B
STEM / Puzzle
MMMU	78.7	79.0	78.1	80.6	82.3
MMMU-PRO	68.6	67.3	68.1	69.3	75.0
MATH-VISION	75.2	71.9	70.2	74.6	86.0
WE-MATH	79.1	70.3	71.6	74.8	84.0
Document Understanding
AI2D	89.0	88.2	88.9	89.2	92.9
CharXiv (RQ)	71.7	68.6	65.2	66.1	79.5
OmniDocBench V1.5	81.2	77.0	83.1	84.5	88.9
General
BLINK	68.8	67.7	68.5	67.1	71.6
Korean
KMMMU	42.7	42.6	37.8	42.1	51.7

Highlights: EXAONE 4.5 frequently outperforms the much larger Qwen3-VL-235B (e.g., on MATH-VISION, WE-MATH, CharXiv) and the strong closed-weight GPT-5 mini (e.g., on MMMU-PRO, MATH-VISION, AI2D, OmniDocBench), demonstrating its efficiency and targeted capability.

Language Benchmarks Results

EXAONE 4.5 shows particular strength in core reasoning and coding tasks, while remaining competitive in agentic tool use and instruction following.

Table 3: Main evaluation results of EXAONE 4.5 REASONING mode on language benchmarks.

Model	EXAONE 4.5 33B	K-EXAONE 236B	GPT-5 mini	Qwen3-VL 235B	Qwen3.5 27B
Architecture	Dense	MoE	-	MoE	Dense
# Total Params	33B	236B	-	236B	27B
# Activated Params	32B	23B	-	22B	27B
Reasoning
AIME 2026	92.6	92.2	92.4	89.4	93.2
LiveCodeBench V6	81.4	80.7	78.1	70.1	80.7
Agentic Tool Use
τ²-BENCH (Retail)	77.9	78.6	78.3	67.0	84.7
τ²-BENCH (Weighted Avg)	72.0	-	-	57.0	-
Instruction Following
IFBENCH	62.6	67.3	74.0	59.2	76.5
IFEVAL	89.6	89.7	92.8	88.2	95.0

Highlights: The model achieves top scores on LiveCodeBench V6 and strong performance on AIME 2026. It substantially outperforms Qwen3-VL-235B on agentic tool use (τ²-BENCH weighted average: 72.0 vs. 57.0) and instruction following benchmarks.

Theoretical and Practical Implications

Industrial Problem-Solving: EXAONE 4.5 is designed as a practical engine for demanding industrial environments, enabling applications like automated quality control, technical documentation analysis, and operational diagnostics through its native multimodal understanding.
Advancement in VLM Design: The report demonstrates the efficacy of architectural choices like large-scale vision encoders, GQA for efficiency, and integrated long-context extension, contributing to the field's knowledge on building performant and efficient VLMs.
Foundation for Future Systems: By establishing robust visual and logical foundations, EXAONE 4.5 serves as a critical milestone towards the development of Vision-Language-Action (VLA) models for autonomous physical interaction.
Community and Research Impact: As an open-weight model, EXAONE 4.5 aims to accelerate community-driven research, foster innovation, and contribute to the next generation of AI systems, aligning with LG's vision of "AI for a better life."

Conclusion

EXAONE 4.5 successfully bridges advanced reasoning with visual comprehension, establishing LG's first open-weight VLM. Through architectural innovations (1.2B vision encoder, GQA, 2D RoPE, MTP) and a rigorous, multi-stage training pipeline focused on document and Korean data, the model acquires robust multimodal capabilities. It achieves a stable 256K token context and demonstrates highly competitive, state-of-the-art performance across a wide range of vision and language benchmarks, often outperforming larger or closed-weight models in complex domains like mathematical reasoning and document parsing. Released under a non-commercial research license (EXAONE AI Model License Agreement 1.2 - NC), EXAONE 4.5 is positioned as a powerful tool for industrial intelligence and a foundational step towards more advanced autonomous AI systems.