# EXAONE 4.5 Technical Report

> EXAONE 4.5 is LG's first open-weight VLM that excels at document understanding and Korean tasks by integrating a 1.2B vision encoder with a 32B language model.

- **Source:** [arXiv](https://arxiv.org/abs/2604.08644)
- **Published:** 2026-04-14
- **Permalink:** https://picx.dev/p/9baz2Q
- **Whiteboard:** https://picx.dev/p/9baz2Q/image

## Summary

# EXAONE 4.5 Technical Report: Summary

## Summary (Overview)
*   **First Open-Weight VLM from LG AI Research:** EXAONE 4.5 is LG's inaugural open-weight Vision-Language Model (VLM), designed for industrial intelligence by integrating a 1.2B-parameter vision encoder into the existing 32B EXAONE 4.0 language model backbone.
*   **Targeted Training for Document & Korean Excellence:** The model is trained on a large-scale, carefully curated dataset emphasizing document-centric corpora and specialized Korean multimodal content, leading to superior performance in document understanding and Korean contextual reasoning tasks.
*   **Extended Multimodal Context & Multilingual Support:** The model supports a context length of up to 256K tokens and processes six languages (Korean, English, Spanish, German, Japanese, Vietnamese), facilitating long-context reasoning and enterprise-scale applications.
*   **Competitive Performance Across Benchmarks:** Evaluations show EXAONE 4.5 achieves competitive results on general benchmarks while outperforming state-of-the-art models of similar scale in key areas like document understanding (e.g., AI2D, CharXiv), mathematical reasoning (e.g., MATH-VISION, WE-MATH), and Korean-specific tasks.
*   **Architectural Innovations for Efficiency:** Key design choices include using Grouped Query Attention (GQA) in the vision encoder, 2D Rotary Positional Embedding (2D RoPE) for spatial understanding, and the Multi-Token Prediction (MTP) module to enhance decoding throughput and computational efficiency.

## Introduction and Theoretical Foundation
The EXAONE foundation model series has been engineered to address complex challenges in real-world industrial environments. Prior iterations focused on language (EXAONE 3.0, 3.5) and specialized reasoning (EXAONE Deep). EXAONE 4.0 introduced a hybrid LLM with dual **NON-REASONING** and **REASONING** modes.

EXAONE 4.5 advances this paradigm by introducing native visual comprehension, marking LG's first open-weight VLM. The core motivation is to bridge advanced language processing with visual perception to enhance AI's practical problem-solving capabilities in industrial settings (e.g., quality control via visual feed analysis, cross-referencing technical manuals and blueprints). This multimodal proficiency is positioned as a critical stepping stone towards future Vision-Language-Action (VLA) models capable of autonomous interaction in physical environments.

## Methodology

### Model Configurations
The architecture integrates a custom-built, from-scratch **1.2B-parameter vision encoder** with the **EXAONE 4.0 32B language model**. To handle the high volume of visual tokens from high-resolution images without aggressive truncation, a large-scale vision encoder is used instead of smaller alternatives.

*   **Efficiency Mechanisms:** **Grouped Query Attention (GQA)** is employed in both the vision encoder and language decoder to reduce attention complexity and improve hardware utilization.
*   **Positional Encoding:** The vision encoder uses **2D Rotary Positional Embedding (2D RoPE)** to capture image spatial structure, while the language model retains standard 1D RoPE.
*   **Throughput Enhancement:** The **Multi-Token Prediction (MTP)** module from K-EXAONE is incorporated to improve decoding throughput.
*   **Tokenizer:** The enhanced multilingual tokenizer from K-EXAONE is reused.

### Pre-training
The pre-training pipeline is structured into two sequential stages:

**Stage 1: Foundational Modality Alignment**
*   **Objective:** End-to-end joint training of vision encoder, merger, and LLM.
*   **Data Mix:** General image-text pairs, interleaved image-text documents, document understanding datasets, OCR-centric samples, and text-only data to preserve language capabilities.

**Stage 2: Perceptual and Knowledge Refinement**
*   **Objective:** Refine the model's understanding of structured, high-density information.
*   **Data Mix:** Increased proportion of grounding, document, OCR-centric data, plus knowledge, mathematics, and STEM domain datasets.

The training configuration is summarized below:

| Stage | Training Modules | Image Tokens | Text Tokens | Sequence Length | Amount of computation (FLOPs) |
| :--- | :--- | :--- | :--- | :--- | :--- |
| **Stage 1** | All | 420B | 400B | 8K | $1.57 \times 10^{23}$ |
| **Stage 2** | All | 225B | - | 8K | $6.43 \times 10^{22}$ |

**Pre-Training Data Curation:** The data mixture is meticulously crafted across several domains:
*   **Image Caption Data:** Korean-English bilingual pairs enhanced via a synthetic pipeline for richness, including task-oriented images (math, charts, documents).
*   **Interleaved Image-Text Data:** Filtered web content upsampled for high information density and STEM relevance.
*   **OCR and Documents:** Synthetic and real datasets at character/word/document level, with documents parsed into structured formats (HTML, Markdown, JSON).
*   **Grounding and Counting:** Data for spatial intelligence, with object locations as normalized bounding boxes $[x_1, y_1, x_2, y_2]$ scaled to $[0, 1000]$.
*   **STEM and Reasoning:** Search-based synthesis pipeline for complex academic content (math graphs, engineering diagrams) coupled with Long Chain-of-Thought (CoT) data.
*   **Korean Specific:** Specialized corpus from Korean sources (Korea Tourism Organization, IT/Game Donga) for cultural and linguistic nuances, with text-to-vision augmentation for academic problems.

### Context Length Extension
A maximum context length of **256K tokens** is achieved by integrating context extension directly into the Supervised Fine-Tuning (SFT) stage, leveraging the stable 128K-capable base LLM as a prior. **Context Parallelism** is used to manage computational complexity.

### Post-training
**1. Supervised Fine-Tuning (SFT):** A high-quality, domain-organized dataset covering multimodal and text-only tasks, supporting all six languages and both NON-REASONING and REASONING modes. A multi-stage curriculum is used for progressive capability strengthening.

**2. Offline Preference Optimization:** Applied in a multi-stage framework with tailored objectives (OCR, chart understanding, safety, etc.). Different loss functions are used for vision and text tasks:
*   For vision tasks, **L\_DPO** is used for stable optimization with a reference model:
    $$L_{\text{DPO}}(\theta) = -\mathbb{E}_{(x, y^+, y^-) \sim D} \left[ \log \sigma \left( \frac{1}{\beta} \left( \log \frac{\pi_\theta(y^+ \mid x)}{\pi_{\text{ref}}(y^+ \mid x)} - \log \frac{\pi_\theta(y^- \mid x)}{\pi_{\text{ref}}(y^- \mid x)} \right) \right) \right].$$
*   For text tasks, **L\_GROUPER** ($G=4$) is used to leverage datasets with multiple rejected responses:
    $$L_{\text{GROUPER}}(\theta) = -\mathbb{E}_{(x, y_i, ..., y_G) \sim D} \left[ \frac{1}{G} \sum_{i=1}^{G} \left( A_i \cdot \exp\left( \frac{1}{|y_i|} \log \pi_\theta(y_i \mid x) \right) \right) \right],$$
    where $z_i = \frac{r_i - \text{mean}(\{r_j\}_{j=1}^G)}{\text{std}(\{r_j\}_{j=1}^G)}$ and $A_i = 2 \cdot \frac{z_i - \min(\{z_j\}_{j=1}^G)}{\max(\{z_j\}_{j=1}^G) - \min(\{z_j\}_{j=1}^G)} - 1$.

**3. Reinforcement Learning:** Joint multimodal RL is conducted on text (math, coding, knowledge) and vision (STEM, charts, OCR) tasks. **GRPO** with the **IcePop** setting is used for policy optimization.

## Empirical Validation / Results
The model is evaluated on a comprehensive suite of vision and language benchmarks.

### Vision Benchmarks Results
EXAONE 4.5 demonstrates competitive and balanced performance across four vision categories (STEM/Puzzle, Document Understanding, General, Korean). Key comparative results are shown in Table 2.

**Table 2: Main evaluation results of EXAONE 4.5 REASONING mode on vision benchmarks.**
| Model | EXAONE 4.5 33B | GPT-5 mini | Qwen3-VL 32B | Qwen3-VL 235B | Qwen3.5 27B |
| :--- | :---: | :---: | :---: | :---: | :---: |
| **Architecture** | Dense | - | Dense | MoE | Dense |
| **# Total Params** | 33B | - | 33B | 236B | 27B |
| **# Activated Params** | 33B | - | 33B | 23B | 27B |
| **STEM / Puzzle** | | | | | |
| MMMU | **78.7** | 79.0 | 78.1 | 80.6 | 82.3 |
| MMMU-PRO | **68.6** | 67.3 | 68.1 | 69.3 | 75.0 |
| MATH-VISION | **75.2** | 71.9 | 70.2 | 74.6 | 86.0 |
| WE-MATH | **79.1** | 70.3 | 71.6 | 74.8 | 84.0 |
| **Document Understanding** | | | | | |
| AI2D | **89.0** | 88.2 | 88.9 | 89.2 | 92.9 |
| CharXiv (RQ) | **71.7** | 68.6 | 65.2 | 66.1 | 79.5 |
| OmniDocBench V1.5 | **81.2** | 77.0 | 83.1 | 84.5 | 88.9 |
| **General** | | | | | |
| BLINK | **68.8** | 67.7 | 68.5 | 67.1 | 71.6 |
| **Korean** | | | | | |
| KMMMU | **42.7** | 42.6 | 37.8 | 42.1 | 51.7 |

*   **Highlights:** EXAONE 4.5 frequently outperforms the much larger Qwen3-VL-235B (e.g., on MATH-VISION, WE-MATH, CharXiv) and the strong closed-weight GPT-5 mini (e.g., on MMMU-PRO, MATH-VISION, AI2D, OmniDocBench), demonstrating its efficiency and targeted capability.

### Language Benchmarks Results
EXAONE 4.5 shows particular strength in core reasoning and coding tasks, while remaining competitive in agentic tool use and instruction following.

**Table 
3: Main evaluation results of EXAONE 4.5 REASONING mode on language benchmarks.**
| Model | EXAONE 4.5 33B | K-EXAONE 236B | GPT-5 mini | Qwen3-VL 235B | Qwen3.5 27B |
| :--- | :---: | :---: | :---: | :---: | :---: |
| **Architecture** | Dense | MoE | - | MoE | Dense |
| **# Total Params** | 33B | 236B | - | 236B | 27B |
| **# Activated Params** | 32B | 23B | - | 22B | 27B |
| **Reasoning** | | | | | |
| AIME 2026 | **92.6** | 92.2 | 92.4 | 89.4 | 93.2 |
| LiveCodeBench V6 | **81.4** | 80.7 | 78.1 | 70.1 | 80.7 |
| **Agentic Tool Use** | | | | | |
| τ²-BENCH (Retail) | **77.9** | 78.6 | 78.3 | 67.0 | 84.7 |
| τ²-BENCH (Weighted Avg) | **72.0** | - | - | 57.0 | - |
| **Instruction Following** | | | | | |
| IFBENCH | **62.6** | 67.3 | 74.0 | 59.2 | 76.5 |
| IFEVAL | **89.6** | 89.7 | 92.8 | 88.2 | 95.0 |

*   **Highlights:** The model achieves top scores on LiveCodeBench V6 and strong performance on AIME 2026. It substantially outperforms Qwen3-VL-235B on agentic tool use (τ²-BENCH weighted average: 72.0 vs. 57.0) and instruction following benchmarks.

## Theoretical and Practical Implications
*   **Industrial Problem-Solving:** EXAONE 4.5 is designed as a practical engine for demanding industrial environments, enabling applications like automated quality control, technical documentation analysis, and operational diagnostics through its native multimodal understanding.
*   **Advancement in VLM Design:** The report demonstrates the efficacy of architectural choices like large-scale vision encoders, GQA for efficiency, and integrated long-context extension, contributing to the field's knowledge on building performant and efficient VLMs.
*   **Foundation for Future Systems:** By establishing robust visual and logical foundations, EXAONE 4.5 serves as a critical milestone towards the development of Vision-Language-Action (VLA) models for autonomous physical interaction.
*   **Community and Research Impact:** As an open-weight model, EXAONE 4.5 aims to accelerate community-driven research, foster innovation, and contribute to the next generation of AI systems, aligning with LG's vision of "AI for a better life."

## Conclusion
EXAONE 4.5 successfully bridges advanced reasoning with visual comprehension, establishing LG's first open-weight VLM. Through architectural innovations (1.2B vision encoder, GQA, 2D RoPE, MTP) and a rigorous, multi-stage training pipeline focused on document and Korean data, the model acquires robust multimodal capabilities. It achieves a stable 256K token context and demonstrates highly competitive, state-of-the-art performance across a wide range of vision and language benchmarks, often outperforming larger or closed-weight models in complex domains like mathematical reasoning and document parsing. Released under a non-commercial research license (EXAONE AI Model License Agreement 1.2 - NC), EXAONE 4.5 is positioned as a powerful tool for industrial intelligence and a foundational step towards more advanced autonomous AI systems.

---

_Markdown view of https://picx.dev/p/9baz2Q, served by PicX — AI-generated visual whiteboard summaries of research papers._
