Qianfan-OCR: A Unified End-to-End Model for Document Intelligence - Summary

Summary (Overview)

  • Unified End-to-End Architecture: Presents Qianfan-OCR, a 4B-parameter vision-language model that integrates document layout analysis, text recognition, and semantic understanding (e.g., chart QA, key information extraction) into a single end-to-end pipeline, eliminating error propagation from traditional multi-stage systems.
  • Layout-as-Thought Mechanism: Introduces an innovative, optional thinking phase triggered by ⟨think⟩ tokens, where the model generates structured layout representations (bounding boxes, element types, reading order) before the final output. This recovers explicit layout analysis within the end-to-end paradigm and improves accuracy on documents with complex layouts.
  • State-of-the-Art Performance: Achieves top scores among end-to-end models on specialized OCR benchmarks: 93.12 on OmniDocBench v1.5 and 79.8 on OlmOCR Bench. It also shows strong results on general OCR (OCRBench: 880) and document understanding tasks, while attaining the highest average score on public Key Information Extraction (KIE) benchmarks.
  • Empirical Advantage over Pipelines: Demonstrates that traditional two-stage OCR+LLM pipelines suffer severe degradation on tasks requiring spatial and visual reasoning (e.g., scoring 0.0 on CharXiv chart QA), highlighting the benefit of preserving full visual context end-to-end.

Introduction and Theoretical Foundation

Current Optical Character Recognition (OCR) systems face a fundamental trade-off between cost, accuracy, and capability. Traditional pipeline systems decompose the task into sequential stages (layout detection → text recognition → understanding), which allows for modular efficiency but suffers from inter-stage error propagation and irreversible loss of visual context. Specialized OCR large models improve accuracy but maintain this two-stage complexity. General Vision-Language Models (VLMs) offer broad capabilities but are not optimized for structured document parsing and underperform on layout-sensitive metrics.

Qianfan-OCR is proposed to bridge these gaps. Its core theoretical foundation is that a unified end-to-end architecture can jointly optimize all sub-tasks, retaining crucial spatial and visual context throughout processing. This is particularly important for tasks like chart understanding or document QA, where layout relationships are semantically meaningful. The model addresses a key practical limitation of end-to-end models—the lack of explicit, user-accessible layout analysis—through the novel Layout-as-Thought mechanism, which integrates structural reasoning as an optional chain-of-thought process.

Methodology

Model Architecture

Qianfan-OCR adopts a multimodal bridging architecture with three core components:

  1. Vision Encoder: A Qianfan-ViT with AnyResolution design, dynamically tiling input images into 448×448 patches to support high-resolution inputs (up to 4K) critical for dense text. It has 24 layers, 1024 hidden dimensions, and produces up to 4,096 visual tokens per document.
  2. Language Model Backbone: Qwen3-4B (4B total parameters) with a 32K context window, Grouped-Query Attention (GQA) for memory efficiency, and RMSNorm.
  3. Cross-Modal Adapter: A lightweight two-layer MLP that projects visual features into the language model's embedding space.

Training Data Synthesis

The model is trained on large-scale, synthetically generated data from six specialized pipelines:

  • Document Parsing: Converts images to structured Markdown using PaddleOCR-VL for layout detection, with a fine-grained taxonomy of 25 element categories (e.g., text, paragraph_title, table, display_formula).
  • Layout-as-Thought: Constructs data where the model generates structured layout analysis within ⟨think⟩...⟨/think⟩ tags before the final output.
  • Key Information Extraction (KIE): Combines open-source data with multi-model labeling and includes semantic generalization for field names.
  • Complex Tables: Mixes programmatically generated tables (with random merges, professional CSS) and tables extracted from real documents.
  • Chart Understanding: Automatically synthesizes samples from arXiv LaTeX sources, extracting figures and generating detailed visual descriptions and reasoning tasks (trend analysis, correlation).
  • Multilingual OCR: Covers 192 languages via reverse synthesis from the HPLT corpus, with specialized handling for different writing systems (RTL, character reshaping).

Training Recipe

A four-stage progressive training strategy is employed:

  • Stage 1: Cross-Modal Alignment (50B tokens) – Adapter-only training for basic vision-language alignment.
  • Stage 2: Foundational OCR Training (2T tokens) – Full-parameter training with an OCR-heavy data mixture (45% Document OCR).
  • Stage 3: Domain-Specific Enhancement (800B tokens) – Targeted training on complex domains (Tables 22%, Formulas 20%, Charts 18%, KIE 18%).
  • Stage 4: Instruction Tuning – Covers a comprehensive set of document intelligence tasks via public data curation, reverse synthesis, and chart data mining.

A critical ablation study confirmed that mixing general-purpose data with domain-specific OCR data in Stage 3 acts as a regularizer, yielding the best performance.

Layout-as-Thought Implementation

  • Activation: Triggered by appending ⟨think⟩ tokens to the user query.
  • Output Format: The model generates a structured sequence listing elements in reading order, each with:
    • A <box> containing normalized bounding box coordinates.
    • A <label> indicating the element type from the 25-category taxonomy.
    • A <brief> providing a content summary for text elements.
  • Coordinate Representation: Bounding box coordinates (normalized to [0, 999]) are represented as dedicated special tokens <COORD_0> to <COORD_999>, making each coordinate a single token to reduce output length and latency.

Empirical Validation / Results

1. OCR-Specific Benchmarks

Qianfan-OCR achieves state-of-the-art performance among end-to-end models.

Table: Performance on OmniDocBench v1.5 (Selected Models)

ModelArchitectureOverall Score ↑Text Edit ↓Formula CDM ↑Table TEDs ↑
Qianfan-OCREnd-to-End93.120.04192.4391.02
DeepSeek-OCR-v2End-to-End91.090.04890.3187.75
PaddleOCR-VL 1.5Pipeline94.500.03594.2192.76
MinerU2.5Pipeline90.670.04788.4688.22

On OlmOCR Bench, Qianfan-OCR achieves an overall score of 79.8, the highest among end-to-end models and competitive with the top pipeline system (PaddleOCR-VL: 80.0).

Effectiveness of Layout-as-Thought: Analysis on OmniDocBench shows the mechanism provides targeted benefits. On documents with high layout label entropy (diverse element types), enabling thinking improves scores. On simpler, homogeneous documents, it can introduce unnecessary overhead. This suggests users should enable it selectively based on document complexity.

2. General OCR Benchmarks

Table: Performance on General OCR Benchmarks

ModelOCRBench ↑OCRBenchv2 (en) ↑OCRBenchv2 (zh) ↑CCOCR-multilan ↑
Qianfan-OCR88056.060.7776.7
Qwen3-VL-4B87360.6859.1374.2
PaddleOCR-VL54918.1540.8645.5

Qianfan-OCR leads on OCRBench and Chinese OCRBenchv2, demonstrating a strong specialization in OCR while maintaining competitive general performance.

3. Document Understanding Benchmarks

Table: Document Understanding Performance (Selected Benchmarks)

BenchmarkQianfan-OCRQwen3-VL-4BPaddleOCR-VL + Qwen3-4B (Pipeline)
CharXiv_DQ94.081.80.0
CharXiv_RQ85.248.50.0
ChartQA88.183.356.8
DocVQA92.894.959.8

Qianfan-OCR excels at chart and academic reasoning tasks (CharXiv, ChartQA), where structural visual reasoning is critical. The complete failure (0.0) of all two-stage pipeline systems on CharXiv starkly illustrates the irreversible loss of essential visual/spatial context when only extracted text is passed to an LLM.

4. Key Information Extraction (KIE) Benchmarks

Table: KIE Benchmark Performance (Mean Normalized Scores)

ModelOverall Mean ↑OCRBench KIE ↑OCRBenchv2 KIE (zh) ↑
Qianfan-OCR87.995.082.3
Qwen3-4B-VL83.589.071.3
Qwen3-VL-235B-A22B84.294.062.9
Gemini-3.1-Pro79.296.063.4

Qianfan-OCR achieves the highest overall mean score (87.9), outperforming larger open-source and commercial models, with particular strength in Chinese document extraction.

5. Inference Throughput

With W8A8 quantization, Qianfan-OCR achieves 1.024 pages per second (PPS) on a single A100 GPU, which is competitive with pipeline systems like PaddleOCR-VL (1.224 PPS). The end-to-end architecture benefits from efficient GPU-centric computation and batching.

Theoretical and Practical Implications

  • Paradigm Shift for Document AI: Demonstrates the viability and competitiveness of unified end-to-end architectures, challenging the traditional pipeline paradigm. It provides empirical evidence that preserving full visual context throughout processing is crucial for tasks requiring joint visual-textual reasoning.
  • Bridging Functionality Gaps: The Layout-as-Thought mechanism successfully bridges a key practical gap between end-to-end models and pipeline systems by recoverin g explicit, user-accessible layout analysis (bounding boxes, element types) within the unified framework.
  • Practical Deployment: The model offers a simplified deployment story compared to complex multi-stage pipelines, reducing to a single-model serving problem. Competitive throughput with quantization makes it practical for production use.
  • Data-Centric Approach: Highlights the importance of large-scale, high-quality synthetic data generation across diverse document domains (tables, charts, multilingual text) for training capable document intelligence models.

Conclusion

Qianfan-OCR is a pioneering 4B-parameter end-to-end model that unifies OCR, layout analysis, and document understanding. Its key innovations—the unified architecture and the Layout-as-Thought mechanism—enable it to achieve state-of-the-art results among end-to-end models on specialized OCR benchmarks while maintaining strong performance on understanding tasks. The model empirically validates the advantage of preserving visual context, as shown by the severe degradation of text-only pipeline systems on visual reasoning tasks. Publicly released via the Baidu AI Cloud Qianfan platform, Qianfan-OCR represents a significant step towards efficient, capable, and user-controllable document intelligence systems.

Future work includes exploring reinforcement learning to make Layout-as-Thought more adaptive, investigating the performance ceiling of end-to-end architectures, and developing more compact model variants for edge deployment.