MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding

Summary (Overview)

Paradigm Shift: Proposes a novel diffusion-based framework for document OCR, reframing the task as an inverse rendering problem under visual conditioning, moving away from traditional left-to-right autoregressive decoding.
Key Architecture: Introduces MinerU-Diffusion, a 2.5B-parameter model featuring a block-wise diffusion decoder and a structured block-attention mechanism. This design enables parallel, global refinement of tokens within blocks while maintaining coarse autoregressive structure across blocks for stability and efficiency.
Advanced Training: Employs a two-stage curriculum learning strategy with uncertainty-driven refinement. This approach stabilizes training on large-scale, diverse data first, then focuses on hard cases identified by inference consistency, overcoming optimization challenges of diffusion models.
Performance & Efficiency: Achieves competitive accuracy on major document parsing benchmarks (e.g., OmniDocBench, CC-OCR, UniMER-Test) while enabling up to 3.2× faster decoding compared to autoregressive baselines through parallel diffusion denoising.
Enhanced Robustness: Demonstrates reduced dependence on linguistic priors and stronger reliance on visual evidence, as validated by the proposed Semantic Shuffle benchmark. Performance remains stable even when document semantics are artificially disrupted.

Introduction and Theoretical Foundation

Modern document Optical Character Recognition (OCR) has evolved from transcribing lines of text to parsing complex, long-form documents containing layout, tables, and mathematical formulas. While Vision-Language Models (VLMs) have become the dominant paradigm, most rely on autoregressive (AR) decoding. This sequential, left-to-right generation introduces latency proportional to output length and amplifies error propagation in long documents. More fundamentally, AR decoding implicitly casts OCR as a language-conditioned reconstruction task, causing models to over-rely on linguistic priors rather than authentic visual evidence, leading to semantic hallucinations when visual signals are weak or semantic structure is disrupted.

The paper argues that the causal, sequential order in AR decoding is an artifact of serialization, not an intrinsic property of document OCR. Instead, document OCR is more naturally modeled as an inverse rendering problem: recovering a spatially coupled 2D document structure (represented as a 1D token sequence) from a visual input. The statistical dependencies between tokens stem from spatial arrangement and formatting constraints, not a fixed generation order.

This insight motivates the shift to Diffusion Language Models (DLMs), specifically masked diffusion. In this paradigm, a clean token sequence $x_0$ is progressively corrupted with mask tokens, and the model learns to denoise it. The forward corruption process is defined as:

q(x_t | x_0) = \prod_{i=1}^{n} \text{Cat}\left(x_t^i; (1-t)\delta_{x_0^i} + t\delta_{\text{[MASK]}}\right)

where $t \in [0, 1]$ is the continuous corruption schedule. The training objective is derived from maximum likelihood estimation, resulting in an Evidence Lower Bound (ELBO):

J_{\text{full}}(x_0, Q, \theta) = \int_0^1 \frac{1}{t|x_0|} \mathbb{E}_{q(x_t|x_0)}\left[ \sum_{i: x_t^i = \text{[MASK]}} \log p_\theta(x_0^i | x_t, Q) \right] dt

where $Q$ is the prompt and $|x_0|$ is the sequence length.

The conditional independence assumption of masked diffusion—that each token can be predicted independently given the visual input and partially observed sequence—is well-aligned with OCR, where the mapping from image to text is largely deterministic. This allows for parallel decoding of long textual spans while maintaining global coherence, offering both a theoretical justification and practical advantages for efficient and robust document OCR.

Methodology

3.1 Problem Formulation: Inverse Rendering via Diffusion

The target is a unified structured token sequence $y = (y^{(1)}, ..., y^{(L)}) \in V^L$ , where $V$ is a shared vocabulary for text, layout markers, table delimiters, and math operators. This sequence corresponds to an underlying 2D document structure. The goal is to perform posterior inference $p(y|x)$ over this sequence given the document image $x$ . Unlike AR decomposition, diffusion-based decoding uses a discrete diffusion process for global iterative refinement under visual conditioning, which better matches the structural properties of OCR.

3.2 MinerU-Diffusion: Unified Diffusion Architecture for OCR

A naive full-attention DLM is computationally expensive ( $O(L^2)$ complexity) and prone to positional instability for long documents. MinerU-Diffusion introduces a block-attention architecture that incorporates structural locality.

The output sequence is partitioned into $B$ contiguous blocks: $y = (y^{(1)}, ..., y^{(B)})$ , where $y^{(b)} \in V^{L'}$ and $L = B L'$ . The conditional posterior is factorized as:

p_\theta(y | x) = \prod_{b=1}^{B} p_\theta\left(y^{(b)} | y^{(<b)}, x\right)

where $y^{(<b)}$ denotes all preceding blocks. Within each block, diffusion operates locally, enabling parallel refinement. This hybrid design provides coarse-grained autoregressive structure across blocks (for anchoring) and parallel diffusion within blocks (for efficiency).

A structured attention mask $M_{ij}$ is applied during training and inference:

M_{ij} = \begin{cases} 1, & \text{if } b(i) = b(j), \\ 1, & \text{if } b(j) < b(i), \\ 0, & \text{otherwise}. \end{cases}

where $b(i)$ is the block index of token $i$ . This allows tokens to attend bidirectionally within their block and causally to all preceding blocks, but not to future blocks. This reduces unnecessary global coupling, stabilizes alignment, and bounds errors locally.

3.3 Two-Stage Curriculum Learning with Uncertainty-Driven Refinement

To address the training instability and lower data utilization efficiency of diffusion models, a two-stage curriculum is proposed.

Stage I: Diversity-Driven Foundational Learning: The model is trained on a large-scale, diverse, and balanced dataset $D_{\text{base}} \sim p_{\text{div}}(x)$ to establish robust visual-semantic alignment and general parsing capabilities.
Stage II: Uncertainty-Driven Boundary Refinement: After Stage I, hard cases are mined. For each sample $x$ , $T$ stochastic inference passes are performed: $\{\hat{y}^{(t)}\}_{t=1}^T = \{f_\theta(x; \xi_t)\}_{t=1}^T$ . A task-specific consistency metric $S(\cdot, \cdot)$ (e.g., PageIoU for layout, CDM for formula, TEDS for tables) is used to compute a mean consistency score: $C(x) = \frac{2}{T(T-1)} \sum_{i<j} S(\hat{y}^{(i)}, \hat{y}^{(j)})$ Samples with low consistency $C(x) < \tau$ (high uncertainty) are selected as the hard set $D_{\text{hard}}$ . These samples are refined via an AI-assisted human pipeline to create high-precision labels $\tilde{D}_{\text{hard}}$ . The final fine-tuning dataset is $D_{\text{SFT}} = \tilde{D}_{\text{hard}} \cup \alpha D_{\text{rand}}$ , where $D_{\text{rand}}$ is a random subset from $D_{\text{base}}$ . The Stage II objective uses an adaptive sample weight: $\mathcal{L}_{\text{hard}}(\theta) = \mathbb{E}_{(x,y) \sim D_{\text{SFT}}} [w(x) \ell(f_\theta(x), y)], \quad w(x) = 1 + \beta(1 - C(x))$ which emphasizes hard, uncertain samples.

Empirical Validation / Results

4.1 Experimental Setups

Data: Trained on the MinerU2.5 dataset (~7.5M samples), focusing on Chinese and English.
Model: Based on the SDAR-1.7B-Chat-b32 architecture with a block size of 32, integrated with a vision encoder from Qwen2-VL-7B. Total parameters: 2.5B.
Evaluation: Benchmarks include OmniDocBench v1.5 (full-document parsing), CC-OCR & OCRBench v2 (tables), and UniMER-Test (formulas).

4.2 Full-Document Parsing Task Results

Table 1: Comprehensive evaluation of document parsing on OmniDocBench v1.5

Type	Methods	Params.	Overall ↑	Text ↓	Formula ↑	Table TEDS ↑	Table TEDS-S ↑	Reading Order ↓	GT Layout
Pipeline	Mineru2-pipeline [43]	-	75.51	0.209	76.55	70.90	79.11	0.225	×
	PP-StructureV3 [7]	-	86.73	0.073	85.79	81.68	89.48	0.073	×
AR	MinerU2.5 [29]	1.2B	90.67	0.047	88.46	88.22	92.38	0.044	×
	PaddleOCR-VL [7]	0.9B	92.56	0.035	91.43	89.76	93.52	0.043	×
dLM	MinerU-Diffusion	2.5B	88.94	0.061	86.41	86.50	90.29	0.059	×
AR	MinerU2.5 [29]	1.2B	93.44	0.025	91.98	90.84	95.10	-	✓
	PaddleOCR-VL [7]	0.9B	93.91	0.021	92.13	91.70	95.42	-	✓
dLM	MinerU-Diffusion	2.5B	93.37	0.028	91.92	91.00	94.86	-	✓

Without GT Layout: MinerU-Diffusion achieves an Overall score of 88.94, outperforming most AR-based models and demonstrating strong end-to-end capability.
With GT Layout: Performance improves to 93.37 Overall, very close to top-tier AR systems (93.44, 93.91). The gap between settings indicates layout prediction remains a bottleneck.

4.3 Element-Specific Parsing Task Results

Table III: Comprehensive recognition results on CC-OCR, OCRBench v2, and UniMER-Test | Type | Method | CC-OCR | OCRBench v2 | UniMER-Test | | :--- | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | | | TEDS ↑ | TEDS-S ↑ | TEDS ↑ | TEDS-S ↑ | CPE ↑ | HWE ↑ | SCE ↑ | SPE ↑ | | AR | MinerU2.5 [29] | 79.76 | 85.16 | 87.13 | 90.62 | 96.6 | 94.4 | 96.4 | 98.4 | | dLM | MinerU-Diffusion | 73.77 | 82.06 | 81.18 | 88.66 | 91.6 | 91.6 | 92.0 | 96.8 |

MinerU-Diffusion shows competitive performance on table and formula recognition, matching or surpassing several AR baselines, though a gap remains to the best specialized systems.

4.4 Ablation Study

Confidence Threshold vs. Decoding Parallelism: A dynamic confidence threshold controls parallelism. Lower thresholds increase speed but may reduce accuracy. At a threshold of 0.95, MinerU-Diffusion achieves ~2.1× speedup (108.9 TPS vs 52 TPS) over MinerU2.5 with similar accuracy. At a threshold of 0.6, it reaches a peak ~3.2× speedup (164.8 TPS) while maintaining >90% accuracy.
Decoding Strategy: Dynamic scheduling outperforms static-step decoding in both accuracy and throughput.
Full-Attn vs Block-Attn: Block-attention is superior, offering near-linear scalability, better mitigation of repetition, and avoidance of fixed-length mismatch issues that plague full-attention.
Two-Stage Curriculum Learning: The full two-stage strategy is essential. Stage 2 alone fails (Overall 35.71 w/o GT layout), while the combined approach achieves 88.94 Overall, confirming the curriculum mitigates optimization instability.

4.5 Semantic Shuffle Analysis

A benchmark was created by shuffling words in documents to disrupt semantics while preserving visual appearance. Results show AR decoder performance degrades sharply as distortion increases, indicating heavy reliance on linguistic priors. In contrast, MinerU-Diffusion's performance remains nearly constant, demonstrating stronger reliance on visual signals and robustness to semantic disruption.

Figure 7: Semantic Shuffle benchmark results across distortion levels (The figure shows AR model accuracy dropping significantly with increased shuffle ratio, while MinerU-Diffusion accuracy remains stable.)

Theoretical and Practical Implications

Theoretical: The work provides a principled reformulation of document OCR as an inverse rendering problem, arguing that diffusion decoding is structurally better aligned with the task's deterministic nature than autoregressive generation.
Efficiency: The block-wise diffusion decoder enables parallel token updates, breaking the sequential bottleneck of AR decoding and achieving significant speedups (up to 3.2×) for long-document parsing, which is critical for practical applications.
Robustness: By reducing dependence on linguistic priors, the model is less prone to semantic hallucinations and more robust in scenarios with weak visual cues or nonsensical text (as shown in Semantic Shuffle). This enhances reliability.
Training Innovation: The two-stage uncertainty-driven curriculum learning strategy addresses key challenges in training diffusion models for OCR, providing a blueprint for stable optimization and performance improvement on hard cases.

Conclusion

MinerU-Diffusion presents a successful paradigm shift in document OCR, replacing autoregressive decoding with a block-wise parallel diffusion framework. The model achieves competitive accuracy across major benchmarks while enabling substantially faster inference. Its reduced reliance on linguistic patterns and stronger grounding in visual evidence, validated by the Semantic Shuffle benchmark, points towards more robust OCR systems. The proposed two-stage curriculum learning strategy is crucial for stabilizing diffusion model training. This work demonstrates that diffusion-based decoding is a promising and principled alternative for accurate, efficient, and reliable document OCR, inspiring future research in this direction.