Summary (Overview)

  • Parallel Region Perception Framework: Proposes PerceptionDLM, the first multimodal diffusion language model (DLM) capable of generating captions for multiple image regions simultaneously within a single denoising process, overcoming the sequential bottleneck of autoregressive (AR) models.
  • Strong Diffusion Baseline: Introduces PerceptionDLM-Base, an 8B-parameter discrete diffusion VLM that outperforms prior diffusion VLMs (e.g., LLaDA-V, SDAR-VL, Dream-VL) on 15/16 standard multimodal benchmarks.
  • Efficiency Breakthrough: Achieves up to 3.44× throughput speedup over AR models (e.g., GAR-8B) under constant workload with 4 masks per image, and near-linear tokens-per-second scaling with increasing number of regions.
  • New Benchmark: Constructs ParaDLC-Bench, a multi-region localized captioning benchmark with 2,345 manually verified questions that jointly evaluate caption quality and inference efficiency, including cross-region interference detection.
  • Competitive Accuracy + Speed: On ParaDLC-Bench, PerceptionDLM achieves 62.4% average accuracy (nearly double that of LLaDA-V at 35.2%) while reducing total inference time to 276 seconds vs. 479 seconds for GAR-8B.

Introduction and Theoretical Foundation

Background

Visual perception in multimodal large language models (MLLMs) increasingly requires fine-grained, localized understanding: models must accurately describe multiple specific regions within a single image. Existing MLLMs rely on autoregressive (AR) decoding, which generates descriptions sequentially per region and per token. As the number of queried regions grows, inference cost and latency increase linearly, making dense perception unscalable.

Motivation for Diffusion Language Models

Large diffusion language models (DLMs) offer a promising alternative: they use masked denoising generation, which naturally supports non-autoregressive and parallel token generation. Prior multimodal DLMs either lack strong perception capabilities or do not exploit the parallelism for concurrent multi-region perception. This work asks: Can we design a multimodal diffusion language model that preserves strong perception quality while unlocking practical parallelism for region-conditioned captioning?

Theoretical Basis

Discrete Diffusion Language Modeling (e.g., LLaDA) formulates text generation as a generative Markov process. The forward process progressively corrupts clean tokens x0x_0 into xtx_t by replacing tokens with a special [MASK] state. The reverse process learns a neural network pθ(x0xt)p_\theta(x_0|x_t) to denoise. Training optimizes a reweighted variational lower bound that simplifies to predicting masked tokens:

LDLM=Et,x0,xt[1ti=1N1[xti=[MASK]]logpθ(x0ixt)](1)\mathcal{L}_{\text{DLM}} = -\mathbb{E}_{t,x_0,x_t} \left[ \frac{1}{t} \sum_{i=1}^N \mathbf{1}_{[x^i_t = \text{[MASK]}]} \log p_\theta(x^i_0|x_t) \right] \tag{1}

where NN is sequence length and 1[]\mathbf{1}[\cdot] selects only masked tokens.

Methodology

PerceptionDLM-Base: A Stronger Diffusion VLM Baseline

Architecture: Consists of a pretrained SigLIP-2 vision encoder, a two-layer MLP connector (GELU activation), and a DLM decoder (LLaDA-8B). Visual features Zv=Φv(Xv)Z_v = \Phi_v(X_v) are projected via Φc\Phi_c to obtain continuous embeddings Hv=Φc(Zv)H_v = \Phi_c(Z_v), which are concatenated with text embeddings of instruction XqX_q and response XaX_a.

Training objective (visual instruction tuning):

LPerceptionDLM-Base=E(Xv,Xq,Xa),t,xt[1tiMalogpθ(x0ixt,Hv,Xq)](2)\mathcal{L}_{\text{PerceptionDLM-Base}} = -\mathbb{E}_{(X_v,X_q,X_a),t,x_t} \left[ \frac{1}{t} \sum_{i \in \mathcal{M}_a} \log p_\theta(x^i_0 | x_t, H_v, X_q) \right] \tag{2}

where Ma\mathcal{M}_a denotes masked token indices within the target response XaX_a only. The diffusion forward process corrupts only response tokens; image and instruction tokens remain uncorrupted as conditions.

Dynamic Resolution: To handle high-resolution images, input images are dynamically partitioned into 512×512512\times512 tiles based on aspect ratio. If multiple tiles, an additional thumbnail is appended. Each tile undergoes pixel unshuffle to reduce tokens to one-quarter, then encoded and concatenated.

Training Stages (4-stage):

  1. Vision-Language Alignment: Lightweight alignment using Bee-Training-Data-Stage1, primarily training the connector while freezing backbone.
  2. Middle-stage Training: Large-scale training on Bee-Training-Data-Stage2. Two strategies explored: full-parameter training (updating diffusion backbone and vision encoder) vs. partially frozen (vision encoder fixed).
  3. Instruction Tuning: SFT with 22M samples from LLaVA-OneVision-1.5-Instruct-Data (VQA, reasoning, OCR, grounding).
  4. High-Quality SFT Refinement: Fine-tuning with Honey-Data-15M, enriched with dual-level chain-of-thought annotations.

Parallel Region Perception Architecture (PerceptionDLM)

Built on PerceptionDLM-Base, the parallel architecture introduces three components (Figure 2):

  1. RoI-aligned Feature Replay: For each region mask, localized visual features are extracted from the vision encoder and projected into the language embedding space as placeholder tokens. Each placeholder is expanded into RoI feature tokens (default 4×44\times4 grid).

  2. Region Prompting: Each region RiR_i is associated with a learnable embedding eie_i (continuous visual prompt). These embeddings are broadcast and fused with visual tokens from corresponding masked regions, enabling the model to distinguish multiple concurrent targets.

  3. Structured Attention Masking: To prevent interference across regions during parallel denoising, attention is restricted for tokens of region RiR_i to:

    • Global visual tokens
    • Shared textual prompt tokens
    • RoI feature tokens of region RiR_i
    • Other tokens within the same region-specific caption span

    Attention to RoI features and caption tokens of other regions is masked out, creating a block-wise attention pattern that enforces region-level independence while preserving shared global context.

Training: Set number of region prompts per image to 6. Uses same loss as Equation (2) with all parameters trainable. AdamW optimizer, batch size 256, learning rate 4×1054\times10^{-5} with linear warmup (3% steps) and cosine decay. During inference, 32 denoising steps for sequence length 32 per mask.

ParaDLC-Bench (Parallel Detailed Localized Captioning Benchmark)

Extends DLC-Bench to multi-mask scenarios with reference-free evaluation using an LLM judge. Two-step evaluation:

  1. Model generates parallel descriptions for multiple masked regions within one image.
  2. LLM judge (GPT-5.2) assesses descriptions via predefined positive/negative questions.

Question categories:

  • Positive questions: Focus on unique attributes of the target mask. Point for accurate inclusion, penalty for factual errors.
  • Negative & Interference questions: Beyond typical absent attribute checks, specifically examine attribute entanglement — whether the model hallucinates features from other concurrent masked objects into the current target's description. Point awarded only if the model correctly identifies the target object and avoids cross-region hallucination.

Benchmark includes 2,345 manually verified questions with rigorous human cross-validation. Verified robust across different judge LLMs (Qwen3.5-27B, Gemini-3.1-Pro).

Training Data Engine (ParaCaption-5.7M)

Constructs single-image, multi-mask caption data from two sources:

  • SA-1B dataset: Filter occluded/part-level masks, use GAR-8B to generate initial descriptions, LLM extracts core categories, SAM3 re-predicts masks with IoU filtering.
  • COCONut dataset: Use GAR-8B for captions, Qwen3-8B verifies semantic match with ground-truth categories.

Post-processing: length restriction, anti-repetition/hallucination filtering. Final: 334k images (3.4M masks) from COCONut, 83k images (2.3M masks) from SA-1B.

Empirical Validation / Results

Performance on Multimodal Benchmarks (PerceptionDLM-Base)

Table 1 (partial – full table in paper):

BenchmarkPerceptionDLM-Base (8B)LLaDA-V (8B)SDAR-VL (8B)Dream-VL (7B)Qwen2.5-VL (7B)InternVL3 (8B)
MMStar63.760.159.963.968.2
SeedBench78.974.864.275.577.0⋆77.1⋆
MMMU47.248.648.653.051.3⋆57.3⋆
MathVista65.552.4†62.568.271.6
DocVQA89.983.956.188.394.992.7
MMVP82.076.7⋆66.573.3⋆80.0
HallusionBench58.450.9⋆44.451.949.9
  • Outperforms LLaDA-V on 15/16 benchmarks.
  • Excels in fine-grained visual perception (MMVP, BLINK, RealWorldQA, CV-Bench-2D) vs. AR models Qwen2.5-VL and InternVL3.
  • Gap remains on complex reasoning (MMMU, MathVista); paper notes diffusion models' arbitrary-order parallel decoding limits reasoning, so autoregressive-order decoding was used for math evaluations.

Region Captioning Benchmarks

Table 2: ParaDLC-Bench and DLC-Bench results

MethodSizeParaDLC-Bench Pos (%)Neg (%)Avg (%)TPFTime (s)DLC-Bench Avg (%)
GPT-5.238.071.055.239.4
Gemini-2.5-Pro39.773.357.547.9
PixelRefer7B40.878.760.5171868.3
DAM3B48.187.269.2132667.3
GAR8B49.087.669.5147967.8
LLaDA-V8B24.146.335.21∗324124.6
SDAR-VL8B30.228.831.31∗94528.8
Dream-VL7B29.728.630.41∗44624.7
PerceptionDLM (Ours)8B42.382.462.42.927653.1
  • ParaDLC-Bench Avg: 62.4% – nearly doubling prior diffusion VLMs (LLaDA-V 35.2%, SDAR-VL 31.3%).
  • TPF (Tokens Per Forward): 2.9 vs. 1.0 for all baselines, indicating effective parallelism.
  • Total inference time: 276 seconds (vs. 479s for GAR, 718s for PixelRefer) — substantial speed advantage.
  • On DLC-Bench (single mask), still outperforms all diffusion baselines with 53.1% avg.

Efficiency Analysis

  • Throughput vs. Region Quantity (Figure 1b): PerceptionDLM achieves near-linear TPS growth with stable per-image latency (~2.9s). GAR-8B shows constant TPS and linearly increasing latency.
  • Throughput Scaling at Constant Workload (Figure 1c): With 4 masks per image, fully parallelized PerceptionDLM achieves 3.44× throughput improvement over sequential processing (TPF=1), reducing single-image latency from 10.04s to 2.92s.

Theoretical and Practical Implications

  • Theoretical Implications: This work demonstrates that discrete diffusion language models can be effectively adapted for structured parallel generation tasks like multi-region perception. The combination of region prompting and structured attention masking provides a generalizable framework for mapping multiple inputs to multiple outputs in a single diffusion process, beyond captioning (e.g., referring expression comprehension, visual grounding).
  • Practical Implications: For real-world applications requiring dense scene understanding (autonomous driving, robotics, accessibility), PerceptionDLM offers significant inference speed improvements without sacrificing caption quality. The ability to process multiple regions in parallel reduces both latency and computational overhead, making it feasible to scale up perception density. The open-source release of models, code, and datasets (ParaDLC-Bench, ParaCaption-5.7M) lowers the barrier for further research.
  • Limitations: The reasoning gap on complex mathematical benchmarks suggests that arbitrary-order parallel decoding inherently limits chain-of-thought reasoning. The paper identifies this as a key future direction, proposing reinforcement learning (RL) as a potential solution (inspired by DeepSeek-R1). Additionally, current accuracy on ParaDLC-Bench (62.4% avg) is still below AR region-specific models (~69%), indicating room for improvement in multi-region disentanglement.

Conclusion

PerceptionDLM is a diffusion-based multimodal model that enables parallel region perception, generating multiple region captions in a single denoising process rather than sequentially. Built upon a strong diffusion VLM baseline (PerceptionDLM-Base), it introduces region-aware prompt embeddings and structured attention masking to enforce region-level independence while sharing global visual context. Extensive evaluations show:

  • PerceptionDLM-Base sets a new state-of-the-art among open-source diffusion VLMs on standard multimodal benchmarks.
  • PerceptionDLM achieves competitive region captioning accuracy with AR models while providing up to 3.5× throughput speedup in dense perception scenarios.
  • The newly constructed ParaDLC-Bench and ParaCaption-5.7M dataset provide resources for future research on efficient, parallel localized understanding.

The work demonstrates that diffusion-based multimodal models are a promising direction for efficient fine-grained visual perception, moving beyond the sequential decoding bottleneck of autoregressive approaches. Future work may focus on bridging the reasoning gap through RL-based training and further improving parallel generation quality.

Related papers