Visual Summary | PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models

Summary (Overview)

Parallel Region Perception Framework: Proposes PerceptionDLM, the first multimodal diffusion language model (DLM) capable of generating captions for multiple image regions simultaneously within a single denoising process, overcoming the sequential bottleneck of autoregressive (AR) models.
Strong Diffusion Baseline: Introduces PerceptionDLM-Base, an 8B-parameter discrete diffusion VLM that outperforms prior diffusion VLMs (e.g., LLaDA-V, SDAR-VL, Dream-VL) on 15/16 standard multimodal benchmarks.
Efficiency Breakthrough: Achieves up to 3.44× throughput speedup over AR models (e.g., GAR-8B) under constant workload with 4 masks per image, and near-linear tokens-per-second scaling with increasing number of regions.
New Benchmark: Constructs ParaDLC-Bench, a multi-region localized captioning benchmark with 2,345 manually verified questions that jointly evaluate caption quality and inference efficiency, including cross-region interference detection.
Competitive Accuracy + Speed: On ParaDLC-Bench, PerceptionDLM achieves 62.4% average accuracy (nearly double that of LLaDA-V at 35.2%) while reducing total inference time to 276 seconds vs. 479 seconds for GAR-8B.

Introduction and Theoretical Foundation

Background

Visual perception in multimodal large language models (MLLMs) increasingly requires fine-grained, localized understanding: models must accurately describe multiple specific regions within a single image. Existing MLLMs rely on autoregressive (AR) decoding, which generates descriptions sequentially per region and per token. As the number of queried regions grows, inference cost and latency increase linearly, making dense perception unscalable.

Motivation for Diffusion Language Models

Large diffusion language models (DLMs) offer a promising alternative: they use masked denoising generation, which naturally supports non-autoregressive and parallel token generation. Prior multimodal DLMs either lack strong perception capabilities or do not exploit the parallelism for concurrent multi-region perception. This work asks: Can we design a multimodal diffusion language model that preserves strong perception quality while unlocking practical parallelism for region-conditioned captioning?

Theoretical Basis

Discrete Diffusion Language Modeling (e.g., LLaDA) formulates text generation as a generative Markov process. The forward process progressively corrupts clean tokens $x_0$ into $x_t$ by replacing tokens with a special [MASK] state. The reverse process learns a neural network $p_\theta(x_0|x_t)$ to denoise. Training optimizes a reweighted variational lower bound that simplifies to predicting masked tokens:

\mathcal{L}_{\text{DLM}} = -\mathbb{E}_{t,x_0,x_t} \left[ \frac{1}{t} \sum_{i=1}^N \mathbf{1}_{[x^i_t = \text{[MASK]}]} \log p_\theta(x^i_0|x_t) \right] \tag{1}

where $N$ is sequence length and $\mathbf{1}[\cdot]$ selects only masked tokens.

Methodology

PerceptionDLM-Base: A Stronger Diffusion VLM Baseline

Architecture: Consists of a pretrained SigLIP-2 vision encoder, a two-layer MLP connector (GELU activation), and a DLM decoder (LLaDA-8B). Visual features $Z_v = \Phi_v(X_v)$ are projected via $\Phi_c$ to obtain continuous embeddings $H_v = \Phi_c(Z_v)$ , which are concatenated with text embeddings of instruction $X_q$ and response $X_a$ .

Training objective (visual instruction tuning):

\mathcal{L}_{\text{PerceptionDLM-Base}} = -\mathbb{E}_{(X_v,X_q,X_a),t,x_t} \left[ \frac{1}{t} \sum_{i \in \mathcal{M}_a} \log p_\theta(x^i_0 | x_t, H_v, X_q) \right] \tag{2}

where $\mathcal{M}_a$ denotes masked token indices within the target response $X_a$ only. The diffusion forward process corrupts only response tokens; image and instruction tokens remain uncorrupted as conditions.

Dynamic Resolution: To handle high-resolution images, input images are dynamically partitioned into $512\times512$ tiles based on aspect ratio. If multiple tiles, an additional thumbnail is appended. Each tile undergoes pixel unshuffle to reduce tokens to one-quarter, then encoded and concatenated.

Training Stages (4-stage):

Vision-Language Alignment: Lightweight alignment using Bee-Training-Data-Stage1, primarily training the connector while freezing backbone.
Middle-stage Training: Large-scale training on Bee-Training-Data-Stage2. Two strategies explored: full-parameter training (updating diffusion backbone and vision encoder) vs. partially frozen (vision encoder fixed).
Instruction Tuning: SFT with 22M samples from LLaVA-OneVision-1.5-Instruct-Data (VQA, reasoning, OCR, grounding).
High-Quality SFT Refinement: Fine-tuning with Honey-Data-15M, enriched with dual-level chain-of-thought annotations.

Parallel Region Perception Architecture (PerceptionDLM)

Built on PerceptionDLM-Base, the parallel architecture introduces three components (Figure 2):

RoI-aligned Feature Replay: For each region mask, localized visual features are extracted from the vision encoder and projected into the language embedding space as placeholder tokens. Each placeholder is expanded into RoI feature tokens (default $4\times4$ grid).
Region Prompting: Each region $R_i$ is associated with a learnable embedding $e_i$ (continuous visual prompt). These embeddings are broadcast and fused with visual tokens from corresponding masked regions, enabling the model to distinguish multiple concurrent targets.
Structured Attention Masking: To prevent interference across regions during parallel denoising, attention is restricted for tokens of region $R_i$ to:
- Global visual tokens
- Shared textual prompt tokens
- RoI feature tokens of region $R_i$
- Other tokens within the same region-specific caption span
Attention to RoI features and caption tokens of other regions is masked out, creating a block-wise attention pattern that enforces region-level independence while preserving shared global context.

Training: Set number of region prompts per image to 6. Uses same loss as Equation (2) with all parameters trainable. AdamW optimizer, batch size 256, learning rate $4\times10^{-5}$ with linear warmup (3% steps) and cosine decay. During inference, 32 denoising steps for sequence length 32 per mask.

ParaDLC-Bench (Parallel Detailed Localized Captioning Benchmark)

Extends DLC-Bench to multi-mask scenarios with reference-free evaluation using an LLM judge. Two-step evaluation:

Model generates parallel descriptions for multiple masked regions within one image.
LLM judge (GPT-5.2) assesses descriptions via predefined positive/negative questions.

Question categories:

Positive questions: Focus on unique attributes of the target mask. Point for accurate inclusion, penalty for factual errors.
Negative & Interference questions: Beyond typical absent attribute checks, specifically examine attribute entanglement — whether the model hallucinates features from other concurrent masked objects into the current target's description. Point awarded only if the model correctly identifies the target object and avoids cross-region hallucination.

Benchmark includes 2,345 manually verified questions with rigorous human cross-validation. Verified robust across different judge LLMs (Qwen3.5-27B, Gemini-3.1-Pro).

Training Data Engine (ParaCaption-5.7M)

Constructs single-image, multi-mask caption data from two sources:

SA-1B dataset: Filter occluded/part-level masks, use GAR-8B to generate initial descriptions, LLM extracts core categories, SAM3 re-predicts masks with IoU filtering.
COCONut dataset: Use GAR-8B for captions, Qwen3-8B verifies semantic match with ground-truth categories.

Post-processing: length restriction, anti-repetition/hallucination filtering. Final: 334k images (3.4M masks) from COCONut, 83k images (2.3M masks) from SA-1B.

Empirical Validation / Results

Performance on Multimodal Benchmarks (PerceptionDLM-Base)

Table 1 (partial – full table in paper):

Benchmark	PerceptionDLM-Base (8B)	LLaDA-V (8B)	SDAR-VL (8B)	Dream-VL (7B)	Qwen2.5-VL (7B)	InternVL3 (8B)
MMStar	63.7	60.1	–	59.9	63.9	68.2
SeedBench	78.9	74.8	64.2	75.5	77.0⋆	77.1⋆
MMMU	47.2	48.6	48.6	53.0	51.3⋆	57.3⋆
MathVista	65.5	52.4†	–	62.5	68.2	71.6
DocVQA	89.9	83.9	56.1	88.3	94.9	92.7
MMVP	82.0	76.7⋆	–	66.5	73.3⋆	80.0
HallusionBench	58.4	50.9⋆	–	44.4	51.9	49.9

Outperforms LLaDA-V on 15/16 benchmarks.
Excels in fine-grained visual perception (MMVP, BLINK, RealWorldQA, CV-Bench-2D) vs. AR models Qwen2.5-VL and InternVL3.
Gap remains on complex reasoning (MMMU, MathVista); paper notes diffusion models' arbitrary-order parallel decoding limits reasoning, so autoregressive-order decoding was used for math evaluations.

Region Captioning Benchmarks

Table 2: ParaDLC-Bench and DLC-Bench results

Method	Size	ParaDLC-Bench Pos (%)	Neg (%)	Avg (%)	TPF	Time (s)	DLC-Bench Avg (%)
GPT-5.2	–	38.0	71.0	55.2	–	–	39.4
Gemini-2.5-Pro	–	39.7	73.3	57.5	–	–	47.9
PixelRefer	7B	40.8	78.7	60.5	1	718	68.3
DAM	3B	48.1	87.2	69.2	1	326	67.3
GAR	8B	49.0	87.6	69.5	1	479	67.8
LLaDA-V	8B	24.1	46.3	35.2	1∗	3241	24.6
SDAR-VL	8B	30.2	28.8	31.3	1∗	945	28.8
Dream-VL	7B	29.7	28.6	30.4	1∗	446	24.7
PerceptionDLM (Ours)	8B	42.3	82.4	62.4	2.9	276	53.1

ParaDLC-Bench Avg: 62.4% – nearly doubling prior diffusion VLMs (LLaDA-V 35.2%, SDAR-VL 31.3%).
TPF (Tokens Per Forward): 2.9 vs. 1.0 for all baselines, indicating effective parallelism.
Total inference time: 276 seconds (vs. 479s for GAR, 718s for PixelRefer) — substantial speed advantage.
On DLC-Bench (single mask), still outperforms all diffusion baselines with 53.1% avg.

Efficiency Analysis

Throughput vs. Region Quantity (Figure 1b): PerceptionDLM achieves near-linear TPS growth with stable per-image latency (~2.9s). GAR-8B shows constant TPS and linearly increasing latency.
Throughput Scaling at Constant Workload (Figure 1c): With 4 masks per image, fully parallelized PerceptionDLM achieves 3.44× throughput improvement over sequential processing (TPF=1), reducing single-image latency from 10.04s to 2.92s.

Theoretical and Practical Implications

Theoretical Implications: This work demonstrates that discrete diffusion language models can be effectively adapted for structured parallel generation tasks like multi-region perception. The combination of region prompting and structured attention masking provides a generalizable framework for mapping multiple inputs to multiple outputs in a single diffusion process, beyond captioning (e.g., referring expression comprehension, visual grounding).
Practical Implications: For real-world applications requiring dense scene understanding (autonomous driving, robotics, accessibility), PerceptionDLM offers significant inference speed improvements without sacrificing caption quality. The ability to process multiple regions in parallel reduces both latency and computational overhead, making it feasible to scale up perception density. The open-source release of models, code, and datasets (ParaDLC-Bench, ParaCaption-5.7M) lowers the barrier for further research.
Limitations: The reasoning gap on complex mathematical benchmarks suggests that arbitrary-order parallel decoding inherently limits chain-of-thought reasoning. The paper identifies this as a key future direction, proposing reinforcement learning (RL) as a potential solution (inspired by DeepSeek-R1). Additionally, current accuracy on ParaDLC-Bench (62.4% avg) is still below AR region-specific models (~69%), indicating room for improvement in multi-region disentanglement.

Conclusion

PerceptionDLM is a diffusion-based multimodal model that enables parallel region perception, generating multiple region captions in a single denoising process rather than sequentially. Built upon a strong diffusion VLM baseline (PerceptionDLM-Base), it introduces region-aware prompt embeddings and structured attention masking to enforce region-level independence while sharing global visual context. Extensive evaluations show:

PerceptionDLM-Base sets a new state-of-the-art among open-source diffusion VLMs on standard multimodal benchmarks.
PerceptionDLM achieves competitive region captioning accuracy with AR models while providing up to 3.5× throughput speedup in dense perception scenarios.
The newly constructed ParaDLC-Bench and ParaCaption-5.7M dataset provide resources for future research on efficient, parallel localized understanding.

The work demonstrates that diffusion-based multimodal models are a promising direction for efficient fine-grained visual perception, moving beyond the sequential decoding bottleneck of autoregressive approaches. Future work may focus on bridging the reasoning gap through RL-based training and further improving parallel generation quality.