LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

Summary (Overview)

  • Parallel Box Decoding (PBD): Introduces a novel framework that decodes bounding boxes as atomic units in a single parallel step, breaking the token-by-token generation bottleneck of traditional vision-language models (VLMs). This preserves intra-box geometric coherence and enables substantial inference speedup.
  • Unified High-Performance Grounding: LocateAnything achieves state-of-the-art (SOTA) accuracy across diverse visual grounding and detection tasks, including object detection (COCO, LVIS), dense detection (Dense200, VisDrone), GUI grounding (ScreenSpot-Pro), document layout analysis (DocLayNet), and referring expression comprehension (RefCOCOg, HumanRef).
  • Massive-Scale Training Data: Develops a scalable data engine to curate LocateAnything-Data, a large-scale dataset with over 138 million training samples (12M unique images, 785M bounding boxes), significantly increasing data diversity for high-precision localization.
  • Hybrid Inference for Speed-Robustness Trade-off: Proposes three on-demand inference modes: Fast Mode (full parallel decoding for max throughput), Slow Mode (autoregressive for max stability), and a Hybrid Mode that defaults to parallel decoding and falls back to autoregressive generation only when unreliable blocks are detected, balancing speed and accuracy.
  • Significant Speed-Up: Achieves up to 2.5× higher decoding throughput (12.7 Boxes Per Second in Hybrid Mode) compared to competitive methods like Rex-Omni (5.0 BPS), while simultaneously improving localization quality, especially at high IoU thresholds.

Introduction and Theoretical Foundation

Vision-language models (VLMs) are increasingly used as general-purpose backbones for interactive systems, requiring high-quality, low-latency localization of entities from natural language intents. Current VLM-based detection and grounding methods commonly formulate the problem as a generative next-token prediction (NTP) task, serializing 2D bounding box coordinates into a 1D token stream (e.g., as textual digits "1024" or quantized tokens x1y1x2y2x_1 \rightarrow y_1 \rightarrow x_2 \rightarrow y_2). This token-by-token decoding creates a practical inference bottleneck and fails to leverage the strong structured correlation among coordinates (x1,y1,x2,y2)(x_1, y_1, x_2, y_2).

While Multi-Token Prediction (MTP) techniques in language modeling offer a path to parallel decoding, they are largely structure-agnostic, grouping tokens into arbitrary chunks. This can lead to learning spurious correlations across bounding-box boundaries, harming accuracy and reliability (see Figure 2 in the paper).

LocateAnything addresses these issues by introducing Parallel Box Decoding (PBD), which aligns MTP blocks with structured geometric units. The core idea is to treat each bounding box (or point) as an atomic unit and learn to predict its full coordinate set in one parallel step during training. This box-aligned training target avoids arbitrary token chunking, improving both localization performance and unlocking the speed benefits of parallel decoding.

Methodology

3.1 Model Architecture and Formulation

LocateAnything builds upon a native-resolution VLM with a Moon-ViT vision encoder and Qwen2.5 language decoder. It abandons standard NTP coordinate generation in favor of a block-based output formulation.

  • Continuous coordinates are normalized to [0,1000][0, 1000], discretized into tokens, and reorganized into a sequence of blocks B=(b1,b2,...,bN)\mathbf{B} = (b_1, b_2, ..., b_N).
  • The joint probability is formulated as P(BZ,E)=i=1NP(bib<i,Z,E)P(\mathbf{B} | \mathcal{Z}, \mathcal{E}) = \prod_{i=1}^{N} P(b_i | b_{<i}, Z, \mathcal{E}).
  • Each block bib_i is an atomic unit of constant length L=6L = 6, accommodating a bounding box and structural tokens (e.g., <box> and </box>). Unoccupied positions are padded with a <null> token.

Four functional block types are defined:

  1. Semantic Block: Encodes the linguistic identity.
  2. Box Block: Contains four quantized coordinates for a bounding box.
  3. Negative Block: Indicates the absence of a queried object.
  4. End Block: Signals generation termination.

3.2 Training Design

A dual-formulation training strategy jointly optimizes NTP (to preserve causal reasoning) and block-wise MTP (for box-aligned predictions). A single concatenated input sequence is constructed: xall=xvisxqxntpxblkx_{\text{all}} = x_{\text{vis}} \oplus x_{\text{q}} \oplus x_{\text{ntp}} \oplus x_{\text{blk}}, where xblkx_{\text{blk}} is created by traversing xntpx_{\text{ntp}}, splitting it into blocks, retaining the first token per block as context, and masking subsequent tokens.

A specialized attention mask (Figure 4) governs information flow:

  • Causal Attention for NTP: The shared context and NTP sequence use a causal mask, isolated from xblkx_{\text{blk}} to prevent leakage.
  • Causal Flow Across Blocks: Attention across different blocks in xblkx_{\text{blk}} is strictly causal (block ii can attend to blocks <i<i).
  • Bidirectional Intra-Block Attention: Tokens within the same block share bidirectional attention, allowing the model to capture internal geometric dependencies.

The training objective is the sum of cross-entropy losses: L=Lntp+Lmtp\mathcal{L} = \mathcal{L}_{\text{ntp}} + \mathcal{L}_{\text{mtp}}.

3.3 On-Demand Inference Modes

Parallel decoding can face Format Irregularity (malformed syntax at category boundaries) and Spatial Ambiguity (blurred boundaries in dense layouts). These are resolved by an NTP fallback mechanism triggered upon detecting violations.

Three inference modes are proposed:

  1. Slow Mode: Standard NTP (autoregressive) for maximum stability.
  2. Fast Mode: MTP predicting box-aligned blocks for maximum throughput.
  3. Hybrid Mode (Default): Uses MTP by default but switches to NTP for problematic blocks, preserving speed while maintaining robust outputs.

3.4 LocateAnything-Data

A large-scale, multi-domain dataset was curated to train a general-purpose model.

  • Scale: 12M unique images, 138M natural language queries, 785M bounding boxes.
  • Task Categories & Distribution:
    • General Object Detection (66.9% of queries)
    • GUI Grounding (16.5%)
    • Natural Language Referring (7.3%)
    • Text Localization / OCR (3.6%)
    • Document/Scene Layout Grounding (3.5%)
    • Point-based Localization (2.2%)

A multi-target grounding data engine (Figure 9) was designed to synthesize annotations from both labeled detection data and unlabeled images using models like Qwen3-VL and Molmo, followed by post-verification.

Empirical Validation / Results

4.2 Main Results

Extensive evaluations show LocateAnything advances the speed-accuracy frontier.

Object Detection (Tables 1 & 2): The 3B model outperforms Rex-Omni-3B, improving mean F1 on LVIS (+3.8%) and COCO (+1.8%). It shows strong generalization on dense benchmarks (Dense200: 58.7 mean F1; VisDrone: 39.9 mean F1).

GUI Grounding (Table 3): Achieves SOTA mean F1 of 60.3 on ScreenSpot-Pro, surpassing both generalist VLMs (Qwen3-VL-30B) and specialized GUI models (GUI-Owl-32B).

Document Layout & OCR (Table 4): Establishes new standards on DocLayNet (76.8 mean F1) and M6Doc (70.1 mean F1), outperforming Rex-Omni by large margins.

Referring Expression Comprehension (Table 5): Achieves highly competitive results, e.g., 78.7 mean F1 on HumanRef.

Decoding Speed (Table 1): Under the default Hybrid Mode, achieves 12.7 BPS, over 10× faster than textual-based Qwen3-VL (1.1 BPS) and 2.5× faster than quantized-based Rex-Omni (5.0 BPS).

4.3 Ablation Study (Table 6 & Figure 7)

Ablations on COCO isolate the benefits of PBD from large-scale data.

  • Coordinate Representation (Table 6a): PBD (Slow Mode) achieves the highest F1-score (52.1), proving box-aligned formulation provides stronger supervision.
  • MTP Formulation (Table 6b): Structure-agnostic MTP methods (SDLM, Block Diffusion) suffer from accuracy drops and limited speed gains. PBD (Fast Mode) dramatically outpaces them in throughput (16.9 BPS) while improving F1 (49.6).
  • Decoding Mode & Losses (Table 6c): Joint training (Lntp+Lblk\mathcal{L}_{\text{ntp}} + \mathcal{L}_{\text{blk}}) pushes the Slow Mode upper bound from 50.1 to 52.1 F1. Hybrid Mode preserves most speed gains (13.2 BPS) while achieving robust accuracy (51.6 F1).
  • Box Output Order (Figure 7 left): X-Y Corner Order (sorting by left-top corner) yields the highest F1-score.
  • Throughput Scaling (Figure 7 right): As the number of target boxes increases, NTP methods suffer severe latency growth, while PBD exhibits little increase in generation time, achieving a 2× to 6× speedup.

Additional Results

  • Pointing Tasks (Table 11): LocateAnything also achieves SOTA results on point-based localization across benchmarks (COCO: 83.9 F1@Point; Dense200: 87.6 F1@Point).
  • Backbone Generalization (Table 13): Applying PBD to Qwen3-VL 4B also improves its speed-accuracy trade-off (F1: 50.8→52.0, BPS: 2.8→9.4), showing the method is not backbone-specific.

Theoretical and Practical Implications

  • Paradigm Shift in VLM-based Localization: Demonstrates that aligning the training and decoding structure with the inherent geometry of the task (bounding boxes as atomic units) is superior to treating coordinates as a generic token stream. This reconciles high-throughput parallel decoding with reliable, structured output.
  • Enabling Real-Time Applications: The significant inference speedup (up to 2.5×) makes high-quality VLM-based grounding feasible for latency-sensitive applications like on-device robotics, embodied agents, and interactive systems.
  • Data Scaling for Precision: Shows that curating large-scale, diverse training data (LocateAnything-Data) is complementary to architectural innovations, crucial for achieving high-precision localization across domains.
  • Flexible Deployment: The proposed on-demand inference modes (Fast, Hybrid, Slow) provide a practical mechanism to balance throughput and robustness based on application requirements (e.g., real-time vs. offline high-precision labeling).

Conclusion

LocateAnything presents a unified framework that reformulates visual grounding and detection in VLMs via Parallel Box Decoding. By treating geometric elements as atomic units, it aligns training with the coupled nature of spatial coordinates. Combined with massive-scale training data and a flexible hybrid inference mechanism, LocateAnything delivers SOTA accuracy across diverse tasks and up to a 2.5× speedup, providing a practical and scalable route for real-time visual perception in embodied AI.

Limitation & Future Work: The model is primarily trained with supervised fine-tuning. Future work includes using reinforcement learning to further optimize the block-level decoding policy, reduce fallback frequency, and improve robustness in hard cases.