WildDet3D: Scaling Promptable 3D Detection in the Wild - Summary

Summary (Overview)

Unified Geometry-Aware Architecture: Introduces WildDet3D, a single model that natively accepts text, 2D point, and 2D box prompts for open-vocabulary monocular 3D detection and can incorporate optional depth signals at inference time via a novel dual-encoder and depth fusion design.
Large-Scale Open-World Dataset: Presents WildDet3D-Data, a massive dataset with over 1M images and 13.5K object categories (a 138× increase over Omni3D), constructed via a multi-model candidate generation and human/VLM verification pipeline.
State-of-the-Art Performance: Achieves new SOTA across multiple benchmarks: 22.6/24.8 AP³ᴰ (text/box) on WildDet3D-Bench, 34.2/36.4 AP³ᴰ on Omni3D, and 40.3/48.9 ODS zero-shot on Argoverse 2 and ScanNet.
Substantial Gains from Depth: Demonstrates that incorporating depth cues at inference yields an average gain of +20.7 AP, highlighting the model's ability to leverage richer geometric information.
Real-World Versatility: Showcases practical applications including an iPhone app, AR integration with Meta Quest, robotic manipulation, and a VLM-agent for 3D referring expression localization.

Introduction and Theoretical Foundation

Understanding objects in 3D from a single image is fundamental for spatial intelligence in robotics, autonomous driving, and AR/VR. A practical, general-purpose monocular 3D detector must satisfy three key requirements not fully addressed by prior work:

Generalize in the wild to long-tailed, open-ended categories.
Support multiple prompt modalities (text, 2D points, 2D boxes) within a unified architecture for flexible interaction.
Leverage extra geometric cues (e.g., sparse LiDAR, partial depth) when available to improve 3D localization.

Existing methods specialize in either text-based querying (open-vocabulary) or fixed geometric inputs (oracle prompts), lacking a flexible, unified framework. Furthermore, progress is hampered by limited datasets covering narrow categories in controlled environments. This work addresses both the model and data bottlenecks.

The theoretical motivation centers on the choice of input modality for generalized 3D detection (Figure 2). Pure LiDAR lacks reliable height and full 6-DoF rotation cues. Pure RGB suffers from inherent scale and occlusion ambiguity. The proposed approach combines RGB with optional depth, retaining dense visual semantics for open-vocabulary recognition while using depth to resolve metric scale ambiguity when available.

Methodology

2.1 Dual-Vision Encoder

The architecture decouples semantic and geometric feature extraction to avoid trade-offs.

Image Encoder: A ViT-H with SimpleFPN neck, initialized from SAM 3, provides high-resolution semantic features. The first 28 of 32 blocks are frozen during training.
RGBD Encoder: A DINOv2 ViT-L/14 accepts 4-channel RGBD input (depth optional). It produces depth latents $Z_d \in \mathbb{R}^{C_d \times 49 \times 49}$ via a ConvStack neck. The first 21 of 24 blocks are frozen.
Depth Fusion Module: Injects depth latents into image features via a ControlNet-style residual design: $V' = V + \text{Conv}_{1\times1}(\text{LN}(Z^{\uparrow}_d))$ where $Z^{\uparrow}_d$ is bilinearly interpolated depth latents, LN is LayerNorm, and the convolution is zero-initialized to preserve pretrained features at training start.

2.2 Promptable Detector

Unifies four prompt types using encoders adapted from SAM 3:

Text Prompt: Category name encoded by a CLIP-style tokenizer and Transformer.
Point Prompt: 2D pixel coordinates $(u, v)$ with positive/negative label.
Box Prompt: 2D bounding box $(x_1, y_1, x_2, y_2)$ .
Exemplar Prompt: 2D box used as a visual exemplar to detect similar objects. Training uses per-prompt batching, aggregating all images containing a unique text category into a batch entry.

2.3 Deeply-Supervised 3D Detection Head

Lifts 2D queries to 3D boxes with deep supervision (loss applied at every decoder layer).

Multi-Source Information Aggregation: Query features are enriched sequentially with:
1. Camera Ray Features: Encoded using 8th-order real spherical harmonics: $\phi(r) = \text{RSH}_8(r / \|r\|) \in \mathbb{R}^{81}$ .
2. Depth Latents: Fused via cross-attention.
3D Box Parameterization: Predicts a 12D encoding: $p_{3d} = (\underbrace{\Delta c_x, \Delta c_y}_{\text{center offset}}, \underbrace{\hat{d}}_{\text{log depth}}, \underbrace{\hat{w}, \hat{h}, \hat{l}}_{\text{log dims}}, \underbrace{r_1, ..., r_6}_{\text{rotation}})$ where $\hat{d} = s_d \cdot \log(d)$ with $s_d=2.0$ , and $(\hat{w}, \hat{h}, \hat{l}) = s_{\text{dim}} \cdot \log(w, h, l)$ with $s_{\text{dim}}=2.0$ .
Unambiguous Rotation Normalization: Applied to ground truth and predictions to resolve rotation-dimension ambiguity: (1) Dimension ordering ensures $w \le l$ , (2) Yaw folding restricts angle to $[0, \pi)$ .
3D Confidence Prediction: Predicts a quality score $s_{3D} \in [0,1]$ with soft target $q^* = \beta \cdot q_{\text{depth}} + (1-\beta) \cdot \text{IoU}_{3D}$ ( $\beta=0.7$ ). Final score combines 2D and 3D confidence: $s = s_{2D} + \alpha \cdot s_{3D}$ ( $\alpha=0.5$ ).

2.4 Multi-Task Learning

The overall training loss aggregates 3D detection and auxiliary losses:

L = \underbrace{L_{3D} + L_{\text{conf}}}_{\text{3D detection losses}} + \underbrace{L_{\text{geom}} + L_{2D}}_{\text{auxiliary losses}}

3D Regression Loss ( $L_{3D}$ ): L1 loss on encoded 3D parameters.
3D Confidence Loss ( $L_{\text{conf}}$ ): IoU-aware focal BCE loss.
Auxiliary Geometry Loss ( $L_{\text{geom}}$ ): Includes metric depth L1, SILog loss, affine-invariant point-map losses, confidence mask BCE, and camera ray MSE.
Auxiliary 2D Detection Loss ( $L_{2D}$ ): Includes IoU-aware classification, box regression, per-category presence, and One-to-Many (O2M) matching (each GT matched to top- $k=4$ predictions).

Ignore-Region Suppression: During training, negative classification loss is suppressed for predictions with 2D IoU > 0.5 against an object marked as IGNORE (lacks valid 3D GT), aligning training with evaluation.

3 WildDet3D-Data Construction Pipeline

A three-stage pipeline creates large-scale 3D annotations from existing 2D datasets (COCO, LVIS, Objects365, V3Det).

Candidate Generation: Five complementary methods generate candidate 3D boxes per 2D annotation:
- 3D-MOOD, DetAny3D, SAM-3D, RANSAC-PCA, LabelAny3D.
- Each candidate undergoes translation and rotation optimization.
Rule-Based Filtering: Applies geometric criteria (edge contact, occlusion, size ratio), VLM-based depicted object filter, and LLM-estimated size/geometry filters.
Candidate Selection: Two parallel paths:
- Human Selection: Crowdsourced annotators select the best candidate and rate quality (good_fit, acceptable, unacceptable).
- VLM Selection: A fine-tuned Molmo2 model scores candidates on six perceptual criteria (category, scale, translation, shape, rotation, tilt); keeps highest-scoring candidate if total score > 10.

Table 1: WildDet3D-Data Statistics

Split	Source	Images	Annotations	Categories	Type	Scene	Max Depth
Existing Datasets
Omni3D [6]	KITTI, nuScenes, etc.	234K	3M+	98	Human	Driving, Furniture	67 m
WildDet3D-Data
Train (Human)	COCO, LVIS, Obj365, V3Det	102,979	229,934	12,064	Human	In-the-wild
Train (Synthetic)	COCO, LVIS, Obj365, V3Det	896,004	3,483,292	11,896	VLM filter	In-the-wild
Val	COCO, LVIS, Obj365	2,470	五项 9,256	785	Human	In-the-wild
Test	COCO, LVIS, Obj365	2,433	5,596	633	Human	In-the-wild
Total	-	1,003,886	3,728,078	13,499	Human + VLM	In-the-wild	81 m

Table 2: Pipeline Validation on Human-Annotated Train Set

Model	Selection Share	Rejection Rate
SAM-3D	40.4%	17.3%
RANSAC-PCA	28.2%	12.5%
DetAny3D	14.5%	42.9%
LabelAny3D	13.0%	21.3%
3D-MOOD	3.8%	25.7%
Overall	—	22.0%
VLM Score	Rejection Rate	n
< 7	71.9%	1,992
7	67.4%	13,670
8	45.3%	18,665
9	36.1%	83,882
10	16.7%	310,329
11	9.2%	52,684
VLM Top-2 Coverage: 73.4%

Empirical Validation / Results

4.1 Experimental Setup

Datasets: WildDet3D-Bench (proposed, 700+ categories), Omni3D, zero-shot on Argoverse 2 (AV2) & ScanNet, real depth on Stereo4D.
Metrics: AP³ᴰ (3D IoU matching for Omni3D, center-distance matching for in-the-wild), ODS (Open Detection Score) for zero-shot.
Training: Three stages on 32 GPUs (12 epochs each on Omni3D, then Omni3D+Others+WildDet3D-Data, final fine-tuning).

4.2 In-the-Wild Evaluation on WildDet3D-Bench

Table 3: WildDet3D-Bench Evaluation Results

Method	Training Data	AP_rare	AP_common	AP_frequent	AP³ᴰ
Text Prompt
3D-MOOD [58]	Omni3D	2.4	2.1	2.6	2.3
WildDet3D	Omni3D	9.0	6.5	5.2	6.8
WildDet3D w/ depth	Omni3D	23.0	21.5	16.1	20.7
WildDet3D	Omni3D, Others, WildDet3D-Data	28.3	21.6	18.7	22.6
WildDet3D w/ depth	Omni3D, Others, WildDet3D-Data	47.4	40.7	37.2	41.6
Box Prompt
OVMono3D-LIFT [59]	Omni3D	7.4	8.8	5.1	7.7
DetAny3D [63]	Omni3D, Others	9.9	7.4	6.3	7.8
WildDet3D	Omni3D	12.0	7.9	5.3	8.4
WildDet3D w/ depth	Omni3D	26.4	24.4	19.6	23.9
WildDet3D	Omni3D, Others, WildDet3D-Data	30.0	24.2	20.3	24.8
WildDet3D w/ depth	Omni3D, Others, WildDet3D-Data	53.7	46.1	42.5	47.2

Key Findings:

WildDet3D trained on Omni3D alone outperforms 3D-MOOD by 3.0× (6.8 vs. 2.3 AP).
Adding WildDet3D-Data yields a 9.8× improvement over 3D-MOOD (22.6 vs. 2.3 AP).
Providing ground-truth depth at test time gives massive gains (+19.0 AP for the full model).
Improvements are consistent across all category frequency groups.

4.3 Results on Omni3D

Table 4: Omni3D Evaluation Results

Method	KITTI	nuScenes	SUNRGBD	Hypersim	ARKitScenes	Objectron	AP³ᴰ
Text Prompt
3D-MOOD Swin-B [58]	31.4	35.8	23.8	9.1	53.9	67.9	30.0
WildDet3D	37.0	31.7	38.9	16.5	64.6	60.5	34.2
WildDet3D w/ depth	36.1	32.0	51.1	26.6	73.3	68.3	41.6
Box Prompt
DetAny3D [63]	38.7	37.6	46.1	16.0	50.6	56.8	34.4
WildDet3D	44.3	35.3	43.1	17.3	66.6	60.8	36.4
WildDet3D w/ depth	42.8	35.9	58.7	30.4	76.6	68.5	45.8