WildDet3D: Scaling Promptable 3D Detection in the Wild - Summary
Summary (Overview)
- Unified Geometry-Aware Architecture: Introduces WildDet3D, a single model that natively accepts text, 2D point, and 2D box prompts for open-vocabulary monocular 3D detection and can incorporate optional depth signals at inference time via a novel dual-encoder and depth fusion design.
- Large-Scale Open-World Dataset: Presents WildDet3D-Data, a massive dataset with over 1M images and 13.5K object categories (a 138× increase over Omni3D), constructed via a multi-model candidate generation and human/VLM verification pipeline.
- State-of-the-Art Performance: Achieves new SOTA across multiple benchmarks: 22.6/24.8 AP³ᴰ (text/box) on WildDet3D-Bench, 34.2/36.4 AP³ᴰ on Omni3D, and 40.3/48.9 ODS zero-shot on Argoverse 2 and ScanNet.
- Substantial Gains from Depth: Demonstrates that incorporating depth cues at inference yields an average gain of +20.7 AP, highlighting the model's ability to leverage richer geometric information.
- Real-World Versatility: Showcases practical applications including an iPhone app, AR integration with Meta Quest, robotic manipulation, and a VLM-agent for 3D referring expression localization.
Introduction and Theoretical Foundation
Understanding objects in 3D from a single image is fundamental for spatial intelligence in robotics, autonomous driving, and AR/VR. A practical, general-purpose monocular 3D detector must satisfy three key requirements not fully addressed by prior work:
- Generalize in the wild to long-tailed, open-ended categories.
- Support multiple prompt modalities (text, 2D points, 2D boxes) within a unified architecture for flexible interaction.
- Leverage extra geometric cues (e.g., sparse LiDAR, partial depth) when available to improve 3D localization.
Existing methods specialize in either text-based querying (open-vocabulary) or fixed geometric inputs (oracle prompts), lacking a flexible, unified framework. Furthermore, progress is hampered by limited datasets covering narrow categories in controlled environments. This work addresses both the model and data bottlenecks.
The theoretical motivation centers on the choice of input modality for generalized 3D detection (Figure 2). Pure LiDAR lacks reliable height and full 6-DoF rotation cues. Pure RGB suffers from inherent scale and occlusion ambiguity. The proposed approach combines RGB with optional depth, retaining dense visual semantics for open-vocabulary recognition while using depth to resolve metric scale ambiguity when available.
Methodology
2.1 Dual-Vision Encoder
The architecture decouples semantic and geometric feature extraction to avoid trade-offs.
- Image Encoder: A ViT-H with SimpleFPN neck, initialized from SAM 3, provides high-resolution semantic features. The first 28 of 32 blocks are frozen during training.
- RGBD Encoder: A DINOv2 ViT-L/14 accepts 4-channel RGBD input (depth optional). It produces depth latents via a ConvStack neck. The first 21 of 24 blocks are frozen.
- Depth Fusion Module: Injects depth latents into image features via a ControlNet-style residual design: where is bilinearly interpolated depth latents, LN is LayerNorm, and the convolution is zero-initialized to preserve pretrained features at training start.
2.2 Promptable Detector
Unifies four prompt types using encoders adapted from SAM 3:
- Text Prompt: Category name encoded by a CLIP-style tokenizer and Transformer.
- Point Prompt: 2D pixel coordinates with positive/negative label.
- Box Prompt: 2D bounding box .
- Exemplar Prompt: 2D box used as a visual exemplar to detect similar objects. Training uses per-prompt batching, aggregating all images containing a unique text category into a batch entry.
2.3 Deeply-Supervised 3D Detection Head
Lifts 2D queries to 3D boxes with deep supervision (loss applied at every decoder layer).
- Multi-Source Information Aggregation: Query features are enriched sequentially with:
- Camera Ray Features: Encoded using 8th-order real spherical harmonics: .
- Depth Latents: Fused via cross-attention.
- 3D Box Parameterization: Predicts a 12D encoding: where with , and with .
- Unambiguous Rotation Normalization: Applied to ground truth and predictions to resolve rotation-dimension ambiguity: (1) Dimension ordering ensures , (2) Yaw folding restricts angle to .
- 3D Confidence Prediction: Predicts a quality score with soft target (). Final score combines 2D and 3D confidence: ().
2.4 Multi-Task Learning
The overall training loss aggregates 3D detection and auxiliary losses:
- 3D Regression Loss (): L1 loss on encoded 3D parameters.
- 3D Confidence Loss (): IoU-aware focal BCE loss.
- Auxiliary Geometry Loss (): Includes metric depth L1, SILog loss, affine-invariant point-map losses, confidence mask BCE, and camera ray MSE.
- Auxiliary 2D Detection Loss (): Includes IoU-aware classification, box regression, per-category presence, and One-to-Many (O2M) matching (each GT matched to top- predictions).
Ignore-Region Suppression: During training, negative classification loss is suppressed for predictions with 2D IoU > 0.5 against an object marked as IGNORE (lacks valid 3D GT), aligning training with evaluation.
3 WildDet3D-Data Construction Pipeline
A three-stage pipeline creates large-scale 3D annotations from existing 2D datasets (COCO, LVIS, Objects365, V3Det).
- Candidate Generation: Five complementary methods generate candidate 3D boxes per 2D annotation:
- 3D-MOOD, DetAny3D, SAM-3D, RANSAC-PCA, LabelAny3D.
- Each candidate undergoes translation and rotation optimization.
- Rule-Based Filtering: Applies geometric criteria (edge contact, occlusion, size ratio), VLM-based depicted object filter, and LLM-estimated size/geometry filters.
- Candidate Selection: Two parallel paths:
- Human Selection: Crowdsourced annotators select the best candidate and rate quality (
good_fit,acceptable,unacceptable). - VLM Selection: A fine-tuned Molmo2 model scores candidates on six perceptual criteria (category, scale, translation, shape, rotation, tilt); keeps highest-scoring candidate if total score > 10.
- Human Selection: Crowdsourced annotators select the best candidate and rate quality (
Table 1: WildDet3D-Data Statistics
| Split | Source | Images | Annotations | Categories | Type | Scene | Max Depth |
|---|---|---|---|---|---|---|---|
| Existing Datasets | |||||||
| Omni3D [6] | KITTI, nuScenes, etc. | 234K | 3M+ | 98 | Human | Driving, Furniture | 67 m |
| WildDet3D-Data | |||||||
| Train (Human) | COCO, LVIS, Obj365, V3Det | 102,979 | 229,934 | 12,064 | Human | In-the-wild | |
| Train (Synthetic) | COCO, LVIS, Obj365, V3Det | 896,004 | 3,483,292 | 11,896 | VLM filter | In-the-wild | |
| Val | COCO, LVIS, Obj365 | 2,470 | 五项 9,256 | 785 | Human | In-the-wild | |
| Test | COCO, LVIS, Obj365 | 2,433 | 5,596 | 633 | Human | In-the-wild | |
| Total | - | 1,003,886 | 3,728,078 | 13,499 | Human + VLM | In-the-wild | 81 m |
Table 2: Pipeline Validation on Human-Annotated Train Set
| Model | Selection Share | Rejection Rate |
|---|---|---|
| SAM-3D | 40.4% | 17.3% |
| RANSAC-PCA | 28.2% | 12.5% |
| DetAny3D | 14.5% | 42.9% |
| LabelAny3D | 13.0% | 21.3% |
| 3D-MOOD | 3.8% | 25.7% |
| Overall | — | 22.0% |
| VLM Score | Rejection Rate | n |
| < 7 | 71.9% | 1,992 |
| 7 | 67.4% | 13,670 |
| 8 | 45.3% | 18,665 |
| 9 | 36.1% | 83,882 |
| 10 | 16.7% | 310,329 |
| 11 | 9.2% | 52,684 |
| VLM Top-2 Coverage: 73.4% |
Empirical Validation / Results
4.1 Experimental Setup
- Datasets: WildDet3D-Bench (proposed, 700+ categories), Omni3D, zero-shot on Argoverse 2 (AV2) & ScanNet, real depth on Stereo4D.
- Metrics:
AP³ᴰ(3D IoU matching for Omni3D, center-distance matching for in-the-wild),ODS(Open Detection Score) for zero-shot. - Training: Three stages on 32 GPUs (12 epochs each on Omni3D, then Omni3D+Others+WildDet3D-Data, final fine-tuning).
4.2 In-the-Wild Evaluation on WildDet3D-Bench
Table 3: WildDet3D-Bench Evaluation Results
| Method | Training Data | AP_rare | AP_common | AP_frequent | AP³ᴰ |
|---|---|---|---|---|---|
| Text Prompt | |||||
| 3D-MOOD [58] | Omni3D | 2.4 | 2.1 | 2.6 | 2.3 |
| WildDet3D | Omni3D | 9.0 | 6.5 | 5.2 | 6.8 |
| WildDet3D w/ depth | Omni3D | 23.0 | 21.5 | 16.1 | 20.7 |
| WildDet3D | Omni3D, Others, WildDet3D-Data | 28.3 | 21.6 | 18.7 | 22.6 |
| WildDet3D w/ depth | Omni3D, Others, WildDet3D-Data | 47.4 | 40.7 | 37.2 | 41.6 |
| Box Prompt | |||||
| OVMono3D-LIFT [59] | Omni3D | 7.4 | 8.8 | 5.1 | 7.7 |
| DetAny3D [63] | Omni3D, Others | 9.9 | 7.4 | 6.3 | 7.8 |
| WildDet3D | Omni3D | 12.0 | 7.9 | 5.3 | 8.4 |
| WildDet3D w/ depth | Omni3D | 26.4 | 24.4 | 19.6 | 23.9 |
| WildDet3D | Omni3D, Others, WildDet3D-Data | 30.0 | 24.2 | 20.3 | 24.8 |
| WildDet3D w/ depth | Omni3D, Others, WildDet3D-Data | 53.7 | 46.1 | 42.5 | 47.2 |
Key Findings:
- WildDet3D trained on Omni3D alone outperforms 3D-MOOD by 3.0× (6.8 vs. 2.3 AP).
- Adding WildDet3D-Data yields a 9.8× improvement over 3D-MOOD (22.6 vs. 2.3 AP).
- Providing ground-truth depth at test time gives massive gains (+19.0 AP for the full model).
- Improvements are consistent across all category frequency groups.
4.3 Results on Omni3D
Table 4: Omni3D Evaluation Results
| Method | KITTI | nuScenes | SUNRGBD | Hypersim | ARKitScenes | Objectron | AP³ᴰ |
|---|---|---|---|---|---|---|---|
| Text Prompt | |||||||
| 3D-MOOD Swin-B [58] | 31.4 | 35.8 | 23.8 | 9.1 | 53.9 | 67.9 | 30.0 |
| WildDet3D | 37.0 | 31.7 | 38.9 | 16.5 | 64.6 | 60.5 | 34.2 |
| WildDet3D w/ depth | 36.1 | 32.0 | 51.1 | 26.6 | 73.3 | 68.3 | 41.6 |
| Box Prompt | |||||||
| DetAny3D [63] | 38.7 | 37.6 | 46.1 | 16.0 | 50.6 | 56.8 | 34.4 |
| WildDet3D | 44.3 | 35.3 | 43.1 | 17.3 | 66.6 | 60.8 | 36.4 |
| WildDet3D w/ depth | 42.8 | 35.9 | 58.7 | 30.4 | 76.6 | 68.5 | 45.8 |
Key Findings:
- WildDet3D surpasses prior SOTA in both text (+5.8 AP over 3D-MOOD) and box (+2.0 AP over DetAny3D) prompt settings.
- Achieves superior results with 10× fewer training epochs (12 vs. 120 for 3D-MOOD).
- Depth input provides substantial gains, especially on indoor datasets with depth sensors.
4.4 Zero-Shot Evaluation
Table 5: Zero-Shot Evaluation on Argoverse 2 and ScanNet
| Method | Argoverse 2 [54] | ScanNet [12] | ||||||
|---|---|---|---|---|---|---|---|---|
| AP ↑ | mATE ↓ | mASE ↓ | mAOE ↓ | ODS ↑ | AP ↑ | mATE ↓ | mASE ↓ | |
| 3D-MOOD Swin-B [58] | 14.7 | 0.755 | 0.680 |