Holi-Spatial: Evolving Video Streams into Holistic 3D Spatial Intelligence - Summary

Summary (Overview)

  • Key Contribution: Introduces Holi-Spatial, the first fully automated pipeline to convert raw video streams into large-scale, high-quality 3D spatial annotations (geometry, semantics, QA pairs) without human intervention.
  • Core Dataset: Constructs Holi-Spatial-4M, a large-scale multimodal dataset containing 12K optimized 3D Gaussian Splatting (3DGS) scenes, 1.3M 2D masks, 320K 3D bounding boxes and captions, 1.2M 3D grounding instances, and 1.2M spatial QA pairs.
  • Superior Performance: The pipeline significantly outperforms existing feed-forward and per-scene optimized methods on benchmarks like ScanNet, ScanNet++, and DL3DV, e.g., improving 3D detection AP50 by 64% on ScanNet++.
  • VLM Enhancement: Fine-tuning Vision-Language Models (e.g., Qwen3-VL) on the curated dataset leads to substantial gains in downstream tasks, including a 15% AP50 improvement on ScanNet++ 3D grounding and a 7.9% accuracy increase on MMSI-Bench for spatial reasoning.
  • Scalable Solution: Addresses the data scarcity bottleneck in spatial intelligence by providing a principled, scalable method to generate diverse, open-vocabulary annotations from abundant web videos, moving beyond limited manually annotated 3D scans.

Introduction and Theoretical Foundation

Spatial intelligence is crucial for enabling Large Multimodal Models (LMMs) to understand and interact with the 3D world, with applications in robotics, AR/VR, and scene editing. However, progress is hindered by a severe scarcity of large-scale, fine-grained 3D data. Existing approaches typically generate Question-Answer (QA) pairs from a small number of manually annotated 3D datasets (e.g., ScanNet) or apply feed-forward models to single images. These methods suffer from limited scalability, domain gaps, and narrow semantic coverage (e.g., only 50 classes in ScanNet).

Theoretical Foundation: The work is motivated by the observation that recent advances in AI tools (e.g., VLMs, segmentation models, 3D reconstruction) have matured sufficiently. By systematically composing these tools, it is possible to build an automated spatial annotation engine that can outperform human annotations, enabling a positive data flywheel. The core idea is to reframe spatial data curation as a scalable, non-human pipeline that converts abundant raw videos into annotated 3D scenes.

Key Insight: Holi-Spatial unifies a broad spectrum of spatial tasks—3D reconstruction, novel view synthesis, depth rendering, 2D/3D instance segmentation, captioning, grounding, and spatial QA—into a single, automated framework, as summarized in Table 1.

Table 1: Overview of Pipeline Capabilities. Holi-Spatial serves as a unified framework supporting diverse spatial tasks without relying on 3D priors.

MethodInputsOutputs
ImagesPoint Clouds
2D-VLM Methods
SAM3
SA2VA
3D-VLM Methods
SpatialLM
LLaVA-3D
SceneScript
3D-GS based Understanding Methods
M3-Spatial
LangSplat
Holi-Spatial

Methodology

The Holi-Spatial pipeline operates in three progressive stages (Figure 3):

1. Geometric Optimization

  • Objective: Distill high-fidelity 3D structure from raw videos.
  • Process:
    1. Use Structure-from-Motion (SfM) to recover camera parameters.
    2. Initialize a dense point cloud using monocular depth priors (Depth-Anything-V3).
    3. Optimize a 3D Gaussian Splatting (3DGS) scene representation under geometric supervision (multi-view depth consistency) to sharpen structure and suppress floaters/artifacts.
  • Output: Clean, consistent 3D scene geometry with high-quality rendered depth maps.

2. Image-level Perception

  • Objective: Extract spatially consistent object proposals from 2D keyframes.
  • Process:
    1. Uniformly sample keyframes I=I1,...,ITI = \\{ I_1, ..., I_T \\} from the video.
    2. For each frame ItI_t, use a VLM (Gemini3-Pro) with a dynamic class-label memory Mt=Mt1Extract(It)M_t = M_{t-1} \cup \text{Extract}(I_t) to generate an open-vocabulary caption, ensuring semantic consistency across views.
    3. Use SAM3, guided by prompts from MtM_t, to perform open-vocabulary instance segmentation, producing predictions Ot=(Mk,sk)k=1NO_t = \\{ (M_k, s_k) \\}_{k=1}^N where MkM_k is a binary mask and sks_k is a confidence score.
    4. 2D-to-3D Lifting: Unproject mask pixels into 3D using the refined depth map DtD_t and camera intrinsics K\mathbf{K}: P=Dt(u)K1u~\mathbf{P} = D_t(\mathbf{u}) \cdot \mathbf{K}^{-1} \tilde{\mathbf{u}} where u=(u,v)\mathbf{u} = (u, v) is a pixel and u~=[u,v,1]\tilde{\mathbf{u}} = [u, v, 1]^\top is its homogeneous coordinate. A geometry-aware filtering strategy (mask erosion + mesh-guided depth filtering) is applied to suppress boundary floaters and noise (Figure 4).
    5. Estimate an initial 3D Oriented Bounding Box (OBB) from the filtered point cloud.

3. Scene-level Refinement

  • Objective: Perform coarse-to-fine refinement to generate final, high-quality 3D annotations.
  • Process: Starting from initial proposals Pinit=(Bi,ci,si)i=1MP_{\text{init}} = \\{ (B_i, c_i, s_i) \\}_{i=1}^M (box, category, confidence):
    1. Multi-View Merge & Post-Process: Merge redundant 3D proposals across views if they share the same category and their 3D IoU exceeds a threshold τmerge\tau_{\text{merge}} (set to 0.2): ci=cjIoU3D(Bi,Bj)>τmergec_i = c_j \land \text{IoU}_{3D}(B_i, B_j) > \tau_{\text{merge}} Update attributes (confidence, source image index) to retain the most reliable observation. Apply floor-alignment to OBBs for gravity consistency (Figure 5). Output: PmergedP_{\text{merged}}.
    2. Confidence-Based Filtering & Refinement: Apply a tri-level decision rule to each instance pkp_k in PmergedP_{\text{merged}} based on its confidence score sks_k: Action(pk)={keep,skτhighdiscard,sk<τlowverify,τlowsk<τhigh\text{Action}(p_k) = \begin{cases} \text{keep}, & s_k \geq \tau_{\text{high}} \\\\ \text{discard}, & s_k < \tau_{\text{low}} \\\\ \text{verify}, & \tau_{\text{low}} \leq s_k < \tau_{\text{high}} \end{cases} where τhigh=0.9\tau_{\text{high}}=0.9 and τlow=0.8\tau_{\text{low}}=0.8. Instances in the "verify" band are reassessed by a VLM-based agent equipped with zoom-in and SAM3 re-segmentation tools.
    3. Annotation Generation: For each final instance pkPfinalp_k \in P_{\text{final}}, retrieve its optimal source image and use a VLM (Qwen3-VL-30B) to generate a detailed caption. Procedurally synthesize Spatial QA pairs (1.25M total) covering camera-centric (rotation, movement) and object-centric (distance, direction, size) reasoning tasks.

Empirical Validation / Results

1. Framework Evaluation on 3D Benchmarks The pipeline was evaluated on ScanNet, ScanNet++, and DL3DV-10K. Results (Table 2) show Holi-Spatial is the only framework generating high-quality predictions across all tasks (depth, 2D segmentation, 3D detection).

Table 2: Quantitative results of 3D spatial understanding. Bold indicates best results. ↑ : higher is better.

MethodScanNet++ScanNetDL3DV
Depth F1 (↑)2D Seg. IoU (↑)3D Det. AP50 (↑)
2D-VLM Methods
SAM30.50
SA2VA0.25
3D-VLM Methods
SpatialLM6.23
LLaVA-3D4.80
SceneScript4.42
3DGS-based Methods
M3-Spatial0.390.11
LangSplat0.210.06
Holi-Spatial (Ours)0.890.6470.05
  • Geometry & Depth: Achieves a Depth F1-score of 0.89 on ScanNet++, vastly outperforming 3DGS baselines (M3-Spatial: 0.39). Visualizations (Figure 7) show cleaner geometry with fewer artifacts.
  • 2D Segmentation: Achieves IoU of 0.64 on ScanNet++, significantly better than SA2VA (0.25). Leverages multi-view information to segment challenging instances (Figure 8).
  • 3D Object Detection: Achieves AP50 of 70.05 on ScanNet++, an order-of-magnitude improvement over 3D-VLM baselines (LLaVA-3D: 4.80). Produces more objects with accurate labels and tight bounding boxes (Figure 9).

2. VLM Fine-tuning Evaluation

  • Spatial Reasoning: Fine-tuning Qwen3-VL on the 1.2M spatial QA pairs from Holi-Spatial-4M improves performance on MMSI-Bench and MindCube benchmarks (Table 3). For example, Qwen3-VL-8B accuracy improves from 29.4% to 49.1% on MindCube.
  • 3D Grounding: Fine-tuning on the 1.2M 3D grounding pairs leads to major gains on ScanNet++ (Table 4). Qwen3-VL-8B AP50 improves from 13.50 to 27.98, a +14.48 point gain, mitigating the single-view bias of baseline models (Figure 11).

Table 3: Quantitative results of Spatial Understanding QA tasks.

ModelMMSI-BenchMindCube
Qwen3-VL-8B [13]31.129.4
Qwen3-VL-8B + Ours32.649.1
Qwen3-VL-2B [13]26.133.5
Qwen3-VL-2B + Ours27.644.0

Table 4: Quantitative results of 3D Grounding on ScanNet++.

MethodAP15AP25AP50
VST-7B-SFT [26]17.2914.5011.20
Qwen3-VL-8B [13]19.8216.8013.50
Qwen3-VL-8B + Ours35.5231.9427.98

3. Ablation Studies Key components are validated (Table 5, Figure 10):

  • Geometric Training (3DGS): Using GS-refined depth (ID.2) drastically improves precision (P25P_{25}: 0.13 → 0.81) and recall (R25R_{25}: 0.31 → 0.89) over raw DA3 depth (ID.1), producing cleaner geometry.
  • Confidence Filter: Applying the confidence filter (ID.4) improves precision (0.35 → 0.67) by removing false positives but reduces recall (0.74 → 0.69) by discarding some challenging true positives.
  • Agent Recall: The VLM-based agent verification step (ID.5) recovers challenging true positives filtered out by confidence, achieving the best precision-recall balance (0.81, 0.89).

Table 5: Ablation study on depth refinement, confidence filtering, and agent recall. P25P_{25} and R25R_{25} denote precision and recall under IoU 25%.

IDDA3 Depth3DGS TrainingConf. FilterAgent RecallP25P_{25}R25R_{25}
Step 1: Geometric Optimization
10.130.31
20.810.89
Step 3: Scene-Level Refinement
30.350.74
40.670.69
50.810.89

Theoretical and Practical Implications

  • Theoretical: Demonstrates the feasibility and superiority of a fully automated, composition-based approach to spatial data curation, challenging the paradigm of relying on limited human-annotated 3D scans. It establishes a new benchmark for generating holistic, multi-task spatial supervision from video.
  • Practical:
    1. Scalable Data Generation: Provides a pathway to create massive, diverse 3D understanding datasets from the vast reservoir of web videos, potentially overcoming the data bottleneck in spatial intelligence.
    2. Model Enhancement: Shows that fine-tuning state-of-the-art VLMs on automatically generated data leads to significant performance gains in 3D grounding and spatial reasoning, validating the quality and utility of the curated data.
    3. Unified Framework: Offers a versatile tool for applications requiring 3D scene understanding, such as robotics (manipulation, navigation), augmented reality, and autonomous systems.

Conclusion

Holi-Spatial presents a groundbreaking, fully automated pipeline for converting raw videos into holistic 3D spatial annotations, culminating in the release of the large-scale Holi-Spatial-4M dataset. The method synergistically combines 3DGS-based geometric optimization, open-vocabulary perception, and scene-level refinement to produce high-quality, multi-level supervision. Extensive experiments validate its superior performance over existing methods and its effectiveness in enhancing VLM capabilities for 3D tasks.

Future Directions & Limitations: The pipeline is computationally expensive due to per-scene optimization and may degrade with challenging videos (motion blur, occlusion). Semantic labels may inherit biases from foundation models. Future work aims to improve efficiency, expand to broader domains (outdoor, dynamic scenes), and develop stronger benchmarks. The