Holi-Spatial: Evolving Video Streams into Holistic 3D Spatial Intelligence - Summary

Summary (Overview)

Key Contribution: Introduces Holi-Spatial, the first fully automated pipeline to convert raw video streams into large-scale, high-quality 3D spatial annotations (geometry, semantics, QA pairs) without human intervention.
Core Dataset: Constructs Holi-Spatial-4M, a large-scale multimodal dataset containing 12K optimized 3D Gaussian Splatting (3DGS) scenes, 1.3M 2D masks, 320K 3D bounding boxes and captions, 1.2M 3D grounding instances, and 1.2M spatial QA pairs.
Superior Performance: The pipeline significantly outperforms existing feed-forward and per-scene optimized methods on benchmarks like ScanNet, ScanNet++, and DL3DV, e.g., improving 3D detection AP50 by 64% on ScanNet++.
VLM Enhancement: Fine-tuning Vision-Language Models (e.g., Qwen3-VL) on the curated dataset leads to substantial gains in downstream tasks, including a 15% AP50 improvement on ScanNet++ 3D grounding and a 7.9% accuracy increase on MMSI-Bench for spatial reasoning.
Scalable Solution: Addresses the data scarcity bottleneck in spatial intelligence by providing a principled, scalable method to generate diverse, open-vocabulary annotations from abundant web videos, moving beyond limited manually annotated 3D scans.

Introduction and Theoretical Foundation

Spatial intelligence is crucial for enabling Large Multimodal Models (LMMs) to understand and interact with the 3D world, with applications in robotics, AR/VR, and scene editing. However, progress is hindered by a severe scarcity of large-scale, fine-grained 3D data. Existing approaches typically generate Question-Answer (QA) pairs from a small number of manually annotated 3D datasets (e.g., ScanNet) or apply feed-forward models to single images. These methods suffer from limited scalability, domain gaps, and narrow semantic coverage (e.g., only 50 classes in ScanNet).

Theoretical Foundation: The work is motivated by the observation that recent advances in AI tools (e.g., VLMs, segmentation models, 3D reconstruction) have matured sufficiently. By systematically composing these tools, it is possible to build an automated spatial annotation engine that can outperform human annotations, enabling a positive data flywheel. The core idea is to reframe spatial data curation as a scalable, non-human pipeline that converts abundant raw videos into annotated 3D scenes.

Key Insight: Holi-Spatial unifies a broad spectrum of spatial tasks—3D reconstruction, novel view synthesis, depth rendering, 2D/3D instance segmentation, captioning, grounding, and spatial QA—into a single, automated framework, as summarized in Table 1.

Table 1: Overview of Pipeline Capabilities. Holi-Spatial serves as a unified framework supporting diverse spatial tasks without relying on 3D priors.

Method Inputs Outputs
Images Point Clouds
2D-VLM Methods
SAM3 ✓ ✗
SA2VA ✓ ✗
3D-VLM Methods
SpatialLM ✗ ✓
LLaVA-3D ✗ ✓
SceneScript ✗ ✓
3D-GS based Understanding Methods
M3-Spatial ✓ ✗
LangSplat ✓ ✗
Holi-Spatial ✓ ✓

Method	Inputs	Outputs
	Images	Point Clouds
2D-VLM Methods
SAM3	✓	✗
SA2VA	✓	✗
3D-VLM Methods
SpatialLM	✗	✓
LLaVA-3D	✗	✓
SceneScript	✗	✓
3D-GS based Understanding Methods
M3-Spatial	✓	✗
LangSplat	✓	✗
Holi-Spatial	✓	✓

Methodology

The Holi-Spatial pipeline operates in three progressive stages (Figure 3):

1. Geometric Optimization

Objective: Distill high-fidelity 3D structure from raw videos.
Process:
1. Use Structure-from-Motion (SfM) to recover camera parameters.
2. Initialize a dense point cloud using monocular depth priors (Depth-Anything-V3).
3. Optimize a 3D Gaussian Splatting (3DGS) scene representation under geometric supervision (multi-view depth consistency) to sharpen structure and suppress floaters/artifacts.
Output: Clean, consistent 3D scene geometry with high-quality rendered depth maps.

2. Image-level Perception

Objective: Extract spatially consistent object proposals from 2D keyframes.
Process:
1. Uniformly sample keyframes $I = \\{ I_1, ..., I_T \\}$ from the video.
2. For each frame $I_t$ , use a VLM (Gemini3-Pro) with a dynamic class-label memory $M_t = M_{t-1} \cup \text{Extract}(I_t)$ to generate an open-vocabulary caption, ensuring semantic consistency across views.
3. Use SAM3, guided by prompts from $M_t$ , to perform open-vocabulary instance segmentation, producing predictions $O_t = \\{ (M_k, s_k) \\}_{k=1}^N$ where $M_k$ is a binary mask and $s_k$ is a confidence score.
4. 2D-to-3D Lifting: Unproject mask pixels into 3D using the refined depth map $D_t$ and camera intrinsics $\mathbf{K}$ : $\mathbf{P} = D_t(\mathbf{u}) \cdot \mathbf{K}^{-1} \tilde{\mathbf{u}}$ where $\mathbf{u} = (u, v)$ is a pixel and $\tilde{\mathbf{u}} = [u, v, 1]^\top$ is its homogeneous coordinate. A geometry-aware filtering strategy (mask erosion + mesh-guided depth filtering) is applied to suppress boundary floaters and noise (Figure 4).
5. Estimate an initial 3D Oriented Bounding Box (OBB) from the filtered point cloud.

3. Scene-level Refinement

Objective: Perform coarse-to-fine refinement to generate final, high-quality 3D annotations.
Process: Starting from initial proposals $P_{\text{init}} = \\{ (B_i, c_i, s_i) \\}_{i=1}^M$ $P_{init} = (B_{i}, c_{i}, s_{i})_{i = 1}^{M}$ (box, category, confidence):
1. Multi-View Merge & Post-Process: Merge redundant 3D proposals across views if they share the same category and their 3D IoU exceeds a threshold $\tau_{\text{merge}}$ (set to 0.2): $c_i = c_j \land \text{IoU}_{3D}(B_i, B_j) > \tau_{\text{merge}}$ Update attributes (confidence, source image index) to retain the most reliable observation. Apply floor-alignment to OBBs for gravity consistency (Figure 5). Output: $P_{\text{merged}}$ .
2. Confidence-Based Filtering & Refinement: Apply a tri-level decision rule to each instance $p_k$ in $P_{\text{merged}}$ based on its confidence score $s_k$ : $\text{Action}(p_k) = \begin{cases} \text{keep}, & s_k \geq \tau_{\text{high}} \\\\ \text{discard}, & s_k < \tau_{\text{low}} \\\\ \text{verify}, & \tau_{\text{low}} \leq s_k < \tau_{\text{high}} \end{cases}$ where $\tau_{\text{high}}=0.9$ and $\tau_{\text{low}}=0.8$ . Instances in the "verify" band are reassessed by a VLM-based agent equipped with zoom-in and SAM3 re-segmentation tools.
3. Annotation Generation: For each final instance $p_k \in P_{\text{final}}$ , retrieve its optimal source image and use a VLM (Qwen3-VL-30B) to generate a detailed caption. Procedurally synthesize Spatial QA pairs (1.25M total) covering camera-centric (rotation, movement) and object-centric (distance, direction, size) reasoning tasks.

Empirical Validation / Results

1. Framework Evaluation on 3D Benchmarks The pipeline was evaluated on ScanNet, ScanNet++, and DL3DV-10K. Results (Table 2) show Holi-Spatial is the only framework generating high-quality predictions across all tasks (depth, 2D segmentation, 3D detection).

Table 2: Quantitative results of 3D spatial understanding. Bold indicates best results. ↑ : higher is better.

Method ScanNet++ ScanNet DL3DV
Depth F1 (↑) 2D Seg. IoU (↑) 3D Det. AP50 (↑)
2D-VLM Methods
SAM3 – 0.50 –
SA2VA – 0.25 –
3D-VLM Methods
SpatialLM – – 6.23
LLaVA-3D – – 4.80
SceneScript – – 4.42
3DGS-based Methods
M3-Spatial 0.39 0.11 –
LangSplat 0.21 0.06 –
Holi-Spatial (Ours) 0.89 0.64 70.05

Method	ScanNet++	ScanNet	DL3DV
	Depth F1 (↑)	2D Seg. IoU (↑)	3D Det. AP50 (↑)
2D-VLM Methods
SAM3	–	0.50	–
SA2VA	–	0.25	–
3D-VLM Methods
SpatialLM	–	–	6.23
LLaVA-3D	–	–	4.80
SceneScript	–	–	4.42
3DGS-based Methods
M3-Spatial	0.39	0.11	–
LangSplat	0.21	0.06	–
Holi-Spatial (Ours)	0.89	0.64	70.05

Geometry & Depth: Achieves a Depth F1-score of 0.89 on ScanNet++, vastly outperforming 3DGS baselines (M3-Spatial: 0.39). Visualizations (Figure 7) show cleaner geometry with fewer artifacts.
2D Segmentation: Achieves IoU of 0.64 on ScanNet++, significantly better than SA2VA (0.25). Leverages multi-view information to segment challenging instances (Figure 8).
3D Object Detection: Achieves AP50 of 70.05 on ScanNet++, an order-of-magnitude improvement over 3D-VLM baselines (LLaVA-3D: 4.80). Produces more objects with accurate labels and tight bounding boxes (Figure 9).

2. VLM Fine-tuning Evaluation

Spatial Reasoning: Fine-tuning Qwen3-VL on the 1.2M spatial QA pairs from Holi-Spatial-4M improves performance on MMSI-Bench and MindCube benchmarks (Table 3). For example, Qwen3-VL-8B accuracy improves from 29.4% to 49.1% on MindCube.
3D Grounding: Fine-tuning on the 1.2M 3D grounding pairs leads to major gains on ScanNet++ (Table 4). Qwen3-VL-8B AP50 improves from 13.50 to 27.98, a +14.48 point gain, mitigating the single-view bias of baseline models (Figure 11).

Table 3: Quantitative results of Spatial Understanding QA tasks.

Model MMSI-Bench MindCube
Qwen3-VL-8B [13] 31.1 29.4
Qwen3-VL-8B + Ours 32.6 49.1
Qwen3-VL-2B [13] 26.1 33.5
Qwen3-VL-2B + Ours 27.6 44.0

Model	MMSI-Bench	MindCube
Qwen3-VL-8B [13]	31.1	29.4
Qwen3-VL-8B + Ours	32.6	49.1
Qwen3-VL-2B [13]	26.1	33.5
Qwen3-VL-2B + Ours	27.6	44.0

Table 4: Quantitative results of 3D Grounding on ScanNet++.

Method AP15 AP25 AP50
VST-7B-SFT [26] 17.29 14.50 11.20
Qwen3-VL-8B [13] 19.82 16.80 13.50
Qwen3-VL-8B + Ours 35.52 31.94 27.98

Method	AP15	AP25	AP50
VST-7B-SFT [26]	17.29	14.50	11.20
Qwen3-VL-8B [13]	19.82	16.80	13.50
Qwen3-VL-8B + Ours	35.52	31.94	27.98

3. Ablation Studies Key components are validated (Table 5, Figure 10):

Geometric Training (3DGS): Using GS-refined depth (ID.2) drastically improves precision ( $P_{25}$ : 0.13 → 0.81) and recall ( $R_{25}$ : 0.31 → 0.89) over raw DA3 depth (ID.1), producing cleaner geometry.
Confidence Filter: Applying the confidence filter (ID.4) improves precision (0.35 → 0.67) by removing false positives but reduces recall (0.74 → 0.69) by discarding some challenging true positives.
Agent Recall: The VLM-based agent verification step (ID.5) recovers challenging true positives filtered out by confidence, achieving the best precision-recall balance (0.81, 0.89).

Table 5: Ablation study on depth refinement, confidence filtering, and agent recall. $P_{25}$ and $R_{25}$ denote precision and recall under IoU 25%.

ID DA3 Depth 3DGS Training Conf. Filter Agent Recall $P_{25}$ $R_{25}$
Step 1: Geometric Optimization
1 ✓ ✗ ✓ ✓ 0.13 0.31
2 ✓ ✓ ✓ ✓ 0.81 0.89
Step 3: Scene-Level Refinement
3 ✓ ✓ ✗ ✗ 0.35 0.74
4 ✓ ✓ ✓ ✗ 0.67 0.69
5 ✓ ✓ ✓ ✓ 0.81 0.89

ID	DA3 Depth	3DGS Training	Conf. Filter	Agent Recall	$P_{25}$	$R_{25}$
Step 1: Geometric Optimization
1	✓	✗	✓	✓	0.13	0.31
2	✓	✓	✓	✓	0.81	0.89
Step 3: Scene-Level Refinement
3	✓	✓	✗	✗	0.35	0.74
4	✓	✓	✓	✗	0.67	0.69
5	✓	✓	✓	✓	0.81	0.89

Theoretical and Practical Implications

Theoretical: Demonstrates the feasibility and superiority of a fully automated, composition-based approach to spatial data curation, challenging the paradigm of relying on limited human-annotated 3D scans. It establishes a new benchmark for generating holistic, multi-task spatial supervision from video.
Practical:
1. Scalable Data Generation: Provides a pathway to create massive, diverse 3D understanding datasets from the vast reservoir of web videos, potentially overcoming the data bottleneck in spatial intelligence.
2. Model Enhancement: Shows that fine-tuning state-of-the-art VLMs on automatically generated data leads to significant performance gains in 3D grounding and spatial reasoning, validating the quality and utility of the curated data.
3. Unified Framework: Offers a versatile tool for applications requiring 3D scene understanding, such as robotics (manipulation, navigation), augmented reality, and autonomous systems.

Conclusion

Holi-Spatial presents a groundbreaking, fully automated pipeline for converting raw videos into holistic 3D spatial annotations, culminating in the release of the large-scale Holi-Spatial-4M dataset. The method synergistically combines 3DGS-based geometric optimization, open-vocabulary perception, and scene-level refinement to produce high-quality, multi-level supervision. Extensive experiments validate its superior performance over existing methods and its effectiveness in enhancing VLM capabilities for 3D tasks.

Future Directions & Limitations: The pipeline is computationally expensive due to per-scene optimization and may degrade with challenging videos (motion blur, occlusion). Semantic labels may inherit biases from foundation models. Future work aims to improve efficiency, expand to broader domains (outdoor, dynamic scenes), and develop stronger benchmarks. The