SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning - Summary

Summary (Overview)

  • Core Contribution: Proposes SpatialBoost, a scalable framework to enhance the 3D spatial awareness of existing pre-trained 2D vision encoders (e.g., DINOv3, SigLIPv2) by injecting spatial knowledge expressed through linguistic descriptions.
  • Key Idea: Converts dense 3D spatial information from 2D images into a hierarchical, multi-turn Chain-of-Thought (CoT) reasoning dataset expressed in language. This dataset is then used to fine-tune vision encoders via a Large Language Model (LLM) without catastrophic forgetting.
  • Primary Method: A three-stage training pipeline (feature alignment, visual instruction tuning, vision encoder fine-tuning) incorporating a novel dual-channel attention mechanism to preserve pre-trained knowledge while learning new spatial information.
  • Main Results: Demonstrates consistent performance improvements across diverse benchmarks requiring 3D perception (depth estimation, segmentation, 3D scene understanding, robot control) and even on general vision tasks (image classification, retrieval). For example, boosts DINOv3 mIoU on ADE20K from 55.9 to 59.7.
  • Scalability: Shows that performance scales with the size of the generated reasoning dataset, highlighting the method's data-efficient and scalable nature.

Introduction and Theoretical Foundation

Pre-trained image representation models (vision encoders) have achieved remarkable success across various computer vision tasks. However, they are predominantly trained on 2D image data, which fundamentally limits their ability to capture 3D spatial relationships between objects and backgrounds in the real world. This lack of spatial awareness constrains their effectiveness in downstream applications such as 3D scene understanding, vision-based robotic control, and tasks requiring geometric reasoning.

Existing approaches to imbue 3D knowledge, such as training on multi-view images, face scalability challenges due to the need for carefully curated or simulated data. The paper hypothesizes that spatial information extracted from 2D images by specialized models (e.g., for depth estimation, segmentation) can be systematically converted into explicit linguistic representations. Since language naturally composes information sequentially and structurally, it can serve as a scalable supervision signal for learning dense spatial relationships.

Building on recent works that use language as supervision for visual representation learning, the authors introduce SpatialBoost. The core theoretical insight is that a Large Language Model (LLM) can act as an effective conduit to transfer rich, hierarchical 3D knowledge into a frozen vision encoder through a carefully constructed language-based reasoning dataset.

Methodology

SpatialBoost consists of a three-stage training pipeline and a novel dataset construction method.

1. Multi-modal Architecture & Training Pipeline

The architecture comprises a vision encoder fVf_V, a trainable projection module gPg_P, and an LLM fLf_L. The training has three stages:

  • Stage 1: Feature Alignment. The projector gPg_P is trained to map image features from fVf_V into the textual embedding space of the LLM fLf_L. fVf_V and fLf_L are frozen.
  • Stage 2: Visual Instruction Tuning. The projector gPg_P and the LLM fLf_L are fine-tuned using a mix of standard visual instruction data and a newly constructed Multi-view VQA dataset (to handle multi-view inputs). fVf_V remains frozen.
  • Stage 3: Vision Encoder Fine-tuning with Dual-Channel Attention. This is the core stage where spatial knowledge is injected into fVf_V. The vision encoder and projector are fine-tuned using the Multi-turn Visual Spatial Reasoning Dataset, while the LLM is frozen.

To prevent catastrophic forgetting of pre-trained knowledge during Stage 3, a dual-channel attention mechanism is introduced. For each original attention layer Attn()\text{Attn}(\cdot), an additional layer Attn+()\text{Attn}^+(\cdot) is added with weights initialized identically. Their outputs are merged via a trainable mixture factor α\alpha:

Attnfinal(x)=αAttn(x)+(1α)Attn+(x)\text{Attn}_{\text{final}}(x) = \alpha \cdot \text{Attn}(x) + (1 - \alpha) \cdot \text{Attn}^+(x)

where α=sigmoid(a)(0,1)d\alpha = \text{sigmoid}(a) \in (0,1)^d and aRda \in \mathbb{R}^d is a zero-initialized parameter (dd is the hidden dimension). Only Attn+\text{Attn}^+ and α\alpha are updated during fine-tuning, allowing the model to gradually incorporate new spatial attention patterns.

2. Dataset Construction

  • Multi-view VQA Dataset: For Stage 2 alignment. Constructed from 3D/video datasets (ScanNet, Ego4D). Image pairs are filtered by LPIPS similarity 0.35LPIPS(xi,xj)0.650.35 \leq \text{LPIPS}(x_i, x_j) \leq 0.65. GPT-4o generates general multi-view QAs (common, adversarial, multi-choice).
  • Multi-turn Visual Spatial Reasoning Dataset: For Stage 3 knowledge injection. Constructs a 12-turn conversation per image, following a hierarchical CoT structure:
    1. Pixel-level (5 turns): Queries absolute/relative 3D position of points (e.g., "What is the depth at (x,y)?").
    2. Object-level (4 turns): Queries semantic spatial info using object bounding cubes (e.g., "Is [A] left of [B]?"). Uses pixel-level answers as rationale.
    3. Scene-level (1 turn): Queries holistic 3D understanding (e.g., "How far is [A] from [B]?").
    4. Scene Caption (2 turns): GPT-generated general captions to preserve non-spatial knowledge.
  • For single-view images, a 3D point cloud is generated using models like Depth-Pro [10] and SAM. For multi-view images, a 3D reconstruction model (VGGT [86]) is used.

Empirical Validation / Results

SpatialBoost was applied to multiple state-of-the-art vision encoders (OpenCLIP, SigLIPv2, DINOv2, DINOv3) and evaluated across six task categories.

Table 1: Monocular Depth Estimation (RMSE, lower is better)

MethodNYUd (lin.)NYUd (DPT)KITTI (lin.)KITTI (DPT)
DINOv30.310.252.332.02
+SpatialBoost (ours)0.250.212.201.84
SigLIPv20.510.403.322.64
+SpatialBoost (ours)0.390.342.712.50

Table 2: Semantic Segmentation (mIoU, higher is better)

MethodADE20K (lin.)ADE20K (+ms)Pascal VOC (lin.)Pascal VOC (+ms)
DINOv355.960.386.689.8
+SpatialBoost (ours)59.763.188.590.9
SigLIPv242.848.772.679.1
+SpatialBoost (ours)45.150.879.082.2

**Table .

Table 4: Vision-based Robot Learning (CortexBench)

MethodAdroitMetaWorldDMControlTrifingerAvg.
DINOv363.9 ± 1.583.8 ± 1.670.8 ± 1.872.8 ± 0.572.8
+SpatialBoost (ours)71.8 ± 3.492.0 ± 1.980.4 ± 2.479.0 ± 0.680.8
SigLIPv256.5 ± 3.084.7 ± 2.969.4 ± 2.168.3 ± 0.869.7
+SpatialBoost (ours)66.5 ± 1.989.1 ± 0.973.5 ± 1.873.9 ± 0.775.8

Table 5: Image Classification & Retrieval

MethodImageNet (lin.)Oxford-H (mAP)Paris-H (mAP)Met (GAP)
DINOv388.460.787.155.4
+SpatialBoost (ours)90.264.188.657.0
SigLIPv289.125.160.913.9
+SpatialBoost (ours)90.036.069.124.0

Ablation Studies & Analysis

  • LLM vs. Pixel-level Supervision (Table 6): Fine-tuning with an LLM decoder outperformed alternatives (linear heads, SAM decoder, VGGT decoder) across classification, segmentation, depth estimation, and VLR tasks, validating language as superior for dense information transfer.
  • Multi-turn Reasoning Order (Table 7): The forward hierarchical order (pixel → object → scene) yielded optimal performance compared to reversed or shuffled orders.
  • Dual-channel Attention (Figure 6): This mechanism uniquely preserved and enhanced pre-trained knowledge (ImageNet accuracy from 86.3% to 87.6%), whereas full fine-tuning or LoRA caused degradation.
  • Dataset Scalability (Figure 5): Performance on depth estimation and segmentation improved consistently as the size of the reasoning dataset increased from 50K to 300K samples.
  • Application to Spatial-aware Encoders (Table 9): SpatialBoost provided further performance gains even when applied to vision encoders already designed for spatial awareness (e.g., TIPS, PE-Core), demonstrating its complementary nature.

Theoretical and Practical Implications

  • Theoretical: Demonstrates that language can serve as an effective and scalable medium for transferring complex, hierarchical 3D knowledge into visual representation models. It bridges the gap between 2D pre-training and 3D understanding without requiring massive multi-view datasets.
  • Methodological: Introduces a practical fine-tuning framework (dual-channel attention) that enables knowledge injection without catastrophic forgetting, making it feasible to enhance powerful, existing encoders rather than training from scratch.
  • Practical: The improvements are broad and significant, not only in 3D-centric tasks but also in general vision tasks like classification and retrieval. This suggests that enhancing spatial reasoning leads to more robust and general visual representations. The method can be directly applied to boost the performance of state-of-the-art encoders (e.g., DINOv3) across a wide range of applications, from robotics to scene understanding.

Conclusion

SpatialBoost presents a novel and effective framework for enhancing the spatial awareness of pre-trained vision encoders by leveraging language-guided reasoning. Key innovations include:

  1. A hierarchical, multi-turn visual spatial reasoning dataset that converts dense 3D information into linguistic form.
  2. A three-stage training pipeline with a dual-channel attention mechanism that injects spatial knowledge while preserving pre-existing capabilities.
  3. Comprehensive empirical validation showing consistent and scalable performance gains across both spatial and general vision tasks.

The work facilitates future research on designing and enhancing vision encoders, demonstrating the power of language as a supervision signal for acquiring complex visual understanding. A noted limitation is the reliance on vision foundation models for dataset construction, pointing to the need for large-scale ground-truth spatial annotations as a valuable future direction.