SpatialBench: A Comprehensive Benchmark for Spatial Foundation Models

Summary (Overview)

SpatialBench is a comprehensive, cross-paradigm benchmark designed to assess the robustness and generalization of spatial foundation models across diverse conditions. It comprises 19 datasets, 546 scenes, evaluates 41 models across 6 paradigms, and employs a deterministic multi-density sampling protocol (Single, Sparse, Medium, Dense).
Key findings reveal that full-context attention models define the accuracy upper bound under high-memory conditions, while bounded-memory strategies enable long-sequence scalability on limited GPUs. Data quality outweighs data volume for performance, and egocentric and wrist-view domains remain dominant out-of-distribution (OOD) failure modes.
To address the identified data gap, the authors introduce DA-Next-5M, a large-scale dataset of 5.5M frames from egocentric and wrist-view sources, and DA-Next, a strong baseline model trained on this data, which shows substantial improvements over DA3-Giant (e.g., +47%/59% in depth estimation).

Introduction and Theoretical Foundation

Spatial foundation models are widely deployed in robotics, AR/VR, autonomous driving, and embodied AI for their ability to recover 3D structures from images. However, their robustness across unpredictable real-world conditions (domain shifts, variable input densities, hardware constraints) remains unclear. Existing evaluations are limited by narrow paradigm coverage, limited scene domains, and arbitrary frame sampling, making it difficult to assess true generalization.

SpatialBench addresses these gaps with three core principles:

Deterministic Multi-Density Evaluation Protocol: Precomputes frame indices across four density regimes (Single, Sparse, Medium, Dense) to systematically assess model robustness across input scales.
Broad Domain Coverage Across 19 Datasets: Aggregates diverse datasets spanning indoor/outdoor, static/dynamic, real/synthetic, and various viewpoint types (normal, egocentric, wrist-view). Scenes are annotated with orthogonal tags for fine-grained analysis.
Comprehensive and Cross-Paradigm Model Comparison: Provides unified adapters for 41 model variants across six paradigms: optimization-based, end-to-end feed-forward, online/streaming, chunk-based, SLAM-based, and test-time training (TTT).

Methodology

Data Collection and Curation

SpatialBench unifies 19 heterogeneous 3D vision datasets into a common representation (RGB frames, metric depth maps, camera-to-world poses, intrinsics). A deterministic evaluation protocol uses precomputed JSON records for each (scene, view-density) pair.

Key Datasets: Static-real (7-Scenes, DTU, NRGBD, ScanNet++, Tanks & Temples, ETH3D), Static-synthetic (Hiroom), Dynamic-real (TUM-Dynamic, DROID, Xperience, Waymo, KITTI-Odometry), Dynamic-synthetic (ADT, RLBench with Colosseum, RoboTwin, Robolab, Virtual KITTI 2, OmniWorld-Game), and a Single-frame Mixture.

DROID Curation Pipeline: For high-quality wrist-view sequences, stereo videos are processed via $S^2M^2$ for metric depth, MapAnything for initial camera poses, SAM3 for dynamic region segmentation, and Bundle Adjustment for pose refinement. A unified depth map post-processing pipeline (range clipping, flying point removal, bilateral filtering, isolated region removal, sky masking) ensures annotation quality.

Multi-density Evaluation Regimes

Single: Fixed deterministic frame index for monocular depth prior evaluation.
Sparse: Formulated as a weighted set-cover problem to maximize voxel coverage with a small frame budget $K$ , promoting viewpoint diversity.
Medium: Uses a set-cover formulation favoring view overlap over diversity, with a length-adaptive frame budget.
Dense: Targets online, long-horizon settings, preserving temporal continuity while bounding evaluation cost with a maximum frame budget.

Evaluated Models

The benchmark evaluates 41 model variants across six paradigms. Key models include:

Optimization-based: DUSt3R, MASt3R.
End-to-End Feed-Forward: VGGT, Fast3R, FastVGGT, MUSt3R, MapAnything, OmniVGGT, $\pi^3$ , AMB3R, DepthAnything3 (DA3), WorldMirror, VGGT-Omega.
Online/Streaming: Spann3R, CUT3R, MonST3R, Point3R, Stream3R, StreamVGGT, PAGE4D, InfiniteVGGT, WinT3R, LongStream, LingBot-Map.
Chunk-based: VGGT-Long, $\pi^3$ -Long, DA3-Streaming.
SLAM-based: MASt3R-SLAM, VGGT-SLAM.
Test-Time Training (TTT): TTT3R, Scal3R, LoGeR.

Task Description and Metrics

Five evaluation tasks are designed:

Camera Pose Estimation: Evaluates pairwise geometry using Relative Rotation Accuracy ( $\text{RAcc}_x$ ), Relative Translation Accuracy ( $\text{TAcc}_x$ ), and AUC $_x$ (area under the joint accuracy curve).
Camera Trajectory Estimation: For continuous sequences, computes Absolute Trajectory Error ( $\text{ATE}$ ), Relative Translation Error ( $\text{RPE}_t$ ), and Relative Rotation Error ( $\text{RPE}_r$ ) after global Sim(3) alignment.
Depth Estimation: Computes AbsRel, SqRel, RMSE, LogRMSE, and threshold inlier ratios ( $\delta_\tau$ ) over valid pixels. Predicted depths are aligned via median scaling by default.
Dense-View Reconstruction: Evaluates scene-level 3D point clouds using Accuracy, Completeness, F-score (harmonic mean), and Overall score $(\text{Accuracy} + \text{Completeness}) / 2$ .
Prior-Enhanced Prediction: Targets methods that accept auxiliary inputs (e.g., depth, camera pose priors).

Empirical Validation / Results

Key Table: Main Results on SpatialBench (Table 1)

Method	#Params (M)	Time (s)	Single Frame AbsRel ↓	Sparse AbsRel ↓	AUC@30 ↑	Medium AbsRel ↓	AUC@30 ↑	ATE ↓	F-Score ↑	Dense AbsRel ↓	AUC@30 ↑	ATE ↓	F-Score ↑	Average AbsRel ↓	AUC@30 ↑	ATE ↓	F-Score ↑
DA-Next (Ours)	1303.76	0.50	0.166 (-54.9%)	0.050 (-47.4%)	0.809 (+3.1%)	0.035 (-59.3%)	0.819 (+5.5%)	1.442 (+24.2%)	0.727 (-2.0%)	OOM	OOM	OOM	OOM	0.084	0.814	1.442	0.727
DA3-Giant	1355.67	0.47	0.368	0.095	0.785	0.086	0.776	1.161	0.742	OOM	OOM	OOM	OOM	0.183	0.780	1.161	0.742
$\pi^3$ -X	1360.03	0.24	0.371	0.084	0.741	0.078	0.744	0.369	0.658	OOM	OOM	OOM	OOM	0.178	0.742	0.369	0.658
VGGT-Omega	1143.81	0.48	0.516	0.077	0.803	0.067	0.795	0.659	0.706	–	–	–	–	0.220	0.799	0.659	0.706
... (Other models)	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...

Note: Best, second-best, third-best highlighted. OOM/Timeout shaded. DA-Next excluded from per-column rankings.

Key Findings:

Full-Context Attention Sets Accuracy Upper Bound: Under the same input budget ( $N=800$ ), full-context feed-forward models (DA3-Giant, $\pi^3$ ) achieve the lowest depth errors, outperforming streaming/online variants. This indicates globally coupled attention remains highly effective for geometric reasoning.
Bounded-Memory Modeling Enables Long-Sequence Scalability: While full-context models' GPU memory grows rapidly with sequence length (leading to OOM on dense inputs), streaming, online, chunk-wise, and TTT variants maintain flatter memory curves, enabling continuous reconstruction under hardware constraints, albeit with lower depth accuracy.
Training Data Quality Outweighs Volume: Performance correlates with dataset count, but data quality is more decisive. DA3's careful pseudo-GT curation strategy yields top performance despite not using the largest training corpus.
Egocentric and Wrist-View Are Dominant OOD Failure Modes: Cross-method average performance drops sharply on ego-view and wrist-view sequences, indicating a field-level limitation due to underrepresented training data. DA-Next, trained on the egocentric/wrist-view DA-Next-5M dataset, shows substantial improvements over DA3-Giant: depth AbsRel improves by 47% (0.095→0.050) on sparse and 59% (0.086→0.035) on medium inputs; AUC@30 improves by +3.1% and +5.5%.
Test-Time Training (TTT) Gains Concentrated on Dense Sequences: TTT methods (Scal3R, LoGeR) consistently improve pairwise camera pose accuracy (AUC@30) and global trajectory consistency (ATE) over their base models (VGGT, $\pi^3$ ) on dense inputs, but gains are inconsistent or negative on sparse/medium inputs, confirming TTT is engineered for length generalization.
Injecting GT Priors: Injecting GT depth priors drives depth estimation to near-perfect accuracy across prior-aware models. However, camera pose prior injection yields inconsistent gains; some models partially override injected poses with their own predictions.

Theoretical and Practical Implications

Model Design: Full-context models are preferable for accuracy on bounded inputs; bounded-memory methods are better for long-horizon or resource-constrained deployment. The trade-off between accuracy and scalability is explicit.
Data Curation: Targeted in-domain data curation (e.g., DA-Next-5M for embodied views) is more effective for closing OOD gaps than simply scaling generic training mixtures. Domain match matters more than dataset count.
Evaluation Protocol: SpatialBench's deterministic, density-aware, and domain-diverse design provides a rigorous foundation for future research, enabling fair comparisons and revealing model behaviors across input regimes.
Benchmark Utility: The benchmark exposes critical gaps in current models and provides a clear roadmap for improvement, emphasizing the need for models that are robust across domains, densities, and hardware constraints.

Conclusion

SpatialBench reveals that current spatial foundation models are not yet all-round players, showing gaps in domain generalization and input-density robustness. The benchmark's extensive analysis provides key insights:

Full-context attention maximizes accuracy; bounded-memory strategies unlock scalability.
Data quality is paramount; domain alignment is critical for embodied tasks.
Egocentric and wrist-view domains are the largest OOD failure modes.

To address the most significant data gap, the authors introduced DA-Next-5M and trained DA-Next, establishing a strong baseline. They hope SpatialBench serves as a rigorous foundation for developing more generalizable and robust 3D foundation models. Future work should focus on improving domain generalization, especially for embodied viewpoints, and further exploring the trade-offs between full-context and bounded-memory architectures.