ReVSI: Rebuilding Visual Spatial Intelligence Evaluation for Accurate Assessment of VLM 3D Reasoning - Summary

Summary (Overview)

Identifies Critical Pitfalls in Existing Benchmarks: The paper reveals two major validity issues in current Visual Spatial Intelligence (VSI) benchmarks like VSI-Bench: 1) Annotation-to-video ground-truth drift due to noisy 3D reconstructions, and 2) Scene-observability mismatch where questions are unanswerable under realistic, sparse frame sampling used by VLMs.
Introduces the ReVSI Benchmark: Proposes a rebuilt benchmark with video-aligned re-annotation of 381 scenes from 5 datasets, rigorously verified QA generation with bias mitigation, and frame-aware evaluation protocols (16/32/64/All frames) to ensure questions are answerable and correct under the model's actual input.
Enables Controlled Diagnostic Analysis: Provides fine-grained object visibility metadata and constructs "dummy videos" (e.g., removing frames with queried objects) to probe models' reliance on visual evidence versus non-visual priors or hallucinations.
Reveals New Insights into Model Capabilities: Evaluation on ReVSI shows that proprietary models (GPT-5.2, Gemini) are under-assessed by VSI-Bench, while many fine-tuned specialized models show smaller gains or even degradation, and exhibit severe hallucination in evidence-absent settings, uncovering behaviors obscured by prior benchmarks.

Introduction and Theoretical Foundation

The ability of Vision-Language Models (VLMs) to reason about 3D spaces from video inputs—Visual Spatial Intelligence (VSI)—is a critical capability for real-world applications. However, the paper argues that widely used benchmarks for evaluating this capability, such as VSI-Bench, suffer from systematic validity issues that lead to unreliable conclusions about model performance.

The core theoretical problem is a misalignment between the benchmark's assumptions and the VLM's operational reality. Benchmarks often repurpose ground-truth annotations from 3D scanned scene datasets (e.g., ScanNet) originally created for traditional 3D perception tasks. When these annotations, which can be noisy, incomplete, or geometrically inaccurate, are projected onto video frames for VLM evaluation, they create incorrect or ambiguous Question-Answer (QA) pairs. Furthermore, evaluations typically assume the model has access to the entire scene (all video frames), while practical VLMs operate on a strict input budget of sparsely sampled frames (e.g., 16-64 frames), making many questions effectively unanswerable.

Therefore, the paper's foundational motivation is to rebuild VSI evaluation with one guiding principle: ensuring strict consistency between what the model sees (its input frames) and what the benchmark asks. This requires high-quality, video-aligned annotations and frame-adaptive evaluation protocols.

Methodology

The ReVSI benchmark is constructed through a multi-stage process focusing on data quality, question validity, and evaluation controllability.

1. Video-Aligned Object and Geometry Re-annotation:

Datasets: 381 scenes from ScanNetv2, ScanNet++, ARKitScenes, 3RScan, and MultiScan.
Process: Using a custom 3D web annotation tool, authors manually re-annotated object labels and 3D bounding boxes, filtering incorrect labels, refining boxes, and adding missing objects.
Open Vocabulary: Object labels are manually assigned, enabling fine-grained descriptions (e.g., "Sony PlayStation"), resulting in 504 unique labels vs. 65 in VSI-Bench.
Geometry: Room boundary polygons are manually annotated from top-down views for accurate area calculation, replacing automatic methods like Alpha Shape used in VSI-Bench.

2. QA Re-generation with Verification and Bias Control:

All QA pairs are rebuilt from scratch with human verification for each sample.
Bias Mitigation: Introduces new question templates to reduce exploitable statistical biases (e.g., "2" as a common count in VSI-Bench).
- Object Counting: Adds single-count queries and cumulative counting across two categories.
- Object Size Estimation: Excludes categories with near-fixed dimensions and subsamples to ensure size diversity.
- Absolute Distance: Removes short-range (<1m) pairs and adds more long-range queries.
- Relative Direction & Distance: Adds templates for "farthest" object and "backward-facing" orientation.

3. Frame-Aware Evaluation Protocols:

Constructs separate, valid QA sets for 16, 32, 64, and All-frame sampling settings.
Object visibility for each frame budget is determined via projection and manual verification.
Ensures each question is answerable (queried objects are visible) and correct (ground truth matches the visible evidence) under the specific frame setting.

4. Visibility-Guided Controlled Diagnostics (Dummy Videos):

To test if models rely on visual evidence or priors, constructs evidence-absent controls:
- Query-Drop: Remove all frames containing the queried object(s).
- First-Frame Repeated: Repeat the first frame of the Query-Drop video.
- Black: Use entirely black frames.
For these videos, the ground-truth answer is deterministically known (e.g., count = 0), allowing measurement of hallucination rate.

5. Evaluation Metrics:

Multiple-Choice Questions (MCQ): Exact-match Accuracy.
Numerical Questions (NQ): Mean Relative Accuracy (MRA), defined as: $\text{MRA} = \frac{1}{|C|} \sum_{\theta \in C} \mathbb{1}\left\{ \frac{|\hat{y} - y|}{y} < 1 - \theta \right\}$ where $C = \{0.5, 0.55, ..., 0.95\}$ is a set of confidence thresholds, $\hat{y}$ is the prediction, and $y$ is the ground truth.

Empirical Validation / Results

Evaluations on ReVSI versus VSI-Bench reveal significant discrepancies and new insights.

Table 1: Dataset Statistics Comparison

Metric	VSI-Bench	ReVSI
Scenes	288	381
Objects	3,185	5,365
Object Labels	65	504
Open-Vocabulary	No	Yes

1. Exposing Flaws in VSI-Bench (Table 3):

Numerical Tasks: VSI-Bench systematically underestimates proprietary model performance. On ReVSI, proprietary models (GPT-5.2, Gemini 3 Pro) consistently outperform open-source models, reversing the trend suggested by VSI-Bench.
Absolute Distance: Despite ReVSI being more challenging (fewer short-range queries), models score higher. Analysis shows the relative-error-based MRA metric is less punitive for larger distances, and VSI-Bench's abundance of short-range queries made evaluation artificially strict.

2. Performance of Specialized VLMs (Table 4):

Fine-tuned models (Cambrian-S, VLM3R, Spatial-MLLM) show substantially smaller gains on ReVSI compared to their reported gains on VSI-Bench.
In some cases (e.g., SpaceR), fine-tuning leads to performance degradation versus the base model.
Scaling training data (e.g., Spatial-MLLM from 135k to 820k samples) yields only marginal improvements (~3%), suggesting data quality, not quantity, is the bottleneck.

3. Hallucination vs. Perception (Tables 5 & 6):

Object Counting on Dummy Videos (Table 5): Tests models when the correct answer is always 0.
- Human: 100% accuracy (predicts 0).
- Proprietary Models: Lower hallucination rates (e.g., GPT-5.2: 74% acc on Query-Drop).
- Open-Source & Fine-tuned Models: Catastrophic failure. Many fine-tuned models predict non-zero counts (accuracy <10%), revealing severe overfitting to noisy training data and disregard for visual input.
- Notable Divergence: Qwen3-VL predicts 0 correctly, while InternVL3.5 is heavily misled by scene context, frequently predicting "2" (the biased answer from VSI-Bench).

Table 5: Object Counting Accuracy on 16-Frame Dummy Videos (GT=0)

Method	Query-Drop	First-Frame	Black
Human	100.0	100.0	100.0
GPT-5.2	74.0	89.6	99.2
Gemini 3 Pro	62.3	85.0	94.0
Qwen3-VL-32B-Instruct	50.5	92.7	100.0
InternVL3.5-38B	9.1	45.0	1.2
Cambrian-S-7B (Fine-tuned)	1.1	2.8	0.0
Spatial-MLLM-4B-820k (Fine-tuned)	0.0	0.0	0.0

Object Size Estimation on Black Videos (Table 6): Even with explicit prompts to base answers on visual evidence, InternVL3.5 achieves high scores (~50 MRA) on black videos, indicating heavy reliance on priors, while Qwen3-VL scores near zero.

Theoretical and Practical Implications

Benchmark Design: Establishes new principles for valid VSI evaluation: video-aligned ground truth, frame-consistent QA, and controlled evidence manipulation. Highlights that repurposing 3D annotations for video tasks requires explicit verification.
Model Assessment: Reveals that strong performance on existing benchmarks can mask fundamental flaws, such as hallucination and over-reliance on dataset priors. ReVSI provides a more reliable and diagnostic tool for comparing model capabilities.
Model Development: Suggests that current fine-tuning strategies using noisy spatial data may be counterproductive, degrading base model capabilities. Emphasizes the need for high-quality supervision data over simply scaling quantity.
Trustworthiness: For real-world applications (e.g., robotics, room assessment), it is crucial to know if a model's spatial reasoning is grounded in visual evidence or driven by memorized priors. ReVSI's diagnostic tools help assess this.

Conclusion

The paper identifies and addresses fundamental validity gaps in evaluating 3D spatial intelligence in VLMs. By introducing ReVSI—a benchmark rebuilt with video-aligned annotations, rigorous QA verification, and frame-aware evaluation protocols—it enables a more accurate and reliable assessment. Key findings include that proprietary models are under-assessed by prior benchmarks, fine-tuned models show limited robust gains and high hallucination rates, and models exhibit dramatically different behaviors under controlled evidence-absent settings. ReVSI provides a foundational tool for developing and evaluating VLMs with genuinely grounded 3D spatial reasoning capabilities.

Limitations & Future Work: The high-quality annotation process is costly and limits scalability. Future work should develop automated or semi-automated pipelines for generating high-quality spatial supervision.