SpatialEvo: Self-Evolving Spatial Intelligence via Deterministic Geometric Environments

Summary (Overview)

  • Core Innovation: Introduces the first self-evolving framework for 3D spatial reasoning, centered on a Deterministic Geometric Environment (DGE). The DGE computes exact, zero-noise ground truth directly from 3D scene assets (point clouds, camera poses), replacing unreliable model consensus with objective physical feedback.
  • Training Paradigm: Employs a single Vision-Language Model (VLM) that co-evolves between Questioner and Solver roles via spatial-grounded self-play. The Questioner generates valid spatial questions, while the Solver answers them under DGE constraints.
  • Adaptive Curriculum: A task-adaptive scheduler dynamically concentrates training on the model's weakest spatial reasoning categories, enabling endogenous curriculum emergence without manual design.
  • Key Results: SpatialEvo achieves the highest average performance across nine benchmarks at both 3B and 7B model scales, showing strong gains on spatial reasoning tasks (e.g., VSI-Bench) without degradation on general visual understanding benchmarks.
  • Fundamental Insight: Leverages the unique property of 3D spatial reasoning—its answers are deterministic consequences of underlying geometry—to bypass the self-reinforcing error problem inherent in consensus-based self-evolution methods.

Introduction and Theoretical Foundation

Spatial reasoning over 3D scenes is a core capability for embodied intelligence. However, continuous model improvement is bottlenecked by the high cost of manual geometric annotation for creating large-scale datasets. While the self-evolving paradigm (where models improve through iterative self-play) offers a promising path to overcome this, existing methods rely on model consensus (e.g., majority voting) to construct pseudo-labels for training. This introduces a systematic bias: training anchored to these labels risks reinforcing the model's existing errors rather than correcting them.

The paper identifies a distinctive property of 3D spatial reasoning that circumvents this limitation: unlike natural language or general vision tasks, the ground truth for a spatial question is a deterministic consequence of the underlying geometry. Given a dense point cloud, calibrated camera poses, and a well-formed question, the correct answer can be computed exactly and programmatically, without any model judgment. This transforms every unannotated 3D scene into a potential source of noise-free supervision.

Building on this insight, the authors propose SpatialEvo, a framework that integrates VLM self-play with programmatic physical verification. Its core is the Deterministic Geometric Environment (DGE), which serves as an exact geometric oracle.

Methodology

The SpatialEvo framework consists of two core components: the Deterministic Geometric Environment (DGE) and Spatial-Grounded Policy Co-Evolution.

1. Deterministic Geometric Environment (DGE)

The DGE formalizes 3D spatial reasoning into executable atomic rules, transforming scene assets into a zero-noise interactive judge.

  • Task Formalization: The framework defines 16 spatial reasoning task categories, organized by observational granularity:
    • Multi-image scene-level (6 tasks): e.g., Object Counting, Absolute Distance, Relative Direction.
    • Single-image (3 tasks): e.g., Single-View Relative Direction, Camera-Object Distance.
    • Dual-image (7 tasks): e.g., Inter-Camera Relative Position, Visibility Comparison.
  • Geometric Validation Rule Sets: For each task category, the DGE pre-defines a set of geometric verification rules that decouple spatial intuition into executable atomic criteria. Rules check:
    1. Premise Consistency: All referenced scene entities (objects, frames) exist and are uniquely localizable.
    2. Inferential Solvability: Geometric premises are unambiguously computable (e.g., sufficient point cloud density, meaningful viewpoint disparity).
    3. Geometric Degeneracy Filtering: Edge cases with low training value are discarded.
  • Automated Verification Pipeline: A three-stage pipeline converts free-form questions into verified ground truth.
    1. Entity Parsing: A lightweight LLM extracts structured entities (frame indices, object categories) from the question text.
    2. Legality Verification: Extracted entities are validated against the task's rule set. Invalid questions are truncated.
    3. Ground-Truth Synthesis: For valid questions, the DGE invokes its geometric toolkit (coordinate transforms, bounding-box fitting, depth projection) to compute the precise answer.

2. Spatial-Grounded Policy Co-Evolution

A single policy model πθ\pi_\theta is trained to co-evolve in two roles using the GRPO (Group Relative Policy Optimization) framework.

  • Spatial Self-Play Mechanism: The model alternates between Questioner and Solver via role-conditioned prompting.
    • Questioner: Perceives the holistic 3D scene layout from multi-view images and generates a physically valid spatial question for a task sampled by the scheduler.
    • Solver: Given a question (and its DGE-computed ground truth), derives a precise answer through step-by-step reasoning.
  • Task.Adaptive Scheduler: Drives an endogenous, adaptive curriculum. It maintains a historical effective accuracy aˉk\bar{a}_k for each task category kk. The sampling probability pkp_k for a feasible task is weighted inversely to this accuracy: pk=wkjTsfeasiblewj,wk=max(δ,1aˉk)p_k = \frac{w_k}{\sum_{j \in \mathcal{T}_s^{\text{feasible}}} w_j}, \quad w_k = \max(\delta, 1 - \bar{a}_k) where δ\delta is a minimum exploration weight. This focuses training on the model's weakest areas.
  • Reward Design:
    • Questioner Reward (rQr_Q): Encourages generating valid, well-grounded questions. rQ=αffmt+(1α)fvalidfobsr_Q = \alpha f_{\text{fmt}} + (1-\alpha) f_{\text{valid}} \cdot f_{\text{obs}} where ffmtf_{\text{fmt}} is format compliance, fvalidf_{\text{valid}} is the DGE validity score, and fobsf_{\text{obs}} is a score for the quality of visual observation.
    • Solver Reward (rAr_A): Rewards accurate answers or good explanations for invalid questions. rA={αffmt+(1α)facc,if Q is validαffmt+(1α)fexplain,if Q is invalidr_A = \begin{cases} \alpha f_{\text{fmt}} + (1-\alpha) f_{\text{acc}}, & \text{if } Q \text{ is valid} \\ \alpha f_{\text{fmt}} + (1-\alpha) f_{\text{explain}}, & \text{if } Q \text{ is invalid} \end{cases} where faccf_{\text{acc}} is accuracy against DGE ground truth, and fexplainf_{\text{explain}} is the quality of explaining why a question is invalid.

Empirical Validation / Results

Experiments were conducted across nine benchmarks using Qwen2.5-VL-3B and 7B models. Training data was constructed from ~4K scenes from ScanNet, ScanNet++, and ARKitScenes.

Main Results

Table 1: Main results across nine benchmarks. SpatialEvo achieves the highest average score at both 3B and 7B scales.

BenchmarkQwen2.5-VL-3BQwen2.5-VL-7B
BaselineSpatialLadderSpaceRSpatialSSRLSpatialEvoBaselineViLaSRSpaceRSpatialSSRL
VSI-Bench28.145.736.028.039.231.145.436.833.7
RealWorldQA63.457.161.465.466.569.557.964.769.9
EmbSpatial55.957.655.659.861.263.647.860.369.3
SpatialViz24.228.631.925.925.427.029.830.928.4
STARE33.126.436.836.836.941.821.436.243.3
CoreCognition56.858.329.157.657.459.656.456.460.2
ViewSpatial36.243.035.938.442.336.432.335.137.5
V-STAR74.936.775.477.075.478.535.673.879.1
MMStar54.645.844.956.555.261.660.854.963.5
AVG47.544.445.249.551.152.143.049.953.9
  • Spatial Reasoning Gains: SpatialEvo shows substantial improvements on core spatial benchmarks (VSI-Bench, EmbSpatial, ViewSpatial).
  • General Capability Retention: Performance on general visual understanding benchmarks (MMStar, RealWorldQA) remains competitive, indicating no degradation from spatial specialization.
  • Scale Consistency: Advantages hold at both 3B and 7B model scales.

Ablation Studies

Table 2: Ablation study on SpatialEvo (Qwen2.5-VL-7B). Removing physical grounding causes the largest performance drop.

VariantVSI-BenchRealWorldQAEmbSpatialSpatialVizSTARECoreCognitionViewSpatialV-STARMMStarAvgΔ Avg
SpatialEvo (Ours)46.166.766.028.641.360.243.278.062.554.7
Architecture Design
w/o Questioner40.267.165.727.739.060.540.477.059.953.1↓ 1.6
w/o Solver36.670.261.826.139.858.534.575.460.551.5↓ 3.2
w/o Physical Grounding18.868.563.523.439.759.735.477.060.649.6↓ 5.1
w/o Adaptive Scheduler43.468.568.027.539.560.343.277.062.454.4↓ 0.3
  • Critical Role of DGE: Replacing DGE ground truth with majority-vote pseudo-labels (w/o Physical Grounding) causes the largest average drop (-5.1) and a catastrophic collapse on VSI-Bench (from 46.1 to 18.8), validating the necessity of deterministic feedback.
  • Importance of Co-Evolution: Removing either the Questioner or Solver role leads to significant performance degradation, confirming the value of the dual-role, self-play mechanism.
  • Scheduler Contribution: The adaptive scheduler provides a consistent, albeit smaller, boost to performance.

Analysis: Online Evolution vs. Static Learning

A controlled comparison shows that online self-evolution outperforms static dataset training, even when the static dataset is constructed from SpatialEvo's own offline generated data.

Table 3: Comparison of online self-evolution and static learning paradigms on VSI-Bench (Qwen2.5-VL-3B). SpatialEvo's online RL outperforms all SFT baselines.

ParadigmMethodObj. CountAbs. Dist.Obj. SizeRoom SizeRel. Dist.Rel. Dir.Route PlanAppr. OrderAvg.
Qwen2.5-VL-3B (Base)33.521.117.922.632.842.729.921.028.0
RLw/ SpatialLadder RL62.829.859.227.838.944.233.524.940.1
SpatialEvo (Online RL)65.235.161.351.446.244.329.926.546.3
SFTw/ SpatialLadder Data63.031.961.243.043.043.632.531.743.7
w/ SpaceR Data28.325.936.334.935.146.835.143.436.3
w/ SpatialSSRL Data35.624.615.421.834.139.228.923.528.1
w/ SpatialEvo Offline Data62.628.960.049.140.345.731.425.943.9

This demonstrates the key advantage of online self-evolution: the training distribution dynamically aligns with the model's current cognitive frontier through real-time DGE interaction, enabling adaptive hard-sample mining that static datasets cannot replicate.

Curriculum Emergence Analysis

Training dynamics confirm the endogenous emergence of an adaptive curriculum.

  • Reward Trajectories: Both Questioner and Solver rewards improve steadily. The Questioner quickly learns to generate valid questions (fvalid1.0f_{\text{valid}} \rightarrow 1.0), while the Solver's accuracy reward increases.
  • Task Sampling Shifts: The adaptive scheduler successfully up-weights harder categories (e.g., Relative Direction from a uniform 16.7% to 21.8%) and down-weights easier ones (e.g., Room Area to 12.5%) as training progresses.

Theoretical and Practical Implications

  • Paradigm Shift for Self-Evolution: SpatialEvo demonstrates that in domains where ground truth is physically deterministic, self-evolution can be anchored to objective environmental feedback rather than noisy model consensus. This prevents error reinforcement and enables truly corrective learning.
  • Data-Efficient Continuous Improvement: The framework provides a scalable path for continuous model improvement in 3D spatial reasoning without proportional investment in human annotation. It turns unannotated 3D scene datasets into perpetual training engines.
  • Embodied Intelligence Foundation: The physically grounded self-evolution paradigm can serve as a reference for broader embodied AI research, where an agent's capabilities emerge from interaction with a verifiable physical world.
  • Limitations: The approach currently depends on high-fidelity 3D assets (dense point clouds, calibrated poses), limiting it to static indoor environments. Performance is also sensitive to the quality of entity parsing and the underlying point cloud reconstruction.

Conclusion

SpatialEvo introduces the