SpatialEvo: Self-Evolving Spatial Intelligence via Deterministic Geometric Environments

Summary (Overview)

Core Innovation: Introduces the first self-evolving framework for 3D spatial reasoning, centered on a Deterministic Geometric Environment (DGE). The DGE computes exact, zero-noise ground truth directly from 3D scene assets (point clouds, camera poses), replacing unreliable model consensus with objective physical feedback.
Training Paradigm: Employs a single Vision-Language Model (VLM) that co-evolves between Questioner and Solver roles via spatial-grounded self-play. The Questioner generates valid spatial questions, while the Solver answers them under DGE constraints.
Adaptive Curriculum: A task-adaptive scheduler dynamically concentrates training on the model's weakest spatial reasoning categories, enabling endogenous curriculum emergence without manual design.
Key Results: SpatialEvo achieves the highest average performance across nine benchmarks at both 3B and 7B model scales, showing strong gains on spatial reasoning tasks (e.g., VSI-Bench) without degradation on general visual understanding benchmarks.
Fundamental Insight: Leverages the unique property of 3D spatial reasoning—its answers are deterministic consequences of underlying geometry—to bypass the self-reinforcing error problem inherent in consensus-based self-evolution methods.

Introduction and Theoretical Foundation

Spatial reasoning over 3D scenes is a core capability for embodied intelligence. However, continuous model improvement is bottlenecked by the high cost of manual geometric annotation for creating large-scale datasets. While the self-evolving paradigm (where models improve through iterative self-play) offers a promising path to overcome this, existing methods rely on model consensus (e.g., majority voting) to construct pseudo-labels for training. This introduces a systematic bias: training anchored to these labels risks reinforcing the model's existing errors rather than correcting them.

The paper identifies a distinctive property of 3D spatial reasoning that circumvents this limitation: unlike natural language or general vision tasks, the ground truth for a spatial question is a deterministic consequence of the underlying geometry. Given a dense point cloud, calibrated camera poses, and a well-formed question, the correct answer can be computed exactly and programmatically, without any model judgment. This transforms every unannotated 3D scene into a potential source of noise-free supervision.

Building on this insight, the authors propose SpatialEvo, a framework that integrates VLM self-play with programmatic physical verification. Its core is the Deterministic Geometric Environment (DGE), which serves as an exact geometric oracle.

Methodology

The SpatialEvo framework consists of two core components: the Deterministic Geometric Environment (DGE) and Spatial-Grounded Policy Co-Evolution.

1. Deterministic Geometric Environment (DGE)

The DGE formalizes 3D spatial reasoning into executable atomic rules, transforming scene assets into a zero-noise interactive judge.

Task Formalization: The framework defines 16 spatial reasoning task categories, organized by observational granularity:
- Multi-image scene-level (6 tasks): e.g., Object Counting, Absolute Distance, Relative Direction.
- Single-image (3 tasks): e.g., Single-View Relative Direction, Camera-Object Distance.
- Dual-image (7 tasks): e.g., Inter-Camera Relative Position, Visibility Comparison.
Geometric Validation Rule Sets: For each task category, the DGE pre-defines a set of geometric verification rules that decouple spatial intuition into executable atomic criteria. Rules check:
1. Premise Consistency: All referenced scene entities (objects, frames) exist and are uniquely localizable.
2. Inferential Solvability: Geometric premises are unambiguously computable (e.g., sufficient point cloud density, meaningful viewpoint disparity).
3. Geometric Degeneracy Filtering: Edge cases with low training value are discarded.
Automated Verification Pipeline: A three-stage pipeline converts free-form questions into verified ground truth.
1. Entity Parsing: A lightweight LLM extracts structured entities (frame indices, object categories) from the question text.
2. Legality Verification: Extracted entities are validated against the task's rule set. Invalid questions are truncated.
3. Ground-Truth Synthesis: For valid questions, the DGE invokes its geometric toolkit (coordinate transforms, bounding-box fitting, depth projection) to compute the precise answer.

2. Spatial-Grounded Policy Co-Evolution

A single policy model $\pi_\theta$ is trained to co-evolve in two roles using the GRPO (Group Relative Policy Optimization) framework.

Spatial Self-Play Mechanism: The model alternates between Questioner and Solver via role-conditioned prompting.
- Questioner: Perceives the holistic 3D scene layout from multi-view images and generates a physically valid spatial question for a task sampled by the scheduler.
- Solver: Given a question (and its DGE-computed ground truth), derives a precise answer through step-by-step reasoning.
Task.Adaptive Scheduler: Drives an endogenous, adaptive curriculum. It maintains a historical effective accuracy $\bar{a}_k$ for each task category $k$ . The sampling probability $p_k$ for a feasible task is weighted inversely to this accuracy: $p_k = \frac{w_k}{\sum_{j \in \mathcal{T}_s^{\text{feasible}}} w_j}, \quad w_k = \max(\delta, 1 - \bar{a}_k)$ where $\delta$ is a minimum exploration weight. This focuses training on the model's weakest areas.
Reward Design:
- Questioner Reward ( $r_Q$ ): Encourages generating valid, well-grounded questions. $r_Q = \alpha f_{\text{fmt}} + (1-\alpha) f_{\text{valid}} \cdot f_{\text{obs}}$ where $f_{\text{fmt}}$ is format compliance, $f_{\text{valid}}$ is the DGE validity score, and $f_{\text{obs}}$ is a score for the quality of visual observation.
- Solver Reward ( $r_A$ ): Rewards accurate answers or good explanations for invalid questions. $r_A = \begin{cases} \alpha f_{\text{fmt}} + (1-\alpha) f_{\text{acc}}, & \text{if } Q \text{ is valid} \\ \alpha f_{\text{fmt}} + (1-\alpha) f_{\text{explain}}, & \text{if } Q \text{ is invalid} \end{cases}$ where $f_{\text{acc}}$ is accuracy against DGE ground truth, and $f_{\text{explain}}$ is the quality of explaining why a question is invalid.

Empirical Validation / Results

Experiments were conducted across nine benchmarks using Qwen2.5-VL-3B and 7B models. Training data was constructed from ~4K scenes from ScanNet, ScanNet++, and ARKitScenes.

Main Results

Table 1: Main results across nine benchmarks. SpatialEvo achieves the highest average score at both 3B and 7B scales.

Benchmark	Qwen2.5-VL-3B				Qwen2.5-VL-7B
	Baseline	SpatialLadder	SpaceR	SpatialSSRL	SpatialEvo	Baseline	ViLaSR	SpaceR	SpatialSSRL
VSI-Bench	28.1	45.7	36.0	28.0	39.2	31.1	45.4	36.8	33.7
RealWorldQA	63.4	57.1	61.4	65.4	66.5	69.5	57.9	64.7	69.9
EmbSpatial	55.9	57.6	55.6	59.8	61.2	63.6	47.8	60.3	69.3
SpatialViz	24.2	28.6	31.9	25.9	25.4	27.0	29.8	30.9	28.4
STARE	33.1	26.4	36.8	36.8	36.9	41.8	21.4	36.2	43.3
CoreCognition	56.8	58.3	29.1	57.6	57.4	59.6	56.4	56.4	60.2
ViewSpatial	36.2	43.0	35.9	38.4	42.3	36.4	32.3	35.1	37.5
V-STAR	74.9	36.7	75.4	77.0	75.4	78.5	35.6	73.8	79.1
MMStar	54.6	45.8	44.9	56.5	55.2	61.6	60.8	54.9	63.5
AVG	47.5	44.4	45.2	49.5	51.1	52.1	43.0	49.9	53.9

Spatial Reasoning Gains: SpatialEvo shows substantial improvements on core spatial benchmarks (VSI-Bench, EmbSpatial, ViewSpatial).
General Capability Retention: Performance on general visual understanding benchmarks (MMStar, RealWorldQA) remains competitive, indicating no degradation from spatial specialization.
Scale Consistency: Advantages hold at both 3B and 7B model scales.

Ablation Studies

Table 2: Ablation study on SpatialEvo (Qwen2.5-VL-7B). Removing physical grounding causes the largest performance drop.

Variant	VSI-Bench	RealWorldQA	EmbSpatial	SpatialViz	STARE	CoreCognition	ViewSpatial	V-STAR	MMStar	Avg	Δ Avg
SpatialEvo (Ours)	46.1	66.7	66.0	28.6	41.3	60.2	43.2	78.0	62.5	54.7	–
Architecture Design
w/o Questioner	40.2	67.1	65.7	27.7	39.0	60.5	40.4	77.0	59.9	53.1	↓ 1.6
w/o Solver	36.6	70.2	61.8	26.1	39.8	58.5	34.5	75.4	60.5	51.5	↓ 3.2
w/o Physical Grounding	18.8	68.5	63.5	23.4	39.7	59.7	35.4	77.0	60.6	49.6	↓ 5.1
w/o Adaptive Scheduler	43.4	68.5	68.0	27.5	39.5	60.3	43.2	77.0	62.4	54.4	↓ 0.3

Critical Role of DGE: Replacing DGE ground truth with majority-vote pseudo-labels (w/o Physical Grounding) causes the largest average drop (-5.1) and a catastrophic collapse on VSI-Bench (from 46.1 to 18.8), validating the necessity of deterministic feedback.
Importance of Co-Evolution: Removing either the Questioner or Solver role leads to significant performance degradation, confirming the value of the dual-role, self-play mechanism.
Scheduler Contribution: The adaptive scheduler provides a consistent, albeit smaller, boost to performance.

Analysis: Online Evolution vs. Static Learning

A controlled comparison shows that online self-evolution outperforms static dataset training, even when the static dataset is constructed from SpatialEvo's own offline generated data.

Table 3: Comparison of online self-evolution and static learning paradigms on VSI-Bench (Qwen2.5-VL-3B). SpatialEvo's online RL outperforms all SFT baselines.

Paradigm	Method	Obj. Count	Abs. Dist.	Obj. Size	Room Size	Rel. Dist.	Rel. Dir.	Route Plan	Appr. Order	Avg.
Qwen2.5-VL-3B (Base)		33.5	21.1	17.9	22.6	32.8	42.7	29.9	21.0	28.0
RL	w/ SpatialLadder RL	62.8	29.8	59.2	27.8	38.9	44.2	33.5	24.9	40.1
	SpatialEvo (Online RL)	65.2	35.1	61.3	51.4	46.2	44.3	29.9	26.5	46.3
SFT	w/ SpatialLadder Data	63.0	31.9	61.2	43.0	43.0	43.6	32.5	31.7	43.7
	w/ SpaceR Data	28.3	25.9	36.3	34.9	35.1	46.8	35.1	43.4	36.3
	w/ SpatialSSRL Data	35.6	24.6	15.4	21.8	34.1	39.2	28.9	23.5	28.1
	w/ SpatialEvo Offline Data	62.6	28.9	60.0	49.1	40.3	45.7	31.4	25.9	43.9

This demonstrates the key advantage of online self-evolution: the training distribution dynamically aligns with the model's current cognitive frontier through real-time DGE interaction, enabling adaptive hard-sample mining that static datasets cannot replicate.

Curriculum Emergence Analysis

Training dynamics confirm the endogenous emergence of an adaptive curriculum.

Reward Trajectories: Both Questioner and Solver rewards improve steadily. The Questioner quickly learns to generate valid questions ( $f_{\text{valid}} \rightarrow 1.0$ ), while the Solver's accuracy reward increases.
Task Sampling Shifts: The adaptive scheduler successfully up-weights harder categories (e.g., Relative Direction from a uniform 16.7% to 21.8%) and down-weights easier ones (e.g., Room Area to 12.5%) as training progresses.

Theoretical and Practical Implications

Paradigm Shift for Self-Evolution: SpatialEvo demonstrates that in domains where ground truth is physically deterministic, self-evolution can be anchored to objective environmental feedback rather than noisy model consensus. This prevents error reinforcement and enables truly corrective learning.
Data-Efficient Continuous Improvement: The framework provides a scalable path for continuous model improvement in 3D spatial reasoning without proportional investment in human annotation. It turns unannotated 3D scene datasets into perpetual training engines.
Embodied Intelligence Foundation: The physically grounded self-evolution paradigm can serve as a reference for broader embodied AI research, where an agent's capabilities emerge from interaction with a verifiable physical world.
Limitations: The approach currently depends on high-fidelity 3D assets (dense point clouds, calibrated poses), limiting it to static indoor environments. Performance is also sensitive to the quality of entity parsing and the underlying point cloud reconstruction.

Conclusion

SpatialEvo introduces the