Visual Summary | SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

Summary (Overview)

SpatialClaw proposes a training‑free framework for spatial reasoning that uses a persistent Python kernel as the action interface, replacing single‑pass code execution and structured tool‑call interfaces.
The framework enables iterative code generation, execution, feedback inspection, and revision over multiple steps, allowing flexible composition of perception tools and scientific libraries.
Evaluated on 20 spatial reasoning benchmarks spanning single‑image, multi‑view, video/4D, and general spatial tasks, using six VLM backbones (Qwen3.5, Qwen3.6, Gemma4; 27B–397B parameters).
SpatialClaw achieves 59.9% average accuracy, outperforming the recent spatial agent SpaceTools by +11.2 percentage points, with consistent gains across all backbones without any model‑ or benchmark‑specific adaptation.
Largest gains occur on tasks requiring chained geometric computation (multi‑view reasoning, video spatial & 4D reasoning), validating the design of code as the action interface.

Introduction and Theoretical Foundation

Spatial reasoning – determining object positions, relationships, and motion in 3D – remains challenging for vision‑language models (VLMs). Prior work augments VLMs with specialist perception tools (e.g., detectors, segmenters, depth estimators), but the agent’s capability depends crucially on the action interface through which tools are invoked:

Single‑pass code execution: The agent writes a complete Python program and runs it once, committing to a full analysis strategy before observing any intermediate result (e.g., masks, depth maps, runtime errors).
Structured tool‑calls: The agent dispatches pre‑registered tools via typed commands (e.g., JSON), offering limited support for composing outputs or performing test‑time computations not anticipated by the API.

Both interfaces struggle with open‑ended, compositional spatial reasoning. The authors argue that code should be treated as an orchestration space – not a one‑shot program or a dispatch interface – in which the agent sequences, composes, and diagnoses perception tools in response to intermediate evidence. This motivates SpatialClaw, whose design centers on a persistent Python kernel that retains state across steps, enabling flexible multi‑turn reasoning.

Methodology

SpatialClaw consists of two main components:

Persistent Kernel Workspace

Initialized once per example and discarded after termination.
Pre‑loaded primitives:
- InputImages – sampled frames or images.
- Metadata – frame rate, duration, frame indices.
- tools – perception modules (e.g., Reconstruct wraps Depth Anything 3 for depth, camera geometry, and point maps; SAM3 for segmentation) and utility wrappers for mask/geometry operations.
- show(...) – registers an image for visual feedback in the next step.
- vlm – dispatches queries to a separate VLM (e.g., visual grounding, commonsense reasoning).
- ReturnAnswer(...) – submits a candidate answer.
Scientific libraries (NumPy, SciPy, Matplotlib) are available for custom computation.

Five‑Stage Agentic Loop (Figure 3)

Planning: A separate LLM session produces an analysis plan (without seeing images) from the question, metadata, and tool docs.
Code Generation: The main VLM writes one executable Python cell per step, conditioned on the plan, previous execution trajectory, and visual feedback.
Code Execution: The cell is executed in the persistent kernel after a static AST check for safety.
Feedback Assembly: Stdout, variable summaries, error tracebacks, and images from show() are appended to the model context.
Answer Submission: Loop terminates when ReturnAnswer() is called or maximum steps $N_{\text{max}}=30$ are reached.

The system prompt encodes general spatial reasoning discipline (e.g., cross‑check evidence, prefer metric computation, visually inspect tool outputs, sanity‑check magnitudes) without benchmark‑specific examples.

Empirical Validation / Results

Main Results (Table 1)

Across 20 benchmarks and 6 backbones, SpatialClaw consistently improves over the no‑tool baseline. Key gains:

Video spatial & 4D reasoning: E.g., DSI‑Bench +18.3%p average improvement.
Multi‑view spatial reasoning: MindCube +14.3%p average.
Average over all benchmarks: +4.6–7.7%p across backbones (SpatialClaw achieves 59.9% with Gemma4‑31B vs. 53.4% no‑tool).

Action Interface Comparison (Table 2)

Using the same toolset and system prompt, SpatialClaw outperforms:

No‑tool baseline: 53.4%
Single‑Pass Code: 55.2%
Structured Tool‑Call: 56.7%
SpatialClaw: 59.9% Largest margins occur on tasks requiring multi‑step geometric composition (e.g., MindCube +15.3%p over Structured Tool‑Call, +15.3%p over Single‑Pass Code).

Comparison with Other Spatial Agents (Table 3)

Method	Interface Type	Average (20 bench.)
VADAR (Marsili et al., 2025)	Single‑pass code	33.3%*
pySpatial (Luo et al., 2026)	Single‑pass code	47.8%
SpaceTools (Chen et al., 2026)	Structured tool‑call	48.7%
SpatialClaw (Ours)	Code as action	59.9%
*VADAR does not support video/multi‑image; values are from supported benchmarks.

SpatialClaw exceeds the best baseline (SpaceTools) by +11.2 percentage points on average.

Ablation Study (Table 4)

No utility functions (only SAM3/DA3 + scientific libraries): average 56.4% vs. full 56.9% – performance nearly on par, showing the agent can substitute utility logic with on‑the‑fly numerical computation.
No perception tools (only code interface + scientific libraries): average 51.4% vs. no‑tool baseline 48.7% – +2.7%p gain purely from the action interface, independent of external perception.

Analysis of Primitive Composition (Figure 5)

Analysis of numpy/scipy operations across 13 meta‑categories shows spontaneous task‑adaptive composition:

Distance questions → heavy use of scipy.spatial.KDTree, np.linalg.norm, np.argmin
Direction questions → reliance on np.dot, np.cross
This specialization is not hard‑coded but emerges from question semantics, demonstrating the flexibility of code as the action interface.

Per‑Category Win/Loss Margins (Figure 4)

SpatialClaw secures a net advantage in 11 of 13 meta‑categories over both Structured Tool‑Call and Single‑Pass Code. Largest gains ( +6–9 pp) in:

Camera motion
Multi‑view/viewpoint reasoning
Relative direction

These categories require chained geometric computation across frames and viewpoints, where the persistent kernel providing cross‑step composition and revision gives the greatest leverage.

Theoretical and Practical Implications

Action interface design is a critical, underexplored axis for agent‑based spatial reasoning. Code as an action interface enables flexible, iterative composition of perception outputs with numerical primitives, allowing agents to adapt to test‑time computations not anticipated by a fixed tool API.
Generalization across models and tasks: SpatialClaw delivers consistent improvements across six backbones (27B–397B) and 20 diverse benchmarks without any model‑ or benchmark‑specific tuning, suggesting the gains stem from the interface itself, not overfitting to a particular VLM.
Practical applicability: The framework is training‑free and can be readily deployed with existing open‑source VLMs and perception tools, offering a drop‑in improvement for spatial reasoning in autonomous driving, robotics, video understanding, and 3D scene analysis.

Conclusion

SpatialClaw demonstrates that rethinking the action interface – adopting code as a persistent, multi‑turn orchestration medium – yields consistent and significant improvements in spatial reasoning for VLMs. The framework achieves 59.9% average accuracy across 20 benchmarks, +11.2 points over prior agents, without any adaptation. The gains are largest on tasks requiring multi‑step geometric composition, confirming that the expressive action interface is the primary driver of performance. Future directions include extending to more complex 3D/4D reasoning, integrating learning‑based tool selection, and exploring the interface design for other modalities (e.g., audio, tactile).