Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training - Summary

Summary (Overview)

  • Proposes Spatial-TTT, a framework for streaming visual-based spatial intelligence that uses Test-Time Training (TTT) to update a subset of parameters ("fast weights") online, acting as a compact non-linear memory to accumulate 3D evidence from long-horizon video streams.
  • Introduces a hybrid TTT architecture that interleaves TTT layers with standard self-attention anchor layers (3:1 ratio) to preserve pretrained model knowledge while enabling efficient long-context compression. It uses large-chunk updates parallel with sliding-window attention (SWA) for efficient spatial video processing.
  • Develops a spatial-predictive mechanism using lightweight depth-wise 3D spatiotemporal convolutions on the Q/K/V projections of TTT layers to capture geometric correspondence and temporal continuity, making fast-weight updates more spatially coherent.
  • Constructs a dense scene-description dataset to provide rich, scene-level supervision for training the fast-weight update dynamics to retain structured global 3D information, bridging the gap left by sparse spatial QA datasets.
  • Achieves state-of-the-art performance on major video spatial benchmarks (VSI-Bench, MindCube, VSI-SUPER) while maintaining efficient, near-linear scaling in memory and computation with increasing video length.

Introduction and Theoretical Foundation

Spatial intelligence—the ability to perceive, understand, and reason about 3D structure and geometric relationships—is crucial for applications like embodied robotics, autonomous driving, and AR. In real-world scenarios, this information is gathered from continuous streams of visual observations, not single static images. This necessitates streaming spatial understanding: the capability to selectively maintain, progressively update, and reason over spatial memory from long-horizon video inputs.

While Multimodal Large Language Models (MLLMs) excel at 2D understanding, they struggle with 3D spatial tasks due to a lack of geometric priors from their 2D image-text training. Existing spatial-aware MLLMs are limited to single images or short clips and cannot scale to the thousands of frames in practical streaming scenarios. Naively extending the input sequence is computationally prohibitive due to quadratic attention complexity, while aggressive temporal subsampling loses critical spatial details.

To address this, the paper introduces Spatial-TTT, built on the Test-Time Training (TTT) paradigm. Instead of fixed parameters, TTT maintains adaptive fast weights that are updated online during inference, serving as a compact memory to accumulate evidence from unbounded streams. The core challenge is designing an architecture and training strategy that enables these fast weights to effectively capture, organize, and retain spatial information over time.

Methodology

1. Overall Framework and Hybrid TTT Architecture

The framework, illustrated in Figure 2, is based on a transformer decoder. To preserve the pretrained model's cross-modal alignment and semantic reasoning, a hybrid architecture is used: 75% of decoder layers are TTT layers, interleaved with 25% standard self-attention anchor layers. The anchor layers maintain full attention over the context.

  • Large-Chunk Updates & Parallel Sliding-Window Attention (SWA): For efficiency, visual tokens are processed in large chunks (size b), aligned with multiple video frames. Within each TTT layer, a TTT branch and a SWA branch operate in parallel, sharing Q/K/V projections. The SWA (with window size w ≥ b) ensures intra-chunk spatiotemporal continuity, which the causal TTT update alone cannot provide. The layer output combines both: ot=WindowAttn(qt,K[tw:t],V[tw:t])+fWt(qt)o_t = \text{WindowAttn}(q_t, K_{[t-w:t]}, V_{[t-w:t]}) + f_{W_t}(q_t) where fWf_W is the fast-weight network (a bias-free SwiGLU-MLP): fW(x)=W2[SiLU(W1x)(W3x)]f_W(x) = W_2[\text{SiLU}(W_1x) \odot (W_3x)] and \odot is element-wise multiplication.

2. Spatial-Predictive Mechanism

Standard TTT uses point-wise linear projections for Q/K/V, ignoring spatiotemporal structure. To inject geometric inductive bias, lightweight depth-wise 3D spatiotemporal convolutions are applied to the Q, K, V of the TTT branch. For a token at position (t,h,w)(t, h, w) and channel ii:

x~t,h,wi=δNθδixt+δt,h+δh,w+δwi,x{q,k,v}\tilde{x}^i_{t,h,w} = \sum_{\delta \in \mathcal{N}} \theta^i_{\delta} \cdot x^i_{t+\delta_t, h+\delta_h, w+\delta_w}, \quad x \in \{q, k, v\}

where N={κ/2,...,κ/2}3\mathcal{N} = \{-\lfloor \kappa/2 \rfloor, ..., \lfloor \kappa/2 \rfloor\}^3 is the local 3D neighborhood with kernel size κ\kappa, and θ\theta are learnable weights initialized as a Dirac delta.

The fast weights are updated using the Muon update rule for stability:

Gt=MuonUpdate(Gt1,WL(fWt1(k~t),v~t))G_t = \text{MuonUpdate}(G_{t-1}, \nabla_W \mathcal{L}(f_{W_{t-1}}(\tilde{k}_t), \tilde{v}_t)) WtL2Norm(Wt1ηGt)W_t \leftarrow \text{L2Norm}(W_{t-1} - \eta G_t)

where GtG_t is the orthogonalized gradient with momentum.

3. Bridging Sparse Supervision with Dense Scene Descriptions

Existing spatial QA datasets provide sparse, local supervision. To teach the model to update fast weights to retain globally useful 3D evidence, a dense scene-description dataset is constructed from SceneVerse annotations. The model is trained to generate comprehensive descriptions covering:

  1. Global context (scene type/function)
  2. Objects and counts
  3. Object relations (spatial layouts, pairwise relations) This provides rich, high-coverage gradients for learning effective update dynamics.

4. Spatial-Aware Progressive Training Strategy

A two-stage training strategy is employed:

  1. Stage 1 (Dense Description Training): Train the hybrid TTT architecture on the dense scene-description dataset. A sliding window annealing strategy is used: the SWA window size w is linearly annealed from w_max down to the chunk size b, forcing TTT layers to gradually take over cross-chunk information propagation.
  2. Stage 2 (Spatial VQA Fine-tuning): Fine-tune on 2M spatial VQA samples (object direction/distance, counting, route planning, etc.) with w = b fixed, teaching the model to selectively retain task-relevant evidence.

Inference: Uses a dual KV cache mechanism—a fixed-length sliding window cache for SWA and a TTT pending cache that triggers a fast-weight update when it reaches size b.

Empirical Validation / Results

Extensive experiments were conducted on spatial benchmarks. The model is initialized from Qwen3-VL-2B-Instruct.

Key Results Table

Table 1: Evaluation Results on VSI-Bench (Yang et al., 2025a)

ModelsNumerical Question (MRA)Multiple-Choice Question (ACC)Avg.
Obj. CountAbs. DistObj. Size
Human94.347.060.4
Spatial-TTT-2B (Ours)70.847.871.7
Best Prior Open-Source (VST-7B-SFT)72.044.474.3
Best Prior Proprietary (GPT-5)53.334.473.3
  • VSI-Bench: Spatial-TTT-2B achieves state-of-the-art overall performance (Avg. 64.4), outperforming all proprietary and open-source baselines despite its compact 2B size. It shows particular strength in Relative Direction and Route Plan tasks.
  • MindCube-Tiny: Spatial-TTT achieves 76.2% accuracy, significantly outperforming the best proprietary (Gemini-3-pro, 63.9%) and open-source (MindCube-3B, 51.7%) baselines, demonstrating strong reasoning under viewpoint changes.
  • VSI-SUPER (Streaming): As shown in Table 3, Spatial-TTT maintains stable performance on long videos (up to 120 minutes) for both recall and counting tasks, while baselines (Qwen3-VL-2B, Cambrian-S-7B) fail or run out of memory on longer sequences.

Ablation Study Results

Table 4: Ablations of Spatial-TTT on VSI-Bench

SettingNumericalMultiple-ChoiceAvg.
Spatial-TTT (Full)64.064.864.4
w/o SP-Mechanism60.763.462.1
w/o Dense Data61.061.561.3
w/o Hybrid Arch55.452.453.9
  • Each core component contributes significantly to the final performance.

Efficiency Analysis

Table 5: Peak Memory Usage (GB) and TFLOPs

Models128 frames256 frames512 frames1024 frames
MEMTFLOPsMEMTFLOPs
Qwen3-VL-2B (Quadratic)6.275.98.3179.9
Spatial-MLLM-4B (Heavy Encoder)25.91698.841.86002.1
Spatial-TTT-2B (Ours) (Linear)6.274.37.0156.2
  • Spatial-TTT demonstrates near-linear scaling in memory and computation, making it practical for long streaming videos.

Theoretical and Practical Implications

  • Theoretical: Introduces a novel paradigm for streaming spatial intelligence by combining Test-Time Training with spatiotemporal inductive biases. It demonstrates that online parameter adaptation can serve as an effective, compact memory mechanism for accumulating evidence over arbitrarily long sequences, challenging the static inference paradigm.
  • Practical: Provides a scalable solution for real-world applications requiring long-horizon spatial understanding (e.g., robotic navigation, long-term AR scene understanding). The efficiency gains (linear scaling) make deployment on resource-constrained devices more feasible. The dense scene-description dataset and training strategy offer a blueprint for teaching models to build structured, persistent spatial representations.

Conclusion

Spatial-TTT presents a comprehensive framework for streaming visual-based spatial intelligence. By leveraging test-time training with a hybrid architecture, spatial-predictive mechanisms, and dense supervision, it enables MLLMs to maintain and update spatial memory effectively over long video streams. The method achieves state-of-the-art performance across benchmarks while maintaining computational efficiency. This work points toward a promising direction for building AI systems with persistent spatial memory, essential for robust interaction with the 3D world. Future work may explore applying similar principles to other streaming multimodal tasks and further optimizing the fast-weight update dynamics.