Summary (Overview)

Core Idea: Introduces token warping as a robust method to enable Multimodal Large Language Models (MLLMs) to reason about scenes from nearby viewpoints, by warping the model's internal image tokens rather than synthesizing new pixels.
Key Finding: Backward token warping (defining a dense grid on the target view and fetching corresponding tokens from the source) outperforms forward warping and pixel-based methods, as it provides the MLLM with a regular, dense token structure it is trained on.
Robustness: Demonstrates that MLLM image tokens are inherently robust to positional noise during patch fetching, making them a suitable substrate for geometric transformation even with imperfect depth estimates.
Benchmark: Introduces ViewBench, a new benchmark for evaluating viewpoint-conditioned spatial reasoning and target-view object description, constructed from real-world 3D scans (ScanNet).
Superior Performance: The proposed token warping method consistently outperforms strong baselines, including specialist MLLMs fine-tuned for spatial reasoning, pixel-wise warping, and generative novel view synthesis.

Introduction and Theoretical Foundation

A core challenge in spatial reasoning for MLLMs is understanding how a scene appears from a different viewpoint. While depth estimation is highly accurate, incorporating it into MLLMs does not yield genuine 3D understanding, and models struggle with simple viewpoint transformation tasks.

The research is inspired by classical theories of mental imagery (Shepard, Minsky, Pylyshyn, Hinton), which posit that mental images rely on part-level structural descriptions rather than holistic object-level or pixel-level representations. The evolution of computer vision, particularly the adoption of Vision Transformers (ViTs), has converged on image tokens as perceptual atomic units. This paper investigates whether transforming these tokens can generate consistent internal scene representations under viewpoint changes.

The hypothesis is that token-level transformations are more robust than pixel-level warping, which amplifies small depth errors into severe distortions, and more detailed than object-level abstractions, which sacrifice fine-grained spatial coherence.

Methodology

3.1. Image Tokenization in MLLMs

In ViT-based MLLMs, an image $I$ is partitioned into a fixed, non-overlapping grid of patches $\{u_i\}_{i=1}^M$ , each associated with a grid-center coordinate $c_i = (x_i, y_i)$ . A shallow encoder $E$ maps each patch to an embedding $e_i = E(u_i)$ . These embeddings, with their coordinates, are processed by a vision encoder $V$ (e.g., ViT) to produce image tokens $\{v_i\}_{i=1}^M = V(\{(e_i, c_i)\}_{i=1}^M)$ .

3.2. Fetching Position Noise Sensitivity Test

To demonstrate token robustness, a controlled experiment perturbs the positional information of tokens. For each token $v_i$ with coordinate $c_i$ , a displacement vector $u_i = (\Delta x_i, \Delta y_i)$ is sampled, smoothed, normalized, and scaled by a maximum displacement value. This emulates noisy positional perturbations from warping.

Key Result: MLLM performance (Qwen2.5-VL on CV-Bench-2D) remains stable under increasing perturbation (0-20 pixels), degrading only mildly at extreme noise levels. This contrasts with a pixel-level perturbation baseline, showing tokens are a more robust representation for geometric transformation.

3.3. Designing Token Warping Functions

The goal is to answer a question $Q$ about a scene from a target viewpoint $\Pi_T$ , given a source image $I_S$ from viewpoint $\Pi_S$ , its depth map $D$ , and intrinsics $K$ .

Let $c \in \mathbb{R}^{(HW) \times 2}$ denote the grid-center coordinates of $I_S$ . The warped coordinates $c^*$ are computed via a warping function $f$ :

c^* = f_{S \to T}(c, \Pi_{S \to T}, K, D)

where $\Pi_{S \to T} = \Pi_T \Pi_S^{-1}$ is the relative pose and $f_{S \to T}: \mathbb{R}^{(HW) \times 2} \to \mathbb{R}^{(HW) \times 2}$ projects from source to target.

The paper explores two main strategies and two fetching variants:

Forward vs. Backward Warping:
- Forward: Projects source tokens to the target view via $f_{S \to T}$ . This often yields sparse, irregular token distributions.
- Backward: Defines a dense, regular grid on the target view and maps each point back to the source via $f_{T \to S}$ to fetch a corresponding token. This produces regularly placed tokens. Backward warping is adopted as the primary strategy.
Nearest vs. Adaptive Fetching (within backward warping):
- Nearest Fetching: For a mapped target coordinate $c_i^*$ , retrieve the token from the nearest source grid-center point.
- Adaptive Fetching: For each $c_i^*$ , dynamically crop a new patch centered at that coordinate from the source image and encode it into a token.

The backward mapping $f_{T \to S}$ is implemented by constructing a 3D proxy mesh from the source depth map and using ray casting from the target grid.

Empirical Validation / Results

4. ViewBench Benchmark

A new benchmark constructed from ScanNet to evaluate viewpoint-conditioned reasoning.

Data: Source-target image pairs with known poses and overlap ratios (5-15%, 15-25%, 25 35%).
Tasks:
1. View-Conditioned Spatial Reasoning: Asks about the left-right relationship between two points (annotated with text or shapes) from the target viewpoint. The relationship flips between source and target.
2. Target-View Object Description: Asks to describe an object at a specified point from the target viewpoint.
Metrics: Accuracy (%) for spatial reasoning; a score from 1-10 (rated by Qwen2.5-14B) for object description.

5. Evaluation

Baselines: Specialist MLLMs (SpatialReasoner, VLM-3R, ViLaSR), base MLLM (Qwen2.5-VL), generative novel view synthesis (GenWarp), and pixel-wise warping (forward/backward).

Key Quantitative Results (Table 1):

Method	ViewBench-Text (5-15%, GT Depth)	ViewBench-Shape (5-15%, GT Depth)	ViewBench-Object (5-15%, GT Depth)
Target View (Oracle)	100.00%	100.00%	6.64
Qwen2.5-VL (Source only)	46.23%	24.42%	-
Pixel-Wise Backward	71.86%	62.40%	4.53
Token Warping Backward-Nearest	74.87%	67.44%	4.80
Token Warping Backward-Adaptive	77.89%	67.44%	4.97

Backward token warping (both nearest and adaptive) consistently outperforms all baselines across all tasks and overlap levels.
Nearest fetching performs comparably to adaptive fetching, offering a simpler, more efficient solution.
Token warping significantly outperforms pixel-wise warping and specialist MLLMs, and surpasses generative NVS (GenWarp).

Additional Findings (Supplementary):

The advantage holds with estimated depth and pose (Depth Anything v2, Depth Pro, VGGT).
Token warping remains effective under extreme viewpoint shifts (2-5% overlap) and occlusion scenarios.
A geometry-based oracle (bypassing the MLLM) achieves >93% accuracy, confirming the geometric pipeline's reliability.

Qualitative Results: Visualizations show that pixel-wise warping introduces severe distortions and artifacts, while token warping preserves semantic coherence, leading to correct MLLM responses.

Theoretical and Practical Implications

Theoretical: The work provides empirical support for the cognitive theory that part-level structural representations (here, image tokens) are a effective substrate for mental imagery and viewpoint transformation. It bridges classic cognitive science with modern multimodal AI architecture.

Practical:

Efficiency: Token warping adds minimal inference-time overhead without requiring model fine-tuning.
Robustness: Offers a reliable method for viewpoint simulation that is tolerant to geometric noise from depth estimation.
Application: Enhances MLLMs' capability for embodied tasks (e.g., navigation, manipulation) where reasoning from unseen nearby viewpoints is crucial.
Benchmark: ViewBench provides a focused testbed for evaluating and improving viewpoint-aware reasoning in MLLMs.

Conclusion

Inspired by part-based mental imagery theories, this paper demonstrates that token-level warping is a simple yet highly effective strategy for enabling MLLMs to look from nearby viewpoints. Backward warping that constructs a dense, regular target token grid is key to robust performance. The proposed method, requiring only a source image, depth, and relative pose, consistently outperforms strong alternatives, establishing tokens as a robust perceptual substrate for geometric transformation in MLLMs.