TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation - Summary

Summary (Overview)

  • Introduces TerraScope, a unified Vision-Language Model (VLM) framework for pixel-grounded geospatial reasoning in Earth Observation (EO). It generates segmentation masks interleaved with textual reasoning chains, enabling fine-grained spatial analysis.
  • Proposes Terra-CoT, a large-scale 1 million sample instruction-tuning dataset with pixel-level masks embedded in reasoning chains, generated via an automated hierarchical pipeline.
  • Introduces TerraScope-Bench, the first benchmark for pixel-grounded geospatial reasoning, comprising 3,837 expert-verified samples across six tasks, with dual metrics for answer accuracy and mask quality.
  • Demonstrates superior performance: TerraScope significantly outperforms 11 existing general and EO-specific VLMs on the proposed benchmark and shows strong generalization on established EO benchmarks (Landsat30-AU, DisasterM3).
  • Enables advanced reasoning capabilities: The framework supports modality-flexible reasoning (adaptive fusion of optical/SAR data) and multi-temporal reasoning for change analysis.

Introduction and Theoretical Foundation

Earth Observation satellites generate vast imagery archives critical for environmental monitoring, disaster response, and resource management. While Vision-Language Models (VLMs) offer a flexible, unified approach to EO data analysis, state-of-the-art VLMs struggle with fine-grained geospatial reasoning requiring pixel-accurate spatial analysis. Existing models often fail at tasks like calculating land-cover class coverage (see Fig. 1). Methods from natural images that use coarse-grained grounding (bounding boxes, crops) are inadequate for EO due to:

  1. Continuous spatial distributions: Land cover transitions gradually, making coarse grounding noisy.
  2. Multi-sensor, multi-temporal data: EO analysis requires integrating optical (spectral) and SAR (all-weather) imagery, as well as temporal sequences for change detection, within a unified framework.

To address this, TerraScope embodies the principle of "thinking with pixels". It explicitly localizes task-relevant regions and grounds each reasoning step in pixel-level visual evidence, moving beyond language-only or coarsely-grounded reasoning. Unlike prior VLMs that rely on external tools, TerraScope uses mixed decoders to jointly generate segmentation masks and reasoning traces intrinsically.

Methodology

3.1 Overview: Pixel-Grounded Visual Reasoning

Formally, let f()f(\cdot) be a VLM with text encoder fTf_T and vision encoder fVf_V. Given question QQ and image II, traditional VLMs perform language-only reasoning:

[r1,r2,...,rk,a]=f(v,q)[ r_1, r_2, ..., r_k, a ] = f(v, q)

where rir_i are reasoning steps and aa is the final answer.

Pixel-grounded visual reasoning interleaves masked visual features:

[r1,(m1,v1),r2,(m2,v2),...,rk,(mk,vk),a]=f(v,q)[ r_1, (m_1, v_1), r_2, (m_2, v_2), ..., r_k, (m_k, v_k), a ] = f(v, q)

At each step ii, the model generates a segmentation mask mim_i and selects masked visual features viv_i from the identified regions.

3.2 TerraScope Framework

The architecture (Fig. 2) builds upon InternVL3, augmented with a pixel-level segmentation module.

  • Pixel-Grounded Chain-of-Thought: A cooperative mechanism between dual (language and mask) decoders. The language decoder triggers the mask decoder upon generating a [SEG] token. The predicted mask mim_i is aligned with the visual token grid. Visual tokens are selected if the mask covers >50% of their spatial region:

    vi={vjmitok[j]=1,j[1,N]}v_i = \{ v_j | m^{tok}_i[j] = 1, j \in [1, N] \}

    where vjv_j is the jj-th visual token and mitokm^{tok}_i is the token-level mask. The selected features viv_i are projected and injected into the LLM to guide subsequent reasoning.

  • Multi-Modal Reasoning (Optical-SAR): For optical (voptv^{opt}) and SAR (vSARv^{SAR}) features, text-guided, token-level modality selection is used. Cross-attention between text and each modality computes relevance scores βjμ\beta^\mu_j for modality μ\mu:

    βjμ=1L=1LSoftmax(vμqD)j,μ{opt,SAR}\beta^\mu_j = \frac{1}{L} \sum_{\ell=1}^{L} \text{Softmax}\left( \frac{{v^\mu q^\top}}{\sqrt{D}} \right)_{j\ell}, \quad \mu \in \{\text{opt}, \text{SAR}\}

    Features are selected from the modality with higher relevance per token:

    vj={vjoptif βjopt>βjSARvjSARotherwise,j where mitok[j]=1v_j = \begin{cases} v^{opt}_j & \text{if } \beta^{opt}_j > \beta^{SAR}_j \\ v^{SAR}_j & \text{otherwise} \end{cases}, \quad \forall j \text{ where } m^{tok}_i[j] = 1
  • Multi-Temporal Reasoning: Explicit temporal indicators (e.g., "Image: t_i") before [SEG] tokens specify which image in a sequence to segment and extract features from.

  • Training: A two-stage supervised fine-tuning process.

    1. Grounding Pretraining: Train on 2M referring expression segmentation pairs (frozen vision encoder, projector, LLM; train only mask decoder).
    2. Instruction Tuning: Fine-tune on 1M Terra-CoT samples (unfreeze projector & mask decoder; fine-tune LLM via LoRA). The combined loss is:
    L=LLM+λLseg\mathcal{L} = \mathcal{L}_{LM} + \lambda \mathcal{L}_{seg}

    where LLM\mathcal{L}_{LM} is language modeling loss and Lseg\mathcal{L}_{seg} is segmentation loss (Dice + cross-entropy); λ=0.5\lambda=0.5.

3.3 Terra-CoT Dataset Curation Pipeline

A two-stage automated pipeline creates pixel-grounded reasoning data at scale (Fig. 3).

  1. Grounded Captioning with Chain-of-Thought (Cap-CoT): Use existing datasets with semantic annotations to prompt a large multimodal model to produce detailed captions referencing highlighted land-cover masks. Yields 250K Cap-CoT samples.
  2. Hierarchical Data Synthesis: Use a model (TerraScope-Cap) trained on Cap-CoT to annotate unlabeled global imagery. Then synthesize questions via a two-level process:
    • Level 1 (L1): Template-based questions for basic spatial grounding (existence, counting, localization, area, boundary).
    • Level 2 (L2): Use an LLM to compose L1 questions into complex reasoning: L2-Spatial (cross-entity analysis) and L2-Semantic (requires domain knowledge). This produces 1M Terra-CoT samples.

Empirical Validation / Results

4. TerraScope-Bench

A new benchmark with 3,837 samples across six expert-verified task categories (Fig. 4):

  1. Coverage Percentage Analysis (CA)
  2. Absolute Area Quantification (AQ)
  3. Comparative Area Ranking (CR)
  4. Boundary Relationship Detection (BRD)
  5. Distance Measurement (DM)
  6. Building Change Estimation (BCE) It features dual evaluation metrics: answer correctness and segmentation mask quality (IoU).

5.1 Main Results

Table 1: Quantitative performance on TerraScope-Bench, Landsat30AU, and DisasterM3.

ModelSizeTerraScope-Bench (Avg.)Landsat30AU (Avg.)DisasterM3 (Avg.)
General VLMs
GPT-4o †-38.7-22.8
LLaVA-OV7B37.557.025.3
Qwen2.5-VL7B38.558.631.8
InternVL38B36.054.827.2
Qwen3-VL-Think ‡8B43.365.032.5
EO-Specific VLMs
GeoChat7B33.753.0-
EarthDial4B36.339.425.5
EarthMind4B42.1--
Fine-tuned on Terra-CoT
InternVL38B54.967.636.1
GLM-4.1V-Think ‡9B59.668.038.8
TerraScope8B68.973.946.5

† proprietary, ‡ reasoning models. "Avg." is average performance on multiple-choice tasks.

Key Findings:

  1. Pixel-grounded reasoning is challenging: Existing VLMs struggle, especially on precise spatial tasks (e.g., area estimation).
  2. EO-specific models show limited advantage, potentially due to training predominantly on high-resolution (<5m) data.
  3. Reasoning models perform better but lack visual grounding, leading to hallucinations.
  4. Terra-CoT effectively improves performance: Fine-tuning general VLMs on it yields substantial gains.
  5. TerraScope achieves SOTA: It outperforms all baselines on TerraScope-Bench and generalizes well.
  6. TerraScope provides interpretable reasoning: It achieves high segmentation IoU, demonstrating faithful spatial grounding (Fig. 5).

5.2 Ablation Studies

Table 2: Ablation on CoT strategies.

CoT StrategyTerraScope-BenchLandsat30AUDisasterM3
Original (Pretrained)33.845.723.6
Textual CoT w/o Seg.58.756.532.9
Textual CoT with Seg.60.658.935.8
Random-Mask CoT43.253.832.6
Box CoT62.870.543.9
TerraScope (Pixel)68.973.946.5
  • Pixel-level grounding is essential: Random-Mask CoT (random visual tokens) underperforms, and Box CoT (bounding box grounding) is inferior to precise pixel-level masking (TerraScope), especially for irregular land cover shapes.
  • Segmentation quality correlates with answer correctness: Correct predictions have a mean IoU of 0.628 vs. 0.443 for incorrect ones (Pearson correlation r=0.607, p<0.001). High-quality visual grounding is crucial for correct reasoning (Fig. 6).

Table 3: Ablation on multi-modal reasoning (Optical+SAR).

Fusion MethodCAAQCRBRDDM
No Fusion (Optical only)73.270.271.880.065.9
Feature Concatenation74.571.673.081.267.4
Text-guided (test only)72.369.066.778.863.6
Text-guided (train+test)74.370.972.780.768.2
  • Multi-modal fusion improves performance: All fusion methods outperform optical-only.
  • Text-guided selection is effective and efficient: While concatenation scores slightly higher, text-guided selection reduces context length by processing only the relevant modality per token. Training with the selection mechanism is essential for it to work.

Visualization Insights (Fig. 7 & 8):

  • Cloud penetration: In cloud-contaminated cases, fusing SAR data enables accurate segmentation where optical-only fails.
  • Adaptive modality selection: The model prioritizes optical tokens in clear regions and SAR tokens in cloud-covered areas.
  • Structured, grounded reasoning: TerraScope decomposes complex questions into interpretable sub-steps, each grounded by a precise segmentation mask, leading to transparent numerical computations (e.g., pixel counting for area, distance measurement).

Theoretical and Practical Implications

  • Theoretical: Introduces and formalizes the novel paradigm of "pixel-grounded visual reasoning" for geospatial analysis, moving beyond language-only or coarsely-grounded reasoning. It demonstrates that interleaving precise segmentation masks with reasoning chains is a more faithful and effective approach for continuous spatial domains like EO.
  • Practical: TerraScope provides a unified, interpretable tool for fine-grained EO analysis tasks (coverage calculation, change detection, distance measurement) that are crucial for applications in environmental monitoring, disaster assessment, and urban planning. The released Terra-CoT dataset and TerraScope-Bench facilitate future research in pixel-grounded reasoning for EO.

Conclusion

TerraScope presents a comprehensive framework for pixel-grounded geospatial reasoning in Earth Observation. Its key innovations are:

  1. A unified VLM that generates segmentation masks interleaved with reasoning traces for precise, interpretable spatial analysis.
  2. Support for modality-flexible (optical/SAR) and multi-temporal reasoning.
  3. The Terra-CoT dataset (1M samples) and TerraScope-Bench benchmark to enable and evaluate pixel-grounded reasoning.

Experiments show TerraScope's significant superiority over existing VLMs and its strong generalization. The work establishes a new direction for developing VLMs capable of fine-grained, trustworthy reasoning in geospatial and other continuous spatial domains.