TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation - Summary

Summary (Overview)

Introduces TerraScope, a unified Vision-Language Model (VLM) framework for pixel-grounded geospatial reasoning in Earth Observation (EO). It generates segmentation masks interleaved with textual reasoning chains, enabling fine-grained spatial analysis.
Proposes Terra-CoT, a large-scale 1 million sample instruction-tuning dataset with pixel-level masks embedded in reasoning chains, generated via an automated hierarchical pipeline.
Introduces TerraScope-Bench, the first benchmark for pixel-grounded geospatial reasoning, comprising 3,837 expert-verified samples across six tasks, with dual metrics for answer accuracy and mask quality.
Demonstrates superior performance: TerraScope significantly outperforms 11 existing general and EO-specific VLMs on the proposed benchmark and shows strong generalization on established EO benchmarks (Landsat30-AU, DisasterM3).
Enables advanced reasoning capabilities: The framework supports modality-flexible reasoning (adaptive fusion of optical/SAR data) and multi-temporal reasoning for change analysis.

Introduction and Theoretical Foundation

Earth Observation satellites generate vast imagery archives critical for environmental monitoring, disaster response, and resource management. While Vision-Language Models (VLMs) offer a flexible, unified approach to EO data analysis, state-of-the-art VLMs struggle with fine-grained geospatial reasoning requiring pixel-accurate spatial analysis. Existing models often fail at tasks like calculating land-cover class coverage (see Fig. 1). Methods from natural images that use coarse-grained grounding (bounding boxes, crops) are inadequate for EO due to:

Continuous spatial distributions: Land cover transitions gradually, making coarse grounding noisy.
Multi-sensor, multi-temporal data: EO analysis requires integrating optical (spectral) and SAR (all-weather) imagery, as well as temporal sequences for change detection, within a unified framework.

To address this, TerraScope embodies the principle of "thinking with pixels". It explicitly localizes task-relevant regions and grounds each reasoning step in pixel-level visual evidence, moving beyond language-only or coarsely-grounded reasoning. Unlike prior VLMs that rely on external tools, TerraScope uses mixed decoders to jointly generate segmentation masks and reasoning traces intrinsically.

Methodology

3.1 Overview: Pixel-Grounded Visual Reasoning

Formally, let $f(\cdot)$ be a VLM with text encoder $f_T$ and vision encoder $f_V$ . Given question $Q$ and image $I$ , traditional VLMs perform language-only reasoning:

[ r_1, r_2, ..., r_k, a ] = f(v, q)

where $r_i$ are reasoning steps and $a$ is the final answer.

Pixel-grounded visual reasoning interleaves masked visual features:

[ r_1, (m_1, v_1), r_2, (m_2, v_2), ..., r_k, (m_k, v_k), a ] = f(v, q)

At each step $i$ , the model generates a segmentation mask $m_i$ and selects masked visual features $v_i$ from the identified regions.

3.2 TerraScope Framework

The architecture (Fig. 2) builds upon InternVL3, augmented with a pixel-level segmentation module.

Pixel-Grounded Chain-of-Thought: A cooperative mechanism between dual (language and mask) decoders. The language decoder triggers the mask decoder upon generating a [SEG] token. The predicted mask $m_i$ is aligned with the visual token grid. Visual tokens are selected if the mask covers >50% of their spatial region:
$v_i = \{ v_j | m^{tok}_i[j] = 1, j \in [1, N] \}$
where $v_j$ is the $j$ -th visual token and $m^{tok}_i$ is the token-level mask. The selected features $v_i$ are projected and injected into the LLM to guide subsequent reasoning.
Multi-Modal Reasoning (Optical-SAR): For optical ( $v^{opt}$ ) and SAR ( $v^{SAR}$ ) features, text-guided, token-level modality selection is used. Cross-attention between text and each modality computes relevance scores $\beta^\mu_j$ for modality $\mu$ :
$\beta^\mu_j = \frac{1}{L} \sum_{\ell=1}^{L} \text{Softmax}\left( \frac{{v^\mu q^\top}}{\sqrt{D}} \right)_{j\ell}, \quad \mu \in \{\text{opt}, \text{SAR}\}$
Features are selected from the modality with higher relevance per token:
$v_j = \begin{cases} v^{opt}_j & \text{if } \beta^{opt}_j > \beta^{SAR}_j \\ v^{SAR}_j & \text{otherwise} \end{cases}, \quad \forall j \text{ where } m^{tok}_i[j] = 1$
Multi-Temporal Reasoning: Explicit temporal indicators (e.g., "Image: t_i") before [SEG] tokens specify which image in a sequence to segment and extract features from.
Training: A two-stage supervised fine-tuning process.
1. Grounding Pretraining: Train on 2M referring expression segmentation pairs (frozen vision encoder, projector, LLM; train only mask decoder).
2. Instruction Tuning: Fine-tune on 1M Terra-CoT samples (unfreeze projector & mask decoder; fine-tune LLM via LoRA). The combined loss is:
$\mathcal{L} = \mathcal{L}_{LM} + \lambda \mathcal{L}_{seg}$
where $\mathcal{L}_{LM}$ is language modeling loss and $\mathcal{L}_{seg}$ is segmentation loss (Dice + cross-entropy); $\lambda=0.5$ .

3.3 Terra-CoT Dataset Curation Pipeline

A two-stage automated pipeline creates pixel-grounded reasoning data at scale (Fig. 3).

Grounded Captioning with Chain-of-Thought (Cap-CoT): Use existing datasets with semantic annotations to prompt a large multimodal model to produce detailed captions referencing highlighted land-cover masks. Yields 250K Cap-CoT samples.
Hierarchical Data Synthesis: Use a model (TerraScope-Cap) trained on Cap-CoT to annotate unlabeled global imagery. Then synthesize questions via a two-level process:
- Level 1 (L1): Template-based questions for basic spatial grounding (existence, counting, localization, area, boundary).
- Level 2 (L2): Use an LLM to compose L1 questions into complex reasoning: L2-Spatial (cross-entity analysis) and L2-Semantic (requires domain knowledge). This produces 1M Terra-CoT samples.

Empirical Validation / Results

4. TerraScope-Bench

A new benchmark with 3,837 samples across six expert-verified task categories (Fig. 4):

Coverage Percentage Analysis (CA)
Absolute Area Quantification (AQ)
Comparative Area Ranking (CR)
Boundary Relationship Detection (BRD)
Distance Measurement (DM)
Building Change Estimation (BCE) It features dual evaluation metrics: answer correctness and segmentation mask quality (IoU).

5.1 Main Results

Table 1: Quantitative performance on TerraScope-Bench, Landsat30AU, and DisasterM3.

Model	Size	TerraScope-Bench (Avg.)	Landsat30AU (Avg.)	DisasterM3 (Avg.)
General VLMs
GPT-4o †	-	38.7	-	22.8
LLaVA-OV	7B	37.5	57.0	25.3
Qwen2.5-VL	7B	38.5	58.6	31.8
InternVL3	8B	36.0	54.8	27.2
Qwen3-VL-Think ‡	8B	43.3	65.0	32.5
EO-Specific VLMs
GeoChat	7B	33.7	53.0	-
EarthDial	4B	36.3	39.4	25.5
EarthMind	4B	42.1	-	-
Fine-tuned on Terra-CoT
InternVL3	8B	54.9	67.6	36.1
GLM-4.1V-Think ‡	9B	59.6	68.0	38.8
TerraScope	8B	68.9	73.9	46.5

† proprietary, ‡ reasoning models. "Avg." is average performance on multiple-choice tasks.

Key Findings:

Pixel-grounded reasoning is challenging: Existing VLMs struggle, especially on precise spatial tasks (e.g., area estimation).
EO-specific models show limited advantage, potentially due to training predominantly on high-resolution (<5m) data.
Reasoning models perform better but lack visual grounding, leading to hallucinations.
Terra-CoT effectively improves performance: Fine-tuning general VLMs on it yields substantial gains.
TerraScope achieves SOTA: It outperforms all baselines on TerraScope-Bench and generalizes well.
TerraScope provides interpretable reasoning: It achieves high segmentation IoU, demonstrating faithful spatial grounding (Fig. 5).

5.2 Ablation Studies

Table 2: Ablation on CoT strategies.

CoT Strategy	TerraScope-Bench	Landsat30AU	DisasterM3
Original (Pretrained)	33.8	45.7	23.6
Textual CoT w/o Seg.	58.7	56.5	32.9
Textual CoT with Seg.	60.6	58.9	35.8
Random-Mask CoT	43.2	53.8	32.6
Box CoT	62.8	70.5	43.9
TerraScope (Pixel)	68.9	73.9	46.5

Pixel-level grounding is essential: Random-Mask CoT (random visual tokens) underperforms, and Box CoT (bounding box grounding) is inferior to precise pixel-level masking (TerraScope), especially for irregular land cover shapes.
Segmentation quality correlates with answer correctness: Correct predictions have a mean IoU of 0.628 vs. 0.443 for incorrect ones (Pearson correlation r=0.607, p<0.001). High-quality visual grounding is crucial for correct reasoning (Fig. 6).

Table 3: Ablation on multi-modal reasoning (Optical+SAR).

Fusion Method	CA	AQ	CR	BRD	DM
No Fusion (Optical only)	73.2	70.2	71.8	80.0	65.9
Feature Concatenation	74.5	71.6	73.0	81.2	67.4
Text-guided (test only)	72.3	69.0	66.7	78.8	63.6
Text-guided (train+test)	74.3	70.9	72.7	80.7	68.2

Multi-modal fusion improves performance: All fusion methods outperform optical-only.
Text-guided selection is effective and efficient: While concatenation scores slightly higher, text-guided selection reduces context length by processing only the relevant modality per token. Training with the selection mechanism is essential for it to work.

Visualization Insights (Fig. 7 & 8):

Cloud penetration: In cloud-contaminated cases, fusing SAR data enables accurate segmentation where optical-only fails.
Adaptive modality selection: The model prioritizes optical tokens in clear regions and SAR tokens in cloud-covered areas.
Structured, grounded reasoning: TerraScope decomposes complex questions into interpretable sub-steps, each grounded by a precise segmentation mask, leading to transparent numerical computations (e.g., pixel counting for area, distance measurement).

Theoretical and Practical Implications

Theoretical: Introduces and formalizes the novel paradigm of "pixel-grounded visual reasoning" for geospatial analysis, moving beyond language-only or coarsely-grounded reasoning. It demonstrates that interleaving precise segmentation masks with reasoning chains is a more faithful and effective approach for continuous spatial domains like EO.
Practical: TerraScope provides a unified, interpretable tool for fine-grained EO analysis tasks (coverage calculation, change detection, distance measurement) that are crucial for applications in environmental monitoring, disaster assessment, and urban planning. The released Terra-CoT dataset and TerraScope-Bench facilitate future research in pixel-grounded reasoning for EO.

Conclusion

TerraScope presents a comprehensive framework for pixel-grounded geospatial reasoning in Earth Observation. Its key innovations are:

A unified VLM that generates segmentation masks interleaved with reasoning traces for precise, interpretable spatial analysis.
Support for modality-flexible (optical/SAR) and multi-temporal reasoning.
The Terra-CoT dataset (1M samples) and TerraScope-Bench benchmark to enable and evaluate pixel-grounded reasoning.

Experiments show TerraScope's significant superiority over existing VLMs and its strong generalization. The work establishes a new direction for developing VLMs capable of fine-grained, trustworthy reasoning in geospatial and other continuous spatial domains.