# TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation

> TerraScope introduces a Vision-Language Model that generates pixel-level segmentation masks interleaved with textual reasoning chains for fine-grained geospatial analysis, outperforming existing models.

- **Source:** [arXiv](https://arxiv.org/abs/2603.19039)
- **Published:** 2026-03-24
- **Permalink:** https://picx.dev/p/JdnaaU
- **Whiteboard:** https://picx.dev/p/JdnaaU/image

## Summary

# TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation - Summary

## Summary (Overview)
*   **Introduces TerraScope**, a unified Vision-Language Model (VLM) framework for **pixel-grounded geospatial reasoning** in Earth Observation (EO). It generates segmentation masks interleaved with textual reasoning chains, enabling fine-grained spatial analysis.
*   **Proposes Terra-CoT**, a large-scale **1 million sample instruction-tuning dataset** with pixel-level masks embedded in reasoning chains, generated via an automated hierarchical pipeline.
*   **Introduces TerraScope-Bench**, the first benchmark for pixel-grounded geospatial reasoning, comprising **3,837 expert-verified samples** across six tasks, with dual metrics for **answer accuracy** and **mask quality**.
*   **Demonstrates superior performance**: TerraScope significantly outperforms 11 existing general and EO-specific VLMs on the proposed benchmark and shows strong generalization on established EO benchmarks (Landsat30-AU, DisasterM3).
*   **Enables advanced reasoning capabilities**: The framework supports **modality-flexible reasoning** (adaptive fusion of optical/SAR data) and **multi-temporal reasoning** for change analysis.

## Introduction and Theoretical Foundation
Earth Observation satellites generate vast imagery archives critical for environmental monitoring, disaster response, and resource management. While Vision-Language Models (VLMs) offer a flexible, unified approach to EO data analysis, **state-of-the-art VLMs struggle with fine-grained geospatial reasoning requiring pixel-accurate spatial analysis**. Existing models often fail at tasks like calculating land-cover class coverage (see Fig. 1). Methods from natural images that use coarse-grained grounding (bounding boxes, crops) are inadequate for EO due to:
1.  **Continuous spatial distributions**: Land cover transitions gradually, making coarse grounding noisy.
2.  **Multi-sensor, multi-temporal data**: EO analysis requires integrating optical (spectral) and SAR (all-weather) imagery, as well as temporal sequences for change detection, within a unified framework.

To address this, TerraScope embodies the principle of **"thinking with pixels"**. It explicitly localizes task-relevant regions and grounds each reasoning step in pixel-level visual evidence, moving beyond language-only or coarsely-grounded reasoning. Unlike prior VLMs that rely on external tools, TerraScope uses **mixed decoders** to jointly generate segmentation masks and reasoning traces intrinsically.

## Methodology

### 3.1 Overview: Pixel-Grounded Visual Reasoning
Formally, let $f(\cdot)$ be a VLM with text encoder $f_T$ and vision encoder $f_V$. Given question $Q$ and image $I$, traditional VLMs perform language-only reasoning:
$$ [ r_1, r_2, ..., r_k, a ] = f(v, q) $$
where $r_i$ are reasoning steps and $a$ is the final answer.

**Pixel-grounded visual reasoning** interleaves masked visual features:
$$ [ r_1, (m_1, v_1), r_2, (m_2, v_2), ..., r_k, (m_k, v_k), a ] = f(v, q) $$
At each step $i$, the model generates a segmentation mask $m_i$ and selects masked visual features $v_i$ from the identified regions.

### 3.2 TerraScope Framework
The architecture (Fig. 2) builds upon InternVL3, augmented with a pixel-level segmentation module.

*   **Pixel-Grounded Chain-of-Thought**: A cooperative mechanism between dual (language and mask) decoders. The language decoder triggers the mask decoder upon generating a `[SEG]` token. The predicted mask $m_i$ is aligned with the visual token grid. Visual tokens are selected if the mask covers >50% of their spatial region:
    $$ v_i = \{ v_j | m^{tok}_i[j] = 1, j \in [1, N] \} $$
    where $v_j$ is the $j$-th visual token and $m^{tok}_i$ is the token-level mask. The selected features $v_i$ are projected and injected into the LLM to guide subsequent reasoning.

*   **Multi-Modal Reasoning (Optical-SAR)**: For optical ($v^{opt}$) and SAR ($v^{SAR}$) features, text-guided, token-level modality selection is used. Cross-attention between text and each modality computes relevance scores $\beta^\mu_j$ for modality $\mu$:
    $$ \beta^\mu_j = \frac{1}{L} \sum_{\ell=1}^{L} \text{Softmax}\left( \frac{{v^\mu q^\top}}{\sqrt{D}} \right)_{j\ell}, \quad \mu \in \{\text{opt}, \text{SAR}\} $$
    Features are selected from the modality with higher relevance per token:
    $$ v_j = \begin{cases} v^{opt}_j & \text{if } \beta^{opt}_j > \beta^{SAR}_j \\ v^{SAR}_j & \text{otherwise} \end{cases}, \quad \forall j \text{ where } m^{tok}_i[j] = 1 $$

*   **Multi-Temporal Reasoning**: Explicit temporal indicators (e.g., "Image: t_i") before `[SEG]` tokens specify which image in a sequence to segment and extract features from.

*   **Training**: A two-stage supervised fine-tuning process.
    1.  **Grounding Pretraining**: Train on 2M referring expression segmentation pairs (frozen vision encoder, projector, LLM; train only mask decoder).
    2.  **Instruction Tuning**: Fine-tune on 1M Terra-CoT samples (unfreeze projector & mask decoder; fine-tune LLM via LoRA).
    The combined loss is:
    $$ \mathcal{L} = \mathcal{L}_{LM} + \lambda \mathcal{L}_{seg} $$
    where $\mathcal{L}_{LM}$ is language modeling loss and $\mathcal{L}_{seg}$ is segmentation loss (Dice + cross-entropy); $\lambda=0.5$.

### 3.3 Terra-CoT Dataset Curation Pipeline
A two-stage automated pipeline creates pixel-grounded reasoning data at scale (Fig. 3).
1.  **Grounded Captioning with Chain-of-Thought (Cap-CoT)**: Use existing datasets with semantic annotations to prompt a large multimodal model to produce detailed captions referencing highlighted land-cover masks. Yields **250K Cap-CoT samples**.
2.  **Hierarchical Data Synthesis**: Use a model (TerraScope-Cap) trained on Cap-CoT to annotate unlabeled global imagery. Then synthesize questions via a two-level process:
    *   **Level 1 (L1)**: Template-based questions for basic spatial grounding (existence, counting, localization, area, boundary).
    *   **Level 2 (L2)**: Use an LLM to compose L1 questions into complex reasoning: **L2-Spatial** (cross-entity analysis) and **L2-Semantic** (requires domain knowledge).
    This produces **1M Terra-CoT samples**.

## Empirical Validation / Results

### 4. TerraScope-Bench
A new benchmark with **3,837 samples** across six expert-verified task categories (Fig. 4):
1.  Coverage Percentage Analysis (CA)
2.  Absolute Area Quantification (AQ)
3.  Comparative Area Ranking (CR)
4.  Boundary Relationship Detection (BRD)
5.  Distance Measurement (DM)
6.  Building Change Estimation (BCE)
It features **dual evaluation metrics**: answer correctness and segmentation mask quality (IoU).

### 5.1 Main Results
**Table 1: Quantitative performance on TerraScope-Bench, Landsat30AU, and DisasterM3.**

| Model | Size | TerraScope-Bench (Avg.) | Landsat30AU (Avg.) | DisasterM3 (Avg.) |
| :--- | :--- | :--- | :--- | :--- |
| **General VLMs** | | | | |
| GPT-4o † | - | 38.7 | - | 22.8 |
| LLaVA-OV | 7B | 37.5 | 57.0 | 25.3 |
| Qwen2.5-VL | 7B | 38.5 | 58.6 | 31.8 |
| InternVL3 | 8B | 36.0 | 54.8 | 27.2 |
| Qwen3-VL-Think ‡ | 8B | 43.3 | 65.0 | 32.5 |
| **EO-Specific VLMs** | | | | |
| GeoChat | 7B | 33.7 | 53.0 | - |
| EarthDial | 4B | 36.3 | 39.4 | 25.5 |
| EarthMind | 4B | 42.1 | - | - |
| **Fine-tuned on Terra-CoT** | | | | |
| InternVL3 | 8B | 54.9 | 67.6 | 36.1 |
| GLM-4.1V-Think ‡ | 9B | 59.6 | 68.0 | 38.8 |
| **TerraScope** | **8B** | **68.9** | **73.9** | **46.5** |

*† proprietary, ‡ reasoning models. "Avg." is average performance on multiple-choice tasks.*

**Key Findings:**
1.  **Pixel-grounded reasoning is challenging**: Existing VLMs struggle, especially on precise spatial tasks (e.g., area estimation).
2.  **EO-specific models show limited advantage**, potentially due to training predominantly on high-resolution (<5m) data.
3.  **Reasoning models perform better but lack visual grounding**, leading to hallucinations.
4.  **Terra-CoT effectively improves performance**: Fine-tuning general VLMs on it yields substantial gains.
5.  **TerraScope achieves SOTA**: It outperforms all baselines on TerraScope-Bench and generalizes well.
6.  **TerraScope provides interpretable reasoning**: It achieves high segmentation IoU, demonstrating faithful spatial grounding (Fig. 5).

### 5.2 Ablation Studies
**Table 2: Ablation on CoT strategies.**
| CoT Strategy | TerraScope-Bench | Landsat30AU | DisasterM3 |
| :--- | :--- | :--- | :--- |
| Original (Pretrained) | 33.8 | 45.7 | 23.6 |
| Textual CoT w/o Seg. | 58.7 | 56.5 | 32.9 |
| Textual CoT with Seg. | 60.6 | 58.9 | 35.8 |
| Random-Mask CoT | 43.2 | 53.8 | 32.6 |
| Box CoT | 62.8 | 70.5 | 43.9 |
| **TerraScope (Pixel)** | **68.9** | **73.9** | **46.5** |

*   **Pixel-level grounding is essential**: `Random-Mask CoT` (random visual tokens) underperforms, and `Box CoT` (bounding box grounding) is inferior to precise pixel-level masking (`TerraScope`), especially for irregular land cover shapes.
*   **Segmentation quality correlates with answer correctness**: Correct predictions have a mean IoU of **0.628** vs. **0.443** for incorrect ones (Pearson correlation r=0.607, p<0.001). High-quality visual grounding is crucial for correct reasoning (Fig. 6).

**Table 3: Ablation on multi-modal reasoning (Optical+SAR).**
| Fusion Method | CA | AQ | CR | BRD | DM |
| :--- | :--- | :--- | :--- | :--- | :--- |
| No Fusion (Optical only) | 73.2 | 70.2 | 71.8 | 80.0 | 65.9 |
| Feature Concatenation | **74.5** | **71.6** | **73.0** | **81.2** | 67.4 |
| Text-guided (test only) | 72.3 | 69.0 | 66.7 | 78.8 | 63.6 |
| Text-guided (train+test) | 74.3 | 70.9 | 72.7 | 80.7 | **68.2** |

*   **Multi-modal fusion improves performance**: All fusion methods outperform optical-only.
*   **Text-guided selection is effective and efficient**: While concatenation scores slightly higher, text-guided selection reduces context length by processing only the relevant modality per token. **Training with the selection mechanism is essential** for it to work.

**Visualization Insights (Fig. 7 & 8):**
*   **Cloud penetration**: In cloud-contaminated cases, fusing SAR data enables accurate segmentation where optical-only fails.
*   **Adaptive modality selection**: The model prioritizes optical tokens in clear regions and SAR tokens in cloud-covered areas.
*   **Structured, grounded reasoning**: TerraScope decomposes complex questions into interpretable sub-steps, each grounded by a precise segmentation mask, leading to transparent numerical computations (e.g., pixel counting for area, distance measurement).

## Theoretical and Practical Implications
*   **Theoretical**: Introduces and formalizes the novel paradigm of **"pixel-grounded visual reasoning"** for geospatial analysis, moving beyond language-only or coarsely-grounded reasoning. It demonstrates that **interleaving precise segmentation masks with reasoning chains** is a more faithful and effective approach for continuous spatial domains like EO.
*   **Practical**: TerraScope provides a **unified, interpretable tool** for fine-grained EO analysis tasks (coverage calculation, change detection, distance measurement) that are crucial for applications in environmental monitoring, disaster assessment, and urban planning. The released **Terra-CoT dataset and TerraScope-Bench** facilitate future research in pixel-grounded reasoning for EO.

## Conclusion
TerraScope presents a comprehensive framework for pixel-grounded geospatial reasoning in Earth Observation. Its key innovations are:
1.  A unified VLM that **generates segmentation masks interleaved with reasoning traces** for precise, interpretable spatial analysis.
2.  Support for **modality-flexible** (optical/SAR) and **multi-temporal** reasoning.
3.  The **Terra-CoT** dataset (1M samples) and **TerraScope-Bench** benchmark to enable and evaluate pixel-grounded reasoning.

Experiments show TerraScope's significant superiority over existing VLMs and its strong generalization. The work establishes a new direction for developing VLMs capable of fine-grained, trustworthy reasoning in geospatial and other continuous spatial domains.

---

_Markdown view of https://picx.dev/p/JdnaaU, served by PicX — AI-generated visual whiteboard summaries of research papers._
