TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation - Summary
Summary (Overview)
- Introduces TerraScope, a unified Vision-Language Model (VLM) framework for pixel-grounded geospatial reasoning in Earth Observation (EO). It generates segmentation masks interleaved with textual reasoning chains, enabling fine-grained spatial analysis.
- Proposes Terra-CoT, a large-scale 1 million sample instruction-tuning dataset with pixel-level masks embedded in reasoning chains, generated via an automated hierarchical pipeline.
- Introduces TerraScope-Bench, the first benchmark for pixel-grounded geospatial reasoning, comprising 3,837 expert-verified samples across six tasks, with dual metrics for answer accuracy and mask quality.
- Demonstrates superior performance: TerraScope significantly outperforms 11 existing general and EO-specific VLMs on the proposed benchmark and shows strong generalization on established EO benchmarks (Landsat30-AU, DisasterM3).
- Enables advanced reasoning capabilities: The framework supports modality-flexible reasoning (adaptive fusion of optical/SAR data) and multi-temporal reasoning for change analysis.
Introduction and Theoretical Foundation
Earth Observation satellites generate vast imagery archives critical for environmental monitoring, disaster response, and resource management. While Vision-Language Models (VLMs) offer a flexible, unified approach to EO data analysis, state-of-the-art VLMs struggle with fine-grained geospatial reasoning requiring pixel-accurate spatial analysis. Existing models often fail at tasks like calculating land-cover class coverage (see Fig. 1). Methods from natural images that use coarse-grained grounding (bounding boxes, crops) are inadequate for EO due to:
- Continuous spatial distributions: Land cover transitions gradually, making coarse grounding noisy.
- Multi-sensor, multi-temporal data: EO analysis requires integrating optical (spectral) and SAR (all-weather) imagery, as well as temporal sequences for change detection, within a unified framework.
To address this, TerraScope embodies the principle of "thinking with pixels". It explicitly localizes task-relevant regions and grounds each reasoning step in pixel-level visual evidence, moving beyond language-only or coarsely-grounded reasoning. Unlike prior VLMs that rely on external tools, TerraScope uses mixed decoders to jointly generate segmentation masks and reasoning traces intrinsically.
Methodology
3.1 Overview: Pixel-Grounded Visual Reasoning
Formally, let be a VLM with text encoder and vision encoder . Given question and image , traditional VLMs perform language-only reasoning:
where are reasoning steps and is the final answer.
Pixel-grounded visual reasoning interleaves masked visual features:
At each step , the model generates a segmentation mask and selects masked visual features from the identified regions.
3.2 TerraScope Framework
The architecture (Fig. 2) builds upon InternVL3, augmented with a pixel-level segmentation module.
-
Pixel-Grounded Chain-of-Thought: A cooperative mechanism between dual (language and mask) decoders. The language decoder triggers the mask decoder upon generating a
[SEG]token. The predicted mask is aligned with the visual token grid. Visual tokens are selected if the mask covers >50% of their spatial region:where is the -th visual token and is the token-level mask. The selected features are projected and injected into the LLM to guide subsequent reasoning.
-
Multi-Modal Reasoning (Optical-SAR): For optical () and SAR () features, text-guided, token-level modality selection is used. Cross-attention between text and each modality computes relevance scores for modality :
Features are selected from the modality with higher relevance per token:
-
Multi-Temporal Reasoning: Explicit temporal indicators (e.g., "Image: t_i") before
[SEG]tokens specify which image in a sequence to segment and extract features from. -
Training: A two-stage supervised fine-tuning process.
- Grounding Pretraining: Train on 2M referring expression segmentation pairs (frozen vision encoder, projector, LLM; train only mask decoder).
- Instruction Tuning: Fine-tune on 1M Terra-CoT samples (unfreeze projector & mask decoder; fine-tune LLM via LoRA). The combined loss is:
where is language modeling loss and is segmentation loss (Dice + cross-entropy); .
3.3 Terra-CoT Dataset Curation Pipeline
A two-stage automated pipeline creates pixel-grounded reasoning data at scale (Fig. 3).
- Grounded Captioning with Chain-of-Thought (Cap-CoT): Use existing datasets with semantic annotations to prompt a large multimodal model to produce detailed captions referencing highlighted land-cover masks. Yields 250K Cap-CoT samples.
- Hierarchical Data Synthesis: Use a model (TerraScope-Cap) trained on Cap-CoT to annotate unlabeled global imagery. Then synthesize questions via a two-level process:
- Level 1 (L1): Template-based questions for basic spatial grounding (existence, counting, localization, area, boundary).
- Level 2 (L2): Use an LLM to compose L1 questions into complex reasoning: L2-Spatial (cross-entity analysis) and L2-Semantic (requires domain knowledge). This produces 1M Terra-CoT samples.
Empirical Validation / Results
4. TerraScope-Bench
A new benchmark with 3,837 samples across six expert-verified task categories (Fig. 4):
- Coverage Percentage Analysis (CA)
- Absolute Area Quantification (AQ)
- Comparative Area Ranking (CR)
- Boundary Relationship Detection (BRD)
- Distance Measurement (DM)
- Building Change Estimation (BCE) It features dual evaluation metrics: answer correctness and segmentation mask quality (IoU).
5.1 Main Results
Table 1: Quantitative performance on TerraScope-Bench, Landsat30AU, and DisasterM3.
| Model | Size | TerraScope-Bench (Avg.) | Landsat30AU (Avg.) | DisasterM3 (Avg.) |
|---|---|---|---|---|
| General VLMs | ||||
| GPT-4o † | - | 38.7 | - | 22.8 |
| LLaVA-OV | 7B | 37.5 | 57.0 | 25.3 |
| Qwen2.5-VL | 7B | 38.5 | 58.6 | 31.8 |
| InternVL3 | 8B | 36.0 | 54.8 | 27.2 |
| Qwen3-VL-Think ‡ | 8B | 43.3 | 65.0 | 32.5 |
| EO-Specific VLMs | ||||
| GeoChat | 7B | 33.7 | 53.0 | - |
| EarthDial | 4B | 36.3 | 39.4 | 25.5 |
| EarthMind | 4B | 42.1 | - | - |
| Fine-tuned on Terra-CoT | ||||
| InternVL3 | 8B | 54.9 | 67.6 | 36.1 |
| GLM-4.1V-Think ‡ | 9B | 59.6 | 68.0 | 38.8 |
| TerraScope | 8B | 68.9 | 73.9 | 46.5 |
† proprietary, ‡ reasoning models. "Avg." is average performance on multiple-choice tasks.
Key Findings:
- Pixel-grounded reasoning is challenging: Existing VLMs struggle, especially on precise spatial tasks (e.g., area estimation).
- EO-specific models show limited advantage, potentially due to training predominantly on high-resolution (<5m) data.
- Reasoning models perform better but lack visual grounding, leading to hallucinations.
- Terra-CoT effectively improves performance: Fine-tuning general VLMs on it yields substantial gains.
- TerraScope achieves SOTA: It outperforms all baselines on TerraScope-Bench and generalizes well.
- TerraScope provides interpretable reasoning: It achieves high segmentation IoU, demonstrating faithful spatial grounding (Fig. 5).
5.2 Ablation Studies
Table 2: Ablation on CoT strategies.
| CoT Strategy | TerraScope-Bench | Landsat30AU | DisasterM3 |
|---|---|---|---|
| Original (Pretrained) | 33.8 | 45.7 | 23.6 |
| Textual CoT w/o Seg. | 58.7 | 56.5 | 32.9 |
| Textual CoT with Seg. | 60.6 | 58.9 | 35.8 |
| Random-Mask CoT | 43.2 | 53.8 | 32.6 |
| Box CoT | 62.8 | 70.5 | 43.9 |
| TerraScope (Pixel) | 68.9 | 73.9 | 46.5 |
- Pixel-level grounding is essential:
Random-Mask CoT(random visual tokens) underperforms, andBox CoT(bounding box grounding) is inferior to precise pixel-level masking (TerraScope), especially for irregular land cover shapes. - Segmentation quality correlates with answer correctness: Correct predictions have a mean IoU of 0.628 vs. 0.443 for incorrect ones (Pearson correlation r=0.607, p<0.001). High-quality visual grounding is crucial for correct reasoning (Fig. 6).
Table 3: Ablation on multi-modal reasoning (Optical+SAR).
| Fusion Method | CA | AQ | CR | BRD | DM |
|---|---|---|---|---|---|
| No Fusion (Optical only) | 73.2 | 70.2 | 71.8 | 80.0 | 65.9 |
| Feature Concatenation | 74.5 | 71.6 | 73.0 | 81.2 | 67.4 |
| Text-guided (test only) | 72.3 | 69.0 | 66.7 | 78.8 | 63.6 |
| Text-guided (train+test) | 74.3 | 70.9 | 72.7 | 80.7 | 68.2 |
- Multi-modal fusion improves performance: All fusion methods outperform optical-only.
- Text-guided selection is effective and efficient: While concatenation scores slightly higher, text-guided selection reduces context length by processing only the relevant modality per token. Training with the selection mechanism is essential for it to work.
Visualization Insights (Fig. 7 & 8):
- Cloud penetration: In cloud-contaminated cases, fusing SAR data enables accurate segmentation where optical-only fails.
- Adaptive modality selection: The model prioritizes optical tokens in clear regions and SAR tokens in cloud-covered areas.
- Structured, grounded reasoning: TerraScope decomposes complex questions into interpretable sub-steps, each grounded by a precise segmentation mask, leading to transparent numerical computations (e.g., pixel counting for area, distance measurement).
Theoretical and Practical Implications
- Theoretical: Introduces and formalizes the novel paradigm of "pixel-grounded visual reasoning" for geospatial analysis, moving beyond language-only or coarsely-grounded reasoning. It demonstrates that interleaving precise segmentation masks with reasoning chains is a more faithful and effective approach for continuous spatial domains like EO.
- Practical: TerraScope provides a unified, interpretable tool for fine-grained EO analysis tasks (coverage calculation, change detection, distance measurement) that are crucial for applications in environmental monitoring, disaster assessment, and urban planning. The released Terra-CoT dataset and TerraScope-Bench facilitate future research in pixel-grounded reasoning for EO.
Conclusion
TerraScope presents a comprehensive framework for pixel-grounded geospatial reasoning in Earth Observation. Its key innovations are:
- A unified VLM that generates segmentation masks interleaved with reasoning traces for precise, interpretable spatial analysis.
- Support for modality-flexible (optical/SAR) and multi-temporal reasoning.
- The Terra-CoT dataset (1M samples) and TerraScope-Bench benchmark to enable and evaluate pixel-grounded reasoning.
Experiments show TerraScope's significant superiority over existing VLMs and its strong generalization. The work establishes a new direction for developing VLMs capable of fine-grained, trustworthy reasoning in geospatial and other continuous spatial domains.