GGT-100K: Generative Ground Truth for Generalizable Real-World Image Restoration

Summary (Overview)

Proposes Generative Ground Truth (GGT): A scalable paradigm for constructing real-world paired training data for image restoration (IR) using generative multimodal foundation models (MFMs). This addresses the bottleneck of scarce high-quality real-world paired data.
Introduces GGT-100K Dataset: A large-scale, diverse LQ-HQ paired dataset containing 103,707 training pairs and 500 test pairs (1024x1024 resolution), covering complex real-world degradations (general mixed, rain, haze, snow, low-light, old photos).
Systematic MFM Evaluation: Evaluates nine state-of-the-art MFMs with fixed and adaptive prompting strategies. Identifies Nano-Banana-2 with Gemini-based adaptive prompting as the best model for balancing perceptual quality and content fidelity for GGT generation.
Demonstrates Consistent Generalization Improvement: Extensive experiments show that training or fine-tuning a wide range of IR models (CNNs, Transformers, all-in-one, generative) with GGT-100K consistently improves their real-world generalization performance, with particularly strong benefits for generative models.
Implements Multi-Stage Quality Control: A pipeline involving metric-based filtering, VLM-assisted refinement, and manual verification ensures the reliability of the generated HQ targets in the dataset.

Introduction and Theoretical Foundation

Real-world image restoration aims to recover images affected by complex, mixed, and often unknown degradations. Despite architectural advances, robust generalization remains limited due to a critical data bottleneck:

Synthetic Data: Scalable but fails to model real-world degradation complexity, leading to a domain gap.
Real-World Paired Data: More realistic but expensive, difficult to scale, and limited in scene diversity.

Recent generative Multimodal Foundation Models (MFMs) offer a promising solution. They can take an image and instructions as input to produce a desired output, suggesting they could generate restoration-oriented HQ targets from LQ inputs. However, this task is non-trivial as MFMs may distort structures or hallucinate details. This work systematically investigates whether MFMs can generate HQ targets with sufficient fidelity and stability for supervised real-world IR model training.

The core idea is to use the most capable MFM to produce Generative Ground Truth (GGT)—high-quality, perceptually realistic, and content-faithful targets for real-world low-quality images—and use these pairs to train IR models, thereby expanding their generalization boundaries.

Methodology

1. GGT-100K Construction Pipeline (Fig. 3)

The pipeline has four main stages:

Source Image Collection: Collect real-world LQ images from three sources:
- Existing datasets without GT (e.g., RealBlur, SIDD) and broader vision datasets.
- Internet sources (Flickr, Unsplash, Pexels) under CC0 licenses.
- Own captured data with various cameras/phones. Images are normalized to 1024x1024 and categorized (General Mixed, Rain, Haze, Snow, Low-Light, Old Photo).
Systematic MFM Evaluation: Evaluate 9 MFMs (3 open-source, 6 closed-source) with 4 prompting strategies (fixed, fixed-no-change, GPT-adaptive, Gemini-adaptive) on 200 real-world images. Evaluation criteria:
- Fidelity: Full-reference metrics (PSNR, SSIM, LPIPS, DISTS) on synthetically degraded DIV2K-Val images.
- Perceptual Quality: No-reference metrics (NIQE, MUSIQ, MANIQA, TOPIQ, AFINE-NR) on real LQ images.
- VLM-based Assessment: Success rate (VLM-R) judged by a VLM (Gemini-3.1-Pro) on five aspects.
- Human Preference: User study with 20 participants. A composite score Avg. i is calculated by normalizing and averaging scores across fidelity, perceptual, and VLM aspects:
$\tilde{m}_{i,j} = \frac{m_{i,j} - m_j^{\text{min}}}{m_j^{\text{max}} - m_j^{\text{min}}}, \quad s_i^a = \frac{1}{|\mathcal{M}_a|} \sum_{j \in \mathcal{M}_a} \tilde{m}_{i,j}, \quad \text{Avg.}_i = \frac{1}{3}(s_i^{\text{fid}} + s_i^{\text{per}} + s_i^{\text{vlm}})$
Target Generation: Use the selected best MFM-prompt combination (Nano-Banana-2 with Gemini-adaptive prompting) to generate candidate HQ targets for all collected LQ images.
Multi-Stage Quality Control:
- Metric-based Filtering: Exclude samples where no-reference perceptual metrics show little or negative improvement.
- VLM-assisted Refinement: Use a VLM to assess generated results on five aspects (restoration quality, object consistency, geometry alignment, content reasonableness, color consistency). For rejected samples, use VLM feedback to create corrective prompts and regenerate (up to 3 attempts).
- Manual Verification: Final human review to remove samples with noticeable artifacts or inconsistencies.

2. Experimental Validation Setup

Models Evaluated:
- CNN/Transformer Backbones: MPRNet, NAFNet, SwinIR, X-Restormer.
- All-in-One Models: PromptIR, MoCE-IR, DA-CLIP, FoundIR.
- Generative Models: FLUX-Controlnet (T2I), Qwen-Image-Edit (TI2I).
Training Data Settings:
- Baseline ("w/o"): ~200K pairs from 15 existing synthetic/real datasets, with controlled category composition.
- Augmented ("w/"): Baseline + GGT-100K with a 1:1 sampling ratio.
Test Sets:
- GGT-100K Test Set: 500 carefully selected paired images.
- Public RealLQ Sets: RealDeg (social media, old photos), OpenReal80K subsets (haze, rain, snow, night).
Evaluation Metrics: Full-reference (PSNR, SSIM, LPIPS, DISTS), no-reference (NIQE, MUSIQ, MANIQA, TOPIQ, AFINE-NR), and VLM-R.

Empirical Validation / Results

1. MFM Evaluation Results (Table 1)

Key Finding: Nano-Banana-2 with Gemini-based adaptive prompting achieves the best overall balance, with the highest composite score (Avg. = 0.8427) and human preference (32.5%). It performs strongly and consistently across all four evaluation aspects.

Table 1: Comparison of different MFMs and prompting strategies (Excerpt - Top Performers)

Model	Prompt	PSNR ↑	SSIM ↑	LPIPS ↓	MUSIQ ↑	VLM-R ↑	Avg. ↑	Human ↑
Nano-Banana-2	Gemini	27.1701	0.7949	0.1280	59.2831	70.0%	0.8427	32.5%
Nano-Banana-2	GPT	26.7364	0.7923	0.1300	56.6527	69.0%	0.8078	-
Nano-Banana-Pro	Fix-NC	26.5977	0.7865	0.1375	54.0572	68.5%	0.7667	15.0%
Qwen-Image-Edit	Fix	27.5471	0.8170	0.1710	51.7165	57.0%	0.6602	5.5%

2. Effectiveness of GGT-100K for Model Training

Table 2: Performance on GGT-100K Test Set (Full Results - Excerpt)

Model	GGT-100K	PSNR ↑	SSIM ↑	LPIPS ↓	MUSIQ ↑	AFINE-NR ↓	VLM-R ↑
NAFNet	w/o	25.1255	0.7708	0.3653	42.0124	-0.7012	27.6%
	w/	28.2461	0.8349	0.3110	46.7094	-0.7881	53.8%
	Improvement	+3.1206	+0.0641	-0.0543	+4.6970	-0.0869	+26.2%
FoundIR	Official	26.0398	0.7866	0.3486	39.3646	-0.6971	28.8%
	w/o	25.8048	0.7844	0.3508	42.5388	-0.7365	35.8%
	w/	27.1777	0.8213	0.3351	43.9238	-0.8087	60.8%
	Improvement	+1.3729	+0.0369	-0.0157	+1.3850	-0.0722	+25.0%
Qwen-Image-Edit	Official	22.3141	0.7479	0.3042	60.8628	-0.9565	68.0%
	w/o	25.8559	0.7787	0.2813	51.4215	-0.8401	77.4%
	w/	26.1811	0.7828	0.2155	62.5519	-0.9611	87.6%
	Improvement	+0.3252	+0.0041	-0.0658	+11.1304	-0.1210	+10.2%

Key Findings:

Consistent Gains: Adding GGT-100K improves fidelity metrics (PSNR, SSIM), perceptual metrics (MUSIQ, AFINE-NR), and VLM-R for all evaluated models.
Particularly Strong for Generative Models: Models like FLUX-Controlnet and Qwen-Image-Edit show very large gains in perceptual metrics and VLM-R. GGT-100K helps them achieve both strong generation ability and high content fidelity.
Outperforms Official Models: FoundIR and Qwen-Image-Edit trained with GGT-100K surpass their official releases across most metrics.

3. Ablation Study: Importance of Quality Control (Table 4)

Training with unscreened generated data (w/o-QC) already provides gains over the baseline, but multi-stage quality control (w/-QC) leads to further improvements, especially in fidelity-oriented metrics (PSNR, SSIM) and for generative models.

Table 4: Ablation on Quality Control (Excerpt for FLUX-Controlnet)

Model	Quality Control	PSNR ↑	SSIM ↑	LPIPS ↓	MUSIQ ↑	VLM-R ↑
FLUX-Controlnet	Baseline	22.4486	0.6901	0.3773	48.5454	25.4%
	w/o-QC	19.4203	0.6613	0.3619	66.1931	45.8%
	w/-QC	23.1413	0.7325	0.2625	63.0910	63.4%
	QC Improvement	+3.7210	+0.0712	-0.0994	-3.1021	+17.6%

Theoretical and Practical Implications

Paradigm Shift for IR Data: Demonstrates that generative MFMs can serve as practical tools for restoration-oriented data generation, offering a scalable alternative to costly real-world capture or imperfect synthetic data.
Resource for the Community: GGT-100K is released as a valuable dataset to advance generalizable real-world IR, particularly for fine-tuning modern generative models.
Insights on MFM Usage: The systematic evaluation provides practical insights: 1) Prompting strategy is critical (adaptive prompts significantly outperform fixed ones); 2) Different MFMs have clear preference biases (trade-off between fidelity and perceptual enhancement); 3) Nano-Banana -2 with adaptive prompting is currently the most balanced choice for GGT generation.
Enhanced Generalization: The results validate that training with GGT-100K consistently expands the generalization boundaries of IR models across diverse real-world degradation types.

Conclusion

This work proposes Generative Ground Truth (GGT), a paradigm for constructing real-world paired IR data using multimodal foundation models. Through systematic evaluation, Nano-Banana-2 with Gemini-based adaptive prompting is identified as the most suitable model. This is used to build GGT-100K, a large-scale dataset with over 100K LQ-HQ pairs. Extensive experiments demonstrate that GGT-100K consistently improves the real-world generalization of a wide spectrum of IR models, with especially strong benefits for finetuned generative models. The work validates MFMs as effective tools for data generation and provides GGT-100K as a useful resource to advance the field of generalizable image restoration.

Limitations and Future Work:

GGT-100K is a high-quality approximation of ground truth; some samples may contain subtle MFM imperfections.
It cannot cover the full, open-ended space of real-world degradations.
Future work can explore specialized network designs and training objectives tailored for GGT-based supervision.