Gen-Searcher: Reinforcing Agentic Search for Image Generation - Summary

Summary (Overview)

First Multimodal Deep Search Agent for Image Generation: Gen-Searcher is the first trained agent that performs multi-hop web search and reasoning to gather textual knowledge and visual references for knowledge-intensive image generation.
Novel Data Pipeline and Benchmarks: The authors constructed two high-quality training datasets (Gen-Searcher-SFT-10k and Gen-Searcher-RL-6k) and introduced the KnowGen benchmark for evaluating search-grounded generation, along with the K-Score metric.
Dual Reward Reinforcement Learning: The model is trained via a two-stage process (Supervised Fine-Tuning followed by Agentic Reinforcement Learning) using a novel dual reward feedback design that combines text-based and image-based rewards to provide stable and informative learning signals for GRPO training.
Significant Performance Gains: Gen-Searcher substantially improves image generation on knowledge-intensive tasks, boosting Qwen-Image by ~16 points on KnowGen and ~15 points on WISE, and shows strong transferability across different image generators (e.g., Seedream 4.5, Nano Banana Pro).
Open Foundation: The project is fully open-sourced (data, models, code) to serve as a foundation for future research on search agents for image generation.

Introduction and Theoretical Foundation

Recent text-to-image models are constrained by frozen internal knowledge, struggling with real-world prompts that require up-to-date or knowledge-intensive information (e.g., specific landmarks, new products). While some proprietary models support text search, they lack visual reference retrieval. Prior RAG-based methods are limited by static databases and shallow retrieval, and prompt-based workflows are brittle and suboptimal.

This paper introduces Gen-Searcher, the first attempt to train a multimodal deep search agent for image generation using agentic reinforcement learning (RL). The core idea is to train an agent that can actively perform multi-hop web search, browse, and reason to gather both textual evidence and visual references, which are then used to create a grounded prompt for a downstream image generator. This addresses the fundamental limitation of frozen knowledge in generative models.

Methodology

1. Dataset Construction Pipeline

A four-stage pipeline was created to generate training data, which did not naturally exist.

Text Prompt Construction: Two strategies were used:
- Primary: Prompt engineering with Gemini 3 Pro to generate multi-hop search-intensive prompts across ~20 diverse categories (Anime, Celebrities, Physics, Art, etc.).
- Complementary: Converting existing deep research QA datasets into image-generation-oriented prompts, primarily for General News.
Agentic Trajectory Generation: Gemini 3 Pro was used with search tools (search, image_search, browse) in a multi-turn loop to generate search trajectories, resulting in a final grounded prompt and selected reference images.
Ground-Truth Image Synthesis: The final prompts were fed into Nano Banana Pro to synthesize corresponding images as ground truth.
Data Filtering & Curation: Seed1.8 was used to score and filter samples based on faithfulness, correctness, aesthetics, safety, etc., combined with rule-based filtering. This yielded ~17K high-quality samples.
- Gen-Searcher-SFT-10k: For supervised fine-tuning.
- Gen-Searcher-RL 6k: For reinforcement learning.
- KnowGen Benchmark: 630 human-verified, held-out evaluation samples.

2. KnowGen Benchmark & K-Score

KnowGen is a comprehensive benchmark for evaluating search-grounded image generation in real-world, knowledge-intensive scenarios.

Categories: Divided into two subsets:
- Science & Knowledge: Astronomy, Biology, Chemistry, Physics, Engineering, Medicine, Industry, Architecture, History, Geography, Religion, Politics, Culture, Art, Sports.
- Pop Culture & News: Anime, Games, Films, Celebrities, Posters, General News.
Evaluation Metric - K-Score: Uses GPT-4.1 as a judge to evaluate generated images from four dimensions, each scored on a scale of {0, 0.5, 1}:
- Faithfulness: Scene-structure level adherence to the prompt.
- Visual Correctness: Accuracy of grounded visual attributes vs. reference.
- Text Accuracy: Presence, legibility, and correctness of required readable text.
- Aesthetics: Overall visual quality and appeal.
The final K-Score is a weighted combination: $\text{K-Score} = 0.1 \cdot \text{Faithfulness} + 0.4 \cdot \text{Visual Correctness} + 0.4 \cdot \text{Text Accuracy} + 0.1 \cdot \text{Aesthetics}$

3. Training Scheme

Gen-Searcher is initialized from Qwen3-VL-8B-Instruct and trained in two stages.

Search Tools: The agent is equipped with three tools: search (web text search), image_search (retrieve images via text query), and browse (analyze webpage content).
Stage 1: Supervised Fine-Tuning (SFT): Trained on Gen-Searcher-SFT-10k to learn basic multi-turn tool use for search, reasoning, and prompt composition.
Stage 2: Agentic Reinforcement Learning (RL): Trained on Gen-Searcher-RL-6k using GRPO to optimize search trajectories.
Dual Reward Feedback Design: To address the noise and instability of pure image-based rewards (due to generator variance), a combined reward is used: $R = (1 - \alpha) R_{\text{image}} + \alpha R_{\text{text}} \quad \text{(Equation 1)}$
- $R_{\text{image}}$ : K-Score of the final generated image.
- $R_{\text{text}}$ : Text-based reward (scored 0-1 by GPT-4.1) evaluating if the gathered prompt contains sufficient/correct information for generation.
- $\alpha$ : Balancing hyperparameter (set to 0.5).
Optimization: The policy is optimized using GRPO. For each sampled output $o_i$ under query $q$ , the advantage is computed: $A_i = \frac{R_i - \text{mean}(\{R_j\})}{\text{std}(\{R_j\})} \quad \text{(Equation 2)}$ The final policy update follows the standard GRPO objective $J_{\text{GRPO}}$ (Equation 3 in the paper), which includes a clipped probability ratio and a KL divergence penalty.

Empirical Validation / Results

Main Results on KnowGen Benchmark

Table 1: Performance of different models on the KnowGen benchmark.

Models	Science & Knowledge	Pop Culture & News	Overall K-Score
	Visual cor.	Text acc.	Visual cor.
GPT-Image-1.5	29.25	40.14	29.43
Nano Banana Pro	39.46	49.32	30.51
Seedream 4.5	14.46	26.19	12.50
Qwen-Image	6.80	0.34	7.59
Gen-Searcher-8B + Qwen-Image	26.87	17.18	25.30
Gen-Searcher-8B + Seedream 4.5	36.35	43.52	39.04
Gen-Searcher-8B + Nano Banana Pro	45.07	49.32	43.01

KnowGen is Challenging: Open-source models (Qwen-Image, FLUX, Z-Image) score only 9-15, showing the difficulty of knowledge-intensive generation.
Effectiveness of Gen-Searcher: Brings substantial gains across backbones.
- Improves Qwen-Image from 14.98 to 31.52 (+16.54 points).
- Transfers effectively to other generators: improves Seedream 4.5 from 31.01 to 47.29 (+16.28 points) and Nano Banana Pro from 50.38 to 53.30.
Dimension Analysis: Gains primarily come from improvements in Visual Correctness and Text Accuracy, the two most critical components of K-Score.

Performance on WISE Benchmark

Table 2: Performance on the WISE benchmark (Overall Score).

Model	Overall Score
Qwen-Image	0.62
LongCat-Image	0.65
Gen-Searcher-8B + Qwen-Image	0.77

Gen-Searcher improves Qwen-Image from 0.62 to 0.77 on WISE, a gain of 0.15, demonstrating strong generalization to other knowledge-based generation benchmarks.

Ablation Study

Table case: Ablation Study on KnowGen (K-Score with Qwen-Image).

Method	K-Score
Qwen-Image (Baseline)	14.98
+ Manual Workflow (no training)	22.91
+ Gen-Searcher-SFT only	28.15
+ Gen-Searcher w.o. text reward ( $\alpha=0$ )	29.59
+ Gen-Searcher w.o. image reward ( $\alpha=1$ )	29.36
+ Gen-Searcher (Full)	31.52

SFT is crucial: Learning from trajectories is better than a manual workflow.
RL provides further gains: Beyond SFT initialization.
Dual rewards are complementary: Removing either reward leads to degradation, validating the design.

Parameter Analysis

Performance remains strong when the balancing coefficient $\alpha$ is in the range **0.3 to."

Theoretical and Practical Implications

Advancing Agentic AI for Creative Tasks: Demonstrates that agentic RL can be successfully applied to complex, creative tasks like image generation, moving beyond traditional QA or tool-use domains.
Bridging the Knowledge Gap in Generative Models: Provides a generalizable framework to augment any image generator with up-to-date, external knowledge without retraining the generator itself, addressing a fundamental limitation.
Importance of Multimodal Search: Highlights the necessity of retrieving both textual evidence and visual references for accurate generation in real-world scenarios, as text-only search is insufficient for fine-grained visual attributes.
Robust RL Training Design: The dual reward feedback mechanism offers a solution to the challenge of noisy rewards in end-to-end creative pipelines, making RL training more stable and effective.
Foundation for Future Research: The open-sourced datasets, benchmark, and model establish a foundation for developing more capable search agents for generation and other multimodal tasks.

Conclusion

Gen-Searcher is the first trained multimodal deep search agent for knowledge-intensive image generation. By constructing novel datasets and a benchmark, and training via SFT and agentic RL with dual rewards, the model achieves substantial performance improvements and demonstrates strong transferability across image generators. This work opens a new direction for augmenting generative models with active, web-powered knowledge retrieval. Future work may explore scaling to larger models, extending to video generation, and improving the efficiency of the search-and-generation pipeline.