Gen-Searcher: Reinforcing Agentic Search for Image Generation - Summary

Summary (Overview)

  • First Multimodal Deep Search Agent for Image Generation: Gen-Searcher is the first trained agent that performs multi-hop web search and reasoning to gather textual knowledge and visual references for knowledge-intensive image generation.
  • Novel Data Pipeline and Benchmarks: The authors constructed two high-quality training datasets (Gen-Searcher-SFT-10k and Gen-Searcher-RL-6k) and introduced the KnowGen benchmark for evaluating search-grounded generation, along with the K-Score metric.
  • Dual Reward Reinforcement Learning: The model is trained via a two-stage process (Supervised Fine-Tuning followed by Agentic Reinforcement Learning) using a novel dual reward feedback design that combines text-based and image-based rewards to provide stable and informative learning signals for GRPO training.
  • Significant Performance Gains: Gen-Searcher substantially improves image generation on knowledge-intensive tasks, boosting Qwen-Image by ~16 points on KnowGen and ~15 points on WISE, and shows strong transferability across different image generators (e.g., Seedream 4.5, Nano Banana Pro).
  • Open Foundation: The project is fully open-sourced (data, models, code) to serve as a foundation for future research on search agents for image generation.

Introduction and Theoretical Foundation

Recent text-to-image models are constrained by frozen internal knowledge, struggling with real-world prompts that require up-to-date or knowledge-intensive information (e.g., specific landmarks, new products). While some proprietary models support text search, they lack visual reference retrieval. Prior RAG-based methods are limited by static databases and shallow retrieval, and prompt-based workflows are brittle and suboptimal.

This paper introduces Gen-Searcher, the first attempt to train a multimodal deep search agent for image generation using agentic reinforcement learning (RL). The core idea is to train an agent that can actively perform multi-hop web search, browse, and reason to gather both textual evidence and visual references, which are then used to create a grounded prompt for a downstream image generator. This addresses the fundamental limitation of frozen knowledge in generative models.

Methodology

1. Dataset Construction Pipeline

A four-stage pipeline was created to generate training data, which did not naturally exist.

  1. Text Prompt Construction: Two strategies were used:
    • Primary: Prompt engineering with Gemini 3 Pro to generate multi-hop search-intensive prompts across ~20 diverse categories (Anime, Celebrities, Physics, Art, etc.).
    • Complementary: Converting existing deep research QA datasets into image-generation-oriented prompts, primarily for General News.
  2. Agentic Trajectory Generation: Gemini 3 Pro was used with search tools (search, image_search, browse) in a multi-turn loop to generate search trajectories, resulting in a final grounded prompt and selected reference images.
  3. Ground-Truth Image Synthesis: The final prompts were fed into Nano Banana Pro to synthesize corresponding images as ground truth.
  4. Data Filtering & Curation: Seed1.8 was used to score and filter samples based on faithfulness, correctness, aesthetics, safety, etc., combined with rule-based filtering. This yielded ~17K high-quality samples.
    • Gen-Searcher-SFT-10k: For supervised fine-tuning.
    • Gen-Searcher-RL 6k: For reinforcement learning.
    • KnowGen Benchmark: 630 human-verified, held-out evaluation samples.

2. KnowGen Benchmark & K-Score

KnowGen is a comprehensive benchmark for evaluating search-grounded image generation in real-world, knowledge-intensive scenarios.

  • Categories: Divided into two subsets:
    • Science & Knowledge: Astronomy, Biology, Chemistry, Physics, Engineering, Medicine, Industry, Architecture, History, Geography, Religion, Politics, Culture, Art, Sports.
    • Pop Culture & News: Anime, Games, Films, Celebrities, Posters, General News.
  • Evaluation Metric - K-Score: Uses GPT-4.1 as a judge to evaluate generated images from four dimensions, each scored on a scale of {0, 0.5, 1}:
    • Faithfulness: Scene-structure level adherence to the prompt.
    • Visual Correctness: Accuracy of grounded visual attributes vs. reference.
    • Text Accuracy: Presence, legibility, and correctness of required readable text.
    • Aesthetics: Overall visual quality and appeal.
  • The final K-Score is a weighted combination: K-Score=0.1Faithfulness+0.4Visual Correctness+0.4Text Accuracy+0.1Aesthetics\text{K-Score} = 0.1 \cdot \text{Faithfulness} + 0.4 \cdot \text{Visual Correctness} + 0.4 \cdot \text{Text Accuracy} + 0.1 \cdot \text{Aesthetics}

3. Training Scheme

Gen-Searcher is initialized from Qwen3-VL-8B-Instruct and trained in two stages.

  • Search Tools: The agent is equipped with three tools: search (web text search), image_search (retrieve images via text query), and browse (analyze webpage content).
  • Stage 1: Supervised Fine-Tuning (SFT): Trained on Gen-Searcher-SFT-10k to learn basic multi-turn tool use for search, reasoning, and prompt composition.
  • Stage 2: Agentic Reinforcement Learning (RL): Trained on Gen-Searcher-RL-6k using GRPO to optimize search trajectories.
  • Dual Reward Feedback Design: To address the noise and instability of pure image-based rewards (due to generator variance), a combined reward is used: R=(1α)Rimage+αRtext(Equation 1)R = (1 - \alpha) R_{\text{image}} + \alpha R_{\text{text}} \quad \text{(Equation 1)}
    • RimageR_{\text{image}}: K-Score of the final generated image.
    • RtextR_{\text{text}}: Text-based reward (scored 0-1 by GPT-4.1) evaluating if the gathered prompt contains sufficient/correct information for generation.
    • α\alpha: Balancing hyperparameter (set to 0.5).
  • Optimization: The policy is optimized using GRPO. For each sampled output oio_i under query qq, the advantage is computed: Ai=Rimean({Rj})std({Rj})(Equation 2)A_i = \frac{R_i - \text{mean}(\{R_j\})}{\text{std}(\{R_j\})} \quad \text{(Equation 2)} The final policy update follows the standard GRPO objective JGRPOJ_{\text{GRPO}} (Equation 3 in the paper), which includes a clipped probability ratio and a KL divergence penalty.

Empirical Validation / Results

Main Results on KnowGen Benchmark

Table 1: Performance of different models on the KnowGen benchmark.

ModelsScience & KnowledgePop Culture & NewsOverall K-Score
Visual cor.Text acc.Visual cor.
GPT-Image-1.529.2540.1429.43
Nano Banana Pro39.4649.3230.51
Seedream 4.514.4626.1912.50
Qwen-Image6.800.347.59
Gen-Searcher-8B + Qwen-Image26.8717.1825.30
Gen-Searcher-8B + Seedream 4.536.3543.5239.04
Gen-Searcher-8B + Nano Banana Pro45.0749.3243.01
  • KnowGen is Challenging: Open-source models (Qwen-Image, FLUX, Z-Image) score only 9-15, showing the difficulty of knowledge-intensive generation.
  • Effectiveness of Gen-Searcher: Brings substantial gains across backbones.
    • Improves Qwen-Image from 14.98 to 31.52 (+16.54 points).
    • Transfers effectively to other generators: improves Seedream 4.5 from 31.01 to 47.29 (+16.28 points) and Nano Banana Pro from 50.38 to 53.30.
  • Dimension Analysis: Gains primarily come from improvements in Visual Correctness and Text Accuracy, the two most critical components of K-Score.

Performance on WISE Benchmark

Table 2: Performance on the WISE benchmark (Overall Score).

ModelOverall Score
Qwen-Image0.62
LongCat-Image0.65
Gen-Searcher-8B + Qwen-Image0.77

Gen-Searcher improves Qwen-Image from 0.62 to 0.77 on WISE, a gain of 0.15, demonstrating strong generalization to other knowledge-based generation benchmarks.

Ablation Study

Table case: Ablation Study on KnowGen (K-Score with Qwen-Image).

MethodK-Score
Qwen-Image (Baseline)14.98
+ Manual Workflow (no training)22.91
+ Gen-Searcher-SFT only28.15
+ Gen-Searcher w.o. text reward (α=0\alpha=0)29.59
+ Gen-Searcher w.o. image reward (α=1\alpha=1)29.36
+ Gen-Searcher (Full)31.52
  • SFT is crucial: Learning from trajectories is better than a manual workflow.
  • RL provides further gains: Beyond SFT initialization.
  • Dual rewards are complementary: Removing either reward leads to degradation, validating the design.

Parameter Analysis

Performance remains strong when the balancing coefficient α\alpha is in the range **0.3 to."

Theoretical and Practical Implications

  • Advancing Agentic AI for Creative Tasks: Demonstrates that agentic RL can be successfully applied to complex, creative tasks like image generation, moving beyond traditional QA or tool-use domains.
  • Bridging the Knowledge Gap in Generative Models: Provides a generalizable framework to augment any image generator with up-to-date, external knowledge without retraining the generator itself, addressing a fundamental limitation.
  • Importance of Multimodal Search: Highlights the necessity of retrieving both textual evidence and visual references for accurate generation in real-world scenarios, as text-only search is insufficient for fine-grained visual attributes.
  • Robust RL Training Design: The dual reward feedback mechanism offers a solution to the challenge of noisy rewards in end-to-end creative pipelines, making RL training more stable and effective.
  • Foundation for Future Research: The open-sourced datasets, benchmark, and model establish a foundation for developing more capable search agents for generation and other multimodal tasks.

Conclusion

Gen-Searcher is the first trained multimodal deep search agent for knowledge-intensive image generation. By constructing novel datasets and a benchmark, and training via SFT and agentic RL with dual rewards, the model achieves substantial performance improvements and demonstrates strong transferability across image generators. This work opens a new direction for augmenting generative models with active, web-powered knowledge retrieval. Future work may explore scaling to larger models, extending to video generation, and improving the efficiency of the search-and-generation pipeline.