Gen-Searcher: Reinforcing Agentic Search for Image Generation - Summary
Summary (Overview)
- First Multimodal Deep Search Agent for Image Generation: Gen-Searcher is the first trained agent that performs multi-hop web search and reasoning to gather textual knowledge and visual references for knowledge-intensive image generation.
- Novel Data Pipeline and Benchmarks: The authors constructed two high-quality training datasets (Gen-Searcher-SFT-10k and Gen-Searcher-RL-6k) and introduced the KnowGen benchmark for evaluating search-grounded generation, along with the K-Score metric.
- Dual Reward Reinforcement Learning: The model is trained via a two-stage process (Supervised Fine-Tuning followed by Agentic Reinforcement Learning) using a novel dual reward feedback design that combines text-based and image-based rewards to provide stable and informative learning signals for GRPO training.
- Significant Performance Gains: Gen-Searcher substantially improves image generation on knowledge-intensive tasks, boosting Qwen-Image by ~16 points on KnowGen and ~15 points on WISE, and shows strong transferability across different image generators (e.g., Seedream 4.5, Nano Banana Pro).
- Open Foundation: The project is fully open-sourced (data, models, code) to serve as a foundation for future research on search agents for image generation.
Introduction and Theoretical Foundation
Recent text-to-image models are constrained by frozen internal knowledge, struggling with real-world prompts that require up-to-date or knowledge-intensive information (e.g., specific landmarks, new products). While some proprietary models support text search, they lack visual reference retrieval. Prior RAG-based methods are limited by static databases and shallow retrieval, and prompt-based workflows are brittle and suboptimal.
This paper introduces Gen-Searcher, the first attempt to train a multimodal deep search agent for image generation using agentic reinforcement learning (RL). The core idea is to train an agent that can actively perform multi-hop web search, browse, and reason to gather both textual evidence and visual references, which are then used to create a grounded prompt for a downstream image generator. This addresses the fundamental limitation of frozen knowledge in generative models.
Methodology
1. Dataset Construction Pipeline
A four-stage pipeline was created to generate training data, which did not naturally exist.
- Text Prompt Construction: Two strategies were used:
- Primary: Prompt engineering with Gemini 3 Pro to generate multi-hop search-intensive prompts across ~20 diverse categories (Anime, Celebrities, Physics, Art, etc.).
- Complementary: Converting existing deep research QA datasets into image-generation-oriented prompts, primarily for General News.
- Agentic Trajectory Generation: Gemini 3 Pro was used with search tools (
search,image_search,browse) in a multi-turn loop to generate search trajectories, resulting in a final grounded prompt and selected reference images. - Ground-Truth Image Synthesis: The final prompts were fed into Nano Banana Pro to synthesize corresponding images as ground truth.
- Data Filtering & Curation: Seed1.8 was used to score and filter samples based on faithfulness, correctness, aesthetics, safety, etc., combined with rule-based filtering. This yielded ~17K high-quality samples.
- Gen-Searcher-SFT-10k: For supervised fine-tuning.
- Gen-Searcher-RL 6k: For reinforcement learning.
- KnowGen Benchmark: 630 human-verified, held-out evaluation samples.
2. KnowGen Benchmark & K-Score
KnowGen is a comprehensive benchmark for evaluating search-grounded image generation in real-world, knowledge-intensive scenarios.
- Categories: Divided into two subsets:
- Science & Knowledge: Astronomy, Biology, Chemistry, Physics, Engineering, Medicine, Industry, Architecture, History, Geography, Religion, Politics, Culture, Art, Sports.
- Pop Culture & News: Anime, Games, Films, Celebrities, Posters, General News.
- Evaluation Metric - K-Score: Uses GPT-4.1 as a judge to evaluate generated images from four dimensions, each scored on a scale of {0, 0.5, 1}:
- Faithfulness: Scene-structure level adherence to the prompt.
- Visual Correctness: Accuracy of grounded visual attributes vs. reference.
- Text Accuracy: Presence, legibility, and correctness of required readable text.
- Aesthetics: Overall visual quality and appeal.
- The final K-Score is a weighted combination:
3. Training Scheme
Gen-Searcher is initialized from Qwen3-VL-8B-Instruct and trained in two stages.
- Search Tools: The agent is equipped with three tools:
search(web text search),image_search(retrieve images via text query), andbrowse(analyze webpage content). - Stage 1: Supervised Fine-Tuning (SFT): Trained on Gen-Searcher-SFT-10k to learn basic multi-turn tool use for search, reasoning, and prompt composition.
- Stage 2: Agentic Reinforcement Learning (RL): Trained on Gen-Searcher-RL-6k using GRPO to optimize search trajectories.
- Dual Reward Feedback Design: To address the noise and instability of pure image-based rewards (due to generator variance), a combined reward is used:
- : K-Score of the final generated image.
- : Text-based reward (scored 0-1 by GPT-4.1) evaluating if the gathered prompt contains sufficient/correct information for generation.
- : Balancing hyperparameter (set to 0.5).
- Optimization: The policy is optimized using GRPO. For each sampled output under query , the advantage is computed: The final policy update follows the standard GRPO objective (Equation 3 in the paper), which includes a clipped probability ratio and a KL divergence penalty.
Empirical Validation / Results
Main Results on KnowGen Benchmark
Table 1: Performance of different models on the KnowGen benchmark.
| Models | Science & Knowledge | Pop Culture & News | Overall K-Score |
|---|---|---|---|
| Visual cor. | Text acc. | Visual cor. | |
| GPT-Image-1.5 | 29.25 | 40.14 | 29.43 |
| Nano Banana Pro | 39.46 | 49.32 | 30.51 |
| Seedream 4.5 | 14.46 | 26.19 | 12.50 |
| Qwen-Image | 6.80 | 0.34 | 7.59 |
| Gen-Searcher-8B + Qwen-Image | 26.87 | 17.18 | 25.30 |
| Gen-Searcher-8B + Seedream 4.5 | 36.35 | 43.52 | 39.04 |
| Gen-Searcher-8B + Nano Banana Pro | 45.07 | 49.32 | 43.01 |
- KnowGen is Challenging: Open-source models (Qwen-Image, FLUX, Z-Image) score only 9-15, showing the difficulty of knowledge-intensive generation.
- Effectiveness of Gen-Searcher: Brings substantial gains across backbones.
- Improves Qwen-Image from 14.98 to 31.52 (+16.54 points).
- Transfers effectively to other generators: improves Seedream 4.5 from 31.01 to 47.29 (+16.28 points) and Nano Banana Pro from 50.38 to 53.30.
- Dimension Analysis: Gains primarily come from improvements in Visual Correctness and Text Accuracy, the two most critical components of K-Score.
Performance on WISE Benchmark
Table 2: Performance on the WISE benchmark (Overall Score).
| Model | Overall Score |
|---|---|
| Qwen-Image | 0.62 |
| LongCat-Image | 0.65 |
| Gen-Searcher-8B + Qwen-Image | 0.77 |
Gen-Searcher improves Qwen-Image from 0.62 to 0.77 on WISE, a gain of 0.15, demonstrating strong generalization to other knowledge-based generation benchmarks.
Ablation Study
Table case: Ablation Study on KnowGen (K-Score with Qwen-Image).
| Method | K-Score |
|---|---|
| Qwen-Image (Baseline) | 14.98 |
| + Manual Workflow (no training) | 22.91 |
| + Gen-Searcher-SFT only | 28.15 |
| + Gen-Searcher w.o. text reward () | 29.59 |
| + Gen-Searcher w.o. image reward () | 29.36 |
| + Gen-Searcher (Full) | 31.52 |
- SFT is crucial: Learning from trajectories is better than a manual workflow.
- RL provides further gains: Beyond SFT initialization.
- Dual rewards are complementary: Removing either reward leads to degradation, validating the design.
Parameter Analysis
Performance remains strong when the balancing coefficient is in the range **0.3 to."
Theoretical and Practical Implications
- Advancing Agentic AI for Creative Tasks: Demonstrates that agentic RL can be successfully applied to complex, creative tasks like image generation, moving beyond traditional QA or tool-use domains.
- Bridging the Knowledge Gap in Generative Models: Provides a generalizable framework to augment any image generator with up-to-date, external knowledge without retraining the generator itself, addressing a fundamental limitation.
- Importance of Multimodal Search: Highlights the necessity of retrieving both textual evidence and visual references for accurate generation in real-world scenarios, as text-only search is insufficient for fine-grained visual attributes.
- Robust RL Training Design: The dual reward feedback mechanism offers a solution to the challenge of noisy rewards in end-to-end creative pipelines, making RL training more stable and effective.
- Foundation for Future Research: The open-sourced datasets, benchmark, and model establish a foundation for developing more capable search agents for generation and other multimodal tasks.
Conclusion
Gen-Searcher is the first trained multimodal deep search agent for knowledge-intensive image generation. By constructing novel datasets and a benchmark, and training via SFT and agentic RL with dual rewards, the model achieves substantial performance improvements and demonstrates strong transferability across image generators. This work opens a new direction for augmenting generative models with active, web-powered knowledge retrieval. Future work may explore scaling to larger models, extending to video generation, and improving the efficiency of the search-and-generation pipeline.