OpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents
Summary (Overview)
- Open Recipe: Introduces OpenSearch-VL, a fully open-source recipe for training frontier multimodal deep search agents, including the release of data, code, and models.
- Key Components: The recipe comprises three core innovations: 1) A data curation pipeline that constructs high-quality, tool-demanding multi-hop Visual Question Answering (VQA) data via Wikipedia path sampling, fuzzy entity rewriting, and source-anchor visual grounding; 2) A diverse tool environment unifying retrieval (text/image search), image enhancement (sharpening, super-resolution, perspective correction), and parsing (OCR, cropping) tools; 3) A multi-turn fatal-aware GRPO reinforcement learning algorithm that handles cascading tool failures via token masking and one-sided advantage clamping.
- Empirical Performance: OpenSearch-VL delivers substantial gains, with the 30B-A3B model achieving an average score of 61.6 across seven benchmarks—a +13.8 point improvement over the Qwen3-VL-30B-A3B baseline. It achieves performance comparable to proprietary commercial models on several tasks.
- Core Contribution: Addresses the reproducibility bottleneck in frontier multimodal search agent research by providing transparent, high-quality training data (SearchVL-SFT-36k, SearchVL-RL-8k) and a detailed training methodology.
Introduction and Theoretical Foundation
Multimodal deep search has become a critical capability, enabling models to solve complex questions through active search, evidence verification, and multi-step reasoning. However, top-tier multimodal search agents remain difficult to reproduce due to the absence of open high-quality training data, transparent trajectory synthesis pipelines, or detailed training recipes, as these components are often proprietary in commercial systems.
The paper identifies three main challenges:
- Data Bottleneck: Effective training data must capture image-grounded understanding, multi-hop retrieval, evidence verification, and long-horizon tool use, not simple visual QA. Existing data often enables shortcuts.
- Training Challenge: Applying agentic Reinforcement Learning (RL) to long-horizon tool-use settings is difficult. A single tool failure can invalidate a rollout, wasting useful pre-failure reasoning or introducing noisy gradients.
- Real-World Imperfections: Visual inputs are often imperfect (blurred, low-resolution, skewed). Agents must perform robust visual pre-processing (e.g., cropping, enhancement) before reliable search can begin, a capability lacking in most retrieval-focused agents.
Theoretical Formulation: The problem is formalized as an agent interacting with a multimodal environment. Given an input image and question , the agent answers by interleaving reasoning with tool calls over a tool set (visual and search tools). The interaction unfolds as a multi-turn trajectory:
where is the history, is an action (reasoning trace + tool invocation/response), and is an observation from the environment . The trajectory likelihood under policy is:
Methodology
1. Data Curation Pipeline
The pipeline synthesizes high-quality, tool-demanding trajectories without manual annotation in three stages:
a) High-Quality VQA Construction:
- Wikipedia Path Sampling: Starting from the Wikipedia hyperlink graph , a constrained random walk of length produces a path . Nodes are assigned roles: (visual anchor), (bridge nodes), (answer node).
- Fuzzy Entity Rewriting: The canonical question (generated from the path) is progressively rewritten into a fuzzy counterpart by replacing entity names with relational descriptors, ensuring answer invariance, uniqueness, and non-leakage:
- Anchor-aware Visual Grounding: A representative image of the anchor (not the answer entity) is retrieved and grounded in , yielding the final question . This design reduces single-hop shortcuts.
b) Filtering and Enhancement:
- The Wikipedia-derived VQA pool is merged with open-source corpora (LiveVQA, FVQA, WebQA).
- Staged Filtering: A frozen Qwen3-VL-32B model is used to discard examples answerable 1) without tools, and 2) with a single
ImageSearchcall. - Enhanced Subset (10%): Controlled degradations (blur, downsampling, distortion) are applied to encourage "think-with-image" behavior, requiring the agent to use enhancement tools before retrieval.
c) Multi-turn Trajectory Synthesis:
- Expert trajectories are synthesized by rolling out Claude Opus 4.6 in the real tool environment for each filtered instance.
- Rejection Sampling: Raw rollouts are vetted by a two-stage judge: 1) Final answer must match ground truth (GPT-4o judge), 2) Process-level quality (GPT-5.4 judge on tool-use, consistency).
- This yields the final SearchVL-SFT-36k dataset (36,592 trajectories, avg. 6.3 tool turns).
2. Tool Environment
The tool suite spans three complementary functions, as summarized below:
Table 1 | The search-oriented tool suite integrated within OpenSearch-VL.
| Tool | Description | Arguments | Tool Output |
|---|---|---|---|
| TextSearch | Web search with page reading and LLM summarization | Query + TopK | Query-focused passage summaries |
| ImageSearch | Reverse image / visual entity search over the web | Image + TopK | Visual matches and related webpages |
| Sharpen | Unsharp-masking based deblurring / detail enhancement | Image + Amount | Sharpened image |
| SuperResolution | Deep super-resolution (EDSR) for low-resolution inputs | Image + Scale | High-resolution image |
| PerspectiveCorrect | Auto perspective rectification of skewed documents | Image | Fronto-parallel image |
| Crop | Extract a user-specified rectangular region | Image + Coordinates | Cropped image |
| OCR | Structured document parsing with text and layout labels | Image + Flags | Text blocks with labels and reading order |
3. Multi-Turn Fatal-Aware GRPO Training
The training has two stages: Supervised Fine-Tuning (SFT) on the expert trajectories, followed by Reinforcement Learning (RL).
SFT Objective: The standard SFT objective maximizes the log-likelihood of expert actions:
Observations are excluded from loss computation via a token-level generation mask .
RL with Composite Reward & Fatal-Aware Masking:
- Composite Trajectory Reward: , where .
- : Algorithmic format reward (ensures proper
<tool_call>...</tool_call>structure). - : Terminal accuracy reward (GPT-4o judge).
- : Process-level search quality reward (GPT-5.4 judge on query relevance, progression, etc.).
- : Algorithmic format reward (ensures proper
- Fatal Step Detection: The fatal step index for trajectory is the earliest step where consecutive tool-execution errors commence.
- Fatal-Aware Token Mask: Extends the generation mask to zero out tokens generated after the fatal step:
- One-Sided Advantage Clamping: To preserve useful pre-failure reasoning, the group-normalized advantage is clamped for fatal trajectories: This ensures the valid prefix of a fatal trajectory is only reinforced if its partial reward exceeds the group mean.
- Final GRPO Objective: Integrating the mask and clamped advantage: where is the token-level importance ratio.
Empirical Validation / Results
Main Results
Table 2 | Performance on multimodal knowledge-intensive QA and web-search benchmarks.
| Model | SimpleVQA | VDR | MMSearch | LiveVQA | BrowseComp-VL | FVQA | InfoSeek | Avg |
|---|---|---|---|---|---|---|---|---|
| Direct Reasoning | ||||||||
| GPT-4o | 51.7 | 1.7 | 18.7 | 28.1 | 5.5 | 48.0 | 52.9 | 29.5 |
| Qwen3-VL-30B-A3B | 53.2 | 3.8 | 18.7 | 42.7 | 29.6 | 34.7 | 26.4 | 29.9 |
| Agentic Workflow | ||||||||
| Qwen3-VL-8B | 52.0 | 17.0 | 37.4 | 50.6 | 27.9 | 58.7 | 50.3 | 42.0 |
| OpenSearch-VL-8B (Ours) | 71.6 | 20.8 | 64.5 | 59.6 | 37.6 | 71.5 | 70.2 | 56.6 |
| Qwen3-VL-30B-A3B | 55.1 | 20.2 | 44.2 | 62.0 | 34.1 | 63.0 | 56.2 | 47.8 |
| OpenSearch-VL-30B-A3B (Ours) | 74.9 | 33.5 | 68.7 | 67.4 | 41.1 | 73.2 | 72.4 | 61.6 |
| Qwen3-VL-32B | 58.7 | 23.1 | 53.9 | 45.5 | 35.1 | 61.2 | 58.5 | 48.0 |
| OpenSearch-VL-32B (Ours) | 76.2 | 33.8 | 72.3 | 70.5 | 43.8 | 74.7 | 74.8 | 63.7 |
- Substantial Gains: OpenSearch-VL shows clear advantages over direct-reasoning and RAG baselines. The 30B-A3B model improves the average score from 47.8 to 61.6 (+13.8) over its baseline.
- Scale Effectiveness: Performance scales effectively from 8B to 32B models.
- Competitive Performance: OpenSearch-VL-32B outperforms strong proprietary direct-reasoning models like Gemini-2.5-Pro on the average score.
Ablation Studies
Table 3 | Ablation studies on the SFT data pipeline and RL training recipe. (a) SFT data pipeline ablation (Full pipeline Avg. = 64.6)
| Settings | SimpleVQA | InfoSeek | FVQA | Avg. |
|---|---|---|---|---|
| w/o source-anchor grounding | 53.6 (-12.5) | 54.5 (-7.9) | 51.2 (-14.1) | 53.1 (-11.5) |
| w/o fuzzy entity rewriting | 51.7 (-14.4) | 56.4 (-6.0) | 54.7 (-10.6) | 54.3 (-10.3) |
| w/o staged filtering | 57.6 (-8.5) | 55.2 (-7.2) | 56.3 (-9.0) | 56.4 (-8.2) |
| w/o enhancement subset | 64.9 (-1.2) | 61.7 (-0.7) | 63.2 (-2.1) | 63.3 (-1.3) |
(b) RL recipe ablation (Qwen3-VL-8B + SFT only Avg. = 64.6)
| Method | SimpleVQA | InfoSeek | FVQA | Avg. |
|---|---|---|---|---|
| + Vanilla GRPO | 68.8 | 66.5 | 67.4 | 67.6 |
| + GRPO w/ Hard Masking | 68.3 (-0.5) | 67.9 (+1.4) | 66.9 (-0.5) | 67.7 (+0.1) |
| + GRPO w/ Fatal Masking only | 69.7 (+0.9) | 68.3 (+1.8) | 69.2 (+1.8) | 69.1 (+1.5) |
| + Fatal Masking + One-sided Clamp (Ours) | 71.6 (+2.8) | 72.4 (+5.9) | 71.5 (+4.1) | 71.8 (+4.2) |
- Data Pipeline: Each component (source-anchor grounding, fuzzy rewriting, staged filtering) is crucial, with removals causing large drops (8-11 points). The enhancement subset provides smaller but consistent gains.
- RL Recipe: The full fatal-aware GRPO with one-sided clamping achieves the best results (71.8 avg.), a +4.2 point gain over vanilla GRPO. Fatal masking alone improves over hard-masking, and clamping recovers additional signal.
Training Dynamics and Clamping Analysis
- Figure 3 shows that fatal-aware GRPO sustains longer tool-use trajectories while achieving higher batch accuracy than baselines.
- Figures 4 & 5 visualize the effect of one-sided clamping. It preserves gradients for fatal trajectory prefixes that beat the group mean (a small, high-quality subset ~8.2%), while safely zeroing out the majority (91.8%) that fall below the mean, preventing noisy negative signals.
Theoretical and Practical Implications
- Reproducibility & Open Research: By releasing data, code, and models, OpenSearch-VL lowers the barrier to entry and provides a foundation for reproducible research on multimodal deep search agents, challenging the dominance of proprietary systems.
- Data Curation Insights: The pipeline demonstrates that preventing shortcuts (via source-anchor grounding and fuzzy rewriting) and ensuring tool-demanding complexity (via staged filtering) are essential for training effective search agents.
- RL for Long-Horizon Tool-Use: The fatal-aware GRPO algorithm provides a principled solution to a key challenge in agentic RL: credit assignment in partially failed trajectories. The one-sided clamping mechanism is theoretically justified as being **weak