OpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents

Summary (Overview)

Open Recipe: Introduces OpenSearch-VL, a fully open-source recipe for training frontier multimodal deep search agents, including the release of data, code, and models.
Key Components: The recipe comprises three core innovations: 1) A data curation pipeline that constructs high-quality, tool-demanding multi-hop Visual Question Answering (VQA) data via Wikipedia path sampling, fuzzy entity rewriting, and source-anchor visual grounding; 2) A diverse tool environment unifying retrieval (text/image search), image enhancement (sharpening, super-resolution, perspective correction), and parsing (OCR, cropping) tools; 3) A multi-turn fatal-aware GRPO reinforcement learning algorithm that handles cascading tool failures via token masking and one-sided advantage clamping.
Empirical Performance: OpenSearch-VL delivers substantial gains, with the 30B-A3B model achieving an average score of 61.6 across seven benchmarks—a +13.8 point improvement over the Qwen3-VL-30B-A3B baseline. It achieves performance comparable to proprietary commercial models on several tasks.
Core Contribution: Addresses the reproducibility bottleneck in frontier multimodal search agent research by providing transparent, high-quality training data (SearchVL-SFT-36k, SearchVL-RL-8k) and a detailed training methodology.

Introduction and Theoretical Foundation

Multimodal deep search has become a critical capability, enabling models to solve complex questions through active search, evidence verification, and multi-step reasoning. However, top-tier multimodal search agents remain difficult to reproduce due to the absence of open high-quality training data, transparent trajectory synthesis pipelines, or detailed training recipes, as these components are often proprietary in commercial systems.

The paper identifies three main challenges:

Data Bottleneck: Effective training data must capture image-grounded understanding, multi-hop retrieval, evidence verification, and long-horizon tool use, not simple visual QA. Existing data often enables shortcuts.
Training Challenge: Applying agentic Reinforcement Learning (RL) to long-horizon tool-use settings is difficult. A single tool failure can invalidate a rollout, wasting useful pre-failure reasoning or introducing noisy gradients.
Real-World Imperfections: Visual inputs are often imperfect (blurred, low-resolution, skewed). Agents must perform robust visual pre-processing (e.g., cropping, enhancement) before reliable search can begin, a capability lacking in most retrieval-focused agents.

Theoretical Formulation: The problem is formalized as an agent interacting with a multimodal environment. Given an input image $I_0$ and question $q$ , the agent answers by interleaving reasoning with tool calls over a tool set $\mathcal{T} = \mathcal{T}_v \cup \mathcal{T}_s$ (visual and search tools). The interaction unfolds as a multi-turn trajectory:

\tau = \left\{ (h_0, a_0, o_0), (h_1, a_1, o_1), ..., (h_{L-1}, a_{L-1}, o_{L-1}), (h_L, a_L) \right\}

where $h_l = \{I_l, q, a_{<l}, o_{<l}\}$ is the history, $a_l = [z_l, c_l]$ is an action (reasoning trace + tool invocation/response), and $o_l$ is an observation from the environment $\mathcal{E}$ . The trajectory likelihood under policy $\pi_\theta$ is:

\pi_\theta(\tau | I_0, q) = \prod_{l=0}^{L} P_\theta(a_l | h_l) = \prod_{l=0}^{L} P_\theta(z_l | h_l) P_\theta(c_l | h_l, z_l)

Methodology

1. Data Curation Pipeline

The pipeline synthesizes high-quality, tool-demanding trajectories without manual annotation in three stages:

a) High-Quality VQA Construction:

Wikipedia Path Sampling: Starting from the Wikipedia hyperlink graph $\mathcal{G} = (\mathcal{V}, \mathcal{E})$ , a constrained random walk of length $h \in \{2,3,4\}$ produces a path $P = v_0 \xrightarrow{\rho_1} v_1 \xrightarrow{\rho_2} ... \xrightarrow{\rho_h} v_h$ . Nodes are assigned roles: $v_0$ (visual anchor), $v_1,...,v_{h-1}$ (bridge nodes), $v_h$ (answer node).
Fuzzy Entity Rewriting: The canonical question $q_t$ (generated from the path) is progressively rewritten into a fuzzy counterpart $q_f$ by replacing entity names with relational descriptors, ensuring answer invariance, uniqueness, and non-leakage: $a(q_f) = a(q_t), \quad |\mathcal{R}(q_f)| = 1, \quad \left( \bigcup_{j=0}^{h} \text{aliases}(v_j) \right) \cap q_f = \emptyset$
Anchor-aware Visual Grounding: A representative image $I$ of the anchor $v_0$ (not the answer entity) is retrieved and grounded in $q_f$ , yielding the final question $q$ . This design reduces single-hop shortcuts.

b) Filtering and Enhancement:

The Wikipedia-derived VQA pool is merged with open-source corpora (LiveVQA, FVQA, WebQA).
Staged Filtering: A frozen Qwen3-VL-32B model is used to discard examples answerable 1) without tools, and 2) with a single ImageSearch call.
Enhanced Subset (10%): Controlled degradations (blur, downsampling, distortion) are applied to encourage "think-with-image" behavior, requiring the agent to use enhancement tools before retrieval.

c) Multi-turn Trajectory Synthesis:

Expert trajectories are synthesized by rolling out Claude Opus 4.6 in the real tool environment for each filtered instance.
Rejection Sampling: Raw rollouts are vetted by a two-stage judge: 1) Final answer must match ground truth (GPT-4o judge), 2) Process-level quality (GPT-5.4 judge on tool-use, consistency).
This yields the final SearchVL-SFT-36k dataset (36,592 trajectories, avg. 6.3 tool turns).

2. Tool Environment

The tool suite $\mathcal{T}$ spans three complementary functions, as summarized below:

Table 1 | The search-oriented tool suite integrated within OpenSearch-VL.

Tool	Description	Arguments	Tool Output
TextSearch	Web search with page reading and LLM summarization	Query + TopK	Query-focused passage summaries
ImageSearch	Reverse image / visual entity search over the web	Image + TopK	Visual matches and related webpages
Sharpen	Unsharp-masking based deblurring / detail enhancement	Image + Amount	Sharpened image
SuperResolution	Deep super-resolution (EDSR) for low-resolution inputs	Image + Scale	High-resolution image
PerspectiveCorrect	Auto perspective rectification of skewed documents	Image	Fronto-parallel image
Crop	Extract a user-specified rectangular region	Image + Coordinates	Cropped image
OCR	Structured document parsing with text and layout labels	Image + Flags	Text blocks with labels and reading order

3. Multi-Turn Fatal-Aware GRPO Training

The training has two stages: Supervised Fine-Tuning (SFT) on the expert trajectories, followed by Reinforcement Learning (RL).

SFT Objective: The standard SFT objective maximizes the log-likelihood of expert actions:

\max_\theta \sum_{i=1}^{N} \sum_{l=1}^{L_i} \left[ \log P_\theta\left(z^{(i)}_l | h^{(i)}_l\right) + \log P_\theta\left(c^{(i)}_l | h^{(i)}_l, z^{(i)}_l\right) \right]

Observations are excluded from loss computation via a token-level generation mask $M_{\text{gen}}(y_t)$ .

RL with Composite Reward & Fatal-Aware Masking:

Composite Trajectory Reward: $r(\tau) = r_{\text{fmt}}(\tau) \cdot \left[ \alpha r_{\text{acc}}(\tau) + (1-\alpha) r_{\text{query}}(\tau) \right]$ $r (τ) = r_{fmt} (τ) \cdot [α r_{acc} (τ) + (1 - α) r_{query} (τ)]$ , where $\alpha=0.8$ $α = 0.8$ .
- $r_{\text{fmt}} \in [0,1]$ : Algorithmic format reward (ensures proper <tool_call>...</tool_call> structure).
- $r_{\text{acc}} \in \{0,1\}$ : Terminal accuracy reward (GPT-4o judge).
- $r_{\text{query}} \in [0,1]$ : Process-level search quality reward (GPT-5.4 judge on query relevance, progression, etc.).
Fatal Step Detection: The fatal step index $f_i$ for trajectory $\tau_i$ is the earliest step where $K=3$ consecutive tool-execution errors commence.
Fatal-Aware Token Mask: Extends the generation mask to zero out tokens generated after the fatal step: $M(y_{i,t}) = M_{\text{gen}}(y_{i,t}) \cdot \mathbb{1}\left[ s(t) < f_i \right]$
One-Sided Advantage Clamping: To preserve useful pre-failure reasoning, the group-normalized advantage $\tilde{r}_i$ is clamped for fatal trajectories: $\hat{A}_i = \begin{cases} \tilde{r}_i & \text{if } f_i = L_i + 1 \text{ (non-fatal)} \\ \max(\tilde{r}_i, 0) & \text{if } f_i \le L_i \text{ (fatal)} \end{cases}$ This ensures the valid prefix of a fatal trajectory is only reinforced if its partial reward exceeds the group mean.
Final GRPO Objective: Integrating the mask and clamped advantage: $J(\theta) = \mathbb{E}_{(I_0,q)\sim\mathcal{D}, \{\tau_i\}_{i=1}^G \sim \pi_{\theta_{\text{old}}}(\cdot|I_0,q;\mathcal{E})} \left[ \frac{1}{G} \sum_{i=1}^{G} \frac{1}{\sum_t M_{i,t}} \sum_{t=1}^{|\tau_i|} M_{i,t} \min\left( \rho_{i,t}(\theta) \hat{A}_i, \text{clip}_{1-\epsilon}^{1+\epsilon}(\rho_{i,t}(\theta)) \hat{A}_i \right) \right]$ where $\rho_{i,t}(\theta)$ is the token-level importance ratio.

Empirical Validation / Results

Main Results

Table 2 | Performance on multimodal knowledge-intensive QA and web-search benchmarks.

Model	SimpleVQA	VDR	MMSearch	LiveVQA	BrowseComp-VL	FVQA	InfoSeek	Avg
Direct Reasoning
GPT-4o	51.7	1.7	18.7	28.1	5.5	48.0	52.9	29.5
Qwen3-VL-30B-A3B	53.2	3.8	18.7	42.7	29.6	34.7	26.4	29.9
Agentic Workflow
Qwen3-VL-8B	52.0	17.0	37.4	50.6	27.9	58.7	50.3	42.0
OpenSearch-VL-8B (Ours)	71.6	20.8	64.5	59.6	37.6	71.5	70.2	56.6
Qwen3-VL-30B-A3B	55.1	20.2	44.2	62.0	34.1	63.0	56.2	47.8
OpenSearch-VL-30B-A3B (Ours)	74.9	33.5	68.7	67.4	41.1	73.2	72.4	61.6
Qwen3-VL-32B	58.7	23.1	53.9	45.5	35.1	61.2	58.5	48.0
OpenSearch-VL-32B (Ours)	76.2	33.8	72.3	70.5	43.8	74.7	74.8	63.7

Substantial Gains: OpenSearch-VL shows clear advantages over direct-reasoning and RAG baselines. The 30B-A3B model improves the average score from 47.8 to 61.6 (+13.8) over its baseline.
Scale Effectiveness: Performance scales effectively from 8B to 32B models.
Competitive Performance: OpenSearch-VL-32B outperforms strong proprietary direct-reasoning models like Gemini-2.5-Pro on the average score.

Ablation Studies

Table 3 | Ablation studies on the SFT data pipeline and RL training recipe. (a) SFT data pipeline ablation (Full pipeline Avg. = 64.6)

Settings	SimpleVQA	InfoSeek	FVQA	Avg.
w/o source-anchor grounding	53.6 (-12.5)	54.5 (-7.9)	51.2 (-14.1)	53.1 (-11.5)
w/o fuzzy entity rewriting	51.7 (-14.4)	56.4 (-6.0)	54.7 (-10.6)	54.3 (-10.3)
w/o staged filtering	57.6 (-8.5)	55.2 (-7.2)	56.3 (-9.0)	56.4 (-8.2)
w/o enhancement subset	64.9 (-1.2)	61.7 (-0.7)	63.2 (-2.1)	63.3 (-1.3)

(b) RL recipe ablation (Qwen3-VL-8B + SFT only Avg. = 64.6)

Method	SimpleVQA	InfoSeek	FVQA	Avg.
+ Vanilla GRPO	68.8	66.5	67.4	67.6
+ GRPO w/ Hard Masking	68.3 (-0.5)	67.9 (+1.4)	66.9 (-0.5)	67.7 (+0.1)
+ GRPO w/ Fatal Masking only	69.7 (+0.9)	68.3 (+1.8)	69.2 (+1.8)	69.1 (+1.5)
+ Fatal Masking + One-sided Clamp (Ours)	71.6 (+2.8)	72.4 (+5.9)	71.5 (+4.1)	71.8 (+4.2)

Data Pipeline: Each component (source-anchor grounding, fuzzy rewriting, staged filtering) is crucial, with removals causing large drops (8-11 points). The enhancement subset provides smaller but consistent gains.
RL Recipe: The full fatal-aware GRPO with one-sided clamping achieves the best results (71.8 avg.), a +4.2 point gain over vanilla GRPO. Fatal masking alone improves over hard-masking, and clamping recovers additional signal.

Training Dynamics and Clamping Analysis

Figure 3 shows that fatal-aware GRPO sustains longer tool-use trajectories while achieving higher batch accuracy than baselines.
Figures 4 & 5 visualize the effect of one-sided clamping. It preserves gradients for fatal trajectory prefixes that beat the group mean (a small, high-quality subset ~8.2%), while safely zeroing out the majority (91.8%) that fall below the mean, preventing noisy negative signals.

Theoretical and Practical Implications

Reproducibility & Open Research: By releasing data, code, and models, OpenSearch-VL lowers the barrier to entry and provides a foundation for reproducible research on multimodal deep search agents, challenging the dominance of proprietary systems.
Data Curation Insights: The pipeline demonstrates that preventing shortcuts (via source-anchor grounding and fuzzy rewriting) and ensuring tool-demanding complexity (via staged filtering) are essential for training effective search agents.
RL for Long-Horizon Tool-Use: The fatal-aware GRPO algorithm provides a principled solution to a key challenge in agentic RL: credit assignment in partially failed trajectories. The one-sided clamping mechanism is theoretically justified as being **weak