OpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents

Summary (Overview)

  • Open Recipe: Introduces OpenSearch-VL, a fully open-source recipe for training frontier multimodal deep search agents, including the release of data, code, and models.
  • Key Components: The recipe comprises three core innovations: 1) A data curation pipeline that constructs high-quality, tool-demanding multi-hop Visual Question Answering (VQA) data via Wikipedia path sampling, fuzzy entity rewriting, and source-anchor visual grounding; 2) A diverse tool environment unifying retrieval (text/image search), image enhancement (sharpening, super-resolution, perspective correction), and parsing (OCR, cropping) tools; 3) A multi-turn fatal-aware GRPO reinforcement learning algorithm that handles cascading tool failures via token masking and one-sided advantage clamping.
  • Empirical Performance: OpenSearch-VL delivers substantial gains, with the 30B-A3B model achieving an average score of 61.6 across seven benchmarks—a +13.8 point improvement over the Qwen3-VL-30B-A3B baseline. It achieves performance comparable to proprietary commercial models on several tasks.
  • Core Contribution: Addresses the reproducibility bottleneck in frontier multimodal search agent research by providing transparent, high-quality training data (SearchVL-SFT-36k, SearchVL-RL-8k) and a detailed training methodology.

Introduction and Theoretical Foundation

Multimodal deep search has become a critical capability, enabling models to solve complex questions through active search, evidence verification, and multi-step reasoning. However, top-tier multimodal search agents remain difficult to reproduce due to the absence of open high-quality training data, transparent trajectory synthesis pipelines, or detailed training recipes, as these components are often proprietary in commercial systems.

The paper identifies three main challenges:

  1. Data Bottleneck: Effective training data must capture image-grounded understanding, multi-hop retrieval, evidence verification, and long-horizon tool use, not simple visual QA. Existing data often enables shortcuts.
  2. Training Challenge: Applying agentic Reinforcement Learning (RL) to long-horizon tool-use settings is difficult. A single tool failure can invalidate a rollout, wasting useful pre-failure reasoning or introducing noisy gradients.
  3. Real-World Imperfections: Visual inputs are often imperfect (blurred, low-resolution, skewed). Agents must perform robust visual pre-processing (e.g., cropping, enhancement) before reliable search can begin, a capability lacking in most retrieval-focused agents.

Theoretical Formulation: The problem is formalized as an agent interacting with a multimodal environment. Given an input image I0I_0 and question qq, the agent answers by interleaving reasoning with tool calls over a tool set T=TvTs\mathcal{T} = \mathcal{T}_v \cup \mathcal{T}_s (visual and search tools). The interaction unfolds as a multi-turn trajectory:

τ={(h0,a0,o0),(h1,a1,o1),...,(hL1,aL1,oL1),(hL,aL)}\tau = \left\{ (h_0, a_0, o_0), (h_1, a_1, o_1), ..., (h_{L-1}, a_{L-1}, o_{L-1}), (h_L, a_L) \right\}

where hl={Il,q,a<l,o<l}h_l = \{I_l, q, a_{<l}, o_{<l}\} is the history, al=[zl,cl]a_l = [z_l, c_l] is an action (reasoning trace + tool invocation/response), and olo_l is an observation from the environment E\mathcal{E}. The trajectory likelihood under policy πθ\pi_\theta is:

πθ(τI0,q)=l=0LPθ(alhl)=l=0LPθ(zlhl)Pθ(clhl,zl)\pi_\theta(\tau | I_0, q) = \prod_{l=0}^{L} P_\theta(a_l | h_l) = \prod_{l=0}^{L} P_\theta(z_l | h_l) P_\theta(c_l | h_l, z_l)

Methodology

1. Data Curation Pipeline

The pipeline synthesizes high-quality, tool-demanding trajectories without manual annotation in three stages:

a) High-Quality VQA Construction:

  • Wikipedia Path Sampling: Starting from the Wikipedia hyperlink graph G=(V,E)\mathcal{G} = (\mathcal{V}, \mathcal{E}), a constrained random walk of length h{2,3,4}h \in \{2,3,4\} produces a path P=v0ρ1v1ρ2...ρhvhP = v_0 \xrightarrow{\rho_1} v_1 \xrightarrow{\rho_2} ... \xrightarrow{\rho_h} v_h. Nodes are assigned roles: v0v_0 (visual anchor), v1,...,vh1v_1,...,v_{h-1} (bridge nodes), vhv_h (answer node).
  • Fuzzy Entity Rewriting: The canonical question qtq_t (generated from the path) is progressively rewritten into a fuzzy counterpart qfq_f by replacing entity names with relational descriptors, ensuring answer invariance, uniqueness, and non-leakage: a(qf)=a(qt),R(qf)=1,(j=0haliases(vj))qf=a(q_f) = a(q_t), \quad |\mathcal{R}(q_f)| = 1, \quad \left( \bigcup_{j=0}^{h} \text{aliases}(v_j) \right) \cap q_f = \emptyset
  • Anchor-aware Visual Grounding: A representative image II of the anchor v0v_0 (not the answer entity) is retrieved and grounded in qfq_f, yielding the final question qq. This design reduces single-hop shortcuts.

b) Filtering and Enhancement:

  • The Wikipedia-derived VQA pool is merged with open-source corpora (LiveVQA, FVQA, WebQA).
  • Staged Filtering: A frozen Qwen3-VL-32B model is used to discard examples answerable 1) without tools, and 2) with a single ImageSearch call.
  • Enhanced Subset (10%): Controlled degradations (blur, downsampling, distortion) are applied to encourage "think-with-image" behavior, requiring the agent to use enhancement tools before retrieval.

c) Multi-turn Trajectory Synthesis:

  • Expert trajectories are synthesized by rolling out Claude Opus 4.6 in the real tool environment for each filtered instance.
  • Rejection Sampling: Raw rollouts are vetted by a two-stage judge: 1) Final answer must match ground truth (GPT-4o judge), 2) Process-level quality (GPT-5.4 judge on tool-use, consistency).
  • This yields the final SearchVL-SFT-36k dataset (36,592 trajectories, avg. 6.3 tool turns).

2. Tool Environment

The tool suite T\mathcal{T} spans three complementary functions, as summarized below:

Table 1 | The search-oriented tool suite integrated within OpenSearch-VL.

ToolDescriptionArgumentsTool Output
TextSearchWeb search with page reading and LLM summarizationQuery + TopKQuery-focused passage summaries
ImageSearchReverse image / visual entity search over the webImage + TopKVisual matches and related webpages
SharpenUnsharp-masking based deblurring / detail enhancementImage + AmountSharpened image
SuperResolutionDeep super-resolution (EDSR) for low-resolution inputsImage + ScaleHigh-resolution image
PerspectiveCorrectAuto perspective rectification of skewed documentsImageFronto-parallel image
CropExtract a user-specified rectangular regionImage + CoordinatesCropped image
OCRStructured document parsing with text and layout labelsImage + FlagsText blocks with labels and reading order

3. Multi-Turn Fatal-Aware GRPO Training

The training has two stages: Supervised Fine-Tuning (SFT) on the expert trajectories, followed by Reinforcement Learning (RL).

SFT Objective: The standard SFT objective maximizes the log-likelihood of expert actions:

maxθi=1Nl=1Li[logPθ(zl(i)hl(i))+logPθ(cl(i)hl(i),zl(i))]\max_\theta \sum_{i=1}^{N} \sum_{l=1}^{L_i} \left[ \log P_\theta\left(z^{(i)}_l | h^{(i)}_l\right) + \log P_\theta\left(c^{(i)}_l | h^{(i)}_l, z^{(i)}_l\right) \right]

Observations are excluded from loss computation via a token-level generation mask Mgen(yt)M_{\text{gen}}(y_t).

RL with Composite Reward & Fatal-Aware Masking:

  • Composite Trajectory Reward: r(τ)=rfmt(τ)[αracc(τ)+(1α)rquery(τ)]r(\tau) = r_{\text{fmt}}(\tau) \cdot \left[ \alpha r_{\text{acc}}(\tau) + (1-\alpha) r_{\text{query}}(\tau) \right], where α=0.8\alpha=0.8.
    • rfmt[0,1]r_{\text{fmt}} \in [0,1]: Algorithmic format reward (ensures proper <tool_call>...</tool_call> structure).
    • racc{0,1}r_{\text{acc}} \in \{0,1\}: Terminal accuracy reward (GPT-4o judge).
    • rquery[0,1]r_{\text{query}} \in [0,1]: Process-level search quality reward (GPT-5.4 judge on query relevance, progression, etc.).
  • Fatal Step Detection: The fatal step index fif_i for trajectory τi\tau_i is the earliest step where K=3K=3 consecutive tool-execution errors commence.
  • Fatal-Aware Token Mask: Extends the generation mask to zero out tokens generated after the fatal step: M(yi,t)=Mgen(yi,t)1[s(t)<fi]M(y_{i,t}) = M_{\text{gen}}(y_{i,t}) \cdot \mathbb{1}\left[ s(t) < f_i \right]
  • One-Sided Advantage Clamping: To preserve useful pre-failure reasoning, the group-normalized advantage r~i\tilde{r}_i is clamped for fatal trajectories: A^i={r~iif fi=Li+1 (non-fatal)max(r~i,0)if fiLi (fatal)\hat{A}_i = \begin{cases} \tilde{r}_i & \text{if } f_i = L_i + 1 \text{ (non-fatal)} \\ \max(\tilde{r}_i, 0) & \text{if } f_i \le L_i \text{ (fatal)} \end{cases} This ensures the valid prefix of a fatal trajectory is only reinforced if its partial reward exceeds the group mean.
  • Final GRPO Objective: Integrating the mask and clamped advantage: J(θ)=E(I0,q)D,{τi}i=1Gπθold(I0,q;E)[1Gi=1G1tMi,tt=1τiMi,tmin(ρi,t(θ)A^i,clip1ϵ1+ϵ(ρi,t(θ))A^i)]J(\theta) = \mathbb{E}_{(I_0,q)\sim\mathcal{D}, \{\tau_i\}_{i=1}^G \sim \pi_{\theta_{\text{old}}}(\cdot|I_0,q;\mathcal{E})} \left[ \frac{1}{G} \sum_{i=1}^{G} \frac{1}{\sum_t M_{i,t}} \sum_{t=1}^{|\tau_i|} M_{i,t} \min\left( \rho_{i,t}(\theta) \hat{A}_i, \text{clip}_{1-\epsilon}^{1+\epsilon}(\rho_{i,t}(\theta)) \hat{A}_i \right) \right] where ρi,t(θ)\rho_{i,t}(\theta) is the token-level importance ratio.

Empirical Validation / Results

Main Results

Table 2 | Performance on multimodal knowledge-intensive QA and web-search benchmarks.

ModelSimpleVQAVDRMMSearchLiveVQABrowseComp-VLFVQAInfoSeekAvg
Direct Reasoning
GPT-4o51.71.718.728.15.548.052.929.5
Qwen3-VL-30B-A3B53.23.818.742.729.634.726.429.9
Agentic Workflow
Qwen3-VL-8B52.017.037.450.627.958.750.342.0
OpenSearch-VL-8B (Ours)71.620.864.559.637.671.570.256.6
Qwen3-VL-30B-A3B55.120.244.262.034.163.056.247.8
OpenSearch-VL-30B-A3B (Ours)74.933.568.767.441.173.272.461.6
Qwen3-VL-32B58.723.153.945.535.161.258.548.0
OpenSearch-VL-32B (Ours)76.233.872.370.543.874.774.863.7
  • Substantial Gains: OpenSearch-VL shows clear advantages over direct-reasoning and RAG baselines. The 30B-A3B model improves the average score from 47.8 to 61.6 (+13.8) over its baseline.
  • Scale Effectiveness: Performance scales effectively from 8B to 32B models.
  • Competitive Performance: OpenSearch-VL-32B outperforms strong proprietary direct-reasoning models like Gemini-2.5-Pro on the average score.

Ablation Studies

Table 3 | Ablation studies on the SFT data pipeline and RL training recipe. (a) SFT data pipeline ablation (Full pipeline Avg. = 64.6)

SettingsSimpleVQAInfoSeekFVQAAvg.
w/o source-anchor grounding53.6 (-12.5)54.5 (-7.9)51.2 (-14.1)53.1 (-11.5)
w/o fuzzy entity rewriting51.7 (-14.4)56.4 (-6.0)54.7 (-10.6)54.3 (-10.3)
w/o staged filtering57.6 (-8.5)55.2 (-7.2)56.3 (-9.0)56.4 (-8.2)
w/o enhancement subset64.9 (-1.2)61.7 (-0.7)63.2 (-2.1)63.3 (-1.3)

(b) RL recipe ablation (Qwen3-VL-8B + SFT only Avg. = 64.6)

MethodSimpleVQAInfoSeekFVQAAvg.
+ Vanilla GRPO68.866.567.467.6
+ GRPO w/ Hard Masking68.3 (-0.5)67.9 (+1.4)66.9 (-0.5)67.7 (+0.1)
+ GRPO w/ Fatal Masking only69.7 (+0.9)68.3 (+1.8)69.2 (+1.8)69.1 (+1.5)
+ Fatal Masking + One-sided Clamp (Ours)71.6 (+2.8)72.4 (+5.9)71.5 (+4.1)71.8 (+4.2)
  • Data Pipeline: Each component (source-anchor grounding, fuzzy rewriting, staged filtering) is crucial, with removals causing large drops (8-11 points). The enhancement subset provides smaller but consistent gains.
  • RL Recipe: The full fatal-aware GRPO with one-sided clamping achieves the best results (71.8 avg.), a +4.2 point gain over vanilla GRPO. Fatal masking alone improves over hard-masking, and clamping recovers additional signal.

Training Dynamics and Clamping Analysis

  • Figure 3 shows that fatal-aware GRPO sustains longer tool-use trajectories while achieving higher batch accuracy than baselines.
  • Figures 4 & 5 visualize the effect of one-sided clamping. It preserves gradients for fatal trajectory prefixes that beat the group mean (a small, high-quality subset ~8.2%), while safely zeroing out the majority (91.8%) that fall below the mean, preventing noisy negative signals.

Theoretical and Practical Implications

  • Reproducibility & Open Research: By releasing data, code, and models, OpenSearch-VL lowers the barrier to entry and provides a foundation for reproducible research on multimodal deep search agents, challenging the dominance of proprietary systems.
  • Data Curation Insights: The pipeline demonstrates that preventing shortcuts (via source-anchor grounding and fuzzy rewriting) and ensuring tool-demanding complexity (via staged filtering) are essential for training effective search agents.
  • RL for Long-Horizon Tool-Use: The fatal-aware GRPO algorithm provides a principled solution to a key challenge in agentic RL: credit assignment in partially failed trajectories. The one-sided clamping mechanism is theoretically justified as being **weak