# OpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents

> OpenSearch-VL provides a fully open-source recipe with data, tools, and a novel RL algorithm for training high-performance multimodal search agents, achieving a 13.8-point gain over strong baselines.

- **Source:** [arXiv](https://arxiv.org/abs/2605.05185)
- **Published:** 2026-05-08
- **Permalink:** https://picx.dev/p/GpQ3lz
- **Whiteboard:** https://picx.dev/p/GpQ3lz/image

## Summary

# OpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents

## Summary (Overview)
* **Open Recipe:** Introduces **OpenSearch-VL**, a fully open-source recipe for training frontier multimodal deep search agents, including the release of data, code, and models.
* **Key Components:** The recipe comprises three core innovations: 1) A **data curation pipeline** that constructs high-quality, tool-demanding multi-hop Visual Question Answering (VQA) data via Wikipedia path sampling, fuzzy entity rewriting, and source-anchor visual grounding; 2) A **diverse tool environment** unifying retrieval (text/image search), image enhancement (sharpening, super-resolution, perspective correction), and parsing (OCR, cropping) tools; 3) A **multi-turn fatal-aware GRPO** reinforcement learning algorithm that handles cascading tool failures via token masking and one-sided advantage clamping.
* **Empirical Performance:** OpenSearch-VL delivers substantial gains, with the 30B-A3B model achieving an average score of **61.6** across seven benchmarks—a **+13.8 point improvement** over the Qwen3-VL-30B-A3B baseline. It achieves performance comparable to proprietary commercial models on several tasks.
* **Core Contribution:** Addresses the reproducibility bottleneck in frontier multimodal search agent research by providing transparent, high-quality training data (**SearchVL-SFT-36k**, **SearchVL-RL-8k**) and a detailed training methodology.

## Introduction and Theoretical Foundation
Multimodal deep search has become a critical capability, enabling models to solve complex questions through **active search, evidence verification, and multi-step reasoning**. However, top-tier multimodal search agents remain difficult to reproduce due to the absence of **open high-quality training data, transparent trajectory synthesis pipelines, or detailed training recipes**, as these components are often proprietary in commercial systems.

The paper identifies three main challenges:
1.  **Data Bottleneck:** Effective training data must capture **image-grounded understanding, multi-hop retrieval, evidence verification, and long-horizon tool use**, not simple visual QA. Existing data often enables shortcuts.
2.  **Training Challenge:** Applying agentic Reinforcement Learning (RL) to **long-horizon tool-use settings** is difficult. A single tool failure can invalidate a rollout, wasting useful pre-failure reasoning or introducing noisy gradients.
3.  **Real-World Imperfections:** Visual inputs are often imperfect (blurred, low-resolution, skewed). Agents must perform **robust visual pre-processing** (e.g., cropping, enhancement) before reliable search can begin, a capability lacking in most retrieval-focused agents.

**Theoretical Formulation:** The problem is formalized as an agent interacting with a multimodal environment. Given an input image $I_0$ and question $q$, the agent answers by interleaving reasoning with tool calls over a tool set $\mathcal{T} = \mathcal{T}_v \cup \mathcal{T}_s$ (visual and search tools). The interaction unfolds as a multi-turn trajectory:
$$
\tau = \left\{ (h_0, a_0, o_0), (h_1, a_1, o_1), ..., (h_{L-1}, a_{L-1}, o_{L-1}), (h_L, a_L) \right\}
$$
where $h_l = \{I_l, q, a_{<l}, o_{<l}\}$ is the history, $a_l = [z_l, c_l]$ is an action (reasoning trace + tool invocation/response), and $o_l$ is an observation from the environment $\mathcal{E}$. The trajectory likelihood under policy $\pi_\theta$ is:
$$
\pi_\theta(\tau | I_0, q) = \prod_{l=0}^{L} P_\theta(a_l | h_l) = \prod_{l=0}^{L} P_\theta(z_l | h_l) P_\theta(c_l | h_l, z_l)
$$

## Methodology

### 1. Data Curation Pipeline
The pipeline synthesizes high-quality, tool-demanding trajectories without manual annotation in three stages:

**a) High-Quality VQA Construction:**
*   **Wikipedia Path Sampling:** Starting from the Wikipedia hyperlink graph $\mathcal{G} = (\mathcal{V}, \mathcal{E})$, a constrained random walk of length $h \in \{2,3,4\}$ produces a path $P = v_0 \xrightarrow{\rho_1} v_1 \xrightarrow{\rho_2} ... \xrightarrow{\rho_h} v_h$. Nodes are assigned roles: $v_0$ (visual anchor), $v_1,...,v_{h-1}$ (bridge nodes), $v_h$ (answer node).
*   **Fuzzy Entity Rewriting:** The canonical question $q_t$ (generated from the path) is progressively rewritten into a fuzzy counterpart $q_f$ by replacing entity names with relational descriptors, ensuring **answer invariance, uniqueness, and non-leakage**:
    $$a(q_f) = a(q_t), \quad |\mathcal{R}(q_f)| = 1, \quad \left( \bigcup_{j=0}^{h} \text{aliases}(v_j) \right) \cap q_f = \emptyset$$
*   **Anchor-aware Visual Grounding:** A representative image $I$ of the **anchor $v_0$** (not the answer entity) is retrieved and grounded in $q_f$, yielding the final question $q$. This design **reduces single-hop shortcuts**.

**b) Filtering and Enhancement:**
*   The Wikipedia-derived VQA pool is merged with open-source corpora (LiveVQA, FVQA, WebQA).
*   **Staged Filtering:** A frozen Qwen3-VL-32B model is used to discard examples answerable 1) without tools, and 2) with a single `ImageSearch` call.
*   **Enhanced Subset (10%):** Controlled degradations (blur, downsampling, distortion) are applied to encourage "think-with-image" behavior, requiring the agent to use enhancement tools before retrieval.

**c) Multi-turn Trajectory Synthesis:**
*   Expert trajectories are synthesized by rolling out **Claude Opus 4.6** in the real tool environment for each filtered instance.
*   **Rejection Sampling:** Raw rollouts are vetted by a two-stage judge: 1) Final answer must match ground truth (GPT-4o judge), 2) Process-level quality (GPT-5.4 judge on tool-use, consistency).
*   This yields the final **SearchVL-SFT-36k** dataset (36,592 trajectories, avg. 6.3 tool turns).

### 2. Tool Environment
The tool suite $\mathcal{T}$ spans three complementary functions, as summarized below:

**Table 1 | The search-oriented tool suite integrated within OpenSearch-VL.**
| Tool | Description | Arguments | Tool Output |
| :--- | :--- | :--- | :--- |
| **TextSearch** | Web search with page reading and LLM summarization | Query + TopK | Query-focused passage summaries |
| **ImageSearch** | Reverse image / visual entity search over the web | Image + TopK | Visual matches and related webpages |
| **Sharpen** | Unsharp-masking based deblurring / detail enhancement | Image + Amount | Sharpened image |
| **SuperResolution** | Deep super-resolution (EDSR) for low-resolution inputs | Image + Scale | High-resolution image |
| **PerspectiveCorrect** | Auto perspective rectification of skewed documents | Image | Fronto-parallel image |
| **Crop** | Extract a user-specified rectangular region | Image + Coordinates | Cropped image |
| **OCR** | Structured document parsing with text and layout labels | Image + Flags | Text blocks with labels and reading order |

### 3. Multi-Turn Fatal-Aware GRPO Training
The training has two stages: **Supervised Fine-Tuning (SFT)** on the expert trajectories, followed by **Reinforcement Learning (RL)**.

**SFT Objective:** The standard SFT objective maximizes the log-likelihood of expert actions:
$$
\max_\theta \sum_{i=1}^{N} \sum_{l=1}^{L_i} \left[ \log P_\theta\left(z^{(i)}_l | h^{(i)}_l\right) + \log P_\theta\left(c^{(i)}_l | h^{(i)}_l, z^{(i)}_l\right) \right]
$$
Observations are excluded from loss computation via a **token-level generation mask** $M_{\text{gen}}(y_t)$.

**RL with Composite Reward & Fatal-Aware Masking:**
*   **Composite Trajectory Reward:** $r(\tau) = r_{\text{fmt}}(\tau) \cdot \left[ \alpha r_{\text{acc}}(\tau) + (1-\alpha) r_{\text{query}}(\tau) \right]$, where $\alpha=0.8$.
    *   $r_{\text{fmt}} \in [0,1]$: Algorithmic format reward (ensures proper `<tool_call>...</tool_call>` structure).
    *   $r_{\text{acc}} \in \{0,1\}$: Terminal accuracy reward (GPT-4o judge).
    *   $r_{\text{query}} \in [0,1]$: Process-level search quality reward (GPT-5.4 judge on query relevance, progression, etc.).
*   **Fatal Step Detection:** The fatal step index $f_i$ for trajectory $\tau_i$ is the earliest step where $K=3$ consecutive tool-execution errors commence.
*   **Fatal-Aware Token Mask:** Extends the generation mask to zero out tokens generated *after* the fatal step:
    $$M(y_{i,t}) = M_{\text{gen}}(y_{i,t}) \cdot \mathbb{1}\left[ s(t) < f_i \right]$$
*   **One-Sided Advantage Clamping:** To preserve useful pre-failure reasoning, the group-normalized advantage $\tilde{r}_i$ is clamped for fatal trajectories:
    $$\hat{A}_i = \begin{cases} \tilde{r}_i & \text{if } f_i = L_i + 1 \text{ (non-fatal)} \\ \max(\tilde{r}_i, 0) & \text{if } f_i \le L_i \text{ (fatal)} \end{cases}$$
    This ensures the valid prefix of a fatal trajectory is only reinforced if its partial reward exceeds the group mean.
*   **Final GRPO Objective:** Integrating the mask and clamped advantage:
    $$J(\theta) = \mathbb{E}_{(I_0,q)\sim\mathcal{D}, \{\tau_i\}_{i=1}^G \sim \pi_{\theta_{\text{old}}}(\cdot|I_0,q;\mathcal{E})} \left[ \frac{1}{G} \sum_{i=1}^{G} \frac{1}{\sum_t M_{i,t}} \sum_{t=1}^{|\tau_i|} M_{i,t} \min\left( \rho_{i,t}(\theta) \hat{A}_i, \text{clip}_{1-\epsilon}^{1+\epsilon}(\rho_{i,t}(\theta)) \hat{A}_i \right) \right]$$
    where $\rho_{i,t}(\theta)$ is the token-level importance ratio.

## Empirical Validation / Results

### Main Results
**Table 2 | Performance on multimodal knowledge-intensive QA and web-search benchmarks.**
| Model | SimpleVQA | VDR | MMSearch | LiveVQA | BrowseComp-VL | FVQA | InfoSeek | **Avg** |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| **Direct Reasoning** | | | | | | | | |
| GPT-4o | 51.7 | 1.7 | 18.7 | 28.1 | 5.5 | 48.0 | 52.9 | 29.5 |
| Qwen3-VL-30B-A3B | 53.2 | 3.8 | 18.7 | 42.7 | 29.6 | 34.7 | 26.4 | 29.9 |
| **Agentic Workflow** | | | | | | | | |
| Qwen3-VL-8B | 52.0 | 17.0 | 37.4 | 50.6 | 27.9 | 58.7 | 50.3 | 42.0 |
| **OpenSearch-VL-8B (Ours)** | **71.6** | **20.8** | 64.5 | **59.6** | **37.6** | **71.5** | **70.2** | **56.6** |
| Qwen3-VL-30B-A3B | 55.1 | 20.2 | 44.2 | 62.0 | 34.1 | 63.0 | 56.2 | 47.8 |
| **OpenSearch-VL-30B-A3B (Ours)** | **74.9** | **33.5** | **68.7** | **67.4** | **41.1** | **73.2** | **72.4** | **61.6** |
| Qwen3-VL-32B | 58.7 | 23.1 | 53.9 | 45.5 | 35.1 | 61.2 | 58.5 | 48.0 |
| **OpenSearch-VL-32B (Ours)** | **76.2** | **33.8** | **72.3** | **70.5** | **43.8** | **74.7** | **74.8** | **63.7** |

*   **Substantial Gains:** OpenSearch-VL shows clear advantages over direct-reasoning and RAG baselines. The 30B-A3B model improves the average score from **47.8 to 61.6** (+13.8) over its baseline.
*   **Scale Effectiveness:** Performance scales effectively from 8B to 32B models.
*   **Competitive Performance:** OpenSearch-VL-32B outperforms strong proprietary direct-reasoning models like Gemini-2.5-Pro on the average score.

### Ablation Studies
**Table 3 | Ablation studies on the SFT data pipeline and RL training recipe.**
**(a) SFT data pipeline ablation (Full pipeline Avg. = 64.6)**
| Settings | SimpleVQA | InfoSeek | FVQA | **Avg.** |
| :--- | :---: | :---: | :---: | :---: |
| w/o source-anchor grounding | 53.6 (-12.5) | 54.5 (-7.9) | 51.2 (-14.1) | **53.1 (-11.5)** |
| w/o fuzzy entity rewriting | 51.7 (-14.4) | 56.4 (-6.0) | 54.7 (-10.6) | **54.3 (-10.3)** |
| w/o staged filtering | 57.6 (-8.5) | 55.2 (-7.2) | 56.3 (-9.0) | **56.4 (-8.2)** |
| w/o enhancement subset | 64.9 (-1.2) | 61.7 (-0.7) | 63.2 (-2.1) | **63.3 (-1.3)** |

**(b) RL recipe ablation (Qwen3-VL-8B + SFT only Avg. = 64.6)**
| Method | SimpleVQA | InfoSeek | FVQA | **Avg.** |
| :--- | :---: | :---: | :---: | :---: |
| + Vanilla GRPO | 68.8 | 66.5 | 67.4 | **67.6** |
| + GRPO w/ Hard Masking | 68.3 (-0.5) | 67.9 (+1.4) | 66.9 (-0.5) | **67.7 (+0.1)** |
| + GRPO w/ Fatal Masking only | 69.7 (+0.9) | 68.3 (+1.8) | 69.2 (+1.8) | **69.1 (+1.5)** |
| + **Fatal Masking + One-sided Clamp (Ours)** | **71.6 (+2.8)** | **72.4 (+5.9)** | **71.5 (+4.1)** | **71.8 (+4.2)** |

*   **Data Pipeline:** Each component (source-anchor grounding, fuzzy rewriting, staged filtering) is crucial, with removals causing large drops (8-11 points). The enhancement subset provides smaller but consistent gains.
*   **RL Recipe:** The full fatal-aware GRPO with one-sided clamping achieves the best results (**71.8 avg.**), a **+4.2 point gain** over vanilla GRPO. Fatal masking alone improves over hard-masking, and clamping recovers additional signal.

### Training Dynamics and Clamping Analysis
*   **Figure 3** shows that fatal-aware GRPO sustains longer tool-use trajectories while achieving higher batch accuracy than baselines.
*   **Figures 4 & 5** visualize the effect of one-sided clamping. It preserves gradients for fatal trajectory prefixes that beat the group mean (a small, high-quality subset ~8.2%), while safely zeroing out the majority (91.8%) that fall below the mean, preventing noisy negative signals.

## Theoretical and Practical Implications
*   **Reproducibility & Open Research:** By releasing data, code, and models, OpenSearch-VL lowers the barrier to entry and provides a foundation for reproducible research on multimodal deep search agents, challenging the dominance of proprietary systems.
*   **Data Curation Insights:** The pipeline demonstrates that **preventing shortcuts** (via source-anchor grounding and fuzzy rewriting) and **ensuring tool-demanding complexity** (via staged filtering) are essential for training effective search agents.
*   **RL for Long-Horizon Tool-Use:** The fatal-aware GRPO algorithm provides a principled solution to a key challenge in agentic RL: **credit assignment in partially failed trajectories**. The one-sided clamping mechanism is theoretically justified as being **weak

---

_Markdown view of https://picx.dev/p/GpQ3lz, served by PicX — AI-generated visual whiteboard summaries of research papers._