DR-Venus: Towards Frontier Edge-Scale Deep Research Agents with Only 10K Open Data

Summary (Overview)

Core Contribution: Introduces DR-Venus, a frontier 4B-parameter deep research agent built entirely on roughly 10K open-data trajectories, demonstrating that careful improvement of data quality and utilization can unlock strong agentic capability in small models.
Key Methodology: A two-stage training recipe: 1) Agentic SFT with strict data cleaning and turn-aware resampling to emphasize long-horizon trajectories; 2) Agentic RL using IGPO (Information Gain-based Policy Optimization) with turn-level rewards for dense, format-aware supervision.
Main Results: DR-Venus-4B significantly outperforms prior open-source agentic models under 9B parameters across six benchmarks (e.g., BrowseComp, GAIA) and narrows the gap to much larger 30B-class systems.
Critical Insights: 1) Long-horizon trajectory resampling substantially improves SFT effectiveness; 2) Turn-level RL with IGPO is far more effective than sparse trajectory-level optimization; 3) Pass@K analysis reveals a surprisingly high capability ceiling for 4B agents, highlighting the value of test-time scaling.
Tool Use Analysis: Successful trajectories consistently rely more on browsing than failed ones, and RL further calibrates tool use toward more effective evidence acquisition.

Introduction and Theoretical Foundation

Edge-scale deep research agents based on small language models are attractive for real-world deployment due to advantages in cost, latency, and privacy. However, most existing systems are built on larger models with closed data or complex pipelines, leaving the frontier of small agents under open-data settings unexplored.

This work addresses a central question: how to train a strong small deep research agent under limited open-data supervision? The authors argue this is fundamentally a problem of improving both data quality and data utilization. Small models are more sensitive to noisy trajectories and formatting artifacts, making data quality crucial. Furthermore, their limited capability makes agentic RL challenging, as rollout groups often contain no successful trajectories, leading to "advantage collapse" and reduced training efficiency.

Motivated by this, the paper presents DR-Venus, a 4B frontier edge-scale deep research agent. The training recipe is designed to maximize the value of limited open data through a two-stage pipeline focused on quality and utilization.

Problem Formulation: A Deep Research Agent is formulated as a language model-based policy that solves complex information-seeking tasks through long-horizon interaction with an external environment $E$ equipped with executable actions $A$ (search, browse, answer). Each interaction turn $t$ consists of intermediate reasoning $\tau_t$ and an action $a_t$ :

u_t = (\tau_t, a_t) \sim \pi_\theta(\cdot|h_{\le t-1})

The resulting interaction trajectory is:

H = (q, (u_1, o_1), (u_2, o_2), ..., (u_{T-1}, o_{T-1}), u_T)

Methodology

The overall training recipe consists of two complementary stages.

2.2 Building Basic Agentic Capability with Supervised Fine-Tuning

Data Filtering and Construction: The SFT data is built from 10,001 raw REDSearcher trajectories through a four-step pipeline:

Environment Alignment: Convert all trajectories to match the online inference pipeline format.
Disallowed-Tool Pruning & Duplicate Removal: Remove non-search/browse tool calls (e.g., Python-Interpreter) and duplicate search/browse events.
Correctness Filtering: Retain only trajectories with correct final answers, judged by Qwen3-235B-A22B-Instruct-2507 (93.65% retention).
Turn-Aware Resampling: Upweight longer trajectories to improve utilization of long-horizon interactions. Sampling weights: $1\times$ (0–50 turns), $2\times$ (51–100 turns), $5\times$ (>100 turns). This increases the final training set from 9,365 to 18,745 instances.

Agentic Supervised Fine-Tuning: The model is trained on serialized trajectories using a standard next-token prediction objective, masking non-assistant (environment observation) tokens.

\mathcal{L}_{\text{SFT}}(\theta) = - \sum_{H \in \mathcal{D}_{\text{SFT}}} \sum_{i \in \mathcal{M}(H)} \log \pi_\theta(x_i | x_{<i}) \tag{1}

where $\mathcal{M}(H)$ denotes agent-generated token positions.

2.3 Pushing Toward Frontier Performance with Reinforcement Learning

To address residual failure modes (formatting errors, redundant reasoning), agentic RL is applied using Information Gain-based Policy Optimization (IGPO), which constructs dense turn-level reward signals to improve data efficiency.

Turn-Level Reward Design:

Information Gain (IG) Reward: Measures how much a turn increases the policy's probability of generating the ground truth $g$ . The log probability assigned to $g$ after turn $t$ is:
$\log \pi_\theta(g | h_{i,\le t}) = \frac{1}{L} \sum_{j=1}^{L} \log \pi_\theta(g_j | h_{i,\le t}, g_{<j}) \tag{2}$
The IG reward for turn $t$ is:
$r^{\text{IG}}_{i,t} = \log \pi_\theta(g | h_{i,\le t}) - \log \pi_\theta(g | h_{i,\le t-1}), \quad 1 \le t < T \tag{3}$
Optional Browse-Aware IG Assignment: Compute IG rewards only on browse turns and assign them to the browse turn and all preceding search turns since the last browse.
Turn-Level Format Penalty: Enables fine-grained control over formatting, replacing the original reward with a penalty for malformed turns.
$\hat{r}_{i,t} = \begin{cases} r_{i,t}, & \text{if format at turn } t \text{ is valid} \\ -\lambda_{\text{fmt}}, & \text{otherwise} \end{cases} \tag{4}$
Normalization and Discounted Accumulation: Rewards are normalized within each rollout group and optionally rebalanced using an IG-Scale factor to align IG and outcome reward magnitudes. The scaling factor is:
$s = \min\left( \frac{\max(M^O, \eta)}{M^{\text{IG}} + \delta}, s_{\max} \right), \quad \eta=0.3, \delta=10^{-8}, s_{\max}=10 \tag{7}$
The final turn-level discounted cumulative reward used for optimization is:
$\tilde{R}_{i,t} = \sum_{k=t}^{T_i} \gamma^{k-t} \bar{r}_{i,k} \tag{9}$
where $\gamma$ is the discount factor and $\bar{r}_{i,k}$ is the scaled/normalized reward.

Policy Optimization via IGPO: The objective builds on GRPO-style optimization with turn-level credit assignment.

J_{\text{IGPO}}(\theta) = \mathbb{E}_{\{H_i\}_{i=1}^G} \left[ \frac{1}{G} \sum_{i=1}^G \frac{1}{|u_i|} \sum_{k=1}^{|u_i|} \min\left( \frac{\pi_\theta(u_{i,k}|c_{i,k})}{\pi_{\theta_{\text{old}}}(u_{i,k}|c_{i,k})} \tilde{R}_{i,k}, \text{clip}\left( \frac{\pi_\theta(u_{i,k}|c_{i,k})}{\pi_{\theta_{\text{old}}}(u_{i,k}|c_{i,k})}, 1-\epsilon, 1+\epsilon \right) \tilde{R}_{i,k} \right) - \beta D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}}) \right] \tag{10}

Empirical Validation / Results

Experimental Settings:

Training Data: SFT on cleaned/resampled REDSearcher trajectories (~10K). RL on 1k query-answer pairs from REDSearcher.
Benchmarks: BrowseComp, BrowseComp-ZH, GAIA (Text-Only), xBenchDS-2505, xBenchDS-2510, DeepSearchQA.
Baselines: Three groups: Frontier foundation models (e.g., GPT-5 High), open-source trained agents ≥30B (e.g., Tongyi-DR-30B), and open-source small agents ≤9B (e.g., AgentCPM-Explore-4B).
Implementation: Backbone: Qwen3-4B-Thinking-2507. Tools: Search (Serper/Google API) and Browse (Jina + Qwen3-30B summarization).

3.2 Main Results

Table 1: Overall performance comparison on six widely used deep research benchmarks.

Model	BrowseComp	BrowseComp-ZH	GAIA (Text-Only)	xBench-DS-2505	xBench-DS-2510	DeepSearchQA
Foundation Models
DeepSeek-V3.2	67.6	65.0	75.1	78.0	55.7	60.9
GPT-5 High	54.9	65.0	76.4	77.8	75.0	79.0
Trained Agents (≥ 30B)
Tongyi-DR-30B	43.4	46.7	70.9	75.0	55.0	–
REDSearcher-30B-A3B	42.1	49.8	80.1	–	–	–
Trained Agents (≤ 9B)
AgentCPM-Explore-4B	24.1	29.1	63.9	70.0	34.0	32.8
DR-Venus-4B-SFT	26.8	35.7	65.4	69.0	35.3	37.7
DR-Venus-4B-RL	29.1	37.7	64.4	74.7	40.7	39.6

Key Findings:

Strong SFT Baseline: DR-Venus-4B-SFT already outperforms prior ≤9B agents on most benchmarks (e.g., +2.7 over AgentCPM-Explore-4B on BrowseComp).
RL Unlocks Further Gains: Agentic RL (IGPO) improves over SFT on five of six benchmarks (e.g., +5.7 on xBench-DS-2505), establishing a new SOTA among small agents.
Narrowing the Scale Gap: Despite its 4B size, DR-Venus matches or exceeds several 30B-scale systems on individual benchmarks (e.g., surpasses OpenResearcher-30B-A3B on all reported benchmarks).

3.3 Ablation Study

Table 2: Ablation study on BrowseComp and BrowseComp-ZH.

Model	Training	BrowseComp	BrowseComp-ZH
REDSearcher-30B-A3B (SFT)	SFT	34.7	26.8
DR-Venus-4B-SFT (w/o Resampling)	SFT	22.8	33.9
DR-Venus-4B-SFT (w/ Resampling, Ours)	SFT	26.8 (+4.0)	35.7 (+1.8)
DR-Venus-4B-RL (w/ GRPO)	SFT+RL	25.3 (-1.5)	35.6 (-0.1)
DR-Venus-4B-RL (w/ IGPO, Ours)	SFT+RL	29.1 (+2.3)	37.7 (+2.0)

Key Insights:

Resampling is Effective: Long-horizon trajectory resampling during SFT improves BrowseComp by +4.0 and BrowseComp-ZH by +1.8.
IGPO is Superior to GRPO: Conventional sparse trajectory-level RL (GRPO) brings little to no improvement (even decreases performance), while turn-level dense supervision via IGPO yields consistent gains (+2.3, +2.0).

3.4 Analysis of Capability Boundary

Pass@K evaluation reveals that the capability ceiling of 4B agents is surprisingly high. On BrowseComp-ZH, DR-Venus-4B-SFT reaches 78.5% at Pass@16, significantly exceeding Tongyi-DR-30B (46.7%) and even outperforming proprietary models like Gemini-3-Pro (66.8%) and GPT-5 High (65.0%). This highlights the large deployment potential of edge-scale agents and suggests test-time scaling is especially effective for unlocking small model potential.

3.5 Analysis of Tool Use

Figure 3 shows that correct trajectories consistently exhibit a higher browse ratio than wrong trajectories across all benchmarks. This pattern becomes more pronounced after RL.

Overall Browse Ratio: Increases from 17.49% (SFT) to 22.46% (RL).
Correct Trajectory Browse Ratio: Increases from 23.71% (SFT) to 28.96% (RL).

RL steers tool use toward better evidence gathering, making the "correct > wrong" pattern more consistent and reversing counterintuitive relations seen in SFT.

Theoretical and Practical Implications

Theoretical: Demonstrates that under limited open-data supervision, model scale is not the sole determinant of deep research performance. Careful improvement of data quality (cleaning, filtering) and data utilization (resampling, dense RL rewards) can compensate for a substantial portion of the scale gap.
Practical: Provides a reproducible blueprint for building frontier-edge-scale deep research agents using only open data. The released models, code, and recipes lower the barrier to entry for research and deployment of lightweight, cost-effective agentic assistants.

Conclusion

DR-Venus presents a frontier 4B edge-scale deep research agent built entirely on ~10K open data via a two-stage recipe of quality-focused SFT and utilization-focused RL (IGPO). It significantly outperforms prior small agents and narrows the gap to larger 30B systems. The work shows that strong deep research capability can be unlocked in small models through careful data handling, and that RL is essential for stabilizing long-horizon execution. The released artifacts aim to provide a practical starting point for future research on edge-scale agents.