DR-Venus: Towards Frontier Edge-Scale Deep Research Agents with Only 10K Open Data - Summary
Summary (Overview)
- Presents DR-Venus, a state-of-the-art 4-billion-parameter deep research agent built entirely on ~10K open-source data, designed for efficient edge-scale deployment.
- Introduces a two-stage training recipe focusing on data quality and utilization: 1) Agentic Supervised Fine-Tuning (SFT) with strict data cleaning and long-horizon trajectory resampling, and 2) Agentic Reinforcement Learning (RL) using an Information Gain-based Policy Optimization (IGPO) with dense, turn-level rewards.
- Demonstrates superior performance where DR-Venus-4B significantly outperforms prior agentic models under 9B parameters on multiple benchmarks (e.g., BrowseComp, GAIA) and narrows the performance gap to much larger 30B-class systems.
- Provides key insights: Long-horizon resampling improves SFT effectiveness; turn-level RL (IGPO) is more effective than sparse trajectory-level optimization; and small models have a surprisingly high capability ceiling, making test-time scaling particularly valuable.
- Releases open resources, including models, code, and recipes, to support reproducible research on edge-scale agents.
Introduction and Theoretical Foundation
Recent advances in large language models (LLMs) have enabled capable agents for complex tasks like deep research, which involves iterative search, browsing, evidence collection, and answer synthesis over long interaction horizons. While desirable for real-world deployment due to advantages in cost, latency, and privacy, most existing deep research systems are built on large models with closed or complex data pipelines.
This work addresses a central question: How to train a strong small deep research agent under limited open-data supervision? The authors argue this is fundamentally a problem of improving both data quality and data utilization. Small models are more sensitive to noisy data and imperfect tool-use traces (requiring high quality), and they struggle with agentic RL where rollout groups often contain no successful trajectories (requiring efficient utilization).
The proposed solution is DR-Venus, a 4B frontier agent. The core formulation treats a deep research agent as a language model-based policy that, given a user query , iteratively generates a turn output conditioned on the interaction history , where is intermediate reasoning and is an action (search, browse, answer). The resulting trajectory is:
The goal is to learn a policy that can acquire basic capability from supervised trajectories and improve long-horizon execution reliability through reinforcement learning.
Methodology
The training recipe is a two-stage pipeline designed to maximize data quality and utilization from limited open data.
Stage 1: Building Basic Agentic Capability with Supervised Fine-Tuning (SFT)
Data Filtering and Construction: The SFT data is built from 10,001 raw REDSearcher trajectories through a four-step pipeline:
- Environment Alignment: Convert all trajectories to match the online inference pipeline's format.
- Disallowed-Tool Pruning and Duplicate Removal: Remove non-search/browse tool calls and duplicate interactions (mostly
browse). This left 10,000 valid trajectories. - Correctness Filtering: Retain only trajectories with correct final answers, judged by Qwen3-235B-A22B-Instruct-2507. This left 9,365 trajectories (93.65%).
- Turn-Aware Resampling: Upweight longer trajectories to emphasize long-horizon interactions critical for deep research. Sampling weights: 1× for 0–50 turns, 2× for 51–100 turns, 5× for >100 turns. This increased the final training set to 18,745 instances, raising the proportion of trajectories >50 turns from 60.28% to 80.15%.
Agentic SFT Objective: The model is fine-tuned on the cleaned dataset . The training objective is standard next-token prediction on agent-generated tokens only (reasoning and actions ), masking environment observations :
where denotes the agent-generated token positions.
Stage 2: Pushing Toward Frontier Performance with Reinforcement Learning (RL)
To address residual failure modes (formatting errors, redundant reasoning), agentic RL is applied using Information Gain-based Policy Optimization (IGPO), which provides dense turn-level rewards for better data efficiency.
Turn-Level Reward Design:
- Information Gain (IG) Reward: Measures how much a turn increases the model's confidence in the ground truth answer . The log probability assigned to after turn in rollout is:
The IG reward for turn is then:
- Browse-Aware IG Assignment (Optional): Compute IG rewards only on
browseturns and assign each to that browse turn and all precedingsearchturns since the last browse.
- Browse-Aware IG Assignment (Optional): Compute IG rewards only on
- Turn-Level Format Penalty: Provides fine-grained control. For a turn with reward (either or outcome reward ):
- Normalization and Discounted Accumulation: Rewards are normalized within each rollout group of size : To balance weak outcome supervision in long trajectories, an optional IG-Scale factor is computed and applied: The final turn-level reward is for intermediate turns (if IG-Scale enabled). A discounted cumulative reward is then computed for credit assignment: where is the discount factor.
Policy Optimization via IGPO: The IGPO objective builds on GRPO-style optimization with turn-level credit assignment:
where is the clipping threshold and controls the KL penalty strength.
Empirical Validation / Results
Experimental Setup: The backbone model is Qwen3-4B-Thinking-2507. Agentic SFT uses ~10K cleaned REDSearcher trajectories. Agentic RL uses 1K query-answer pairs from the same source. Evaluation is conducted on six benchmarks: BrowseComp, BrowseComp-ZH, GAIA (Text-Only), xBenchDS-2505, xBenchDS-2510, and DeepSearchQA.
Main Results: Table 1: Overall performance comparison on six deep research benchmarks.
| Model | BrowseComp | BrowseComp-ZH | GAIA (Text-Only) | xBench-DS-2505 | xBench-DS-2510 | DeepSearchQA |
|---|---|---|---|---|---|---|
| Foundation Models | ||||||
| DeepSeek-V3.2 | 67.6 | 65.0 | 75.1 | 78.0 | 55.7 | 60.9 |
| GPT-5 High | 54.9 | 65.0 | 76.4 | 77.8 | 75.0 | 79.0 |
| Trained Agents (≥30B) | ||||||
| Tongyi-DR-30B | 43.4 | 46.7 | 70.9 | 75.0 | 55.0 | – |
| REDSearcher-30B-A3B | 42.1 | 49.8 | 80.1 | – | – | – |
| Trained Agents (≤9B) | ||||||
| AgentCPM-Explore-4B | 24.1 | 29.1 | 63.9 | 70.0 | 34.0 | 32.8 |
| DR-Venus-4B-SFT | 26.8 | 35.7 | 65.4 | 69.0 | 35.3 | 37.7 |
| DR-Venus-4B-RL | 29.1 | 37.7 | 64.4 | 74.7 | 40.7 | 39.6 |
- DR-Venus-4B-SFT already establishes a strong baseline, consistently outperforming prior ≤9B agents (e.g., +2.7 over AgentCPM-Explore-4B on BrowseComp).
- DR-Venus-4B-RL further improves performance, setting a new state-of-the-art among small agents and narrowing the gap to 30B-class systems (e.g., approaching Tongyi-DR-30B on xBench-DS-2505).
Ablation Study: Table 2: Ablation study on BrowseComp and BrowseComp-ZH.
| Model | Training | BrowseComp | BrowseComp-ZH |
|---|---|---|---|
| REDSearcher-30B-A3B (SFT) | SFT | 34.7 | 26.8 |
| DR-Venus-4B-SFT (w/o Resampling) | SFT | 22.8 | 33.9 |
| DR-Venus-4B-SFT (w/ Resampling, Ours) | SFT | 26.8 (+4.0) | 35.7 (+1.8) |
| DR-Venus-4B-RL (w/ GRPO) | SFT+RL | 25.3 (-1.5) | 35.6 (-0.1) |
| DR-Venus-4B-RL (w/ IGPO, Ours) | SFT+RL | 29.1 (+2.3) | 37.7 (+2.0) |
- Turn-aware resampling during SFT provides significant gains (+4.0 on BrowseComp), demonstrating the value of emphasizing long-horizon data.
- IGPO-based RL is effective (+2.3 on BrowseComp), while conventional GRPO leads to performance degradation, highlighting the importance of dense, turn-level reward design.
Analysis of Capability Boundary (Pass@K):
- Pass@K evaluation reveals the capability ceiling of small agents is surprisingly high. For example, on BrowseComp-ZH, DR-Venus-4B-SFT reaches 78.5% at Pass@16, outperforming larger models like Gemini-3-Pro (66.8%) and GPT-5 High (65.0%).
- RL primarily improves reliability in the low-K regime (e.g., Pass@1), making strong trajectories emerge more consistently.
Analysis of Tool Use:
- A clear pattern emerges: correct trajectories consistently exhibit a higher browse ratio than wrong trajectories across all benchmarks.
- RL further strengthens this behavior, increasing the overall browse ratio from 17.49% (SFT) to 22.46% (RL) and steering tool use toward more effective evidence gathering.
Theoretical and Practical Implications
- Data Quality over Pure Scale: The work demonstrates that under limited open-data supervision, careful improvement of data quality (cleaning, alignment) and data utilization (resampling, dense rewards) can compensate for a substantial portion of the scale gap. A well-designed pipeline can unlock strong performance even in a 4B model.
- Effective RL for Small Agents: The paper shows that making RL effective for small agents in long-horizon tasks requires dense, turn-level supervision. The proposed IGPO algorithm with information gain rewards and format-aware regularization provides a solution, outperforming sparse trajectory-level optimization.
- Promise of Edge-Scale Deployment: The high capability ceiling revealed by Pass@K analysis suggests significant deployment potential for small, efficient models. Test-time scaling (e.g., multiple rollouts) may be an especially effective way to unlock the latent reasoning capability of edge-scale agents.
- Open and Reproducible Research: By releasing models, code, and recipes built entirely on open data, the work provides a practical blueprint for advancing research on accessible, edge-scale deep research agents.
Conclusion
DR-Venus presents a frontier 4B edge-scale deep research agent trained on only ~10K open data. Its two-stage recipe—agentic SFT with strict cleaning/resampling followed by agentic RL with IGPO—enables it to outperform prior small agents and compete with much larger systems. Key findings emphasize the importance of data quality, the effectiveness of turn-level RL, and the high latent potential of small models. The released resources aim to foster reproducible progress in building efficient, deployable deep research assistants.