EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL

Summary (Overview)

Unified Automated Pipeline: EnvFactory is a fully automated framework that addresses two core bottlenecks in Agentic Reinforcement Learning (Agentic RL) for tool-use: scalable, verifiable environment construction and realistic, implicitly-reasoned training trajectory synthesis.
Autonomous Environment Construction (EnvGen): The framework autonomously explores authentic online resources to propose diverse tool scenarios, then constructs stateful databases and executable tool interfaces via a multi-agent (Search, Code, Test) iterative verification loop, ensuring robust and low-latency environments.
Topology-Aware Realistic Trajectory Synthesis (QueryGen): It synthesizes natural multi-turn user queries by first building a dependency tool graph, then using a novel topology-aware sampling algorithm to resolve logical dependencies. A calibrated refinement stage injects implicit intents and ambiguity, transforming over-specified instruction lists into realistic human requests.
High Data Efficiency & Performance: Using only 85 verified environments and 2,575 training trajectories, EnvFactory outperforms baselines that use 5x more resources. It improves Qwen3-series models by up to +15% on BFCLv3, +8.6% on MCP-Atlas, and +6% on conversational benchmarks (𝜏²-Bench, VitaBench).

Introduction and Theoretical Foundation

Equipping Large Language Models (LLMs) with tool-use capabilities via Agentic Reinforcement Learning (Agentic RL) is critical for enabling real-world AI agents. The effectiveness of Agentic RL hinges on two factors: scalable, executable environments and realistic, verified training data that captures implicit human reasoning.

Existing approaches fall short:

Environment Limitations: Production APIs are costly and unstable; LLM-simulated environments suffer from hallucination; synthetic environments are often stateless or rely on pre-collected documents, limiting generalization.
Data Limitations: Synthetic trajectories are often over-specified "instruction lists," explicitly enumerating steps rather than reflecting the concise, implicit nature of real human requests. This reduces their effectiveness for training robust decision-making.

EnvFactory is proposed to unify robust environment construction and realistic trajectory generation. It autonomously constructs verified, stateful environments from real-world online resources and synthesizes natural queries through topology-aware graph guidance and calibrated refinement, bridging the realism gap.

Methodology

The framework defines the tool agentic interaction between Users, Agents, and Environments (E). Each environment $e \in E$ is defined as $e = (m, D, \pi, V_e)$ , where:

$m$ : environment metadata (descriptions, tool schemas)
$D$ : stateful database schema
$\pi$ : executable Python implementation
$V_e$ : exposed tool interface (using MCP by default)

The global toolset is $V = \bigcup_{e \in E} V_e$ . Each tool $v \in V$ has an input space $I(v)$ and output space $O(v)$ .

3.2 Environment Construction (EnvGen)

Given an empty set $E = \emptyset$ , EnvGen fully automates the construction of a new environment $e_{new} \notin E$ via a pipeline of three agents:

Proposal and Sketch (Search Agent): Analyzes $E$ for coverage gaps, retrieves authentic sources (API docs, reports), and produces structured metadata $m$ as a blueprint.
Database Modeling & Code Implementation (Code Agent): Derives a stateful database schema $D$ and implements executable Python code $\pi$ for each tool, wrapped into a standardized interface $V_{e_{new}}$ .
Revision Loop (Test Agent): Creates unit tests and validates the environment against four criteria (interface consistency, successful execution, correct behavior, proper state transitions). Iterates with error reports until all tests pass.

The final verified environment is added to the pool: $E \leftarrow E \cup \{e_{new}\}$ .

3.3 Dependency Tool Graph

A tool dependency graph $G = (V, E)$ is constructed to guide realistic multi-step query synthesis.

Construction:
- Step 1: Semantic Parameter Matching: Uses the BAAI/bge-m3 embedding model to compute cosine similarity between output parameters of tool $v_i$ and input parameters of tool $v_j$ . A directed edge $(v_i \rightarrow v_j)$ is added if similarity exceeds a threshold.
- Step 2: Logical Dependency Refinement: An LLM analyzes tools per environment to identify missing logical dependencies and prune spurious edges, ensuring tools without parameters are connected.
Topology-Aware Sampling: To sample a valid tool sequence $\tau = [v_1, ..., v_n]$ $τ = [v_{1}, ..., v_{n}]$ , the algorithm ensures all required inputs of a tool are satisfiable before selection. Input parameters are classified by an LLM as:
- External: Provided by the user (e.g., city, name).
- Internal: Derived from preceding tool outputs (e.g., hotel_id). A parameter $p_i \in I(v)$ is independent if it is: (1) Optional, (2) Externally providable, or (3) Internally satisfiable (output of a prior tool in $\tau$ ). For dependent parameters, the sampler recursively selects a prior tool that can generate it by traversing backward along $G$ . Once dependencies are resolved, the sampler randomly selects 1 to $k$ neighbors from $v$ to extend the chain, enabling non-linear patterns.

3.4 Tool-Use Trajectory Synthesis (QueryGen)

Using a sampled tool chain $\tau$ , QueryGen synthesizes multi-turn trajectories.

Planning: Constructs a user profile/scenario and an initial database state conforming to $D$ . The tool chain is partitioned into multiple dialogue turns.
Generation and Refinement: For each turn, a naturalistic user query is generated conditioned on the state, history, and sampled tools via:
- Subgoal Decomposition: Tools are broken into fine-grained subgoals/user intents.
- Goal Articulation: Natural language requests are composed from subgoals.
- Calibrated Refinement: Enhances realism through: (1) Implicit reference, (2) Action compression, (3) Ambiguity introduction, (4) Goal expansion.
Agentic Interaction & Evaluation: Sandbox environments with agents and simulated users are deployed to obtain $k$ candidate ground-truth solution trajectories. The pipeline evaluates and selects the optimal one, filtering redundant calls and masking non-critical arguments.

3.5 Model Training

Synthesized trajectories are used for post-training via Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). A composite reward function accounts for execution ambiguity:

R = \alpha \cdot R_{\text{traj}} + (1 - \alpha) \cdot R_{\text{state}} - \gamma \cdot P_{\text{length}}

where:

$R_{\text{traj}} \in [0, 1]$ : trajectory-based reward (matches predicted vs. ground-truth sequences)
$R_{\text{state}}$ : state-based reward (equivalence of final database states)
$P_{\text{length}}$ : length penalty
$\alpha, \gamma \geq 0$ : weighting coefficients

Empirical Validation / Results

4.1 Setup

Data: 85 verified MCP environments across 7 domains (commerce, finance, travel, office, lifestyle, research, utilities), comprising 842 tools. From these, 1,622 SFT and 953 RL multi-turn trajectories were synthesized.
Baselines: Qwen3-(1.7B, 4B, 8B) models compared against checkpoints from concurrent work AWM (Wang et al., 2026) and EnvScaler (Song et al., 2026).
Benchmarks: BFCL v3, 𝜏²-Bench, VitaBench, MCP-Atlas.
Training: Stage 1: SFT; Stage 2: RL (using GRPO with VeRL).

4.2 Main Results

Table 1 presents a comprehensive comparison across four benchmarks. Key findings:

Model	Method	# Env.	# Tasks	BFCL Multi-Turn	MCP-Atlas Pass Rate	𝜏²-Bench Avg.	VitaBench Avg.	Overall Avg.
Qwen3-4B	Base	–	–	33.50	4.12	25.25	7.67	24.09
	AWM	526	3,315	40.75	4.47	22.37	11.67	25.47
	EnvScaler	191	11,572	45.00	9.97	29.25	14.69	29.56
	EnvFactory (SFT)	85	1,622	44.25	7.90	25.25	11.33	27.29
	EnvFactory (Full)	85	2,575	48.50	9.97	30.13	16.00	30.77
Qwen3-8B	Base	–	–	41.25	5.15	32.30	16.70	29.23
	AWM	526	3,315	42.25	6.19	28.42	16.48	28.65
	EnvScaler	191	11,572	51.88	9.62	34.30	18.67	32.72
	EnvFactory (SFT)	85	1,622	46.50	8.25	32.71	16.67	30.82
	EnvFactory (Full)	85	2,575	49.00	13.75	33.67	18.67	33.40

SFT Cold Start Delivers Large Gains: SFT on EnvFactory trajectories alone yields substantial improvements (e.g., Qwen3-4B BFCL multi-turn: 33.50 → 44.25).
RL after SFT Further Unlocks Capability: RL training consistently yields further gains, especially on challenging interactive benchmarks (e.g., Qwen3-4B VitaBench: 11.33 → 16.00).
Strong Generalization: Improvements are consistent across both conversational (𝜏²-Bench, VitaBench) and non-conversational (BFCL, MCP-Atlas) benchmarks.
Superior Resource Efficiency: EnvFactory achieves stronger performance with significantly fewer environments and training tasks than baselines.

4.3 Effect of Environment Scaling

Figure 3 shows that increasing the environment pool (from 50 to 85) consistently improves BFCL-v3 multi-turn performance across model scales, indicating broader coverage improves generalization. The scaling curve shows diminishing returns. EnvFactory dominates the efficiency frontier (Figure 3b), achieving higher scores with far fewer resources.

4.4 Ablation Study

Direct RL vs. SFT+RL: Direct RL improves some benchmarks but gains are smaller and less stable than RL after SFT, confirming SFT initialization's importance for stable policy optimization.
Effect of Refinement Stage: Training with refined trajectories consistently outperforms unrefined ones, especially on ambiguous settings (Miss-Func, Miss-Param), confirming refinement improves query quality.
Reward Weighting Coefficient: An ablation over $\alpha$ (trajectory reward weight) on BFCL shows that balanced weighting ( $\alpha = 0.5$ ) performs best. Relying solely on state-based ( $\alpha = 0$ ) or trajectory ( $\alpha = 1.0$ ) reward degrades performance.

Theoretical and Practical Implications

Scalable Agentic RL Foundation: EnvFactory provides a fully automated, scalable pipeline for constructing diverse, executable environments and synthesizing realistic training data, lowering the barrier for Agentic RL research.
Bridging the Realism Gap: The topology-aware sampling and calibrated refinement techniques transform synthetic data from over-specified instruction lists into natural human requests with implicit intents and ambiguity, crucial for training agents that can handle real-world pragmatic communication.
Data Efficiency: The framework demonstrates that high-quality, verified environments and dependency-aware trajectories provide effective supervision from a compact training set, enabling efficient agent training.
Generalization: The method's effectiveness across diverse benchmarks (conversational and compositional) suggests it generalizes well to different tool-use settings.

Conclusion

EnvFactory addresses two critical bottlenecks in Agentic RL for tool-use: the lack of scalable, verifiable environments and the scarcity of realistic training trajectories. By autonomously constructing stateful environments from real-world resources and synthesizing natural queries through topology-aware sampling and refinement, it enables efficient and robust agent training. Experimental results show consistent outperformance over strong baselines in both training efficiency and downstream performance, using significantly fewer resources. This work provides a scalable, extensible foundation for advancing tool-use agent capabilities.