Learning to Retrieve from Agent Trajectories: Summary

Summary (Overview)

Problem Identification: Identifies a fundamental mismatch between human-centric retrieval models (trained on clicks/dwell time) and the needs of LLM-powered search agents, which use retrieval as a multi-turn tool within reasoning loops.
New Paradigm: Proposes Learning to Retrieve from Agent Trajectories (LRAT), a new training paradigm where supervision is derived from agent interaction data (trajectories) instead of human logs.
Key Insights: Analysis reveals that in agent trajectories: (1) Browsing is a necessary condition for success; (2) Unbrowsed documents are reliable negatives without human-like position bias; (3) Post-browse reasoning length is a strong indicator of relevance intensity.
Proposed Framework: LRAT mines supervision from trajectories via search-browse transitions, refines positives using LLM-judged reasoning traces, and incorporates relevance intensity via reasoning-length-aware weighted contrastive learning.
Empirical Validation: LRAT consistently improves evidence recall, end-to-end task success rate, and execution efficiency across six diverse agent backbones (4B to 358B parameters) and multiple retrievers on in-domain and out-of-domain benchmarks.

Introduction and Theoretical Foundation

Traditional Information Retrieval (IR) systems are built on a human-centric paradigm. Learning-to-rank models are trained on large-scale human interaction logs (clicks, dwell time) and optimized to serve human users, creating a powerful data flywheel.

With the rise of Large Language Model (LLM) powered search agents, the primary user of retrieval systems is shifting from humans to agents. Retrieval is no longer a standalone endpoint but a core tool embedded within an agent's multi-turn reasoning and action loop (e.g., ReAct pattern). However, current agents rely on off-the-shelf retrievers trained on human data, creating a fundamental mismatch:

Agent queries are intermediate actions for problem-solving, not final user intents.
Relevance patterns and consumption behaviors differ from humans.

This paper argues that retrieval models for the agent era should be trained directly from agent interaction data. It formulates Learning to Retrieve from Agent Trajectories as a new paradigm, analogous to learning from human click logs. Agent trajectories, generated as a byproduct of every agent invocation, provide a rich, abundant, and sustainable source of supervision for building an agentic search data flywheel.

Methodology

1. Preliminaries: Deep Research Agent Trajectories

A Deep Research Agent solves complex information-seeking tasks via iterative interaction with a retrieval system. Given an initial query $q$ , it produces a multi-turn trajectory $\mathcal{T} = \{(r_t, a_t, o_t)\}_{t=1}^T$ .

[Think]: Produces a reasoning state $r_t$ .
[Search]: Generates an intermediate query $q_t$ . The retriever returns a top- $K$ candidate set $\mathcal{D}_t = \{d_{t,i}\}_{i=1}^K$ . The agent observes snippets.
[Browse]: Selects one document $d_t$ from a previous $\mathcal{D}_{t'}$ to read fully. The content $o_t$ is observed.
[Answer]: Final synthesis after sufficient information is gathered.

The task is to learn a retrieval model from a collection of such trajectories $\{\mathcal{T}\}$ , with supervision derived directly from agent behaviors.

###": Table 1: Statistics of generated trajectories across different retrievers.

Retriever	Correct Trajectories (N)	Avg. S	Avg. B	B/S	Avg. T	Incorrect Trajectories (N)	Avg. S	Avg. B	B/S	Avg. T	Total (N)	Avg. S	Avg. B	B/S	Avg. T
BM25	7,674	9.15	2.96	0.32	12.11	1,872	29.15	5.97	0.20	35.11	9,546	13.07	3.55	0.27	16.63
Qwen3-Emb-0.6B	5,913	12.81	3.68	0.29	16.49	2,062	38.95	7.17	0.18	46.12	7,975	19.57	4.58	0.23	24.15
Qwen3-Emb-4B	6,354	13.24	4.11	0.31	17.34	2,121	36.13	7.47	0.21	43.60	8,475	18.97	4.95	0.26	23.91
Qwen3-Emb-8B	6,541	11.86	3.69	0.31	15.55	2,082	34.47	XI.20	0.21	41.67	8,623	17.32	4.54	0.26	21.85
Total	26,482	11.77	3.61	0.31	15.38	8,137	34.68	6.95	0.20	41.63	34,619	17.25	4.41	0.26	21.66
S=Search, B=Browse, T=Total Steps. Correct trajectories have higher B/S ratios.

2. Analysis of Agent Trajectories (Key Insights)

Browsing is Necessary for Success: Successful trajectories have a higher Browse-to-Search (B/S) ratio. Task success drops to zero if no evidence document is browsed. Therefore, browsed documents are candidate positives.
Unbrowsed Documents are Reliable Negatives: Unlike human clicks, agent browsing is distributed relatively evenly across ranking positions (Fig. 4c), indicating weak position bias. Unbrowsed documents likely result from explicit rejection, making them reliable negatives without needing debiasing.
Post-Browse Reasoning Indicates Relevance Intensity: The length of the reasoning trace $r_{t+2}$ after browsing is strongly correlated with document utility and task success (Fig. 4d). Longer reasoning indicates deeper engagement, analogous to human dwell time.

3. The LRAT Framework

LRAT mines supervision from trajectories and performs intensity-aware training.

A. Mining Relevance Signals

Naive Mining from Search-Browse Transitions: For a search at turn $t$ yielding candidates $\mathcal{D}_t$ , if the next action is [Browse] on $d_{t+1}$ , then $(q_t, d_{t+1})$ is a naive positive. All other unbrowsed candidates $\mathcal{N}_t = \mathcal{D}_t \setminus \{d_{t+1}\}$ are naive negatives.
Reasoning-Aware Positive Filtering: Use an LLM judge (Qwen3-30B) to analyze the post-browse reasoning $r_{t+2}$ . Label $(q_t, d_{t+1})$ as Relevant or Irrelevant. This filters out browsed-but-unhelpful documents.

B. Intensity-Aware Training

Relevance Intensity Estimation: Inspired by time-aware click models, map post-browse reasoning length $l$ to a utility score. The marginal gain follows an exponential decay: $g(x) = \exp\left(-\frac{\ln 2}{\beta} x\right) \tag{1}$ where $\beta$ is the half-life (set to median reasoning length). The cumulative relevance utility is: $u(l) = \int_0^l g(x) dx = \frac{\beta}{\ln 2}\left(1 - \exp\left(-\frac{\ln 2}{\beta} l\right)\right) \tag{2}$ The final relevance intensity weight $w$ for training is: $w = \frac{1}{\mu_{\text{raw}}} \left(1 - \exp\left(-\frac{\ln 2 \cdot l}{\beta}\right)\right) \tag{3}$ where $\mu_{\text{raw}}$ is a global mean for normalization ( $\mathbb{E}[w] \approx 1$ ).
Weighted Contrastive Learning: Train a bi-encoder dense retriever. For a query $q$ and document $d$ , representations are $e_q, e_d \in \mathbb{R}^h$ . Relevance score: $s(q, d) = \text{sim}(e_q, e_d)$ . The weighted InfoNCE loss for a batch of size $N$ is: $\mathcal{L} = -\frac{1}{N} \sum_{i=1}^{N} w_i \cdot \log \frac{\exp(s(q_i, d_i^+) / \tau)}{\exp(s(q_i, d_i^+) / \tau) + \sum_{d^- \in \mathcal{N}_i} \exp(s(q_i, d^-) / \tau)} \tag{4}$ where $w_i$ is from Eq. (3), $d_i^+$ is the positive document, $\mathcal{N}_i$ includes unbrowsed in-set negatives and in-batch negatives, and $\tau$ is temperature.

Empirical Validation / Results

Experimental Setup:

Benchmarks: In-domain: InfoSeek-Eval (300 queries). Out-of-domain: BrowseComp-Plus (830 queries, with evidence annotations).
Retrievers: Qwen3-Embedding-0.6B (decoder-based) and Multilingual-E5-Large-Instruct (encoder-based).
Agents: Six backbones from 4B to 358B parameters: AgentCPM-Explore (4B), WebExplore (8B), Tongyi-DeepResearch (30B), GPT-OSS (120B), MiniMax-M2.1 (229B), GLM-4.7 (358B).
Metrics: Success Rate (SR), Evidence Recall, Average Step Count.

Table 2: Main Results on InfoSeek-Eval (ID) and BrowseComp-Plus (OOD). (Table shows consistent improvements across all agents and retrievers. Key highlights:)

Evidence Recall (OOD): LRAT improves recall by 7.4% to 37.9% across agents.
Success Rate (ID): LRAT improves SR by 5.1% to 38.2%. Average gain: +28.6%.
Success Rate (OOD): LRAT improves SR by 0.0% to 34.4%. Average gain: +27.5%.
Efficiency: Average step count is consistently reduced, especially on InfoSeek-Eval (up to ~30%).

Ablation Study (Fig. 7): Incremental addition of LRAT components on BrowseComp-Plus:

+Naive (browsed=pos, unbrowsed=neg): Substantial gains, confirming reliable negatives.
+Filter (LLM-judge filtering): Further improvement, validating reasoning traces as useful indicators.
+Reweight (intensity weighting): Final gains, highlighting the importance of modeling relevance intensity.

Scalability & Robustness:

Training Data Size (Fig. 8a): Performance generally improves with more trajectory data (up to 30K), showing no early saturation.
Retrieval Top-K (Fig. 8b): LRAT consistently outperforms the base retriever across all $K$ values (1, 5, 10, 20), demonstrating robustness.

Data Flywheel Simulation (Fig. 9):

Simulates an iterative loop: Agent uses retriever → Generates trajectories → Retriever is updated with LRAT → Repeat.
Results show steady improvements in both Success Rate and Evidence Recall across iterations, confirming the potential for a sustainable, self-improving data flywheel driven by agent interactions.
Table 3: Shows that training with incorrect trajectories still yields performance gains (+14.1% to +18.9% SR), indicating they contain useful supervision, further supporting flywheel feasibility.

Theoretical and Practical Implications

Paradigm Shift: Proposes a necessary shift from human-centric to agent-centric retrieval training, aligning model objectives with actual usage in the era of agentic search.
Practical Framework: LRAT provides a simple, effective, and scalable method to leverage abundant agent trajectory data without additional human annotation. It works with arbitrary agents and retrievers.
Performance Impact: Demonstrates that optimizing the retriever is a critical and effective way to improve overall agent system performance, often more bottlenecking than the agent itself.
Sustainable Ecosystem: Highlights the potential for a data flywheel in agentic search, where agent interactions continuously improve the retriever, which in turn enables better agents—a virtuous cycle analogous to the human click log flywheel.

Conclusion

This paper identifies the misalignment between human-trained retrievers and agentic search needs, formalizing Learning to Retrieve from Agent Trajectories as a new paradigm. Analysis reveals key behavioral signals in agent trajectories: browsing necessity, unbrowsed negatives, and reasoning-length-based relevance intensity. The proposed LRAT framework effectively converts these signals into supervision for training agent-aligned retrievers.

Extensive experiments show LRAT consistently improves evidence retrieval, task success, and execution efficiency across diverse agent architectures and scales. Furthermore, it demonstrates the potential for a sustainable data flywheel driven by agent interactions. These findings position agent trajectories as a practical and scalable supervision source, pointing a promising direction for advancing retrieval systems in the agent era.