Learning to Retrieve from Agent Trajectories: Summary

Summary (Overview)

  • Problem Identification: Identifies a fundamental mismatch between human-centric retrieval models (trained on clicks/dwell time) and the needs of LLM-powered search agents, which use retrieval as a multi-turn tool within reasoning loops.
  • New Paradigm: Proposes Learning to Retrieve from Agent Trajectories (LRAT), a new training paradigm where supervision is derived from agent interaction data (trajectories) instead of human logs.
  • Key Insights: Analysis reveals that in agent trajectories: (1) Browsing is a necessary condition for success; (2) Unbrowsed documents are reliable negatives without human-like position bias; (3) Post-browse reasoning length is a strong indicator of relevance intensity.
  • Proposed Framework: LRAT mines supervision from trajectories via search-browse transitions, refines positives using LLM-judged reasoning traces, and incorporates relevance intensity via reasoning-length-aware weighted contrastive learning.
  • Empirical Validation: LRAT consistently improves evidence recall, end-to-end task success rate, and execution efficiency across six diverse agent backbones (4B to 358B parameters) and multiple retrievers on in-domain and out-of-domain benchmarks.

Introduction and Theoretical Foundation

Traditional Information Retrieval (IR) systems are built on a human-centric paradigm. Learning-to-rank models are trained on large-scale human interaction logs (clicks, dwell time) and optimized to serve human users, creating a powerful data flywheel.

With the rise of Large Language Model (LLM) powered search agents, the primary user of retrieval systems is shifting from humans to agents. Retrieval is no longer a standalone endpoint but a core tool embedded within an agent's multi-turn reasoning and action loop (e.g., ReAct pattern). However, current agents rely on off-the-shelf retrievers trained on human data, creating a fundamental mismatch:

  • Agent queries are intermediate actions for problem-solving, not final user intents.
  • Relevance patterns and consumption behaviors differ from humans.

This paper argues that retrieval models for the agent era should be trained directly from agent interaction data. It formulates Learning to Retrieve from Agent Trajectories as a new paradigm, analogous to learning from human click logs. Agent trajectories, generated as a byproduct of every agent invocation, provide a rich, abundant, and sustainable source of supervision for building an agentic search data flywheel.

Methodology

1. Preliminaries: Deep Research Agent Trajectories

A Deep Research Agent solves complex information-seeking tasks via iterative interaction with a retrieval system. Given an initial query qq, it produces a multi-turn trajectory T={(rt,at,ot)}t=1T\mathcal{T} = \{(r_t, a_t, o_t)\}_{t=1}^T.

  • [Think]: Produces a reasoning state rtr_t.
  • [Search]: Generates an intermediate query qtq_t. The retriever returns a top-KK candidate set Dt={dt,i}i=1K\mathcal{D}_t = \{d_{t,i}\}_{i=1}^K. The agent observes snippets.
  • [Browse]: Selects one document dtd_t from a previous Dt\mathcal{D}_{t'} to read fully. The content oto_t is observed.
  • [Answer]: Final synthesis after sufficient information is gathered.

The task is to learn a retrieval model from a collection of such trajectories {T}\{\mathcal{T}\}, with supervision derived directly from agent behaviors.

###": Table 1: Statistics of generated trajectories across different retrievers.

RetrieverCorrect Trajectories (N)Avg. SAvg. BB/SAvg. TIncorrect Trajectories (N)Avg. SAvg. BB/SAvg. TTotal (N)Avg. SAvg. BB/SAvg. T
BM257,6749.152.960.3212.111,87229.155.970.2035.119,54613.073.550.2716.63
Qwen3-Emb-0.6B5,91312.813.680.2916.492,06238.957.170.1846.127,97519.574.580.2324.15
Qwen3-Emb-4B6,35413.244.110.3117.342,12136.137.470.2143.608,47518.974.950.2623.91
Qwen3-Emb-8B6,54111.863.690.3115.552,08234.47XI.200.2141.678,62317.324.540.2621.85
Total26,48211.773.610.3115.388,13734.686.950.2041.6334,61917.254.410.2621.66
S=Search, B=Browse, T=Total Steps. Correct trajectories have higher B/S ratios.

2. Analysis of Agent Trajectories (Key Insights)

  • Browsing is Necessary for Success: Successful trajectories have a higher Browse-to-Search (B/S) ratio. Task success drops to zero if no evidence document is browsed. Therefore, browsed documents are candidate positives.
  • Unbrowsed Documents are Reliable Negatives: Unlike human clicks, agent browsing is distributed relatively evenly across ranking positions (Fig. 4c), indicating weak position bias. Unbrowsed documents likely result from explicit rejection, making them reliable negatives without needing debiasing.
  • Post-Browse Reasoning Indicates Relevance Intensity: The length of the reasoning trace rt+2r_{t+2} after browsing is strongly correlated with document utility and task success (Fig. 4d). Longer reasoning indicates deeper engagement, analogous to human dwell time.

3. The LRAT Framework

LRAT mines supervision from trajectories and performs intensity-aware training.

A. Mining Relevance Signals

  1. Naive Mining from Search-Browse Transitions: For a search at turn tt yielding candidates Dt\mathcal{D}_t, if the next action is [Browse] on dt+1d_{t+1}, then (qt,dt+1)(q_t, d_{t+1}) is a naive positive. All other unbrowsed candidates Nt=Dt{dt+1}\mathcal{N}_t = \mathcal{D}_t \setminus \{d_{t+1}\} are naive negatives.
  2. Reasoning-Aware Positive Filtering: Use an LLM judge (Qwen3-30B) to analyze the post-browse reasoning rt+2r_{t+2}. Label (qt,dt+1)(q_t, d_{t+1}) as Relevant or Irrelevant. This filters out browsed-but-unhelpful documents.

B. Intensity-Aware Training

  1. Relevance Intensity Estimation: Inspired by time-aware click models, map post-browse reasoning length ll to a utility score. The marginal gain follows an exponential decay: g(x)=exp(ln2βx)(1)g(x) = \exp\left(-\frac{\ln 2}{\beta} x\right) \tag{1} where β\beta is the half-life (set to median reasoning length). The cumulative relevance utility is: u(l)=0lg(x)dx=βln2(1exp(ln2βl))(2)u(l) = \int_0^l g(x) dx = \frac{\beta}{\ln 2}\left(1 - \exp\left(-\frac{\ln 2}{\beta} l\right)\right) \tag{2} The final relevance intensity weight ww for training is: w=1μraw(1exp(ln2lβ))(3)w = \frac{1}{\mu_{\text{raw}}} \left(1 - \exp\left(-\frac{\ln 2 \cdot l}{\beta}\right)\right) \tag{3} where μraw\mu_{\text{raw}} is a global mean for normalization (E[w]1\mathbb{E}[w] \approx 1).
  2. Weighted Contrastive Learning: Train a bi-encoder dense retriever. For a query qq and document dd, representations are eq,edRhe_q, e_d \in \mathbb{R}^h. Relevance score: s(q,d)=sim(eq,ed)s(q, d) = \text{sim}(e_q, e_d). The weighted InfoNCE loss for a batch of size NN is: L=1Ni=1Nwilogexp(s(qi,di+)/τ)exp(s(qi,di+)/τ)+dNiexp(s(qi,d)/τ)(4)\mathcal{L} = -\frac{1}{N} \sum_{i=1}^{N} w_i \cdot \log \frac{\exp(s(q_i, d_i^+) / \tau)}{\exp(s(q_i, d_i^+) / \tau) + \sum_{d^- \in \mathcal{N}_i} \exp(s(q_i, d^-) / \tau)} \tag{4} where wiw_i is from Eq. (3), di+d_i^+ is the positive document, Ni\mathcal{N}_i includes unbrowsed in-set negatives and in-batch negatives, and τ\tau is temperature.

Empirical Validation / Results

Experimental Setup:

  • Benchmarks: In-domain: InfoSeek-Eval (300 queries). Out-of-domain: BrowseComp-Plus (830 queries, with evidence annotations).
  • Retrievers: Qwen3-Embedding-0.6B (decoder-based) and Multilingual-E5-Large-Instruct (encoder-based).
  • Agents: Six backbones from 4B to 358B parameters: AgentCPM-Explore (4B), WebExplore (8B), Tongyi-DeepResearch (30B), GPT-OSS (120B), MiniMax-M2.1 (229B), GLM-4.7 (358B).
  • Metrics: Success Rate (SR), Evidence Recall, Average Step Count.

Table 2: Main Results on InfoSeek-Eval (ID) and BrowseComp-Plus (OOD). (Table shows consistent improvements across all agents and retrievers. Key highlights:)

  • Evidence Recall (OOD): LRAT improves recall by 7.4% to 37.9% across agents.
  • Success Rate (ID): LRAT improves SR by 5.1% to 38.2%. Average gain: +28.6%.
  • Success Rate (OOD): LRAT improves SR by 0.0% to 34.4%. Average gain: +27.5%.
  • Efficiency: Average step count is consistently reduced, especially on InfoSeek-Eval (up to ~30%).

Ablation Study (Fig. 7): Incremental addition of LRAT components on BrowseComp-Plus:

  1. +Naive (browsed=pos, unbrowsed=neg): Substantial gains, confirming reliable negatives.
  2. +Filter (LLM-judge filtering): Further improvement, validating reasoning traces as useful indicators.
  3. +Reweight (intensity weighting): Final gains, highlighting the importance of modeling relevance intensity.

Scalability & Robustness:

  • Training Data Size (Fig. 8a): Performance generally improves with more trajectory data (up to 30K), showing no early saturation.
  • Retrieval Top-K (Fig. 8b): LRAT consistently outperforms the base retriever across all KK values (1, 5, 10, 20), demonstrating robustness.

Data Flywheel Simulation (Fig. 9):

  • Simulates an iterative loop: Agent uses retriever → Generates trajectories → Retriever is updated with LRAT → Repeat.
  • Results show steady improvements in both Success Rate and Evidence Recall across iterations, confirming the potential for a sustainable, self-improving data flywheel driven by agent interactions.
  • Table 3: Shows that training with incorrect trajectories still yields performance gains (+14.1% to +18.9% SR), indicating they contain useful supervision, further supporting flywheel feasibility.

Theoretical and Practical Implications

  • Paradigm Shift: Proposes a necessary shift from human-centric to agent-centric retrieval training, aligning model objectives with actual usage in the era of agentic search.
  • Practical Framework: LRAT provides a simple, effective, and scalable method to leverage abundant agent trajectory data without additional human annotation. It works with arbitrary agents and retrievers.
  • Performance Impact: Demonstrates that optimizing the retriever is a critical and effective way to improve overall agent system performance, often more bottlenecking than the agent itself.
  • Sustainable Ecosystem: Highlights the potential for a data flywheel in agentic search, where agent interactions continuously improve the retriever, which in turn enables better agents—a virtuous cycle analogous to the human click log flywheel.

Conclusion

This paper identifies the misalignment between human-trained retrievers and agentic search needs, formalizing Learning to Retrieve from Agent Trajectories as a new paradigm. Analysis reveals key behavioral signals in agent trajectories: browsing necessity, unbrowsed negatives, and reasoning-length-based relevance intensity. The proposed LRAT framework effectively converts these signals into supervision for training agent-aligned retrievers.

Extensive experiments show LRAT consistently improves evidence retrieval, task success, and execution efficiency across diverse agent architectures and scales. Furthermore, it demonstrates the potential for a sustainable data flywheel driven by agent interactions. These findings position agent trajectories as a practical and scalable supervision source, pointing a promising direction for advancing retrieval systems in the agent era.