LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

Summary (Overview)

  • Problem: Large language models (LLMs) struggle with long-context reasoning, often failing to locate and integrate key information amidst extensive distracting content.
  • Novel Data Construction: Introduces a traj-tiered strategy to generate challenging training contexts. Distractors are extracted from real search agent trajectories: Tier-1 (high confusability: documents opened but not cited) and Tier-2 (low confusability: documents searched but not opened).
  • Novel Reward Design: Proposes a rubric reward that uses gold entities along the reasoning chain as fine-grained, entity-level process supervision. It is applied with a positive-only strategy (only to responses with correct final answers) to prevent reward hacking.
  • Key Results: Experiments on three reasoning LLMs (4B–30B) across five long-context benchmarks show LongTraceRL consistently outperforms strong baselines. For example, Qwen3-4B achieves an average gain of +5.7 points over the base model and surpasses the strongest baseline by +2.5 points.
  • Core Insight: Combining realistic, challenging distractors with entity-level process supervision encourages models to perform comprehensive, evidence-grounded reasoning, leading to significant performance improvements.

Introduction and Theoretical Foundation

Long-context reasoning is critical for LLMs but remains a major challenge. Models often hallucinate, rely on fragmented retrieval, or cite irrelevant passages when processing extensive contexts. Reinforcement Learning with Verifiable Rewards (RLVR) has shown promise but is limited for long-context tasks due to:

  1. Low-quality training data: Existing methods use distractors sampled randomly from unrelated documents, which lack semantic relevance and provide low confusability.
  2. Sparse reward signals: Relying solely on outcome-based (final answer) rewards provides no guidance for intermediate reasoning steps, allowing models to guess correctly via wrong paths ("reward hacking").

This paper addresses both limitations. Theoretically, it builds on the premise that effective long-context reasoning requires both:

  • Challenging, realistic training environments that force the model to carefully distinguish relevant from distracting information.
  • Fine-grained, process-level supervision to guide the model's intermediate reasoning steps and prevent shortcuts.

LongTraceRL is introduced as a framework that synthesizes high-quality data from search agent trajectories and provides supervision via an entity-level rubric reward.

Methodology

The LongTraceRL framework consists of two main components.

3.1 Data Construction Pipeline

A four-step pipeline generates long-context training data with agent-derived distractors.

  1. Multi-Hop Question Generation: Inspired by Lu et al. (2025), complex questions are generated via knowledge graph random walks over the KILT Wikipedia snapshot.

    • Starting from a seed entity v0v_0, a controlled random walk of k(=8)k (= 8) steps forms a path P=[v0,v1,...,vk]P = [v_0, v_1, ..., v_k].
    • A powerful LLM (e.g., GPT-5.2) synthesizes a question requiring step-by-step reasoning through all entities in PP. The answer is a specific attribute of the last entity vkv_k.
    • The prompt enforces constraints: no shortcuts, paraphrased clues (no direct keyword matching), and a unique final answer.
    • Output: Question text, ground-truth answer, and the set of gold entities E={e1,e2,...,ek}E = \{e_1, e_2, ..., e_k\} with their corresponding Wikipedia passages.
  2. Agent Search Trajectory Collection: A search agent (capable of SEARCH, OPEN, CITE) attempts to answer each question. Its complete trajectory τ=[(a1,d1),(a2,d2),...]\tau = [(a_1, d_1), (a_2, d_2), ...] is recorded.

    • Trajectory Filtering: Only trajectories where the agent reaches the correct final answer are retained (K=5K = 5 attempts per question). This ensures meaningful, goal-directed search behavior.
  3. Tiered Distractor Extraction: Documents from the trajectory (excluding gold passages) are divided into two tiers:

    • Tier-1 (High Confusability): Documents the agent opened and read but did not cite. These are topically relevant and were deemed worth reading.
    • Tier-2 (Low Confusability): Documents that appeared in search results but were never opened. These are only superficially related.
  4. Long-Context Assembly: The final context is assembled using the traj-tiered strategy.

    • Start with gold passages.
    • Add all **Tier
    • If the target length LL (128K tokens) is not reached, add Tier-2 distractors.
    • Shuffle all documents to prevent positional bias.

3.2 RL with Rubric Reward

The Group Relative Policy Optimization (GRPO) algorithm is used with a novel composite reward.

  • Outcome Reward (rocr_{oc}): A binary reward ({0,1}\{0, 1\}) based on the correctness of the final answer, judged by an LLM.
  • Rubric Reward: Measures the recall of gold entities EE in the model's response.
    • Raw Rubric Score: r^rb={eEe appears in the response}E\hat{r}_{rb} = \frac{|\{e \in E \mid e \text{ appears in the response}\}|}{|E|}.
    • Group-Level Normalization: To ensure comparability across questions, the score is normalized within each group of GG responses:
    rrb={r^rbmaxj[G]r^rb(j),if maxj[G]r^rb(j)>00,otherwiser_{rb} = \begin{cases} \frac{\hat{r}_{rb}}{\max_{j \in [G]} \hat{r}_{rb}^{(j)}}, & \text{if } \max_{j \in [G]} \hat{r}_{rb}^{(j)} > 0 \\ 0, & \text{otherwise} \end{cases}
  • Positive-Only Reward Combination: The rubric reward is only granted to responses with a correct final answer to prevent reward hacking (e.g., enumerating entities without reasoning). r={(1α)roc+αrrb,if roc>00,otherwiser = \begin{cases} (1 - \alpha) \cdot r_{oc} + \alpha \cdot r_{rb}, & \text{if } r_{oc} > 0 \\ 0, & \text{otherwise} \end{cases} The hyperparameter α[0,1]\alpha \in [0, 1] controls the weight of process supervision.

Empirical Validation / Results

4.1 Experimental Setup

  • Models: Qwen3-4B-Thinking-2507, DeepSeek-R1-0528-Qwen3-8B, Qwen3-30B-A3B-Thinking-2507.
  • Training Data: 2,815 long-context QA examples (8-hop questions, 128K context) constructed via the LongTraceRL pipeline.
  • Baselines: Trained on existing RL datasets: DocQA, LoongRL, LongRLVR.
  • Benchmarks: AA-LCR, MRCR, Frames, LongBench v2, LongReason.

4.2 Main Results

LongTraceRL consistently outperforms all baselines across all model scales.

Table 1: Main results on long-context reasoning benchmarks.

MethodAA-LCRMRCRFramesLongBench v2LongReasonAvg
Qwen3-4B-Thinking-2507
Base33.236.276.741.778.553.3
DocQA28.841.978.344.679.954.7
LoongRL32.038.275.841.878.753.3
LongRLVR37.541.878.543.880.756.5
LongTraceRL-GRPO34.038.976.140.778.753.7
LongTraceRL41.845.879.544.183.859.0

Key Findings:

  • Qwen3-4B with LongTraceRL achieves an average score of 59.0, a +5.7 point gain over the base model and +2.5 points over the strongest baseline (LongRLVR).
  • Gains are robust across model families and scales (4B, 8B, 30B).
  • Ablating the rubric reward (LongTraceRL-GRPO) causes a significant drop in performance (59.0 → 53.7), confirming its critical role.

Training Dynamics (Figure 3):

  • The rubric reward grows steadily during training, indicating the model learns to ground reasoning in gold entities.
  • The outcome reward for LongTraceRL also rises and dominates that of the GRPO variant, showing the rubric reward helps reach correct answers.
  • The response length increases with LongTraceRL, showing it encourages more deliberate reasoning. The positive-only strategy combined with a finite response budget prevents reward hacking by self-regulating length.

4.3 Ablation Studies

1. Rubric Ratio α\alpha The weight α\alpha of the rubric reward in the composite reward is crucial. α=0.3\alpha = 0.3 yields the best average performance (59.0). Values that are too low (0.1) or too high (0.5) degrade performance.

Table 2: Performance of LongTraceRL with different rubric reward weight α\alpha.

MethodAA-LCRMRCRFramesLongBench v2LongReasonAvg
+ LongTraceRL(α=0.1\alpha=0.1)39.246.179.044.282.858.3
+ LongTraceRL(α=0.3\alpha=0.3)41.845.879.544.183.859.0
+ LongTraceRL(α=0.5\alpha=0.5)39.043.777.543.581.757.1

2. Source of Distractors The traj-tiered distractor strategy is significantly more effective than alternatives.

Table and Analysis:

  • random: Random sampling from a global pool. Easy distractors (1.35% entity overlap), lowest score (55.7).
  • search: One-shot search engine results. Moderate difficulty (15.00% overlap), score 56.7.
  • traj-random: Pooled trajectory distractors sampled randomly. High difficulty (42.16% overlap), score 57.4.
  • traj-tiered: Our method, prioritizing Tier-1 distractors. Highest difficulty (50.03% overlap, Tier-1 alone 63.23%), best score (59.0).

Table 4: Statistics on how much distractors overlap with rubric entities. Higher ratios indicate harder distractors.

Distractor Strategy#Distr.#w/ Rub.Macro Avg (%)
traj-tiered621372905050.03
traj-random640662652842.16
search31412437215.00
random453925651.35

3. Positive-Only Strategy Removing the positive-only constraint (applying rubric reward to all responses) causes a clear performance drop (average 59.0 → 57.1). The training dynamics show this variant has a misleadingly higher combined reward because incorrect responses still gain rubric points, biasing the policy toward entity enumeration rather than genuine reasoning.

Theoretical and Practical Implications

  • Theoretical Contribution: Demonstrates the importance of fine-grained, entity-level process supervision (rubric reward) and realistic, behaviorally-derived training environments (trajectory distractors) for advancing long-context RL.
  • Practical Impact: Provides a scalable recipe for constructing high-quality long-context training data without requiring human annotation. The use of search agent trajectories bridges the gap between synthetic data and realistic retrieval scenarios.
  • Prevention of Reward Hacking: The positive-only reward combination is a simple yet effective mechanism to ensure process-level rewards incentivize correct reasoning paths without being gamed.
  • Generalizability: Improvements are consistent across multiple model families (Qwen, DeepSeek) and scales (4B to 30B), suggesting the approach is widely applicable.

Conclusion

LongTraceRL presents an effective framework for improving long-context reasoning in LLMs through agent trajectory-based data construction and entity-level rubric rewards. Key takeaways:

  • Search trajectories provide a rich source for constructing tiered, challenging distractors that are more effective than random or one-shot search alternatives.
  • The rubric reward with a positive-only strategy offers fine-grained process supervision that significantly boosts performance and encourages evidence-grounded reasoning without being hacked.
  • Comprehensive experiments across five benchmarks and three model scales demonstrate consistent and substantial improvements over existing methods.

Future Directions & Limitations:

  • Limitation: The data pipeline relies on a single knowledge source (Wikipedia), which may limit reasoning pattern diversity.
  • Limitation: The quality of distractors depends on the capability of the deployed search agent.
  • Future Work: Investigating the influence of agent capability on data quality, and extending the knowledge source to more diverse domains.

Related papers