GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment - Summary

Summary (Overview)

  • Capability-Oriented Dataset: Introduces and openly releases a 23K-sample Reinforcement Learning with Verifiable Rewards (RLVR) dataset spanning 9 distinct long-context task types (e.g., retrieval, comprehension, ranking, summarization), each paired with its natural evaluation metric (EM, Accuracy, F1, NDCG, etc.). This design moves beyond homogeneous retrieval-path data to provide broader and more diverse supervision.
  • TMN-Reweight Algorithm: Proposes a novel optimization method for heterogeneous multitask RL that combines Task-level Mean Normalization (to align cross-task reward scales) with difficulty-adaptive reweighting (to correct advantage estimation bias based on prompt difficulty).
  • Strong Empirical Performance: Under the same vanilla GRPO setup, training on the GoLongRL dataset alone outperforms training on the closed-source QwenLong-L1.5 dataset. The final GoLongRL-30B-A3B model achieves long-context performance comparable to much larger flagship models. TMN-Reweight provides consistent gains over vanilla GRPO.
  • Full Open-Source Release: All resources—the dataset, the complete four-phase construction pipeline, and training code—are publicly released to facilitate research.

Introduction and Theoretical Foundation

Effective utilization of long contexts (tens to hundreds of thousands of tokens) is a critical capability for Large Language Models (LLMs) in practical applications like multi-document analysis and agentic workflows. While reinforcement learning (RL) has shown promise in improving long-context use post-training, existing methods face two core limitations:

  1. Narrow Data Design: Training data is often constructed around complex retrieval paths (e.g., UUID tracking), leading to homogeneous task coverage (mostly QA variants) and uniform reward formulations (e.g., binary accuracy). This neglects key capabilities like summarization, ranking, and structured reasoning.
  2. Suboptimal Multitask Optimization: Standard RL algorithms like GRPO struggle with heterogeneous rewards. Per-prompt normalization can distort advantage estimates across prompts of varying difficulty, and different reward metrics (EM, F1, NDCG) have different variance profiles, causing some tasks to dominate gradients.

This work addresses these gaps through a capability-oriented framework. Inspired by task taxonomies like LongBench Pro, the research starts from the core capabilities required for long-context understanding and designs data and optimization methods accordingly.

Methodology

1. Data Construction: A Capability-Oriented RLVR Dataset

The dataset construction is guided by three principles: capability orientation, reward alignment with task semantics, and priority on real documents. The construction follows a unified four-phase pipeline (P1 to P4).

Dataset Composition: The final dataset contains 22,965 samples across 9 tasks (T1-T9).

TaskReward TypeSamplesRatioReward FunctionCore Capability
T1EM7,90834.4%Exact matchPrecise long-range information retrieval
T2Accuracy6,808III.6%Multiple-choice accuracyEvidence-grounded comprehension and reasoning
T3F13,47815.1%Token F1High-recall exhaustive retrieval and verification
T4math_verify3,05413.3%Math verificationNumerical extraction and quantitative reasoning
T5IoU9374.1%IoU-based structured matchMulti-table structured extraction
T6SubEM3601.6%Substring matchFragment-level structured matching and induction
T7NDCG1200.5%Ranking qualityDimension-quantified retrieval and graded ranking
T8Pairwise1800.8%Pairwise comparisonSequence reconstruction and ordering
T9Summary1200.5%ROUGE-LLong document summarization
Total22,965100%

Data Sources: The dataset combines:

  • ~14K curated open-source samples from established corpora (CLongEval, LongBench Pro, FinancialQA, etc.), mapped to appropriate tasks.
  • ~9K synthetic samples where QA pairs are generated from real source documents (books, academic papers, dialogues) using strong synthesis models (DeepSeek-V3.2, Gemini-2.5-Pro), followed by rigorous multi-stage quality control.

Four-Phase Pipeline (P1-P4):

  1. P1: Source Corpus Collection – Manual curation of annotated datasets and unannotated real-world documents.
  2. P2: Task-Oriented Filtering & Assignment – Applying task-specific criteria to assign each sample/document to one of the 9 tasks.
  3. P3: Sample Construction – Two parallel tracks:
    • Open-source track: Compatibility filtering and reward format standardization of existing annotations.
    • Synthetic track: Length-binning, QA generation, two-stage quality filtering (QA-pair verification by Gemini-2.5-Pro, then multi-model pass-rate verification).
  4. P4: Iterative Refinement – Apply contamination filtering (13-gram overlap) and use benchmark diagnostics to identify and supplement weak capability dimensions.

2. Algorithm: TMN-Reweight

The proposed algorithm addresses two defects in standard GRPO when applied to heterogeneous tasks: Defect 1 (Difficulty-induced advantage bias) and Defect 2 (Cross-task reward scale inconsistency).

Preliminaries (GRPO): For a prompt uu, GRPO samples GG responses {oi}i=1G\{ o_i \}_{i=1}^G, computes rewards {ri}\{ r_i \}, and estimates advantages via group-level z-score normalization:

Aiu=riμuσu+δ,μu=1Gj=1Grj,σu=1G1j=1G(rjμu)2(1)A_i^u = \frac{r_i - \mu_u}{\sigma_u + \delta}, \quad \mu_u = \frac{1}{G} \sum_{j=1}^{G} r_j, \quad \sigma_u = \sqrt{ \frac{1}{G-1} \sum_{j=1}^{G} (r_j - \mu_u)^2 } \tag{1}

TMN-Reweight combines two steps:

Step 1: Task-level Mean Normalization Replaces the per-prompt σu\sigma_u with a task-level root mean square standard deviation:

A^iu=riμuσtask(i)+δ,whereσtask(i)=1UtaskuUtaskσu2(5)\hat{A}_i^u = \frac{r_i - \mu_u}{\sigma_{\text{task}(i)} + \delta}, \quad \text{where} \quad \sigma_{\text{task}(i)} = \sqrt{ \frac{1}{|U_{\text{task}}|} \sum_{u \in U_{\text{task}}} \sigma_u^2 } \tag{5}

Here, σtask(i)\sigma_{\text{task}(i)} is computed over all prompts UtaskU_{\text{task}} from the same task as prompt ii. This aligns gradient scales across tasks while preserving within-task difficulty structure.

Step 2: Difficulty/Adaptive Reweighting Estimates prompt difficulty using a smoothed pass rate to reduce variance:

μ~u=αμu+(1α)μtask,p^=i=1G1[ri>μ~u]G(6)\tilde{\mu}_u = \alpha \cdot \mu_u + (1-\alpha) \cdot \mu_{\text{task}}, \quad \hat{p} = \frac{\sum_{i=1}^G \mathbb{1}[r_i > \tilde{\mu}_u]}{G} \tag{6}

A difficulty weight is computed: w=exp(0.5p^)w = \exp(0.5 - \hat{p}). This weight is applied asymmetrically based on the sign of the TMN advantage A^iu\hat{A}_i^u to create a "four-quadrant" gradient reallocation:

A~i={A^iuwif A^iu>0A^iu1wotherwise(8)\tilde{A}_i = \begin{cases} \hat{A}_i^u \cdot w & \text{if } \hat{A}_i^u > 0 \\ \hat{A}_i^u \cdot \frac{1}{w} & \text{otherwise} \end{cases} \tag{8}
  • For hard prompts (w>1w > 1): Amplifies rare positive successes, downweights negative gradients.
  • For easy prompts (w<1w < 1): Attenuates positive gradients to prevent entropy collapse, amplifies learning from unexpected failures.

Empirical Validation / Results

1. Data Effectiveness

Training with vanilla GRPO on the GoLongRL dataset alone shows strong improvements over baselines, decoupling the data contribution from algorithmic improvements.

Table 3 (Excerpt): Data effectiveness validation (Average scores on long-context benchmarks).

ScaleModelAvg.DocMathLBV2FramesMRCRCorpusQALBV1-QA
4BQwen3-4B-Thinking-2507 (Base)53.061.040.264.438.449.964.0
QwenLong-L1.5 (w. GRPO) †56.161.344.367.140.958.864.1
GoLongRL-4B (w. GRPO)62.262.545.566.667.565.165.9
30BQwen3-30B-A3B-Thinking-2507 (Base)60.163.348.770.241.670.566.5
QwenLong-L1.5 (w. GRPO) †67.265.155.371.466.976.967.9
GoLongRL-30B-A3B (w. GRPO)69.865.355.174.581.673.668.7
† Results from Shen et al. (2025); ‡ Trained on an 8K subset.

Key Findings:

  • At the 4B scale, GoLongRL data with GRPO outperforms QwenLong-L1.5 data with GRPO by +6.1 points average.
  • At the 30B scale, the advantage remains (69.8 vs. 67.2).
  • The GoLongRL dataset with vanilla GRPO is competitive with QwenLong-L1.5 trained with its specialized AEPO algorithm.

2. Algorithmic Improvement (TMN-Reweight)

Ablation at the 4B scale isolates the contribution of the TMN-Reweight algorithm.

Table 5: Long-context benchmark results and component ablation.

ModelAvg.DocMathLBV2FramesMRCRCorpusQALBV1-QA
QwenLong-L1.5-4B †59.462.547.967.447.964.765.8
GoLongRL-4B (w. GRPO)62.262.545.566.667.565.165.9
GoLongRL-4B (w. TMN-Reweight)63.062.347.167.465.569.665.9

Key Findings:

  • TMN-Reweight provides a +0.8 point average gain over vanilla GRPO on the same data.
  • Gains are concentrated on aggregation- and reasoning-intensive benchmarks (CorpusQA: +4.5, LBV2: +1.6).
  • TMN-Reweight achieves the best or second-best score on 5/6 subtasks, demonstrating a more balanced capability profile.

3. Generalization and Extrapolation

General Capability Retention: Training does not degrade general reasoning. Evaluations on MMLU-Pro, AIME, and GPQA show improvements at both 4B and 30B scales. Significant gains are also observed on agentic memory and dialogue memory benchmarks not seen during training (e.g., LongMemEval: +13.6 at both scales).

Length Extrapolation: Although trained with a 160K context window, models show strong generalization to longer sequences (up to 1M tokens) on MRCR and CorpusQA tasks.

Theoretical and Practical Implications

Theoretical: The gradient analysis in Appendix A provides a theoretical motivation for the Task-level Mean Normalization component. It shows that the per-task gradient magnitude scales with EuTk[σu2]\sqrt{\mathbb{E}_{u \sim \mathcal{T}_k}[\sigma_u^2]}, justifying the use of σtask\sigma_{\text{task}} as the normalization constant to equalize gradient contributions across tasks.

Practical:

  1. Data Design is Critical: The results demonstrate that broader capability coverage and heterogeneous reward design are primary drivers for improving long-context RL, potentially more impactful than algorithmic refinements alone.
  2. Open-Source Contribution: The full release of the dataset, pipeline, and code lowers the barrier to entry for long-context RL research and enables reproducibility.
  3. Effective Optimization for Heterogeneous Tasks: TMN-Reweight offers a principled solution to a common problem in multitask RLVR, balancing scale alignment and difficulty-aware learning.

Conclusion

GoLongRL presents an effective, open-source framework for long-context RL. Its core contributions are:

  1. A capability-oriented dataset of 23K samples across 9 task types with diverse rewards, constructed via a rigorous pipeline.
  2. The TMN-Reweight algorithm that mitigates cross-task scale inconsistency and difficulty-induced bias in advantage estimation.

Empirically, the dataset alone enables strong performance, and TMN-Reweight provides consistent gains, leading to models competitive with much larger counterparts. The framework preserves or improves general capabilities and demonstrates effective length extrapolation.

Future Work includes studying scale-dependent optimization dynamics, targeted data supplementation for multi-document reasoning, and integrating token-level weighting methods with RLVR.