RAGEN-2: Reasoning Collapse in Agentic RL
Summary (Overview)
- Identifies Template Collapse: A novel failure mode in multi-turn LLM agent RL where reasoning appears diverse within a single input but becomes input-agnostic (templated) across different inputs. This collapse is invisible to standard entropy-based metrics.
- Proposes Mutual Information (MI) as a Diagnostic: Decomposes reasoning quality into within-input diversity (, measured by entropy) and cross-input distinguishability (, measured by mutual information). Introduces a family of MI proxies (e.g., Retrieval-Acc, MI-ZScore-EMA) for online diagnosis, which correlate more strongly with final task performance than entropy.
- Explains Collapse via a Signal-to-Noise Ratio (SNR) Mechanism: Low within-input reward variance () weakens task-discriminative gradients, allowing input-agnostic regularization terms (KL divergence, entropy bonus) to dominate updates, erasing cross-input reasoning differences.
- Introduces SNR-Aware Filtering: A mitigation strategy that selects high-signal prompts per iteration using reward variance as a lightweight SNR proxy. This method consistently improves both input dependence (MI) and task performance across diverse tasks, algorithms, and model scales.
Introduction and Theoretical Foundation
Training multi-turn Large Language Model (LLM) agents with Reinforcement Learning (RL) is inherently unstable. While reward tracks outcome stability and entropy tracks process stability, entropy is an ambiguous signal for reasoning quality. A high, stable entropy can mask a critical failure mode: template collapse.
Template Collapse occurs when an agent's reasoning chains appear diverse for any given input but are effectively identical across different inputs—relying on fixed, input-agnostic templates. This is problematic because sparse rewards cannot distinguish input-driven reasoning from templated reasoning that happens to succeed, and reasoning chains are hard to supervise directly. As a result, collapse can persist unnoticed, silently degrading reasoning abilities.
The paper addresses two core questions:
- (Q1) How to diagnose template collapse? Entropy metrics track within-input variability but miss input dependence.
- (Q2) Why does template collapse happen? A mechanistic explanation is needed.
The theoretical foundation is built on information theory. The marginal entropy of reasoning decomposes via the standard identity:
where:
- is the mutual information (input dependence) between the input context and the generated reasoning .
- is the conditional entropy (within-input diversity).
Entropy-based metrics proxy but cannot detect a decline in . A policy can sustain high while drops to zero, producing the superficially diverse but input-agnostic boilerplate of template collapse.
Methodology
1. Mutual Information Proxy Family
Since true mutual information has no closed form for token sequences, the authors propose empirical proxies based on in-batch cross-scoring. The intuition is that high means reasoning is distinguishable and specific to its source input .
Method: In-Batch Cross-Scoring Given prompts and reasoning samples per prompt from rollouts, compute teacher-forced log-likelihoods for every pair, forming a scoring matrix:
From this, extract two length-normalized quantities:
where is the per-token log-likelihood under the true source input, and approximates via a uniform mixture over batch prompts.
Primary Proxies:
- Retrieval-Acc (Discrete): Chance of correctly retrieving the source input given reasoning . Under collapse, (chance level).
- MI-ZScore-EMA (Continuous, Robust): Estimates input dependence and applies z-score normalization with Exponential Moving Average (EMA) for stability. Normalized by: with .
Table 1: MI Proxy Family
| Type | Proxy | Formula | Notes |
|---|---|---|---|
| Discrete | Retrieval-Acc | Chance level under collapse | |
| Recall@ | |||
| Continuous (raw) | MI-Est | Per-token; approaches 0 under collapse | |
| MI-Seq-Est | Per-sequence; no length normalization | ||
| Continuous (z-score) | MI-ZScore | Normalized by current-batch marginal std | |
| MI-ZScore-EMA |
2. SNR-Aware Filtering
Based on the SNR mechanism explanation, the authors propose a mitigation strategy.
Core Idea: Prioritize prompts with higher within-input reward variance, where advantage estimates carry stronger task-discriminative information.
Method:
- Estimate Reward Variance: For each prompt with trajectory samples, compute:
- Top- Filtering (Nucleus-style): Rank prompts by descending variance. Keep the smallest prefix of prompts such that their cumulative variance mass reaches a fraction of the total batch variance. Let be the ranking permutation: . Define threshold . Find . The kept set is .
- Update on Filtered Set: Perform the policy update only on trajectories from prompts in . The filtered objective becomes:
3. Experimental Testbed
Experiments are conducted across diverse environments to stress complementary decision-making regimes.
Table 3: Environment Features
| Task | Stochastic | Multi-turn | State | Reward |
|---|---|---|---|---|
| Sokoban | ✗ | ✓ | Grid | Dense |
| FrozenLake | ✓ | ✓ | Grid | Binary |
| MetaMathQA | ✗ | ✓ | Text | Dense |
| Countdown | ✗ | ✗ | Text | Binary |
| SearchQA | ✗ | ✓ | Text | Dense |
| WebShop | ✗ | ✓ | Text | Dense |
| DeepCoder | ✗ | ✗ | Text | Dense |
Training Setup: Models (e.g., Qwen2.5-3B) are trained with the veRL/HybridFlow stack using PPO, DAPO, GRPO, and Dr. GRPO algorithms. Each iteration collects trajectories, typically organized as prompt groups with samples per group.
Empirical Validation / Results
1. Template Collapse is a Consistent Failure Mode
- MI Dynamics Reveal Collapse: During training, the MI proxy (e.g., Retrieval-Acc) declines significantly before task performance degrades, while conditional entropy remains high. This divergence is the hallmark of template collapse (high , low ).
- Behavioral Manifestation: Reasoning length declines monotonically across environments as agents converge toward reusable, shorter, more formulaic templates.
- MI is a Superior Diagnostic: Spearman correlation analysis shows MI-family metrics positively predict final task performance, while entropy metrics show near-zero or negative correlations.
Figure 8: Spearman Correlations of Metrics with Task Performance
- Trajectory MI-ZScore: +0.39
- Reasoning Entropy: -0.11
- Conditional Entropy: -0.14
This confirms that MI predicts performance more reliably than entropy, and entropy can be misleading.
2. SNR-Aware Filtering Improves Performance
- Filtering Strategy Comparison: Top- (nucleus-style) filtering consistently outperforms Top- (fixed-count) filtering and no-filtering baselines across environments.
- Broad Effectiveness: SNR-Aware Filtering improves average task success rates across diverse tasks, RL algorithms (PPO, DAPO, GRPO, Dr. GRPO), model scales (Qwen2.5 0.5B to 7B), model types (Qwen2.5, Llama3.2), and input modalities (text, vision).
Table 4: SNR-Aware Filtering Results Across Variants (Peak Success Rate % with Filter Delta)
| Experiment Variants | Sokoban | FrozenLake | MetaMathQA | Countdown | Average |
|---|---|---|---|---|---|
| Baseline (PPO, Qwen2.5-3B) | 12.9 (+16.0) | 67.0 (+10.9) | 92.6 (+0.6) | 97.9 (+0.0) | 67.6 (+6.9) |
| Algorithm: DAPO | 16.2 (+5.1) | 66.8 (+2.1) | 90.8 (+2.8) | 95.7 (+1.6) | 67.4 (+2.9) |
| Model Scale: 0.5B | 3.3 (+22.9) | 19.5 (+0.0) | 10.0 (-0.2) | 23.0 (-0.7) | 14.0 (+5.5) |
| Modality: Qwen2.5-VL-3B (V) | 65.0 (+12.0) | 19.5 (+59.5) | - | - | 42.3 (+35.8) |
- Compute Overhead: Filtering reduces per-step training time (by 26-41% in tested configurations) because fewer prompts contribute to gradient computation. Variance computation itself adds negligible (<0.1%) overhead.
3. Validating the SNR Mechanism
- Gradient Decomposition Evidence: Sorting prompts into reward-variance (RV) buckets shows:
- Task gradient norm increases monotonically with bucket RV.
- Regularizer gradient norm (KL + entropy) is flat across buckets.
- In the lowest-RV buckets, task gradients nearly vanish while regularization gradients persist, meaning updates are dominated by input-agnostic noise.
- Quartile Ablation (Causal Evidence): Training on prompts from only the highest RV quartile (Q1) yields higher task performance and MI than training on lower quartiles (Q2, Q3, Q4).
Table 6: Quartile Ablation on Sokoban
| Quartile | RV Range | Task Perf (%) | MI Proxy | Entropy |
|---|---|---|---|---|
| Q1 (highest RV) | [4.4–5.6] | 21.1 | 0.95 | 2.02 |
| Q2 | [1.5–4.2] | 19.5 | 0.93 | 71.53 |
| Q3 | [0.0–0.2] | 10.7 | 0.81 | 1.41 |
| Q4 (lowest RV) | [0.0–0.1] | 11.0 | 0.73 | 1.87 |
- Prompt vs. Trajectory Filtering: Prompt-level SNR-Aware Filtering provides larger gains than trajectory-level filtering (selecting top/bottom trajectories within all prompts), confirming the benefit comes from selecting discriminative prompts, not just discarding hard trajectories.
- When Filtering Helps: The metric predicts effectiveness. High ratio indicates bimodal RV distribution where filtering cleanly separates signal from noise.
Theoretical and Practical Implications
Theoretical Implications
- Refines Understanding of RL Training Stability: Establishes that stable entropy is insufficient for healthy training; input dependence () is a critical, previously overlooked dimension.
- Provides a Mechanistic Explanation: Formalizes template collapse via an SNR lens and gradient decomposition, showing how low reward variance leads to regularization-dominated updates that erase input-specific reasoning.
- Information-Theoretic Framework: Offers a principled decomposition of reasoning quality () that can guide future diagnostics and interventions.
Practical Implications
- Superior Training Monitor: Practitioners should monitor Mutual Information proxies (e.g., Retrieval-Acc, MI-ZScore-EMA) alongside reward and entropy, as MI provides earlier and more reliable warning signs of reasoning degradation.
- Effective Mitigation Strategy: SNR-Aware Filtering is a lightweight, general-purpose intervention that integrates easily into existing RL pipelines (PPO, GRPO, etc.) and consistently improves performance across diverse settings.
- Complementary to Existing Stabilizers: SNR-Aware Filtering addresses a different axis (signal quality) compared to KL/entropy tuning (noise control), making it a complementary tool for robust agent training.
Conclusion
The paper makes three key contributions:
- Identifies and diagnoses "template collapse," a silent failure mode in multi-turn agent RL where reasoning becomes input-agnostic while maintaining surface diversity.
- Explains the collapse via an SNR mechanism, showing low reward variance weakens task gradients, letting input-agnostic regularization dominate and erase cross-input reasoning differences.
- Proposes SNR-Aware Filtering as an effective mitigation, using reward variance to select high-signal prompts, which consistently improves performance.
The introduced Mutual Information proxy family serves as a superior diagnostic compared to entropy, and the SNR-Aware Filtering method provides a practical, low-overhead intervention. Together, they offer a framework for understanding and mitigating a systematic failure mode in closed-loop multi-turn agent RL.
Limitations: The SNR decomposition assumes task-signal and regularization noise separate cleanly. The method requires reward variance to be a reliable signal proxy, which may degrade in very sparse or noisy environments. Aggressive filtering may narrow exploration and requires per-task tuning of the keep rate .