RAGEN-2: Reasoning Collapse in Agentic RL

Summary (Overview)

Identifies Template Collapse: A novel failure mode in multi-turn LLM agent RL where reasoning appears diverse within a single input but becomes input-agnostic (templated) across different inputs. This collapse is invisible to standard entropy-based metrics.
Proposes Mutual Information (MI) as a Diagnostic: Decomposes reasoning quality into within-input diversity ( $H(Z|X)$ , measured by entropy) and cross-input distinguishability ( $I(X;Z)$ , measured by mutual information). Introduces a family of MI proxies (e.g., Retrieval-Acc, MI-ZScore-EMA) for online diagnosis, which correlate more strongly with final task performance than entropy.
Explains Collapse via a Signal-to-Noise Ratio (SNR) Mechanism: Low within-input reward variance ( $\widehat{\text{Var}}(R|X)$ ) weakens task-discriminative gradients, allowing input-agnostic regularization terms (KL divergence, entropy bonus) to dominate updates, erasing cross-input reasoning differences.
Introduces SNR-Aware Filtering: A mitigation strategy that selects high-signal prompts per iteration using reward variance as a lightweight SNR proxy. This method consistently improves both input dependence (MI) and task performance across diverse tasks, algorithms, and model scales.

Introduction and Theoretical Foundation

Training multi-turn Large Language Model (LLM) agents with Reinforcement Learning (RL) is inherently unstable. While reward tracks outcome stability and entropy tracks process stability, entropy is an ambiguous signal for reasoning quality. A high, stable entropy can mask a critical failure mode: template collapse.

Template Collapse occurs when an agent's reasoning chains appear diverse for any given input but are effectively identical across different inputs—relying on fixed, input-agnostic templates. This is problematic because sparse rewards cannot distinguish input-driven reasoning from templated reasoning that happens to succeed, and reasoning chains are hard to supervise directly. As a result, collapse can persist unnoticed, silently degrading reasoning abilities.

The paper addresses two core questions:

(Q1) How to diagnose template collapse? Entropy metrics track within-input variability but miss input dependence.
(Q2) Why does template collapse happen? A mechanistic explanation is needed.

The theoretical foundation is built on information theory. The marginal entropy of reasoning $H(Z)$ decomposes via the standard identity:

H(Z) = I(X; Z) + H(Z|X)

where:

$I(X; Z)$ is the mutual information (input dependence) between the input context $X$ and the generated reasoning $Z$ .
$H(Z|X)$ is the conditional entropy (within-input diversity).

Entropy-based metrics proxy $H(Z|X)$ but cannot detect a decline in $I(X;Z)$ . A policy can sustain high $H(Z|X)$ while $I(X;Z)$ drops to zero, producing the superficially diverse but input-agnostic boilerplate of template collapse.

Methodology

1. Mutual Information Proxy Family

Since true mutual information $I(X;Z)$ has no closed form for token sequences, the authors propose empirical proxies based on in-batch cross-scoring. The intuition is that high $I(X;Z)$ means reasoning $Z$ is distinguishable and specific to its source input $X$ .

Method: In-Batch Cross-Scoring Given $P$ prompts and $G$ reasoning samples per prompt from rollouts, compute teacher-forced log-likelihoods for every $(Z_{i,k}, X_j)$ pair, forming a scoring matrix:

\mathcal{L}_{i,k,j} = \log p_\theta(Z_{i,k} | X_j)

From this, extract two length-normalized quantities:

\text{matched}_{i,k} = \frac{\mathcal{L}_{i,k,i}}{|Z_{i,k}|}, \quad \text{marginal}_{i,k} = \frac{1}{|Z_{i,k}|} \log \left( \frac{1}{P} \sum_j \exp(\mathcal{L}_{i,k,j}) \right)

where $\text{matched}_{i,k}$ is the per-token log-likelihood under the true source input, and $\text{marginal}_{i,k}$ approximates $\log p_\theta(Z_{i,k})$ via a uniform mixture over batch prompts.

Primary Proxies:

Retrieval-Acc (Discrete): Chance of correctly retrieving the source input $X_i$ given reasoning $Z_{i,k}$ . $\text{Acc} = \frac{1}{PG} \sum_{i=1}^{P} \sum_{k=1}^{G} \mathbb{I}\left[ i = \arg\max_j \mathcal{L}_{i,k,j} \right]$ Under collapse, $\text{Acc} \to 1/P$ (chance level).
MI-ZScore-EMA (Continuous, Robust): Estimates input dependence and applies z-score normalization with Exponential Moving Average (EMA) for stability. $\widehat{I}(X;Z) = \frac{1}{PG} \sum_{i=1}^{P} \sum_{k=1}^{G} \left( \text{matched}_{i,k} - \text{marginal}_{i,k} \right)$ Normalized by: $\sigma_{\text{EMA}}^{(t)} = \alpha \sigma_{\text{EMA}}^{(t-1)} + (1-\alpha)\sigma_{\text{batch}}^{(t)}$ with $\alpha=0.9$ .

Table 1: MI Proxy Family

Type	Proxy	Formula	Notes
Discrete	Retrieval-Acc	$\frac{1}{PG}\sum_{i,k} \mathbb{I}[\arg\max_j \mathcal{L}_{i,k,j} = i]$	Chance level $1/P$ under collapse
	Recall@ $k$	$\frac{1}{PG}\sum_{i,k} \mathbb{I}[i \in \text{top-}k_j(\mathcal{L}_{i,k,j})]$	$k \in \{2,4,8\}$
Continuous (raw)	MI-Est	$\frac{1}{PG}\sum_{i,k} (\text{matched}_{i,k} - \text{marginal}_{i,k})$	Per-token; approaches 0 under collapse
	MI-Seq-Est	$\frac{1}{PG}\sum_{i,k} \left( \mathcal{L}_{i,k,i} - \log \frac{1}{P}\sum_j e^{\mathcal{L}_{i,k,j}} \right)$	Per-sequence; no length normalization
Continuous (z-score)	MI-ZScore	$\frac{1}{PG}\sum_{i,k} \frac{\text{matched}_{i,k} - \text{marginal}_{i,k}}{\sigma_{\text{batch}} + \epsilon}$	Normalized by current-batch marginal std
	MI-ZScore-EMA	$\frac{1}{PG}\sum_{i,k} \frac{\text{matched}_{i,k} - \text{marginal}_{i,k}}{\sigma_{\text{EMA}} + \epsilon}$	$\sigma_{\text{EMA}}^{(t)} = \alpha \sigma_{\text{EMA}}^{(t-1)} + (1-\alpha)\sigma_{\text{batch}}^{(t)}$

2. SNR-Aware Filtering

Based on the SNR mechanism explanation, the authors propose a mitigation strategy.

Core Idea: Prioritize prompts with higher within-input reward variance, where advantage estimates carry stronger task-discriminative information.

Method:

Estimate Reward Variance: For each prompt $X_i$ with $G$ trajectory samples, compute: $\widehat{\text{Var}}(R|X=X_i) = \frac{1}{G-1} \sum_{g=1}^{G} \left( R_g(X_i) - \bar{R}(X_i) \right)^2, \quad \bar{R}(X_i) = \frac{1}{G} \sum_{g=1}^{G} R_g(X_i)$
Top- $\rho$ Filtering (Nucleus-style): Rank prompts by descending variance. Keep the smallest prefix of prompts such that their cumulative variance mass reaches a fraction $\rho$ of the total batch variance. Let $\sigma$ be the ranking permutation: $\widehat{\text{Var}}(R|X=x_{\sigma(1)}) \geq ... \geq \widehat{\text{Var}}(R|X=x_{\sigma(P)})$ . Define threshold $\tau = \rho \sum_{i=1}^{P} \widehat{\text{Var}}(R|X=x_i)$ . Find $k^* = \min\left\{ k : \sum_{j=1}^{k} \widehat{\text{Var}}(R|X=x_{\sigma(j)}) \geq \tau \right\}$ . The kept set is $S = \{\sigma(1), ..., \sigma(k^*)\}$ .
Update on Filtered Set: Perform the policy update only on trajectories from prompts in $S$ . The filtered objective becomes: $\mathcal{L}_\rho(\theta) = \frac{1}{k^*} \sum_{i \in S} \sum_{j \in \mathcal{B}_i} L_\theta(\xi_j)$

3. Experimental Testbed

Experiments are conducted across diverse environments to stress complementary decision-making regimes.

Table 3: Environment Features

Task	Stochastic	Multi-turn	State	Reward
Sokoban	✗	✓	Grid	Dense
FrozenLake	✓	✓	Grid	Binary
MetaMathQA	✗	✓	Text	Dense
Countdown	✗	✗	Text	Binary
SearchQA	✗	✓	Text	Dense
WebShop	✗	✓	Text	Dense
DeepCoder	✗	✗	Text	Dense

Training Setup: Models (e.g., Qwen2.5-3B) are trained with the veRL/HybridFlow stack using PPO, DAPO, GRPO, and Dr. GRPO algorithms. Each iteration collects $K=128$ trajectories, typically organized as $P=8$ prompt groups with $G=16$ samples per group.

Empirical Validation / Results

1. Template Collapse is a Consistent Failure Mode

MI Dynamics Reveal Collapse: During training, the MI proxy (e.g., Retrieval-Acc) declines significantly before task performance degrades, while conditional entropy $H(Z|X)$ remains high. This divergence is the hallmark of template collapse (high $H(Z|X)$ , low $I(X;Z)$ ).
Behavioral Manifestation: Reasoning length declines monotonically across environments as agents converge toward reusable, shorter, more formulaic templates.
MI is a Superior Diagnostic: Spearman correlation analysis shows MI-family metrics positively predict final task performance, while entropy metrics show near-zero or negative correlations.

Figure 8: Spearman Correlations of Metrics with Task Performance

Trajectory MI-ZScore: +0.39
Reasoning Entropy: -0.11
Conditional Entropy: -0.14

This confirms that MI predicts performance more reliably than entropy, and entropy can be misleading.

2. SNR-Aware Filtering Improves Performance

Filtering Strategy Comparison: Top- $\rho$ (nucleus-style) filtering consistently outperforms Top- $k$ (fixed-count) filtering and no-filtering baselines across environments.
Broad Effectiveness: SNR-Aware Filtering improves average task success rates across diverse tasks, RL algorithms (PPO, DAPO, GRPO, Dr. GRPO), model scales (Qwen2.5 0.5B to 7B), model types (Qwen2.5, Llama3.2), and input modalities (text, vision).

Table 4: SNR-Aware Filtering Results Across Variants (Peak Success Rate % with Filter Delta)

Experiment Variants	Sokoban	FrozenLake	MetaMathQA	Countdown	Average
Baseline (PPO, Qwen2.5-3B)	12.9 (+16.0)	67.0 (+10.9)	92.6 (+0.6)	97.9 (+0.0)	67.6 (+6.9)
Algorithm: DAPO	16.2 (+5.1)	66.8 (+2.1)	90.8 (+2.8)	95.7 (+1.6)	67.4 (+2.9)
Model Scale: 0.5B	3.3 (+22.9)	19.5 (+0.0)	10.0 (-0.2)	23.0 (-0.7)	14.0 (+5.5)
Modality: Qwen2.5-VL-3B (V)	65.0 (+12.0)	19.5 (+59.5)	-	-	42.3 (+35.8)

Compute Overhead: Filtering reduces per-step training time (by 26-41% in tested configurations) because fewer prompts contribute to gradient computation. Variance computation itself adds negligible (<0.1%) overhead.

3. Validating the SNR Mechanism

Gradient Decomposition Evidence: Sorting prompts into reward-variance (RV) buckets shows:
1. Task gradient norm $\|g_{\text{task}}\|$ increases monotonically with bucket RV.
2. Regularizer gradient norm $\|g_{\text{reg}}\|$ (KL + entropy) is flat across buckets.
3. In the lowest-RV buckets, task gradients nearly vanish while regularization gradients persist, meaning updates are dominated by input-agnostic noise.
Quartile Ablation (Causal Evidence): Training on prompts from only the highest RV quartile (Q1) yields higher task performance and MI than training on lower quartiles (Q2, Q3, Q4).

Table 6: Quartile Ablation on Sokoban

Quartile	RV Range	Task Perf (%)	MI Proxy	Entropy
Q1 (highest RV)	[4.4–5.6]	21.1	0.95	2.02
Q2	[1.5–4.2]	19.5	0.93	71.53
Q3	[0.0–0.2]	10.7	0.81	1.41
Q4 (lowest RV)	[0.0–0.1]	11.0	0.73	1.87

Prompt vs. Trajectory Filtering: Prompt-level SNR-Aware Filtering provides larger gains than trajectory-level filtering (selecting top/bottom trajectories within all prompts), confirming the benefit comes from selecting discriminative prompts, not just discarding hard trajectories.
When Filtering Helps: The metric $\text{Std(RV)}/\text{Mean(RV)}$ predicts effectiveness. High ratio indicates bimodal RV distribution where filtering cleanly separates signal from noise.

Theoretical and Practical Implications

Theoretical Implications

Refines Understanding of RL Training Stability: Establishes that stable entropy is insufficient for healthy training; input dependence ( $I(X;Z)$ ) is a critical, previously overlooked dimension.
Provides a Mechanistic Explanation: Formalizes template collapse via an SNR lens and gradient decomposition, showing how low reward variance leads to regularization-dominated updates that erase input-specific reasoning.
Information-Theoretic Framework: Offers a principled decomposition of reasoning quality ( $H(Z) = I(X;Z) + H(Z|X)$ ) that can guide future diagnostics and interventions.

Practical Implications

Superior Training Monitor: Practitioners should monitor Mutual Information proxies (e.g., Retrieval-Acc, MI-ZScore-EMA) alongside reward and entropy, as MI provides earlier and more reliable warning signs of reasoning degradation.
Effective Mitigation Strategy: SNR-Aware Filtering is a lightweight, general-purpose intervention that integrates easily into existing RL pipelines (PPO, GRPO, etc.) and consistently improves performance across diverse settings.
Complementary to Existing Stabilizers: SNR-Aware Filtering addresses a different axis (signal quality) compared to KL/entropy tuning (noise control), making it a complementary tool for robust agent training.

Conclusion

The paper makes three key contributions:

Identifies and diagnoses "template collapse," a silent failure mode in multi-turn agent RL where reasoning becomes input-agnostic while maintaining surface diversity.
Explains the collapse via an SNR mechanism, showing low reward variance weakens task gradients, letting input-agnostic regularization dominate and erase cross-input reasoning differences.
Proposes SNR-Aware Filtering as an effective mitigation, using reward variance to select high-signal prompts, which consistently improves performance.

The introduced Mutual Information proxy family serves as a superior diagnostic compared to entropy, and the SNR-Aware Filtering method provides a practical, low-overhead intervention. Together, they offer a framework for understanding and mitigating a systematic failure mode in closed-loop multi-turn agent RL.

Limitations: The SNR decomposition assumes task-signal and regularization noise separate cleanly. The method requires reward variance to be a reliable signal proxy, which may degrade in very sparse or noisy environments. Aggressive filtering may narrow exploration and requires per-task tuning of the keep rate $\rho$ .