Summary (Overview)
- This paper identifies and formalizes two critical limitations of uniform (position-agnostic) token-level trust regions in LLM reinforcement learning with verifiable rewards (RLVR): they ignore autoregressive asymmetry (early token deviations propagate over longer suffixes) and cumulative prefix drift (per-token errors accumulate in the conditioning history).
- The authors propose CPPO (Cumulative Prefix-divergence Policy Optimization), a token-level masking rule that replaces uniform thresholds with two coupled mechanisms: a position-weighted token-level threshold (stricter at early positions) and a cumulative prefix budget (dynamically restricts divergence as prefix drift accumulates).
- A novel prefix-constrained policy-improvement bound is derived (Theorem 1), showing that constraining weighted prefix averages rather than pointwise divergences provably tightens the surrogate residual bound.
- Empirically, CPPO achieves the best AIME24/25/26 Avg@16 scores across four Qwen3 model scales (1.7B, 8B, 30B-A3B), outperforming strong baselines (GRPO, DPPO, MinPRO, CISPO, TRM) by margins of 0.91–5.56 absolute points.
- Ablations confirm both the position-weight and prefix-budget mechanisms independently contribute, and the gain is robust to divergence metric (TV vs. KL) and approximation granularity (Top-K vs. Binary).
Introduction and Theoretical Foundation
Background and Motivation
Reinforcement learning with verifiable rewards (RLVR) has become standard for LLM reasoning post-training (Ouyang et al., 2022; Shao et al., 2024). In RLVR, a policy generates responses, a verifier assigns scalar rewards, and updates are performed using PPO/GRPO-style token-level objectives (Schulman et al., 2017). Off-policy updates cause the target policy to drift from the rollout policy , and autoregressive generation amplifies divergence because early token deviations alter the conditioning of all subsequent steps.
Existing trust-region mechanisms borrow from classical policy optimization (TRPO, Schulman et al., 2015) but approximate the divergence constraint by:
- PPO/GRPO: Clipping the sampled likelihood ratio (Schulman et al., 2017; Shao et al., 2024)
- DPPO: Constraining the total-variation (TV) divergence with a uniform threshold (Qi et al., 2026)
All these methods apply a uniform, position-agnostic threshold across all token positions, which conflicts with autoregressive generation in two ways:
- Autoregressive asymmetry: Early token deviations affect longer suffixes (more future tokens). A uniform threshold under-penalizes early deviations (which have large propagation multipliers) and over-constrains late-stage exploration.
- Cumulative prefix drift: Per-token divergences accumulate in the conditioning prefix . A uniform threshold permits sequences to drift far from while still passing per-token checks.
Theoretical Foundation: Finite-Horizon Performance Difference
The paper starts from the exact finite-horizon performance difference identity (Lemma 2):
where
and is the suffix likelihood ratio.
The surrogate error must be controlled. Equation (4) shows how token-level divergence at position propagates:
where , is the per-token threshold, , and is the reward bound. The coefficient grows linearly with remaining horizon—the formalization of autoregressive asymmetry.
Methodology
CPPO Masking Rule
CPPO replaces the uniform threshold with two coupled constraints encoded in a per-token indicator :
Position-weighted token-level threshold: with a decreasing linear schedule:
This imposes stricter limits at early positions (, tighter when is large) and relaxes them later.
Cumulative prefix budget: Let and . The condition ensures the weighted prefix average does not exceed (with initial slack ). This dynamically reduces the allowed divergence when earlier tokens have already drifted significantly.
Combined per-token condition:
The effective threshold at token is:
The full token-level mask:
This keeps update terms that move toward (first clause) and only allows terms driving away from when holds.
Theoretical Guarantee
Theorem 1 (CPPO policy-improvement bound): Under constraints and for all prefixes , and assuming is non-increasing,
For the special case of uniform token-level threshold , the residual constant improves from to , giving a ratio (which is <1 when ).
Divergence Approximation
All token-level trust-region methods use the Top-K reduced-TV approximation (K=20) from DPPO (Qi et al., 2026). The exact is computed over the top-20 highest-probability tokens of at each position, normalized to sum to 1.
Algorithm
Algorithm 1 details the mask computation for one response: iterate tokens linearly, maintain prefix sums , , compute effective threshold , and mask updates that violate the condition.
Empirical Validation / Results
Experimental Setup
- Training data: DAPO-Math-17k (≈17k verifiable math prompts)
- Models: Qwen3-1.7B, Qwen3-1.7B-Base, Qwen3-8B-Base, Qwen3-30B-A3B-Base
- Hyperparameters:
- Dense models: , rollouts
- 30B-A3B: , rollouts
- Evaluation: AIME24/25/26 Avg@16 (unweighted mean)
- Baselines: GRPO, CISPO, MinPRO, DPPO, TRM-Max, TRM-Avg
- CPPO settings: (dense) or (MoE); ; adaptive for Base models (top-10% quantile clamped to )
Main Results
Table 1: Best validation AIME24/25/26 Avg@16 (%, higher is better)
| Method | 1.7B | 1.7B-Base | 8B-Base | 30B-A3B-Base |
|---|---|---|---|---|
| GRPO | 27.91 | 8.89 | 23.96 | 38.19 |
| MinPRO | 27.71 | 11.04 | 29.72 | 48.12 |
| CISPO | 28.82 | 11.87 | 29.58 | collapse |
| DPPO | 28.19 | 10.90 | 28.89 | 49.23 |
| TRM-Max | 25.21 | 9.72 | 26.73 | 20.27 |
| TRM-Avg | 26.87 | 11.70 | 27.98 | 48.96 |
| CPPO (ours) | 31.88 | 12.78 | 31.11 | 54.79 |
- CPPO outperforms all baselines in every setting by margins of 0.91–5.56 absolute points.
- The largest gain (5.56 points) is on the largest model (30B-A3B-Base) with longest horizon (16k), where autoregressive asymmetry is most pronounced.
- CISPO collapses on 30B-A3B-Base; TRM-Max degrades to 20.27, while CPPO trains stably.
Ablation Studies (Figure 5, Figure 6)
- Single mechanism ablation: Removing either the position weight or the prefix budget from CPPO (using uniform weights or no prefix budget respectively) still outperforms DPPO, but full CPPO achieves the highest scores.
- Position-weight ordering: Shuffling the position-dependent thresholds randomly (keeping the same multiset) yields lower performance than the autoregressive ordered schedule, confirming that the ordering by position drives the gain.
- Mask vs. soft gate: A soft variant (gradient attenuation near boundary) performs similarly to the hard mask.
- Hyperparameter sensitivity: Varying (0.02→0.03) and (0.8→0.6) maintains performance above DPPO.
- KL vs. TV divergence: CPPO with KL divergence (using TRM thresholds ) matches the TV configuration and outperforms DPPO; TRM Max&Avg with same thresholds does not.
- Binary vs. Top-K approximation: Both approximations yield comparable performance and exceed DPPO.
Theoretical and Practical Implications
Theoretical Contributions
- Formalizes autoregressive asymmetry in the error propagation bound: the coefficient shows early token-level divergence has linearly larger impact on the surrogate residual.
- Derives a prefix-constrained policy-improvement bound (Theorem 1) that replaces the pointwise dependence on with a tighter dependence on , proving that cumulative prefix constraints provably tighten the bound when .
- Connects the bound to practical masking rules: the position weight and prefix budget directly implement the theoretical requirements (monotonicity of and prefix-sum bounds).
Practical Implications for LLM RL
- Drop-in replacement: CPPO modifies only the token-level mask while preserving the standard PPO/GRPO ratio-advantage objective, requiring no additional loss terms or architecture changes.
- Two hyperparameters: (token-level threshold scale) and (prefix-average threshold) plus weight floor . The adaptive for Base models handles initial high-exploration phases automatically.
- Stability gains: CPPO prevents collapse in large models (30B-A3B-Base) where CISPO and TRM-Max fail, and consistently improves over DPPO which shares the same divergence estimator.
- Broad applicability: Gains hold across model scales (1.7B–30B+), architectures (dense and MoE), training stages (Base and post-trained), and divergence metrics (TV, KL, Binary, Top-K).
Conclusion
This work identifies fundamental limitations of uniform token-level trust regions in LLM RL and proposes CPPO, a principled alternative that respects the autoregressive structure of generation. Key contributions:
- Formalization: A finite-horizon error bound showing how token position affects error propagation.
- Algorithm: CPPO's dual-constraint mask (position-weighted threshold + cumulative prefix budget) that dynamically allocates divergence budget along the response.
- Theory: A provably tighter policy-improvement bound via prefix constraints.
- Empirical validation: Consistent improvements across four Qwen3 settings, with ablations confirming both mechanisms are necessary and complementary.
Future directions: Extending the prefix-budget concept to multi-turn interactions, adapting the weight schedule based on task difficulty, and exploring soft-gate variants further. The principle of aligning trust-region structure with the autoregressive factorization of LLMs opens a promising direction for more stable and capable reasoning RL.
Related papers
- Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution
Role-Agent outperforms baselines by using a single LLM as both agent and environment for bootstrapped co-evolution, with only 5.2% extra computation.
- Human Psychometric Questionnaires Mischaracterize LLM Behavior
Psychometric questionnaires produce coherent LLM profiles only due to item transparency, not genuine traits, diverging from generation-probability-based profiling.
- Your UnEmbedding Matrix is Secretly a Feature Lens for Text Embeddings
EmbedFilter filters out the edge spectrum of the unembedding matrix, improving LLM zero-shot embeddings by up to 14.1% on MTEB.