Summary (Overview)

  • This paper identifies and formalizes two critical limitations of uniform (position-agnostic) token-level trust regions in LLM reinforcement learning with verifiable rewards (RLVR): they ignore autoregressive asymmetry (early token deviations propagate over longer suffixes) and cumulative prefix drift (per-token errors accumulate in the conditioning history).
  • The authors propose CPPO (Cumulative Prefix-divergence Policy Optimization), a token-level masking rule that replaces uniform thresholds with two coupled mechanisms: a position-weighted token-level threshold (stricter at early positions) and a cumulative prefix budget (dynamically restricts divergence as prefix drift accumulates).
  • A novel prefix-constrained policy-improvement bound is derived (Theorem 1), showing that constraining weighted prefix averages rather than pointwise divergences provably tightens the surrogate residual bound.
  • Empirically, CPPO achieves the best AIME24/25/26 Avg@16 scores across four Qwen3 model scales (1.7B, 8B, 30B-A3B), outperforming strong baselines (GRPO, DPPO, MinPRO, CISPO, TRM) by margins of 0.91–5.56 absolute points.
  • Ablations confirm both the position-weight and prefix-budget mechanisms independently contribute, and the gain is robust to divergence metric (TV vs. KL) and approximation granularity (Top-K vs. Binary).

Introduction and Theoretical Foundation

Background and Motivation

Reinforcement learning with verifiable rewards (RLVR) has become standard for LLM reasoning post-training (Ouyang et al., 2022; Shao et al., 2024). In RLVR, a policy generates responses, a verifier assigns scalar rewards, and updates are performed using PPO/GRPO-style token-level objectives (Schulman et al., 2017). Off-policy updates cause the target policy π\pi to drift from the rollout policy μ\mu, and autoregressive generation amplifies divergence because early token deviations alter the conditioning of all subsequent steps.

Existing trust-region mechanisms borrow from classical policy optimization (TRPO, Schulman et al., 2015) but approximate the divergence constraint by:

  • PPO/GRPO: Clipping the sampled likelihood ratio ρt=π(ytst)/μ(ytst)\rho_t = \pi(y_t|s_t)/\mu(y_t|s_t) (Schulman et al., 2017; Shao et al., 2024)
  • DPPO: Constraining the total-variation (TV) divergence Dt=DTV(μ(st),π(st))D_t = D_{\text{TV}}(\mu(\cdot|s_t),\pi(\cdot|s_t)) with a uniform threshold δ\delta (Qi et al., 2026)

All these methods apply a uniform, position-agnostic threshold across all token positions, which conflicts with autoregressive generation in two ways:

  1. Autoregressive asymmetry: Early token deviations affect longer suffixes (more future tokens). A uniform threshold under-penalizes early deviations (which have large propagation multipliers) and over-constrains late-stage exploration.
  2. Cumulative prefix drift: Per-token divergences accumulate in the conditioning prefix st=(x,y<t)s_t = (x, y_{<t}). A uniform threshold permits sequences to drift far from μ\mu while still passing per-token checks.

Theoretical Foundation: Finite-Horizon Performance Difference

The paper starts from the exact finite-horizon performance difference identity (Lemma 2):

J(π)J(μ)=Lμ(π)Δ(μ,π)J(\pi) - J(\mu) = L'_\mu(\pi) - \Delta(\mu, \pi)

where

Lμ(π):=Eμ[R(x,y)t=1T(ρt1)],Δ(μ,π):=Eμ[R(x,y)t=1T(ρt1)(1ρt+1:T)]L'_\mu(\pi) := \mathbb{E}_\mu\left[R(x,y) \sum_{t=1}^T (\rho_t - 1)\right], \quad \Delta(\mu,\pi) := \mathbb{E}_\mu\left[R(x,y) \sum_{t=1}^T (\rho_t - 1)(1 - \rho_{t+1:T})\right]

and ρt+1:T=j=t+1Tρj\rho_{t+1:T} = \prod_{j=t+1}^T \rho_j is the suffix likelihood ratio.

The surrogate error Δ(μ,π)|\Delta(\mu,\pi)| must be controlled. Equation (4) shows how token-level divergence at position tt propagates:

Δ(μ,π)4ξt=1T1utj=t+1Tjt=1T1λtut,λt=4ξˉ(Tt)|\Delta(\mu,\pi)| \leq 4\xi \sum_{t=1}^{T-1} u_t \sum_{j=t+1}^T \ell_j \leq \sum_{t=1}^{T-1} \lambda_t u_t, \quad \lambda_t = 4\xi\bar{\ell}(T-t)

where ut=E[Dt]u_t = \mathbb{E}[D_t], t\ell_t is the per-token threshold, ˉ=maxjj\bar{\ell} = \max_j \ell_j, and ξ\xi is the reward bound. The coefficient λt(Tt)\lambda_t \propto (T-t) grows linearly with remaining horizon—the formalization of autoregressive asymmetry.

Methodology

CPPO Masking Rule

CPPO replaces the uniform threshold with two coupled constraints encoded in a per-token indicator ItI_t:

Position-weighted token-level threshold: wtDtδw_t D_t \leq \delta with a decreasing linear schedule:

wt=11wminT1(t1),t=1,,T,  wt[wmin,1]w_t = 1 - \frac{1 - w_{\min}}{T-1}(t-1), \quad t=1,\ldots,T, \; w_t \in [w_{\min}, 1]

This imposes stricter limits at early positions (Dtδ/wtD_t \leq \delta/w_t, tighter when wtw_t is large) and relaxes them later.

Cumulative prefix budget: Let St=j=1twjDjS_t = \sum_{j=1}^t w_j D_j and Wt=j=1twjW_t = \sum_{j=1}^t w_j. The condition Stδ+δbWt1S_t \leq \delta + \delta_b W_{t-1} ensures the weighted prefix average does not exceed δb\delta_b (with initial slack δ\delta). This dynamically reduces the allowed divergence when earlier tokens have already drifted significantly.

Combined per-token condition:

It:  wtDtδ    Stδ+δbWt1I_t : \; w_t D_t \leq \delta \;\wedge\; S_t \leq \delta + \delta_b W_{t-1}

The effective threshold at token tt is:

ctCPPO:=min{δ,δ+δbWt1St1}c_t^{\text{CPPO}} := \min\{\delta, \delta + \delta_b W_{t-1} - S_{t-1}\}

The full token-level mask:

MtCPPO=1[A^t(ρt1)0    It]M_t^{\text{CPPO}} = \mathbb{1}\left[\hat{A}_t(\rho_t - 1) \leq 0 \;\vee\; I_t\right]

This keeps update terms that move π\pi toward μ\mu (first clause) and only allows terms driving π\pi away from μ\mu when ItI_t holds.

Theoretical Guarantee

Theorem 1 (CPPO policy-improvement bound): Under constraints wtDtctw_t D_t \leq c_t and PmδbWmP_m \leq \delta_b W_m for all prefixes m=1,,T1m=1,\ldots,T-1, and assuming rt=λt/wtr_t = \lambda_t/w_t is non-increasing,

J(π)J(μ)Lμ(π)2ξT(T1)ˉδbJ(\pi) - J(\mu) \geq L'_\mu(\pi) - 2\xi T(T-1)\bar{\ell}\delta_b

For the special case of uniform token-level threshold DtδD_t \leq \delta, the residual constant improves from Cuniform=2ξT(T1)δ2C_{\text{uniform}} = 2\xi T(T-1)\delta^2 to CCPPO=2ξT(T1)δδbC_{\text{CPPO}} = 2\xi T(T-1)\delta\delta_b, giving a ratio CCPPO/Cuniform=δb/δC_{\text{CPPO}}/C_{\text{uniform}} = \delta_b/\delta (which is <1 when δb<δ\delta_b < \delta).

Divergence Approximation

All token-level trust-region methods use the Top-K reduced-TV approximation (K=20) from DPPO (Qi et al., 2026). The exact DTV(μ(st),π(st))D_{\text{TV}}(\mu(\cdot|s_t),\pi(\cdot|s_t)) is computed over the top-20 highest-probability tokens of μ\mu at each position, normalized to sum to 1.

Algorithm

Algorithm 1 details the mask computation for one response: iterate tokens linearly, maintain prefix sums StS_t, WtW_t, compute effective threshold ctc_t, and mask updates that violate the condition.

Empirical Validation / Results

Experimental Setup

  • Training data: DAPO-Math-17k (≈17k verifiable math prompts)
  • Models: Qwen3-1.7B, Qwen3-1.7B-Base, Qwen3-8B-Base, Qwen3-30B-A3B-Base
  • Hyperparameters:
    • Dense models: Tmax=8kT_{\max}=8k, n=8n=8 rollouts
    • 30B-A3B: Tmax=16kT_{\max}=16k, n=16n=16 rollouts
  • Evaluation: AIME24/25/26 Avg@16 (unweighted mean)
  • Baselines: GRPO, CISPO, MinPRO, DPPO, TRM-Max, TRM-Avg
  • CPPO settings: δ=0.15\delta = 0.15 (dense) or 0.20.2 (MoE); wmin=0.8w_{\min}=0.8; δb\delta_b adaptive for Base models (top-10% quantile clamped to [2δbmin,4δbmin][2\delta_b^{\min}, 4\delta_b^{\min}])

Main Results

Table 1: Best validation AIME24/25/26 Avg@16 (%, higher is better)

Method1.7B1.7B-Base8B-Base30B-A3B-Base
GRPO27.918.8923.9638.19
MinPRO27.7111.0429.7248.12
CISPO28.8211.8729.58collapse
DPPO28.1910.9028.8949.23
TRM-Max25.219.7226.7320.27
TRM-Avg26.8711.7027.9848.96
CPPO (ours)31.8812.7831.1154.79
  • CPPO outperforms all baselines in every setting by margins of 0.91–5.56 absolute points.
  • The largest gain (5.56 points) is on the largest model (30B-A3B-Base) with longest horizon (16k), where autoregressive asymmetry is most pronounced.
  • CISPO collapses on 30B-A3B-Base; TRM-Max degrades to 20.27, while CPPO trains stably.

Ablation Studies (Figure 5, Figure 6)

  1. Single mechanism ablation: Removing either the position weight or the prefix budget from CPPO (using uniform weights wt1w_t\equiv1 or no prefix budget respectively) still outperforms DPPO, but full CPPO achieves the highest scores.
  2. Position-weight ordering: Shuffling the position-dependent thresholds randomly (keeping the same multiset) yields lower performance than the autoregressive ordered schedule, confirming that the ordering by position drives the gain.
  3. Mask vs. soft gate: A soft variant (gradient attenuation near boundary) performs similarly to the hard mask.
  4. Hyperparameter sensitivity: Varying δb\delta_b (0.02→0.03) and wminw_{\min} (0.8→0.6) maintains performance above DPPO.
  5. KL vs. TV divergence: CPPO with KL divergence (using TRM thresholds δ=0.1,δb=0.002\delta=0.1,\delta_b=0.002) matches the TV configuration and outperforms DPPO; TRM Max&Avg with same thresholds does not.
  6. Binary vs. Top-K approximation: Both approximations yield comparable performance and exceed DPPO.

Theoretical and Practical Implications

Theoretical Contributions

  • Formalizes autoregressive asymmetry in the error propagation bound: the coefficient λt=4ξˉ(Tt)\lambda_t = 4\xi\bar{\ell}(T-t) shows early token-level divergence has linearly larger impact on the surrogate residual.
  • Derives a prefix-constrained policy-improvement bound (Theorem 1) that replaces the pointwise dependence on δ2\delta^2 with a tighter dependence on δδb\delta \delta_b, proving that cumulative prefix constraints provably tighten the bound when δb<δ\delta_b < \delta.
  • Connects the bound to practical masking rules: the position weight wtw_t and prefix budget δb\delta_b directly implement the theoretical requirements (monotonicity of (Tt)/wt(T-t)/w_t and prefix-sum bounds).

Practical Implications for LLM RL

  • Drop-in replacement: CPPO modifies only the token-level mask while preserving the standard PPO/GRPO ratio-advantage objective, requiring no additional loss terms or architecture changes.
  • Two hyperparameters: δ\delta (token-level threshold scale) and δb\delta_b (prefix-average threshold) plus weight floor wminw_{\min}. The adaptive δb\delta_b for Base models handles initial high-exploration phases automatically.
  • Stability gains: CPPO prevents collapse in large models (30B-A3B-Base) where CISPO and TRM-Max fail, and consistently improves over DPPO which shares the same divergence estimator.
  • Broad applicability: Gains hold across model scales (1.7B–30B+), architectures (dense and MoE), training stages (Base and post-trained), and divergence metrics (TV, KL, Binary, Top-K).

Conclusion

This work identifies fundamental limitations of uniform token-level trust regions in LLM RL and proposes CPPO, a principled alternative that respects the autoregressive structure of generation. Key contributions:

  1. Formalization: A finite-horizon error bound showing how token position affects error propagation.
  2. Algorithm: CPPO's dual-constraint mask (position-weighted threshold + cumulative prefix budget) that dynamically allocates divergence budget along the response.
  3. Theory: A provably tighter policy-improvement bound via prefix constraints.
  4. Empirical validation: Consistent improvements across four Qwen3 settings, with ablations confirming both mechanisms are necessary and complementary.

Future directions: Extending the prefix-budget concept to multi-turn interactions, adapting the weight schedule based on task difficulty, and exploring soft-gate variants further. The principle of aligning trust-region structure with the autoregressive factorization of LLMs opens a promising direction for more stable and capable reasoning RL.

Related papers