Visual Summary | Beyond Uniform Token-Level Trust Region in LLM Reinforcement Learning

Summary (Overview)

This paper identifies and formalizes two critical limitations of uniform (position-agnostic) token-level trust regions in LLM reinforcement learning with verifiable rewards (RLVR): they ignore autoregressive asymmetry (early token deviations propagate over longer suffixes) and cumulative prefix drift (per-token errors accumulate in the conditioning history).
The authors propose CPPO (Cumulative Prefix-divergence Policy Optimization), a token-level masking rule that replaces uniform thresholds with two coupled mechanisms: a position-weighted token-level threshold (stricter at early positions) and a cumulative prefix budget (dynamically restricts divergence as prefix drift accumulates).
A novel prefix-constrained policy-improvement bound is derived (Theorem 1), showing that constraining weighted prefix averages rather than pointwise divergences provably tightens the surrogate residual bound.
Empirically, CPPO achieves the best AIME24/25/26 Avg@16 scores across four Qwen3 model scales (1.7B, 8B, 30B-A3B), outperforming strong baselines (GRPO, DPPO, MinPRO, CISPO, TRM) by margins of 0.91–5.56 absolute points.
Ablations confirm both the position-weight and prefix-budget mechanisms independently contribute, and the gain is robust to divergence metric (TV vs. KL) and approximation granularity (Top-K vs. Binary).

Introduction and Theoretical Foundation

Background and Motivation

Reinforcement learning with verifiable rewards (RLVR) has become standard for LLM reasoning post-training (Ouyang et al., 2022; Shao et al., 2024). In RLVR, a policy generates responses, a verifier assigns scalar rewards, and updates are performed using PPO/GRPO-style token-level objectives (Schulman et al., 2017). Off-policy updates cause the target policy $\pi$ to drift from the rollout policy $\mu$ , and autoregressive generation amplifies divergence because early token deviations alter the conditioning of all subsequent steps.

Existing trust-region mechanisms borrow from classical policy optimization (TRPO, Schulman et al., 2015) but approximate the divergence constraint by:

PPO/GRPO: Clipping the sampled likelihood ratio $\rho_t = \pi(y_t|s_t)/\mu(y_t|s_t)$ (Schulman et al., 2017; Shao et al., 2024)
DPPO: Constraining the total-variation (TV) divergence $D_t = D_{\text{TV}}(\mu(\cdot|s_t),\pi(\cdot|s_t))$ with a uniform threshold $\delta$ (Qi et al., 2026)

All these methods apply a uniform, position-agnostic threshold across all token positions, which conflicts with autoregressive generation in two ways:

Autoregressive asymmetry: Early token deviations affect longer suffixes (more future tokens). A uniform threshold under-penalizes early deviations (which have large propagation multipliers) and over-constrains late-stage exploration.
Cumulative prefix drift: Per-token divergences accumulate in the conditioning prefix $s_t = (x, y_{<t})$ . A uniform threshold permits sequences to drift far from $\mu$ while still passing per-token checks.

Theoretical Foundation: Finite-Horizon Performance Difference

The paper starts from the exact finite-horizon performance difference identity (Lemma 2):

J(\pi) - J(\mu) = L'_\mu(\pi) - \Delta(\mu, \pi)

where

L'_\mu(\pi) := \mathbb{E}_\mu\left[R(x,y) \sum_{t=1}^T (\rho_t - 1)\right], \quad \Delta(\mu,\pi) := \mathbb{E}_\mu\left[R(x,y) \sum_{t=1}^T (\rho_t - 1)(1 - \rho_{t+1:T})\right]

and $\rho_{t+1:T} = \prod_{j=t+1}^T \rho_j$ is the suffix likelihood ratio.

The surrogate error $|\Delta(\mu,\pi)|$ must be controlled. Equation (4) shows how token-level divergence at position $t$ propagates:

|\Delta(\mu,\pi)| \leq 4\xi \sum_{t=1}^{T-1} u_t \sum_{j=t+1}^T \ell_j \leq \sum_{t=1}^{T-1} \lambda_t u_t, \quad \lambda_t = 4\xi\bar{\ell}(T-t)

where $u_t = \mathbb{E}[D_t]$ , $\ell_t$ is the per-token threshold, $\bar{\ell} = \max_j \ell_j$ , and $\xi$ is the reward bound. The coefficient $\lambda_t \propto (T-t)$ grows linearly with remaining horizon—the formalization of autoregressive asymmetry.

Methodology

CPPO Masking Rule

CPPO replaces the uniform threshold with two coupled constraints encoded in a per-token indicator $I_t$ :

Position-weighted token-level threshold: $w_t D_t \leq \delta$ with a decreasing linear schedule:

w_t = 1 - \frac{1 - w_{\min}}{T-1}(t-1), \quad t=1,\ldots,T, \; w_t \in [w_{\min}, 1]

This imposes stricter limits at early positions ( $D_t \leq \delta/w_t$ , tighter when $w_t$ is large) and relaxes them later.

Cumulative prefix budget: Let $S_t = \sum_{j=1}^t w_j D_j$ and $W_t = \sum_{j=1}^t w_j$ . The condition $S_t \leq \delta + \delta_b W_{t-1}$ ensures the weighted prefix average does not exceed $\delta_b$ (with initial slack $\delta$ ). This dynamically reduces the allowed divergence when earlier tokens have already drifted significantly.

Combined per-token condition:

I_t : \; w_t D_t \leq \delta \;\wedge\; S_t \leq \delta + \delta_b W_{t-1}

The effective threshold at token $t$ is:

c_t^{\text{CPPO}} := \min\{\delta, \delta + \delta_b W_{t-1} - S_{t-1}\}

The full token-level mask:

M_t^{\text{CPPO}} = \mathbb{1}\left[\hat{A}_t(\rho_t - 1) \leq 0 \;\vee\; I_t\right]

This keeps update terms that move $\pi$ toward $\mu$ (first clause) and only allows terms driving $\pi$ away from $\mu$ when $I_t$ holds.

Theoretical Guarantee

Theorem 1 (CPPO policy-improvement bound): Under constraints $w_t D_t \leq c_t$ and $P_m \leq \delta_b W_m$ for all prefixes $m=1,\ldots,T-1$ , and assuming $r_t = \lambda_t/w_t$ is non-increasing,

J(\pi) - J(\mu) \geq L'_\mu(\pi) - 2\xi T(T-1)\bar{\ell}\delta_b

For the special case of uniform token-level threshold $D_t \leq \delta$ , the residual constant improves from $C_{\text{uniform}} = 2\xi T(T-1)\delta^2$ to $C_{\text{CPPO}} = 2\xi T(T-1)\delta\delta_b$ , giving a ratio $C_{\text{CPPO}}/C_{\text{uniform}} = \delta_b/\delta$ (which is <1 when $\delta_b < \delta$ ).

Divergence Approximation

All token-level trust-region methods use the Top-K reduced-TV approximation (K=20) from DPPO (Qi et al., 2026). The exact $D_{\text{TV}}(\mu(\cdot|s_t),\pi(\cdot|s_t))$ is computed over the top-20 highest-probability tokens of $\mu$ at each position, normalized to sum to 1.

Algorithm

Algorithm 1 details the mask computation for one response: iterate tokens linearly, maintain prefix sums $S_t$ , $W_t$ , compute effective threshold $c_t$ , and mask updates that violate the condition.

Empirical Validation / Results

Experimental Setup

Training data: DAPO-Math-17k (≈17k verifiable math prompts)
Models: Qwen3-1.7B, Qwen3-1.7B-Base, Qwen3-8B-Base, Qwen3-30B-A3B-Base
Hyperparameters:
- Dense models: $T_{\max}=8k$ , $n=8$ rollouts
- 30B-A3B: $T_{\max}=16k$ , $n=16$ rollouts
Evaluation: AIME24/25/26 Avg@16 (unweighted mean)
Baselines: GRPO, CISPO, MinPRO, DPPO, TRM-Max, TRM-Avg
CPPO settings: $\delta = 0.15$ (dense) or $0.2$ (MoE); $w_{\min}=0.8$ ; $\delta_b$ adaptive for Base models (top-10% quantile clamped to $[2\delta_b^{\min}, 4\delta_b^{\min}]$ )

Main Results

Table 1: Best validation AIME24/25/26 Avg@16 (%, higher is better)

Method	1.7B	1.7B-Base	8B-Base	30B-A3B-Base
GRPO	27.91	8.89	23.96	38.19
MinPRO	27.71	11.04	29.72	48.12
CISPO	28.82	11.87	29.58	collapse
DPPO	28.19	10.90	28.89	49.23
TRM-Max	25.21	9.72	26.73	20.27
TRM-Avg	26.87	11.70	27.98	48.96
CPPO (ours)	31.88	12.78	31.11	54.79

CPPO outperforms all baselines in every setting by margins of 0.91–5.56 absolute points.
The largest gain (5.56 points) is on the largest model (30B-A3B-Base) with longest horizon (16k), where autoregressive asymmetry is most pronounced.
CISPO collapses on 30B-A3B-Base; TRM-Max degrades to 20.27, while CPPO trains stably.

Ablation Studies (Figure 5, Figure 6)

Single mechanism ablation: Removing either the position weight or the prefix budget from CPPO (using uniform weights $w_t\equiv1$ or no prefix budget respectively) still outperforms DPPO, but full CPPO achieves the highest scores.
Position-weight ordering: Shuffling the position-dependent thresholds randomly (keeping the same multiset) yields lower performance than the autoregressive ordered schedule, confirming that the ordering by position drives the gain.
Mask vs. soft gate: A soft variant (gradient attenuation near boundary) performs similarly to the hard mask.
Hyperparameter sensitivity: Varying $\delta_b$ (0.02→0.03) and $w_{\min}$ (0.8→0.6) maintains performance above DPPO.
KL vs. TV divergence: CPPO with KL divergence (using TRM thresholds $\delta=0.1,\delta_b=0.002$ ) matches the TV configuration and outperforms DPPO; TRM Max&Avg with same thresholds does not.
Binary vs. Top-K approximation: Both approximations yield comparable performance and exceed DPPO.

Theoretical and Practical Implications

Theoretical Contributions

Formalizes autoregressive asymmetry in the error propagation bound: the coefficient $\lambda_t = 4\xi\bar{\ell}(T-t)$ shows early token-level divergence has linearly larger impact on the surrogate residual.
Derives a prefix-constrained policy-improvement bound (Theorem 1) that replaces the pointwise dependence on $\delta^2$ with a tighter dependence on $\delta \delta_b$ , proving that cumulative prefix constraints provably tighten the bound when $\delta_b < \delta$ .
Connects the bound to practical masking rules: the position weight $w_t$ and prefix budget $\delta_b$ directly implement the theoretical requirements (monotonicity of $(T-t)/w_t$ and prefix-sum bounds).

Practical Implications for LLM RL

Drop-in replacement: CPPO modifies only the token-level mask while preserving the standard PPO/GRPO ratio-advantage objective, requiring no additional loss terms or architecture changes.
Two hyperparameters: $\delta$ (token-level threshold scale) and $\delta_b$ (prefix-average threshold) plus weight floor $w_{\min}$ . The adaptive $\delta_b$ for Base models handles initial high-exploration phases automatically.
Stability gains: CPPO prevents collapse in large models (30B-A3B-Base) where CISPO and TRM-Max fail, and consistently improves over DPPO which shares the same divergence estimator.
Broad applicability: Gains hold across model scales (1.7B–30B+), architectures (dense and MoE), training stages (Base and post-trained), and divergence metrics (TV, KL, Binary, Top-K).

Conclusion

This work identifies fundamental limitations of uniform token-level trust regions in LLM RL and proposes CPPO, a principled alternative that respects the autoregressive structure of generation. Key contributions:

Formalization: A finite-horizon error bound showing how token position affects error propagation.
Algorithm: CPPO's dual-constraint mask (position-weighted threshold + cumulative prefix budget) that dynamically allocates divergence budget along the response.
Theory: A provably tighter policy-improvement bound via prefix constraints.
Empirical validation: Consistent improvements across four Qwen3 settings, with ablations confirming both mechanisms are necessary and complementary.

Future directions: Extending the prefix-budget concept to multi-turn interactions, adapting the weight schedule based on task difficulty, and exploring soft-gate variants further. The principle of aligning trust-region structure with the autoregressive factorization of LLMs opens a promising direction for more stable and capable reasoning RL.