# Beyond Uniform Token-Level Trust Region in LLM Reinforcement Learning

> CPPO replaces uniform token-level trust regions with position-weighted thresholds and cumulative prefix budgets, achieving state-of-the-art AIME results across Qwen3 models.

- **Source:** [arXiv](https://arxiv.org/abs/2606.10968)
- **Published:** 2026-06-11
- **Permalink:** https://picx.dev/p/wEkgGz
- **Whiteboard:** https://picx.dev/p/wEkgGz/image

## Summary

## Summary (Overview)

- This paper identifies and formalizes two critical limitations of uniform (position-agnostic) token-level trust regions in LLM reinforcement learning with verifiable rewards (RLVR): they ignore **autoregressive asymmetry** (early token deviations propagate over longer suffixes) and **cumulative prefix drift** (per-token errors accumulate in the conditioning history).
- The authors propose **CPPO** (Cumulative Prefix-divergence Policy Optimization), a token-level masking rule that replaces uniform thresholds with two coupled mechanisms: a **position-weighted token-level threshold** (stricter at early positions) and a **cumulative prefix budget** (dynamically restricts divergence as prefix drift accumulates).
- A novel **prefix-constrained policy-improvement bound** is derived (Theorem 1), showing that constraining weighted prefix averages rather than pointwise divergences provably tightens the surrogate residual bound.
- Empirically, CPPO achieves the best AIME24/25/26 Avg@16 scores across four Qwen3 model scales (1.7B, 8B, 30B-A3B), outperforming strong baselines (GRPO, DPPO, MinPRO, CISPO, TRM) by margins of 0.91–5.56 absolute points.
- Ablations confirm both the position-weight and prefix-budget mechanisms independently contribute, and the gain is robust to divergence metric (TV vs. KL) and approximation granularity (Top-K vs. Binary).

## Introduction and Theoretical Foundation

### Background and Motivation

Reinforcement learning with verifiable rewards (RLVR) has become standard for LLM reasoning post-training (Ouyang et al., 2022; Shao et al., 2024). In RLVR, a policy generates responses, a verifier assigns scalar rewards, and updates are performed using PPO/GRPO-style token-level objectives (Schulman et al., 2017). Off-policy updates cause the target policy $\pi$ to drift from the rollout policy $\mu$, and autoregressive generation amplifies divergence because early token deviations alter the conditioning of all subsequent steps.

Existing trust-region mechanisms borrow from classical policy optimization (TRPO, Schulman et al., 2015) but approximate the divergence constraint by:
- **PPO/GRPO:** Clipping the sampled likelihood ratio $\rho_t = \pi(y_t|s_t)/\mu(y_t|s_t)$ (Schulman et al., 2017; Shao et al., 2024)
- **DPPO:** Constraining the total-variation (TV) divergence $D_t = D_{\text{TV}}(\mu(\cdot|s_t),\pi(\cdot|s_t))$ with a uniform threshold $\delta$ (Qi et al., 2026)

All these methods apply a **uniform, position-agnostic threshold** across all token positions, which conflicts with autoregressive generation in two ways:

1. **Autoregressive asymmetry:** Early token deviations affect longer suffixes (more future tokens). A uniform threshold under-penalizes early deviations (which have large propagation multipliers) and over-constrains late-stage exploration.
2. **Cumulative prefix drift:** Per-token divergences accumulate in the conditioning prefix $s_t = (x, y_{<t})$. A uniform threshold permits sequences to drift far from $\mu$ while still passing per-token checks.

### Theoretical Foundation: Finite-Horizon Performance Difference

The paper starts from the exact finite-horizon performance difference identity (Lemma 2):
$$
J(\pi) - J(\mu) = L'_\mu(\pi) - \Delta(\mu, \pi)
$$
where
$$
L'_\mu(\pi) := \mathbb{E}_\mu\left[R(x,y) \sum_{t=1}^T (\rho_t - 1)\right], \quad \Delta(\mu,\pi) := \mathbb{E}_\mu\left[R(x,y) \sum_{t=1}^T (\rho_t - 1)(1 - \rho_{t+1:T})\right]
$$
and $\rho_{t+1:T} = \prod_{j=t+1}^T \rho_j$ is the suffix likelihood ratio.

The surrogate error $|\Delta(\mu,\pi)|$ must be controlled. Equation (4) shows how token-level divergence at position $t$ propagates:
$$
|\Delta(\mu,\pi)| \leq 4\xi \sum_{t=1}^{T-1} u_t \sum_{j=t+1}^T \ell_j \leq \sum_{t=1}^{T-1} \lambda_t u_t, \quad \lambda_t = 4\xi\bar{\ell}(T-t)
$$
where $u_t = \mathbb{E}[D_t]$, $\ell_t$ is the per-token threshold, $\bar{\ell} = \max_j \ell_j$, and $\xi$ is the reward bound. The coefficient $\lambda_t \propto (T-t)$ grows linearly with remaining horizon—the formalization of autoregressive asymmetry.

## Methodology

### CPPO Masking Rule

CPPO replaces the uniform threshold with two coupled constraints encoded in a per-token indicator $I_t$:

**Position-weighted token-level threshold:** $w_t D_t \leq \delta$ with a decreasing linear schedule:
$$
w_t = 1 - \frac{1 - w_{\min}}{T-1}(t-1), \quad t=1,\ldots,T, \; w_t \in [w_{\min}, 1]
$$
This imposes stricter limits at early positions ($D_t \leq \delta/w_t$, tighter when $w_t$ is large) and relaxes them later.

**Cumulative prefix budget:** Let $S_t = \sum_{j=1}^t w_j D_j$ and $W_t = \sum_{j=1}^t w_j$. The condition $S_t \leq \delta + \delta_b W_{t-1}$ ensures the weighted prefix average does not exceed $\delta_b$ (with initial slack $\delta$). This dynamically reduces the allowed divergence when earlier tokens have already drifted significantly.

Combined per-token condition:
$$
I_t : \; w_t D_t \leq \delta \;\wedge\; S_t \leq \delta + \delta_b W_{t-1}
$$
The effective threshold at token $t$ is:
$$
c_t^{\text{CPPO}} := \min\{\delta, \delta + \delta_b W_{t-1} - S_{t-1}\}
$$

The full token-level mask:
$$
M_t^{\text{CPPO}} = \mathbb{1}\left[\hat{A}_t(\rho_t - 1) \leq 0 \;\vee\; I_t\right]
$$
This keeps update terms that move $\pi$ toward $\mu$ (first clause) and only allows terms driving $\pi$ away from $\mu$ when $I_t$ holds.

### Theoretical Guarantee

**Theorem 1 (CPPO policy-improvement bound):** Under constraints $w_t D_t \leq c_t$ and $P_m \leq \delta_b W_m$ for all prefixes $m=1,\ldots,T-1$, and assuming $r_t = \lambda_t/w_t$ is non-increasing,
$$
J(\pi) - J(\mu) \geq L'_\mu(\pi) - 2\xi T(T-1)\bar{\ell}\delta_b
$$
For the special case of uniform token-level threshold $D_t \leq \delta$, the residual constant improves from $C_{\text{uniform}} = 2\xi T(T-1)\delta^2$ to $C_{\text{CPPO}} = 2\xi T(T-1)\delta\delta_b$, giving a ratio $C_{\text{CPPO}}/C_{\text{uniform}} = \delta_b/\delta$ (which is <1 when $\delta_b < \delta$).

### Divergence Approximation

All token-level trust-region methods use the **Top-K reduced-TV approximation** (K=20) from DPPO (Qi et al., 2026). The exact $D_{\text{TV}}(\mu(\cdot|s_t),\pi(\cdot|s_t))$ is computed over the top-20 highest-probability tokens of $\mu$ at each position, normalized to sum to 1.

### Algorithm

Algorithm 1 details the mask computation for one response: iterate tokens linearly, maintain prefix sums $S_t$, $W_t$, compute effective threshold $c_t$, and mask updates that violate the condition.

## Empirical Validation / Results

### Experimental Setup

- **Training data:** DAPO-Math-17k (≈17k verifiable math prompts)
- **Models:** Qwen3-1.7B, Qwen3-1.7B-Base, Qwen3-8B-Base, Qwen3-30B-A3B-Base
- **Hyperparameters:**
  - Dense models: $T_{\max}=8k$, $n=8$ rollouts
  - 30B-A3B: $T_{\max}=16k$, $n=16$ rollouts
- **Evaluation:** AIME24/25/26 Avg@16 (unweighted mean)
- **Baselines:** GRPO, CISPO, MinPRO, DPPO, TRM-Max, TRM-Avg
- **CPPO settings:** $\delta = 0.15$ (dense) or $0.2$ (MoE); $w_{\min}=0.8$; $\delta_b$ adaptive for Base models (top-10% quantile clamped to $[2\delta_b^{\min}, 4\delta_b^{\min}]$)

### Main Results

**Table 1: Best validation AIME24/25/26 Avg@16 (%, higher is better)**

| Method | 1.7B | 1.7B-Base | 8B-Base | 30B-A3B-Base |
|--------|------|-----------|---------|--------------|
| GRPO | 27.91 | 8.89 | 23.96 | 38.19 |
| MinPRO | 27.71 | 11.04 | 29.72 | 48.12 |
| CISPO | 28.82 | 11.87 | 29.58 | collapse |
| DPPO | 28.19 | 10.90 | 28.89 | 49.23 |
| TRM-Max | 25.21 | 9.72 | 26.73 | 20.27 |
| TRM-Avg | 26.87 | 11.70 | 27.98 | 48.96 |
| **CPPO (ours)** | **31.88** | **12.78** | **31.11** | **54.79** |

- CPPO outperforms all baselines in every setting by margins of 0.91–5.56 absolute points.
- The largest gain (5.56 points) is on the largest model (30B-A3B-Base) with longest horizon (16k), where autoregressive asymmetry is most pronounced.
- CISPO collapses on 30B-A3B-Base; TRM-Max degrades to 20.27, while CPPO trains stably.

### Ablation Studies (Figure 5, Figure 6)

1. **Single mechanism ablation:** Removing either the position weight or the prefix budget from CPPO (using uniform weights $w_t\equiv1$ or no prefix budget respectively) still outperforms DPPO, but full CPPO achieves the highest scores.
2. **Position-weight ordering:** Shuffling the position-dependent thresholds randomly (keeping the same multiset) yields lower performance than the autoregressive ordered schedule, confirming that the ordering by position drives the gain.
3. **Mask vs. soft gate:** A soft variant (gradient attenuation near boundary) performs similarly to the hard mask.
4. **Hyperparameter sensitivity:** Varying $\delta_b$ (0.02→0.03) and $w_{\min}$ (0.8→0.6) maintains performance above DPPO.
5. **KL vs. TV divergence:** CPPO with KL divergence (using TRM thresholds $\delta=0.1,\delta_b=0.002$) matches the TV configuration and outperforms DPPO; TRM Max&Avg with same thresholds does not.
6. **Binary vs. Top-K approximation:** Both approximations yield comparable performance and exceed DPPO.

## Theoretical and Practical Implications

### Theoretical Contributions

- Formalizes **autoregressive asymmetry** in the error propagation bound: the coefficient $\lambda_t = 4\xi\bar{\ell}(T-t)$ shows early token-level divergence has linearly larger impact on the surrogate residual.
- Derives a **prefix-constrained policy-improvement bound** (Theorem 1) that replaces the pointwise dependence on $\delta^2$ with a tighter dependence on $\delta \delta_b$, proving that cumulative prefix constraints provably tighten the bound when $\delta_b < \delta$.
- Connects the bound to practical masking rules: the position weight $w_t$ and prefix budget $\delta_b$ directly implement the theoretical requirements (monotonicity of $(T-t)/w_t$ and prefix-sum bounds).

### Practical Implications for LLM RL

- **Drop-in replacement:** CPPO modifies only the token-level mask while preserving the standard PPO/GRPO ratio-advantage objective, requiring no additional loss terms or architecture changes.
- **Two hyperparameters:** $\delta$ (token-level threshold scale) and $\delta_b$ (prefix-average threshold) plus weight floor $w_{\min}$. The adaptive $\delta_b$ for Base models handles initial high-exploration phases automatically.
- **Stability gains:** CPPO prevents collapse in large models (30B-A3B-Base) where CISPO and TRM-Max fail, and consistently improves over DPPO which shares the same divergence estimator.
- **Broad applicability:** Gains hold across model scales (1.7B–30B+), architectures (dense and MoE), training stages (Base and post-trained), and divergence metrics (TV, KL, Binary, Top-K).

## Conclusion

This work identifies fundamental limitations of uniform token-level trust regions in LLM RL and proposes CPPO, a principled alternative that respects the autoregressive structure of generation. Key contributions:

1. **Formalization:** A finite-horizon error bound showing how token position affects error propagation.
2. **Algorithm:** CPPO's dual-constraint mask (position-weighted threshold + cumulative prefix budget) that dynamically allocates divergence budget along the response.
3. **Theory:** A provably tighter policy-improvement bound via prefix constraints.
4. **Empirical validation:** Consistent improvements across four Qwen3 settings, with ablations confirming both mechanisms are necessary and complementary.

**Future directions:** Extending the prefix-budget concept to multi-turn interactions, adapting the weight schedule based on task difficulty, and exploring soft-gate variants further. The principle of aligning trust-region structure with the autoregressive factorization of LLMs opens a promising direction for more stable and capable reasoning RL.

---

_Markdown view of https://picx.dev/p/wEkgGz, served by PicX — AI-generated visual whiteboard summaries of research papers._