Summary (Overview)

  • Main Contribution: Introduces BandPO (Band-constrained Policy Optimization), a novel method that replaces the canonical clipping mechanism in PPO with a unified theoretical operator called Band. This operator projects trust regions defined by ff-divergences into dynamic, probability-aware clipping intervals.
  • Key Finding: Identifies a critical bottleneck in canonical clipping: fixed bounds constrain the upward update margin of low-probability actions linearly with their probability, disproportionately suppressing high-advantage tail strategies and inducing entropy collapse.
  • Methodological Innovation: Formulates the mapping from trust region to clipping bounds as a convex optimization problem, guaranteeing globally optimal numerical solutions and deriving closed-form solutions for specific divergences (Total Variation and Pearson χ2\chi^2).
  • Empirical Validation: Demonstrates consistent performance improvements over GRPO and Clip-Higher baselines across multiple models (Qwen2.5 3B/7B, Llama3 8B) on mathematical benchmarks (AMC, AIME), while robustly mitigating entropy collapse.
  • Theoretical Insight: The Band operator naturally circumvents the exploration bottleneck by adaptively expanding the feasible upward margin for low-probability actions, preventing premature clipping and preserving exploration gradients.

Introduction and Theoretical Foundation

Reinforcement Learning from Human Feedback (RLHF) is the dominant paradigm for post-training Large Language Models (LLMs), where proximal constraints on policy updates balance optimization stability with exploration. The canonical clipping mechanism in PPO serves as an efficient surrogate for trust-region updates. However, this paper identifies a critical structural bottleneck: fixed clipping bounds enforce a linear dependence where the maximum feasible probability variation scales proportionally with the old probability. Consequently, for positive-advantage actions, lower probabilities dictate vanishingly small margins for upward variation, rendering them susceptible to premature clipping and nullifying their gradient contributions. This inhibits the model from reinforcing novel, superior strategies in the distribution tail.

While heuristic approaches like Clip-Higher (DAPO) relax upper bounds to delay entropy collapse, they often lead to instability and performance collapse. The fundamental issue is that adjusting fixed thresholds fails to address the inherent limitation. The paper proposes BandPO to bridge this gap by introducing a unified Band operator that projects ff-divergence-induced trust regions into dynamic, probability-aware clipping intervals, governed by a single interpretable radius parameter δ\delta.

Methodology

Notation and Problem Formulation

The RL alignment of LLMs is formulated as a discrete Markov Decision Process. Let πθ\pi_\theta denote the policy. Given a prompt xx, the policy generates a response sequence y=(a1,a2,...,aT)y = (a_1, a_2, ..., a_T). The state st=(x,y<t)s_t = (x, y_{<t}). The optimization objective is to maximize the expected reward: J(θ)=ExD,yπθ[R(x,y)]J(\theta) = \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta}[R(x,y)].

The Group Relative Policy Optimization (GRPO) objective aggregates per-token objectives:

JGRPO(θ)=ExD,{yi}i=1Gπold[1Gi=1G1Tit=1TiJt(θ;yi)]\mathcal{J}_{\text{GRPO}}(\theta) = \mathbb{E}_{x \sim \mathcal{D}, \{y_i\}_{i=1}^G \sim \pi_{\text{old}}} \left[ \frac{1}{G} \sum_{i=1}^G \frac{1}{T_i} \sum_{t=1}^{T_i} \mathcal{J}_t(\theta; y_i) \right]

(1)

The per-token objective with asymmetric clipping is:

Jt(θ;yi)=min(rt,iAt,i,clip(rt,i,1ϵ,1+ϵ+)At,i)βDKL(πθ(st)πref(st))\mathcal{J}_t(\theta; y_i) = \min\left( r_{t,i} A_{t,i}, \text{clip}(r_{t,i}, 1 - \epsilon_{-}, 1 + \epsilon_{+}) A_{t,i} \right) - \beta D_{\text{KL}}(\pi_\theta(\cdot|s_t) \parallel \pi_{\text{ref}}(\cdot|s_t))

(2) where rt,i=πθ(atst)πold(atst)r_{t,i} = \frac{\pi_\theta(a_t|s_t)}{\pi_{\text{old}}(a_t|s_t)} and At,iA_{t,i} is the advantage.

The Bottleneck in Canonical Clipping

The canonical clipping constraint is:

1ϵπθ(as)πold(as)1+ϵ+1 - \epsilon_{-} \leq \frac{\pi_\theta(a|s)}{\pi_{\text{old}}(a|s)} \leq 1 + \epsilon_{+}

(3)

Defining the probability variation Δπ(as)=πθ(as)πold(as)\Delta\pi(a|s) = \pi_\theta(a|s) - \pi_{\text{old}}(a|s), the feasible set under clipping is:

Cclip={Δπ(as)ϵπold(as)Δπ(as)ϵ+πold(as)}\mathcal{C}_{\text{clip}} = \{ \Delta\pi(a|s) \mid -\epsilon_{-} \pi_{\text{old}}(a|s) \leq \Delta\pi(a|s) \leq \epsilon_{+} \pi_{\text{old}}(a|s) \}

(4)

The maximal feasible set from the probability simplex ΔV\Delta_V is:

Csimplex={Δπ(as)πold(as)Δπ(as)1πold(as)}\mathcal{C}_{\text{simplex}} = \{ \Delta\pi(a|s) \mid -\pi_{\text{old}}(a|s) \leq \Delta\pi(a|s) \leq 1 - \pi_{\text{old}}(a|s) \}

(5)

Mapping Csimplex\mathcal{C}_{\text{simplex}} to ratio space gives theoretical bounds:

0πθ(as)πold(as)1πold(as)0 \leq \frac{\pi_\theta(a|s)}{\pi_{\text{old}}(a|s)} \leq \frac{1}{\pi_{\text{old}}(a|s)}

(6)

The bottleneck: Fixed clipping bounds constrain probability variations to scale linearly with πold(as)\pi_{\text{old}}(a|s). The feasible upward shift vanishes as probability approaches zero, contradicting the theoretical upper bound. This induces premature clipping for low-probability, positive-advantage actions.

BandPO: Band-constrained Policy Optimization

ff-Divergence-Induced Trust Regions

Let P()πold(st)ΔVP(\cdot) \triangleq \pi_{\text{old}}(\cdot|s_t) \in \Delta_V and Q()πθ(st)ΔVQ(\cdot) \triangleq \pi_\theta(\cdot|s_t) \in \Delta_V. For a strictly convex function f:R+Rf: \mathbb{R}_+ \to \mathbb{R} with f(1)=0f(1)=0, the ff-divergence is:

Df(QP)aVP(a)f(Q(a)P(a))D_f(Q \parallel P) \triangleq \sum_{a \in \mathcal{V}} P(a) f\left( \frac{Q(a)}{P(a)} \right)

(7)

The trust region Tf,δ(P)\mathcal{T}_{f,\delta}(P) is defined as:

Tf,δ(P){QΔVDf(QP)δ}\mathcal{T}_{f,\delta}(P) \triangleq \{ Q \in \Delta_V \mid D_f(Q \parallel P) \leq \delta \}

(8) where δ>0\delta > 0 is the trust region radius.

The Band Operator

For a token aa with P(a)>0P(a) > 0, the ratio function is r(a;Q,P)Q(a)P(a)r(a; Q, P) \triangleq \frac{Q(a)}{P(a)}. The optimal dynamic bounds are obtained by solving convex optimization problems:

Upper bound:

rf,δ(a;P)maxQTf,δ(P)Q(a)P(a)\overline{r}_{f,\delta}(a; P) \triangleq \max_{Q \in \mathcal{T}_{f,\delta}(P)} \frac{Q(a)}{P(a)}

(10)

Lower bound:

rf,δ(a;P)minQTf,δ(P)Q(a)P(a)\underline{r}_{f,\delta}(a; P) \triangleq \min_{Q \in \mathcal{T}_{f,\delta}(P)} \frac{Q(a)}{P(a)}

(11)

The Band operator is then defined as:

Bandf,δ(r;a,P)clip(r,rf,δ(a;P),rf,δ(a;P))\text{Band}_{f,\delta}(r; a, P) \triangleq \text{clip}\left( r, \underline{r}_{f,\delta}(a; P), \overline{r}_{f,\delta}(a; P) \right)

(12)

Reduction to Univariate Optimization

Lemma 1 (Optimality of Uniform Complement Rescaling): The optimal solution QQ^* to Problems (10) and (11) must preserve relative probability proportions within the complement set V{a}\mathcal{V} \setminus \{a\}:

Q(b)P(b)=c,bV{a}\frac{Q^* (b)}{P(b)} = c, \forall b \in \mathcal{V} \setminus \{a\}

(13) where cR+c \in \mathbb{R}_+ is uniquely determined by simplex normalization.

Let pP(a)p \triangleq P(a). The simplex constraint gives c(r)=1rp1pc(r) = \frac{1 - rp}{1 - p}. Substituting into the divergence yields a univariate function:

Df(QP)=pf(r)+(1p)f(1rp1p)gf(p,r)D_f(Q \parallel P) = p f(r) + (1-p) f\left( \frac{1 - rp}{1 - p} \right) \triangleq g_f(p, r)

(14)

Theorem 1 (Exact Scalarization of Trust-Region Constraints): For p(0,1)p \in (0,1), Problems (10) and (11) are equivalent to finding roots of:

gf(p,r)pf(r)+(1p)f(1rp1p)=δg_f(p, r) \triangleq p f(r) + (1-p) f\left( \frac{1 - rp}{1 - p} \right) = \delta

(15) subject to r[0,1/p]r \in [0, 1/p]. gf(p,r)g_f(p, r) is strictly convex with respect to rr with global minimum 0 at r=1r=1. For δ>0\delta > 0, the equation gf(p,r)=δg_f(p, r) = \delta has exactly two unique roots:

rf,δ(a;P)=min{r[0,1]gf(p,r)=δ}\underline{r}_{f,\delta}(a; P) = \min\{ r \in [0,1] \mid g_f(p, r) = \delta \}

(16)

rf,δ(a;P)=max{r[1,1/p]gf(p,r)=δ}\overline{r}_{f,\delta}(a; P) = \max\{ r \in [1, 1/p] \mid g_f(p, r) = \delta \}

(17)

Properties of Band Bounds

Proposition 1 (Asymptotic Behavior of Band Bounds): Given δ>0\delta > 0:

limp0+rf,δ(p)=+,limp0+rf,δ(p)=0,limp1rf,δ(p)=1\lim_{p \to 0^+} \overline{r}_{f,\delta}(p) = +\infty, \quad \lim_{p \to 0^+} \underline{r}_{f,\delta}(p) = 0, \quad \lim_{p \to 1^-} \overline{r}_{f,\delta}(p) = 1

(18)

Proposition 2 (Strict Monotonicity of Band Bounds): The clipping bounds are strictly monotonic functions of pp. The upper bound rf,δ(p)\overline{r}_{f,\delta}(p) is strictly decreasing with respect to pp, while the lower bound rf,δ(p)\underline{r}_{f,\delta}(p) is strictly increasing with respect to pp.

Solving Band Bounds

Simplex Saturation: For large δ\delta, the divergence constraint may extend beyond simplex boundaries. The optimal Band upper bound is:

rf,δ(p)={rmax,if gf(p,rmax)δr,otherwise\overline{r}_{f,\delta}(p) = \begin{cases} r_{\text{max}}, & \text{if } g_f(p, r_{\text{max}}) \leq \delta \\ r^{\dagger}, & \text{otherwise} \end{cases}

(19) where rmax1/pr_{\text{max}} \triangleq 1/p and rr^{\dagger} is the unique root of gf(p,r)=δg_f(p, r) = \delta in (1,rmax)(1, r_{\text{max}}). Symmetrically for the lower bound.

Generic Numerical Solver: In the active regime, bounds correspond to unique roots of gf(p,r)=δg_f(p, r) = \delta, which can be computed via bracketed root-finding algorithms (e.g., Bisection).

Closed-Form Solutions:

Proposition 4 (Closed-Form Band Bounds for TV and Pearson χ2\chi^2):

  • For Total Variation with fTV(u)=12u1f_{\text{TV}}(u) = \frac{1}{2}|u-1|: rTV,δ(p)=1+δp,rTV,δ(p)=1δp\overline{r}_{\text{TV},\delta}(p) = 1 + \frac{\delta}{p}, \quad \underline{r}_{\text{TV},\delta}(p) = 1 - \frac{\delta}{p} (20)
  • For Pearson χ2\chi^2 with fχ2(u)=(u1)2f_{\chi^2}(u) = (u-1)^2: rχ2,δ(p)=1+δ(1p)p,rχ2,δ(p)=1δ(1p)p\overline{r}_{\chi^2,\delta}(p) = 1 + \sqrt{\frac{\delta(1-p)}{p}}, \quad \underline{r}_{\chi^2,\delta}(p) = 1 - \sqrt{\frac{\delta(1-p)}{p}} (21)

BandPO Objective

BandPO maximizes:

JBandPO(θ)=ExD,{yi}i=1Gπold[1Gi=1G1Tit=1TiJtBand(θ;yi)]\mathcal{J}_{\text{BandPO}}(\theta) = \mathbb{E}_{x \sim \mathcal{D}, \{y_i\}_{i=1}^G \sim \pi_{\text{old}}} \left[ \frac{1}{G} \sum_{i=1}^G \frac{1}{T_i} \sum_{t=1}^{T_i} \mathcal{J}^{\text{Band}}_{t}(\theta; y_i) \right]

(22)

The per-token surrogate objective is:

JtBand(θ;yi)=min(rt,iAt,i,Bandf,δ(rt,i;yt,i,πold(st,i))At,i)βDKL(πrefπθ)t\mathcal{J}^{\text{Band}}_{t}(\theta; y_i) = \min\left( r_{t,i} A_{t,i}, \text{Band}_{f,\delta}\left( r_{t,i}; y_{t,i}, \pi_{\text{old}}(\cdot|s_{t,i}) \right) A_{t,i} \right) - \beta D_{\text{KL}}(\pi_{\text{ref}} \parallel \pi_\theta)_t

(23)

Empirical Validation / Results

Experimental Setup

  • Models: Qwen2.5-3B-Instruct, DeepSeek-R1-Distill-Qwen-1.5B/7B, DeepSeek-R1-Distill-Llama-8B.
  • Datasets: Composite training set (DAPO + MATH Levels 3-5). Validation on AMC 2023, AIME 2024, AIME 2025.
  • Baselines: GRPO (canonical symmetric clipping), GRPO w/ Clip-Higher (DAPO asymmetric clipping).
  • Metrics: pass@32 (probability of at least one correct solution) and mean@32 (expected policy robustness across 32 samples).
  • BandPO Implementation: KL divergence as trust region constraint with δ=0.05\delta = 0.05. Asymmetric clipping thresholds: ϵ+=0.28\epsilon_{+}=0.28, ϵ=0.2\epsilon_{-}=0.2.

Main Results

Table 1: Reasoning performance comparison across model scales (1.5B/3B/7B/8B)

MethodAMC2023 mean@32/pass@32AIME2024 mean@32/pass@32AIME2025 mean@32/pass@32Average mean@32/pass@32
DeepSeek-R1-Distill-Qwen-1.5B (800 steps)
GRPO72.11 / 94.3118.13 / 39.0021.88 / 38.8937.37 / 57.40
GRPO w/ Clip-Higher77.03 / 94.9818.23 / 41.0923.12 / 40.1639.46 / 58.74
GRPO w/ Relaxed Band KL,0.0574.69 / 93.7719.69 / 43.2823.54 / 38.8439.31 / 58.63
GRPO w/ Band KL,0.0577.34 / 94.9820.00 / 51.8023.85 / 40.6540.40 / 62.48
Qwen2.5-3B-Instruct (800 steps)
GRPO45.94 / 77.333.54 / 11.683.23 / 8.7917.57 / 32.60
GRPO w/ Clip-Higher52.66 / 82.914.69 / 14.954.06 / 23.9320.47 / 40.60
GRPO w/ Relaxed Band KL,0.0552.97 / 87.054.58 / 15.114.06 / 21.0020.54 / 41.05
GRPO w/ Band KL,0.0352.81 / 87.844.27 / 10.004.06 / 22.4020.38 / 40.08
GRPO w/ Band KL,0.1051.41 / 84.773.54 / 14.316.04 / 20.8520.33 / 39.98
GRPO w/ Band KL,0.0555.17 / 87.554.79 / 14.216.04 / 24.2822.00 / 42.01
DeepSeek-R1-Distill-Qwen-7B (500 steps)
GRPO87.11 / 95.0027.29 / 49.7132.71 / 55.6249.04 / 66.78
GRPO w/ Clip-Higher87.50 / 95.0026.77 / 48.1130.83 / 56.9648.37 / 66.69
GRPO w/ Relaxed Band KL,0.0588.58 / 95.0029.69 / 50.7833.44 / 54.5250.57 / 66.77
GRPO w/ Band KL,0.0388.28 / 95.0029.17 / 51.5832.71 / 61.4650.05 / 69.35
GRPO w/ Band KL,0.1089.69 / 95.0029