Summary (Overview)

Main Contribution: Introduces BandPO (Band-constrained Policy Optimization), a novel method that replaces the canonical clipping mechanism in PPO with a unified theoretical operator called Band. This operator projects trust regions defined by $f$ -divergences into dynamic, probability-aware clipping intervals.
Key Finding: Identifies a critical bottleneck in canonical clipping: fixed bounds constrain the upward update margin of low-probability actions linearly with their probability, disproportionately suppressing high-advantage tail strategies and inducing entropy collapse.
Methodological Innovation: Formulates the mapping from trust region to clipping bounds as a convex optimization problem, guaranteeing globally optimal numerical solutions and deriving closed-form solutions for specific divergences (Total Variation and Pearson $\chi^2$ ).
Empirical Validation: Demonstrates consistent performance improvements over GRPO and Clip-Higher baselines across multiple models (Qwen2.5 3B/7B, Llama3 8B) on mathematical benchmarks (AMC, AIME), while robustly mitigating entropy collapse.
Theoretical Insight: The Band operator naturally circumvents the exploration bottleneck by adaptively expanding the feasible upward margin for low-probability actions, preventing premature clipping and preserving exploration gradients.

Introduction and Theoretical Foundation

Reinforcement Learning from Human Feedback (RLHF) is the dominant paradigm for post-training Large Language Models (LLMs), where proximal constraints on policy updates balance optimization stability with exploration. The canonical clipping mechanism in PPO serves as an efficient surrogate for trust-region updates. However, this paper identifies a critical structural bottleneck: fixed clipping bounds enforce a linear dependence where the maximum feasible probability variation scales proportionally with the old probability. Consequently, for positive-advantage actions, lower probabilities dictate vanishingly small margins for upward variation, rendering them susceptible to premature clipping and nullifying their gradient contributions. This inhibits the model from reinforcing novel, superior strategies in the distribution tail.

While heuristic approaches like Clip-Higher (DAPO) relax upper bounds to delay entropy collapse, they often lead to instability and performance collapse. The fundamental issue is that adjusting fixed thresholds fails to address the inherent limitation. The paper proposes BandPO to bridge this gap by introducing a unified Band operator that projects $f$ -divergence-induced trust regions into dynamic, probability-aware clipping intervals, governed by a single interpretable radius parameter $\delta$ .

Methodology

Notation and Problem Formulation

The RL alignment of LLMs is formulated as a discrete Markov Decision Process. Let $\pi_\theta$ denote the policy. Given a prompt $x$ , the policy generates a response sequence $y = (a_1, a_2, ..., a_T)$ . The state $s_t = (x, y_{<t})$ . The optimization objective is to maximize the expected reward: $J(\theta) = \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta}[R(x,y)]$ .

The Group Relative Policy Optimization (GRPO) objective aggregates per-token objectives:

\mathcal{J}_{\text{GRPO}}(\theta) = \mathbb{E}_{x \sim \mathcal{D}, \{y_i\}_{i=1}^G \sim \pi_{\text{old}}} \left[ \frac{1}{G} \sum_{i=1}^G \frac{1}{T_i} \sum_{t=1}^{T_i} \mathcal{J}_t(\theta; y_i) \right]

(1)

The per-token objective with asymmetric clipping is:

\mathcal{J}_t(\theta; y_i) = \min\left( r_{t,i} A_{t,i}, \text{clip}(r_{t,i}, 1 - \epsilon_{-}, 1 + \epsilon_{+}) A_{t,i} \right) - \beta D_{\text{KL}}(\pi_\theta(\cdot|s_t) \parallel \pi_{\text{ref}}(\cdot|s_t))

(2) where $r_{t,i} = \frac{\pi_\theta(a_t|s_t)}{\pi_{\text{old}}(a_t|s_t)}$ and $A_{t,i}$ is the advantage.

The Bottleneck in Canonical Clipping

The canonical clipping constraint is:

1 - \epsilon_{-} \leq \frac{\pi_\theta(a|s)}{\pi_{\text{old}}(a|s)} \leq 1 + \epsilon_{+}

(3)

Defining the probability variation $\Delta\pi(a|s) = \pi_\theta(a|s) - \pi_{\text{old}}(a|s)$ , the feasible set under clipping is:

\mathcal{C}_{\text{clip}} = \{ \Delta\pi(a|s) \mid -\epsilon_{-} \pi_{\text{old}}(a|s) \leq \Delta\pi(a|s) \leq \epsilon_{+} \pi_{\text{old}}(a|s) \}

(4)

The maximal feasible set from the probability simplex $\Delta_V$ is:

\mathcal{C}_{\text{simplex}} = \{ \Delta\pi(a|s) \mid -\pi_{\text{old}}(a|s) \leq \Delta\pi(a|s) \leq 1 - \pi_{\text{old}}(a|s) \}

(5)

Mapping $\mathcal{C}_{\text{simplex}}$ to ratio space gives theoretical bounds:

0 \leq \frac{\pi_\theta(a|s)}{\pi_{\text{old}}(a|s)} \leq \frac{1}{\pi_{\text{old}}(a|s)}

(6)

The bottleneck: Fixed clipping bounds constrain probability variations to scale linearly with $\pi_{\text{old}}(a|s)$ . The feasible upward shift vanishes as probability approaches zero, contradicting the theoretical upper bound. This induces premature clipping for low-probability, positive-advantage actions.

BandPO: Band-constrained Policy Optimization

$f$ -Divergence-Induced Trust Regions

Let $P(\cdot) \triangleq \pi_{\text{old}}(\cdot|s_t) \in \Delta_V$ and $Q(\cdot) \triangleq \pi_\theta(\cdot|s_t) \in \Delta_V$ . For a strictly convex function $f: \mathbb{R}_+ \to \mathbb{R}$ with $f(1)=0$ , the $f$ -divergence is:

D_f(Q \parallel P) \triangleq \sum_{a \in \mathcal{V}} P(a) f\left( \frac{Q(a)}{P(a)} \right)

(7)

The trust region $\mathcal{T}_{f,\delta}(P)$ is defined as:

\mathcal{T}_{f,\delta}(P) \triangleq \{ Q \in \Delta_V \mid D_f(Q \parallel P) \leq \delta \}

(8) where $\delta > 0$ is the trust region radius.

The Band Operator

For a token $a$ with $P(a) > 0$ , the ratio function is $r(a; Q, P) \triangleq \frac{Q(a)}{P(a)}$ . The optimal dynamic bounds are obtained by solving convex optimization problems:

Upper bound:

\overline{r}_{f,\delta}(a; P) \triangleq \max_{Q \in \mathcal{T}_{f,\delta}(P)} \frac{Q(a)}{P(a)}

(10)

Lower bound:

\underline{r}_{f,\delta}(a; P) \triangleq \min_{Q \in \mathcal{T}_{f,\delta}(P)} \frac{Q(a)}{P(a)}

(11)

The Band operator is then defined as:

\text{Band}_{f,\delta}(r; a, P) \triangleq \text{clip}\left( r, \underline{r}_{f,\delta}(a; P), \overline{r}_{f,\delta}(a; P) \right)

(12)

Reduction to Univariate Optimization

Lemma 1 (Optimality of Uniform Complement Rescaling): The optimal solution $Q^*$ to Problems (10) and (11) must preserve relative probability proportions within the complement set $\mathcal{V} \setminus \{a\}$ :

\frac{Q^* (b)}{P(b)} = c, \forall b \in \mathcal{V} \setminus \{a\}

(13) where $c \in \mathbb{R}_+$ is uniquely determined by simplex normalization.

Let $p \triangleq P(a)$ . The simplex constraint gives $c(r) = \frac{1 - rp}{1 - p}$ . Substituting into the divergence yields a univariate function:

D_f(Q \parallel P) = p f(r) + (1-p) f\left( \frac{1 - rp}{1 - p} \right) \triangleq g_f(p, r)

(14)

Theorem 1 (Exact Scalarization of Trust-Region Constraints): For $p \in (0,1)$ , Problems (10) and (11) are equivalent to finding roots of:

g_f(p, r) \triangleq p f(r) + (1-p) f\left( \frac{1 - rp}{1 - p} \right) = \delta

(15) subject to $r \in [0, 1/p]$ . $g_f(p, r)$ is strictly convex with respect to $r$ with global minimum 0 at $r=1$ . For $\delta > 0$ , the equation $g_f(p, r) = \delta$ has exactly two unique roots:

\underline{r}_{f,\delta}(a; P) = \min\{ r \in [0,1] \mid g_f(p, r) = \delta \}

(16)

\overline{r}_{f,\delta}(a; P) = \max\{ r \in [1, 1/p] \mid g_f(p, r) = \delta \}

(17)

Properties of Band Bounds

Proposition 1 (Asymptotic Behavior of Band Bounds): Given $\delta > 0$ :

\lim_{p \to 0^+} \overline{r}_{f,\delta}(p) = +\infty, \quad \lim_{p \to 0^+} \underline{r}_{f,\delta}(p) = 0, \quad \lim_{p \to 1^-} \overline{r}_{f,\delta}(p) = 1

(18)

Proposition 2 (Strict Monotonicity of Band Bounds): The clipping bounds are strictly monotonic functions of $p$ . The upper bound $\overline{r}_{f,\delta}(p)$ is strictly decreasing with respect to $p$ , while the lower bound $\underline{r}_{f,\delta}(p)$ is strictly increasing with respect to $p$ .

Solving Band Bounds

Simplex Saturation: For large $\delta$ , the divergence constraint may extend beyond simplex boundaries. The optimal Band upper bound is:

\overline{r}_{f,\delta}(p) = \begin{cases} r_{\text{max}}, & \text{if } g_f(p, r_{\text{max}}) \leq \delta \\ r^{\dagger}, & \text{otherwise} \end{cases}

(19) where $r_{\text{max}} \triangleq 1/p$ and $r^{\dagger}$ is the unique root of $g_f(p, r) = \delta$ in $(1, r_{\text{max}})$ . Symmetrically for the lower bound.

Generic Numerical Solver: In the active regime, bounds correspond to unique roots of $g_f(p, r) = \delta$ , which can be computed via bracketed root-finding algorithms (e.g., Bisection).

Closed-Form Solutions:

Proposition 4 (Closed-Form Band Bounds for TV and Pearson $\chi^2$ ):

For Total Variation with $f_{\text{TV}}(u) = \frac{1}{2}|u-1|$ : $\overline{r}_{\text{TV},\delta}(p) = 1 + \frac{\delta}{p}, \quad \underline{r}_{\text{TV},\delta}(p) = 1 - \frac{\delta}{p}$ (20)
For Pearson $\chi^2$ with $f_{\chi^2}(u) = (u-1)^2$ : $\overline{r}_{\chi^2,\delta}(p) = 1 + \sqrt{\frac{\delta(1-p)}{p}}, \quad \underline{r}_{\chi^2,\delta}(p) = 1 - \sqrt{\frac{\delta(1-p)}{p}}$ (21)

BandPO Objective

BandPO maximizes:

\mathcal{J}_{\text{BandPO}}(\theta) = \mathbb{E}_{x \sim \mathcal{D}, \{y_i\}_{i=1}^G \sim \pi_{\text{old}}} \left[ \frac{1}{G} \sum_{i=1}^G \frac{1}{T_i} \sum_{t=1}^{T_i} \mathcal{J}^{\text{Band}}_{t}(\theta; y_i) \right]

(22)

The per-token surrogate objective is:

\mathcal{J}^{\text{Band}}_{t}(\theta; y_i) = \min\left( r_{t,i} A_{t,i}, \text{Band}_{f,\delta}\left( r_{t,i}; y_{t,i}, \pi_{\text{old}}(\cdot|s_{t,i}) \right) A_{t,i} \right) - \beta D_{\text{KL}}(\pi_{\text{ref}} \parallel \pi_\theta)_t

(23)

Empirical Validation / Results

Experimental Setup

Models: Qwen2.5-3B-Instruct, DeepSeek-R1-Distill-Qwen-1.5B/7B, DeepSeek-R1-Distill-Llama-8B.
Datasets: Composite training set (DAPO + MATH Levels 3-5). Validation on AMC 2023, AIME 2024, AIME 2025.
Baselines: GRPO (canonical symmetric clipping), GRPO w/ Clip-Higher (DAPO asymmetric clipping).
Metrics: pass@32 (probability of at least one correct solution) and mean@32 (expected policy robustness across 32 samples).
BandPO Implementation: KL divergence as trust region constraint with $\delta = 0.05$ . Asymmetric clipping thresholds: $\epsilon_{+}=0.28$ , $\epsilon_{-}=0.2$ .

Main Results

Table 1: Reasoning performance comparison across model scales (1.5B/3B/7B/8B)

Method	AMC2023 mean@32/pass@32	AIME2024 mean@32/pass@32	AIME2025 mean@32/pass@32	Average mean@32/pass@32
DeepSeek-R1-Distill-Qwen-1.5B (800 steps)
GRPO	72.11 / 94.31	18.13 / 39.00	21.88 / 38.89	37.37 / 57.40
GRPO w/ Clip-Higher	77.03 / 94.98	18.23 / 41.09	23.12 / 40.16	39.46 / 58.74
GRPO w/ Relaxed Band KL,0.05	74.69 / 93.77	19.69 / 43.28	23.54 / 38.84	39.31 / 58.63
GRPO w/ Band KL,0.05	77.34 / 94.98	20.00 / 51.80	23.85 / 40.65	40.40 / 62.48
Qwen2.5-3B-Instruct (800 steps)
GRPO	45.94 / 77.33	3.54 / 11.68	3.23 / 8.79	17.57 / 32.60
GRPO w/ Clip-Higher	52.66 / 82.91	4.69 / 14.95	4.06 / 23.93	20.47 / 40.60
GRPO w/ Relaxed Band KL,0.05	52.97 / 87.05	4.58 / 15.11	4.06 / 21.00	20.54 / 41.05
GRPO w/ Band KL,0.03	52.81 / 87.84	4.27 / 10.00	4.06 / 22.40	20.38 / 40.08
GRPO w/ Band KL,0.10	51.41 / 84.77	3.54 / 14.31	6.04 / 20.85	20.33 / 39.98
GRPO w/ Band KL,0.05	55.17 / 87.55	4.79 / 14.21	6.04 / 24.28	22.00 / 42.01
DeepSeek-R1-Distill-Qwen-7B (500 steps)
GRPO	87.11 / 95.00	27.29 / 49.71	32.71 / 55.62	49.04 / 66.78
GRPO w/ Clip-Higher	87.50 / 95.00	26.77 / 48.11	30.83 / 56.96	48.37 / 66.69
GRPO w/ Relaxed Band KL,0.05	88.58 / 95.00	29.69 / 50.78	33.44 / 54.52	50.57 / 66.77
GRPO w/ Band KL,0.03	88.28 / 95.00	29.17 / 51.58	32.71 / 61.46	50.05 / 69.35
GRPO w/ Band KL,0.10	89.69 / 95.00	29