Visual Summary | APPO: Agentic Procedural Policy Optimization

Summary (Overview)

APPO shifts credit assignment in agentic RL from coarse-grained units (tool-call boundaries or fixed workflows) to fine-grained decision points (“procedures”) in the generated sequence.
It introduces a Branching Score (BS) that combines token entropy with the policy-induced likelihood gain of subsequent continuations, enabling selection of high‐impact branching locations while filtering out spurious high‐entropy tokens.
A procedure‐level advantage scaling term (based on future‐value $\Omega$ ) further improves credit distribution across branched rollouts.
Experiments on 13 challenging benchmarks (covering mathematical reasoning, knowledge‐intensive reasoning, and deep search) show APPO consistently outperforms strong baselines by nearly 4 points on average, while maintaining efficient tool‐calls and interpretability.
Theoretical analysis proves that APPO reduces gradient variance and admits a valid policy improvement bound under BS‐guided branching.

Introduction and Theoretical Foundation

Large language models (LLMs) have evolved into autonomous agents capable of multi‑turn interaction with external environments. This progress is largely driven by Reinforcement Learning with Verifiable Rewards (RLVR), but the standard paradigm provides only coarse outcome‑level feedback, making it hard to attribute success or failure to specific intermediate decisions. Existing work expands trajectories from intermediate locations (e.g., at tool‑call boundaries or workflow stages) to densify supervision, yet these coarse units still overlook the procedural knowledge embedded in the thinking content.

Pilot study findings (Figure 1 in the paper):

High‑entropy positions are broadly distributed throughout the thinking span, not concentrated at tool‑call boundaries.
Raw token entropy alone does not reliably indicate decision significance – some high‑entropy tokens reflect lexical rarity (e.g., “march”, “november”) rather than task‑relevant choice.

Motivation for APPO:
APPO redefines the unit of credit assignment from coarse heuristic units to procedures – procedural reasoning patterns organized around high‑impact decision points. It selects branching locations using a comprehensive Branching Score (BS) that goes beyond entropy. It further introduces a procedure‑level advantage scaling term to encourage exploration over procedures with high branching value.

Theoretical foundation (proofs in Appendix A):

Theorem 3.1 (Variance Reduction): BS‑guided branching allocates more samples to high‑variance decision points, reducing gradient variance compared to random branching under the same budget.
Theorem 3.2 (Policy Improvement Bound): APPO’s advantage design (Eq. 7) admits a valid lower bound on policy improvement under BS‑guided branching.

Methodology

Let $\pi_\theta$ be the policy model, $\pi_{\text{ref}}$ the reference model, and $\pi_{\text{old}}$ the behavior policy used to generate initial rollouts.

Token Entropy

A common branching strategy selects tokens with the top‑ $k$ entropy:

H_t = -\sum_{j=1}^{|V|} p_{t,j} \log p_{t,j}, \quad p_t = \pi_\theta(\cdot|G_{<t}, z; \mathcal{T}) = \text{Softmax}(z_t / \tau) \tag{3}

where $|V|$ is vocabulary size, $\tau$ temperature, and $z_t$ logits. Entropy reflects local uncertainty but does not indicate whether a token changes downstream reasoning.

Procedural Rollout Branching

Given an initial rollout, APPO:

For each token, computes the future value $\Omega_{n,i}$ using an accumulated decayed importance‑sampling ratio:
$\Omega_{n,i} = \exp\!\left( \sum_{i' \ge i} \gamma^{i'-i} \log \rho_{i'}(\theta) \right), \quad \rho_{i'}(\theta) = \frac{\pi_\theta(H_{n,i'} | H_{n,<i'}, x; \mathcal{T})}{\pi_{\text{old}}(H_{n,i'} | H_{n,<i'}, x; \mathcal{T})} \tag{4}$
$\Omega_{n,i}$ indicates how much the current policy favors the continuations starting at token $i$ .
Combines $\Omega_{n,i}$ with entropy to define the Branching Score:
$\text{BS}_{n,i} = Z(\text{clip}(\Omega_{n,i}, 1-\epsilon', 1+\epsilon'); \mathcal{H}_n) \cdot Z(H_{n,i}; \mathcal{H}_n) \tag{5}$
where $Z(\cdot; \mathcal{H}_n)$ denotes z‑score normalization within rollout $\mathcal{H}_n$ . The product selects tokens that are both uncertain and consequential.
For each rollout, selects the top‑ $B$ tokens according to $\text{BS}_{n,i}$ and resamples continuations from these positions to generate new branches.

Procedural Advantage Estimation and Policy Optimization

APPO computes group‑relative advantages separately for initial rollouts $\mathcal{T}^{\text{init}}$ and branches $\mathcal{T}^{\text{branch}}$ :

\hat{A}^{\text{base}}_{n,i} = \text{avg}\!\left( \frac{R_n - \text{mean}(\{R_{n'} | H_{n'} \in \mathcal{T}^*\})}{\text{std}(\{R_{n'} | H_{n'} \in \mathcal{T}^*\})} \;\middle|\; \mathcal{T}^* \in \{\mathcal{T}^{\text{init}}, \mathcal{T}^{\text{branch}}\} \right) \tag{6}

An extra future‑aware advantage term further emphasizes decisions with strong downstream influence:

\hat{A}^{\text{fut}}_{n,i} = \text{clip}_{\epsilon'}\!\left( \exp\!\left( \sum_{i' \ge i} \gamma^{i'-i} \log \rho_{i'}(\theta) \right) \right), \quad \rho_{i'}(\theta) = \frac{\pi_\theta(H_{n,i'} | H_{n,<i'}, x; \mathcal{T})}{\pi_{\text{old}}(H_{n,i'} | H_{n,<i'}, x; \mathcal{T})} \tag{7}

The final advantage is:

\hat{A}_{n,i} = \hat{A}^{\text{base}}_{n,i} \left( 1 + b \cdot \hat{A}^{\text{fut}}_{n,i} \right)

where $b$ controls the contribution of the future‑aware term.

The policy optimization objective is:

\mathcal{J}(\theta) = \mathbb{E}_{x \sim \mathcal{X}, \mathcal{H} \sim \pi_{\text{old}}(\cdot|x)} \left[ \frac{1}{M} \sum_{n=1}^{N} \frac{1}{|\mathcal{H}_n|} \sum_{i=1}^{|\mathcal{H}_n|} \min\left( \rho_{n,i}(\theta) \hat{A}_{n,i},\; \text{clip}_\epsilon(\rho_{n,i}(\theta)) \hat{A}_{n,i} \right) - \beta D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}}) \right] \tag{8}

Branches sampled from $\pi_\theta$ are not directly optimized; they provide auxiliary procedural signals for advantage estimation.

Empirical Validation / Results

Setup

Datasets: 13 benchmarks:
- Mathematical Reasoning: AIME24, AIME25, MATH500, GSM8K, MATH.
- Knowledge‑Intensive Reasoning: WebWalker, HotpotQA, 2WikiMultihopQA, Musique, Bamboogle.
- Deep Search: GAIA, WebWalkerQA, Humanity’s Last Exam (HLE), Xbench.
Baselines: Classic RL (GRPO, Reinforce++, DAPO, GPPO, CISPO) and Agentic RL (GIGPO, ARPO).
Backbones: Llama3.1‑8B‑Instruct, Qwen2.5‑7B‑Instruct; also Qwen3‑8B and Qwen3‑14B for Deep Search.
Metrics: F1‑score for QA tasks; LLM‑as‑a‑Judge with Qwen2.5‑72B for correctness; Pass@1 reported.

Main Results

Table 1 – Mathematical Reasoning & Knowledge‑Intensive Reasoning (backbone: Llama3.1‑8B / Qwen2.5‑7B)

Method	AIME24	AIME25	MATH500	GSM8K	MATH	WebWalker	HotpotQA	2Wiki	Musique	Bamboogle	Avg
Llama3.1‑8B
ARPO	23.3	16.7	64.6	88.0	80.2	30.5	65.4	75.5	34.8	72.8	55.3
APPO (Ours)	30.0	16.7	69.4	88.6	81.4	32.0	65.4	79.8	35.2	76.8	57.4
Qwen2.5‑7B
ARPO	30.0	30.0	78.8	92.2	88.8	26.0	58.8	76.1	31.1	71.5	58.3
APPO (Ours)	36.7	30.0	81.0	93.2	90.0	33.5	66.5	81.5	31.5	77.6	62.2

APPO consistently improves over the previous best agentic RL method (ARPO) by 2.1 % to 8.9 % relative gain, and over classic RL baselines by larger margins.

Table 2 – Deep Search (backbone: Qwen3‑8B / Qwen3‑14B)

Method	GAIA (Avg)	WebWalkerQA (Avg)	HLE (Avg)	Xbench (Avg)	Avg
Qwen3‑8B
ARPO	38.8	32.0	8.8	25.0	25.0
APPO (Ours)	42.7	33.8	13.4	28.0	28.0
Qwen3‑14B
ARPO	43.7	40.5	10.0	32.0	32.0
APPO (Ours)	46.6	43.4	12.3	34.0	34.0

APPO establishes new state‑of‑the‑art at both 8B and 14B scales, notably improving on GAIA and HLE.

Scaling Analysis

Pass@K (Figure 3): APPO’s advantage over ARPO increases with $k$ , indicating that APPO improves both top‑1 correctness and the diversity of valid reasoning paths.
Branching Configuration (Table 3, $L=1$ ): Balanced allocation between initial trees $N$ and branches per tree $B$ (e.g., $N=4, B=3$ ) outperforms both extremes ( $N=8, B=1$ or $N=2, B=7$ ). This shows the importance of diverse root trajectories and deep exploration around decision points.
Ablation on Components (Table 4):
- Replacing BS with pure entropy drops average score by 1.7 (Llama) / 0.9 (Qwen).
- Removing future‑aware advantage $\hat{A}^{\text{fut}}$ causes the largest drop (Qwen: 58.1 → 54.7).
- Removing dual‑group advantage estimation also degrades performance.

Qualitative Analysis

Training Dynamics (Figure 4): APPO reaches higher final reward and follows a more stable trajectory than ARPO on both backbones.
Diversity Analysis (Figure 5): UMAP + DBSCAN clustering shows APPO produces more compact, better‑separated clusters than ARPO, indicating that branches around similar decision points are semantically coherent while those from different points are structurally distinct.
Interpretation of BS (Figure 6): High‑entropy tokens include many rare nouns (lexical rarity) that do not affect reasoning. BS filters these out by emphasizing tokens with large downstream distributional shifts between current and old policies, thereby selecting tokens that actually redirect reasoning trajectories.

Theoretical and Practical Implications

Theoretical:
- Theorem 3.1 shows that BS‑guided branching reduces gradient variance compared to random branching under the same budget.
- Theorem 3.2 proves that APPO’s advantage design yields a valid policy improvement bound, ensuring stable and safe policy updates.
Practical:
- APPO provides a principled method to identify and explore high‑impact decision points in agentic RL, leading to better performance on diverse long‑horizon tasks.
- The Branching Score is computationally inexpensive (computed from token probabilities and importance‑sampling ratios) and does not require additional Monte‑Carlo estimation.
- The method maintains efficient tool‑calls (no extra environment interactions beyond the budget) and preserves interpretability (branches are generated around explicit reasoning steps).
- Ablations confirm that all components (BS, dual‑group advantage, future‑aware advantage) contribute non‑redundantly to the gains.
Broader impact: The study demonstrates that modeling procedural decisions as finer‑grained credit‑assignment units is a promising direction for improving exploration and credit assignment in agentic RL, potentially applicable to other domains where intermediate reasoning matters.

Conclusion

APPO (Agentic Procedural Policy Optimization) shifts branching and credit assignment from coarse tool‑call or workflow boundaries to fine‑grained decision points (procedures) in the generated sequence. It uses a Branching Score that combines token entropy with policy‑induced likelihood gains to select high‑value branching locations, and introduces a future‑aware advantage scaling term for more targeted credit distribution. Experiments on 13 benchmarks covering mathematical reasoning, knowledge‑intensive reasoning, and deep search show consistent improvements over strong baselines (e.g., ARPO, GRPO) while keeping tool‑calls efficient and behavior interpretable.

The findings are generic and enlightening: modeling procedural decisions offers a practical direction for improving exploration and credit assignment in agentic RL. Future work could extend APPO to multi‑agent settings or integrate procedural understanding into offline RL and preference optimization.