Process Rewards with Learned Reliability: Summary

Summary (Overview)

Core Contribution: Introduces BETA PRM, a distributional Process Reward Model that predicts both a step-level success probability ( $\mu$ ) and the reliability ( $\kappa$ ) of that prediction, moving beyond single-point reward estimates.
Key Innovation: Trains the model using a Beta-Binomial likelihood that explains observed Monte Carlo success counts $(K, N)$ , rather than regressing to the noisy empirical ratio $K/N$ as a point target.
Main Findings: BETA PRM improves PRM-guided Best-of-N selection accuracy across four backbones and four benchmarks (e.g., +3.37 avg. points on InternVL2.5-8B) while preserving standard step-level error detection capability.
Practical Application: Demonstrates Adaptive Computation Allocation (ACA), a reliability-aware inference method that uses BETA PRM's uncertainty signal to dynamically allocate computation, reducing token usage by up to 33.57% while improving accuracy over fixed-budget Best-of-N.
Empirical Validation: The learned concentration parameter $\kappa$ provides a non-trivial reliability signal, with training dynamics showing the model learns to assign adaptive confidence, forming a high-confidence upper tail.

Introduction and Theoretical Foundation

Process Reward Models (PRMs) provide step-level feedback for reasoning chains, crucial for tasks like test-time candidate selection and policy optimization. However, existing PRMs typically output only a single scalar reward score (e.g., step correctness probability). This creates a mismatch:

Inference Limitation: A single score cannot express predictive uncertainty. A causal PRM judges a step without seeing its future continuations, making it uncertain if a correct-looking prefix leads to a correct final answer.
Training Limitation: Step-level supervision is often derived from Monte Carlo estimates: sample $N$ continuations from a prefix, count successful ones $K$ , and use the empirical ratio $\hat{q} = K/N$ as a label. This ratio is a noisy, finite-sample estimate of the true latent prefix success probability $q$ . Standard PRM training regresses to this single point $\hat{q}$ , forcing the model to fit sampling noise.

The paper addresses these limitations by proposing a PRM that can express uncertainty about its own predictions. The theoretical foundation shifts from point estimation to distributional modeling: treat the latent prefix success probability $q_t$ as a random variable with a Beta belief, and treat the observed success count $K_t$ as a Binomial sample from $q_t$ . This Beta-Binomial framework naturally pairs the count-based supervision with a distributional output.

Methodology

BETA PRM Formulation

The model assumes a generative process where the latent prefix success probability $q_t$ follows a Beta distribution, and the observed success count $K_t$ follows a Binomial distribution conditional on $q_t$ :

q_t \sim \text{Beta}(\alpha_t, \beta_t), \quad K_t | q_t \sim \text{Binomial}(N, q_t)

For interpretability, the Beta distribution is reparameterized by its mean $\mu_t$ (the expected success probability/standard PRM score) and concentration $\kappa_t$ (controlling belief tightness/reliability):

\mu_t = \frac{\alpha_t}{\alpha_t + \beta_t}, \quad \kappa_t = \alpha_t + \beta_t

Thus, $\alpha_t = \mu_t \kappa_t$ and $\beta_t = (1 - \mu_t) \kappa_t$ .

Model Parameterization

At each process marker <prm> after step $t$ :

Reward Mean ( $\mu_t$ ): Derived from the language model's logits for reward tokens Yes ( $z_t^{\text{Yes}}$ ) and No ( $z_t^{\text{No}}$ ): $\mu_t = \frac{\exp(z_t^{\text{Yes}})}{\exp(z_t^{\text{Yes}}) + \exp(z_t^{\text{No}})}$
Concentration ( $\kappa_t$ ): Predicted by a separate lightweight linear head $g_\phi$ applied to the hidden state $h_t$ : $\kappa_t = \text{softplus}(g_\phi(h_t)) + \kappa_{\min}$ where $\kappa_{\min}$ is a small lower bound for stability.

Training Objective

The core training loss is the negative log-likelihood of the observed count $K_t$ under the predicted Beta-Binomial distribution:

\mathcal{L}_{\text{Beta-Binomial}} = -\frac{1}{|\mathcal{P}|} \sum_{t \in \mathcal{P}} \log p(K_t | N, \alpha_t, \beta_t)

where

p(K_t | N, \alpha_t, \beta_t) = \binom{N}{K_t} \frac{B(K_t + \alpha_t, N - K_t + \beta_t)}{B(\alpha_t, \beta_t)}

and $B(\cdot, \cdot)$ is the Beta function. $\mathcal{P}$ is the set of supervised markers.

An auxiliary regularization loss is added to calibrate the concentration parameter, discouraging high confidence ( $\kappa_t$ ) when the predicted mean ( $\mu_t$ ) disagrees with the evidence ( $K_t/N$ ):

\mathcal{L}_{\text{reg}} = \lambda_{\text{reg}} \frac{1}{|\mathcal{P}|} \sum_{t \in \mathcal{P}} \left[ \text{sg}(\mu_t) - \frac{K_t}{N} \right] \kappa_t

Here, $\text{sg}(\cdot)$ denotes stop-gradient, preventing this term from pulling $\mu_t$ toward the noisy ratio. The total loss is:

\mathcal{L} = \mathcal{L}_{\text{Beta-Binomial}} + \mathcal{L}_{\text{reg}}

Adaptive Computation Allocation (ACA)

ACA is a reliability-aware inference method for PRM-guided Best-of-N reasoning. It uses BETA PRM's outputs ( $\mu_t$ , $\kappa_t$ ) to allocate computation dynamically:

Risk-Adjusted Candidate Score: For a candidate solution $y = s_{1:T}$ , compute a score that penalizes uncertainty. First, derive the step-level standard deviation $\sigma_t$ from the Beta parameters: $\sigma_t = \sqrt{\frac{\mu_t(1 - \mu_t)}{\kappa_t + 1}}$ The candidate score aggregates a risk-adjusted step score: $S(y) = \frac{1}{T} \sum_{t=1}^{T} (\mu_t - \lambda \sigma_t)$
Progressive Batches & Early Stopping: Instead of generating all $N$ candidates at once, ACA starts with a small pool ( $n_0$ ). It computes Lower and Upper Confidence Bounds (LCB, UCB) for each candidate $y$ : $\text{LCB}(y) = S(y) - c_{\text{stop}} U(y), \quad \text{UCB}(y) = S(y) + c_{\text{stop}} U(y), \quad U(y) = \frac{1}{T}\sum_{t=1}^T \sigma_t$ If the top candidate's LCB exceeds all other candidates' UCBs ( $\text{LCB}(y^\star) > \max_{y \neq y^\star} \text{UCB}(y)$ ), ACA stops early and returns $y^\star$ .
Uncertainty-Guided Prefix Repair: If not stopping, ACA allocates the next batch to the most promising non-winner candidate (highest UCB). To repair it, ACA selects a cutpoint step (e.g., the earliest step with a conservatively scored value $\mu_t - c_{\text{cut}}\sigma_t$ below a threshold $p_{\text{bad}}$ ), keeps the prefix before it, and samples new continuations from that point.

Empirical Validation / Results

BETA PRM Improves Best-of-N Selection

Trained on VisualPRM400K-v1.1, BETA PRM was evaluated as a solution selector for Best-of-16 across four backbones and four multimodal reasoning benchmarks. Candidates were ranked using a risk-budget selector $S_{\text{RB}}(y)$ that discounts rewards for steps with high uncertainty ( $\sigma_t > \tau$ ).

Table 1: PRM-guided Best-of-16 Final-Answer Accuracy

Selector	MathVision	OlympiadBench	MathVerse	MathVista	Avg. $\Delta$ vs. Std. PRM
InternVL3-14B
+Standard PRM	23.03	16.67	45.41	60.70	–
+BETA PRM	25.66	16.67	46.35	62.30	+1.29
InternVL3-8B
+Standard PRM	22.69	15.33	44.80	60.00	–
+BETA PRM	24.34	18.00	45.20	61.10	+1.46
InternVL2.5-8B
+Standard PRM	21.38	11.33	42.81	57.60	–
+BETA PRM	25.66	15.33	44.31	61.30	+3.37
Qwen2.5-VL-7B
+Standard PRM	21.38	14.00	44.92	60.30	–
+BETA PRM	24.34	17.33	45.99	63.60	+2.66

BETA PRM achieves the best accuracy in every backbone–benchmark combination, with consistent average gains over the standard PRM baseline.

BETA PRM Preserves Error Detection Ability

Evaluated on VisualProcessBench for step-level error detection (binary classification of correct/erroneous steps), BETA PRM remains competitive with standard PRMs, showing the Beta-Binomial training does not degrade this core PRM capability.

Table 2: Step-level Error Detection on VisualProcessBench (Overall Micro-F1)

Model	Overall F1
InternVL3-14B
Standard PRM	61.90
BETA PRM	61.90
InternVL3-8B
Standard PRM	60.69
BETA PRM	61.85
InternVL2.5-8B
Standard PRM	61.54
BETA PRM	60.97
Qwen2.5-VL-7B
Standard PRM	62.23
BETA PRM	62.91

Ablations Validate Design Choices

Auxiliary Regularizer ( $\mathcal{L}_{\text{reg}}$ ): Removing it consistently reduces Best-of-N accuracy (Table 3), confirming its role in calibrating the learned concentration $\kappa_t$ .
Training Dynamics: The mean and 90th percentile of $\kappa_t$ drop early in training and later recover (Figure 4), showing the model first becomes conservative and then learns to assign higher confidence to well-supported predictions. The recovery of a high-confidence upper tail indicates the model learns adaptive, non-uniform reliability.

ACA Improves Accuracy-Token Tradeoff

Using BETA PRM's reliability signal, ACA was evaluated against a fixed-budget Best-of-16 baseline under the same maximum budget ( $N=16$ ).

Table 4: ACA vs. Vanilla Best-of-16 (BoN)

Method	MathVision	OlympiadBench	MathVerse	MathVista
InternVL2.5-8B	Acc. ↑ / Tokens ↓	Acc. ↑ / Tokens ↓	Acc. ↑ / Tokens ↓	Acc. ↑ / Tokens ↓
Vanilla BoN	25.00 / 1383k	15.33 / 1151k	44.47 / 17932k	60.90 / 2790k
ACA	26.32 / 965k (↓30.24%)	16.67 / 958k (↓16.76%)	45.58 / 11912k (↓33.57%)	62.20 / 1949k (↓30.14%)
Qwen2.5-VL-7B
Vanilla BoN	24.67 / 1383k	16.67 / 1151k	45.74 / 17932k	63.30 / 2790k
ACA	26.65 / 988k (↓28.57%)	18.00 / 928k (↓19.39%)	46.40 / 12015k (↓33.00%)	64.00 / 2030k (↓27.22%)

ACA improves final-answer accuracy while simultaneously reducing token usage by 16.76% to 33.57%.

A key ablation (Table 5) shows that BETA PRM's learned uncertainty is crucial for ACA's success. Replacing it with a standard PRM using either a reward-only score or a proxy uncertainty ( $\sigma_t = \sqrt{\mu_t(1-\mu_t)}$ ) results in worse accuracy-token tradeoffs, demonstrating the value of the explicitly modeled reliability signal $\kappa_t$ .

Theoretical and Practical Implications

Theoretical Implications:

Shift in PRM Formulation: Proposes a principled, distributional alternative to point-estimate PRMs, aligning the training objective (Beta-Binomial likelihood) with the nature of Monte Carlo supervision (count data).
Uncertainty as a First-Class Citizen: Introduces reliability ( $\kappa$ ) as a core, learnable output of a reward model, enabling downstream algorithms to reason about trustworthiness.

Practical Implications:

Improved Test-Time Scaling: BETA PRM provides a better signal for PRM-guided candidate selection, leading to higher accuracy in Best-of-N settings.
Efficient Inference: The ACA method demonstrates that reliability signals can be used for adaptive computation allocation, significantly reducing inference cost (tokens) while maintaining or improving accuracy, moving beyond fixed-budget approaches.
Broader Applicability: The reliability-aware interface (reward + confidence) is broadly useful for any PRM-guided decision-making, including search, policy optimization, and interactive systems where knowing when to trust a score is critical.

Conclusion

The paper identifies a key limitation in current Process Reward Models: their single-point reward outputs provide no indication of predictive reliability. To address this, it introduces BETA PRM, a distributional PRM that models prefix success probability with a Beta belief and is trained via a Beta-Binomial likelihood on Monte Carlo count observations. This gives the model two outputs: a predicted reward mean $\mu$ and a learned reliability concentration $\kappa$ .

Empirically, BETA PRM improves the accuracy of PRM-guided Best-of-N selection across multiple backbones and benchmarks without sacrificing step-level error detection. Furthermore, leveraging this reliability signal, the proposed Adaptive Computation Allocation (ACA) method dynamically allocates inference-time compute, achieving a superior accuracy-token tradeoff by reducing usage by up to 33.57% while improving final-answer accuracy. Overall, BETA PRM advances PRMs from simple scorers to reliability-aware models, enabling more robust and efficient reasoning systems.