Process Rewards with Learned Reliability: Summary
Summary (Overview)
- Core Contribution: Introduces BETA PRM, a distributional Process Reward Model that predicts both a step-level success probability () and the reliability () of that prediction, moving beyond single-point reward estimates.
- Key Innovation: Trains the model using a Beta-Binomial likelihood that explains observed Monte Carlo success counts , rather than regressing to the noisy empirical ratio as a point target.
- Main Findings: BETA PRM improves PRM-guided Best-of-N selection accuracy across four backbones and four benchmarks (e.g., +3.37 avg. points on InternVL2.5-8B) while preserving standard step-level error detection capability.
- Practical Application: Demonstrates Adaptive Computation Allocation (ACA), a reliability-aware inference method that uses BETA PRM's uncertainty signal to dynamically allocate computation, reducing token usage by up to 33.57% while improving accuracy over fixed-budget Best-of-N.
- Empirical Validation: The learned concentration parameter provides a non-trivial reliability signal, with training dynamics showing the model learns to assign adaptive confidence, forming a high-confidence upper tail.
Introduction and Theoretical Foundation
Process Reward Models (PRMs) provide step-level feedback for reasoning chains, crucial for tasks like test-time candidate selection and policy optimization. However, existing PRMs typically output only a single scalar reward score (e.g., step correctness probability). This creates a mismatch:
- Inference Limitation: A single score cannot express predictive uncertainty. A causal PRM judges a step without seeing its future continuations, making it uncertain if a correct-looking prefix leads to a correct final answer.
- Training Limitation: Step-level supervision is often derived from Monte Carlo estimates: sample continuations from a prefix, count successful ones , and use the empirical ratio as a label. This ratio is a noisy, finite-sample estimate of the true latent prefix success probability . Standard PRM training regresses to this single point , forcing the model to fit sampling noise.
The paper addresses these limitations by proposing a PRM that can express uncertainty about its own predictions. The theoretical foundation shifts from point estimation to distributional modeling: treat the latent prefix success probability as a random variable with a Beta belief, and treat the observed success count as a Binomial sample from . This Beta-Binomial framework naturally pairs the count-based supervision with a distributional output.
Methodology
BETA PRM Formulation
The model assumes a generative process where the latent prefix success probability follows a Beta distribution, and the observed success count follows a Binomial distribution conditional on :
For interpretability, the Beta distribution is reparameterized by its mean (the expected success probability/standard PRM score) and concentration (controlling belief tightness/reliability):
Thus, and .
Model Parameterization
At each process marker <prm> after step :
- Reward Mean (): Derived from the language model's logits for reward tokens
Yes() andNo(): - Concentration (): Predicted by a separate lightweight linear head applied to the hidden state : where is a small lower bound for stability.
Training Objective
The core training loss is the negative log-likelihood of the observed count under the predicted Beta-Binomial distribution:
where
and is the Beta function. is the set of supervised markers.
An auxiliary regularization loss is added to calibrate the concentration parameter, discouraging high confidence () when the predicted mean () disagrees with the evidence ():
Here, denotes stop-gradient, preventing this term from pulling toward the noisy ratio. The total loss is:
Adaptive Computation Allocation (ACA)
ACA is a reliability-aware inference method for PRM-guided Best-of-N reasoning. It uses BETA PRM's outputs (, ) to allocate computation dynamically:
- Risk-Adjusted Candidate Score: For a candidate solution , compute a score that penalizes uncertainty. First, derive the step-level standard deviation from the Beta parameters: The candidate score aggregates a risk-adjusted step score:
- Progressive Batches & Early Stopping: Instead of generating all candidates at once, ACA starts with a small pool (). It computes Lower and Upper Confidence Bounds (LCB, UCB) for each candidate : If the top candidate's LCB exceeds all other candidates' UCBs (), ACA stops early and returns .
- Uncertainty-Guided Prefix Repair: If not stopping, ACA allocates the next batch to the most promising non-winner candidate (highest UCB). To repair it, ACA selects a cutpoint step (e.g., the earliest step with a conservatively scored value below a threshold ), keeps the prefix before it, and samples new continuations from that point.
Empirical Validation / Results
BETA PRM Improves Best-of-N Selection
Trained on VisualPRM400K-v1.1, BETA PRM was evaluated as a solution selector for Best-of-16 across four backbones and four multimodal reasoning benchmarks. Candidates were ranked using a risk-budget selector that discounts rewards for steps with high uncertainty ().
Table 1: PRM-guided Best-of-16 Final-Answer Accuracy
| Selector | MathVision | OlympiadBench | MathVerse | MathVista | Avg. vs. Std. PRM |
|---|---|---|---|---|---|
| InternVL3-14B | |||||
| +Standard PRM | 23.03 | 16.67 | 45.41 | 60.70 | – |
| +BETA PRM | 25.66 | 16.67 | 46.35 | 62.30 | +1.29 |
| InternVL3-8B | |||||
| +Standard PRM | 22.69 | 15.33 | 44.80 | 60.00 | – |
| +BETA PRM | 24.34 | 18.00 | 45.20 | 61.10 | +1.46 |
| InternVL2.5-8B | |||||
| +Standard PRM | 21.38 | 11.33 | 42.81 | 57.60 | – |
| +BETA PRM | 25.66 | 15.33 | 44.31 | 61.30 | +3.37 |
| Qwen2.5-VL-7B | |||||
| +Standard PRM | 21.38 | 14.00 | 44.92 | 60.30 | – |
| +BETA PRM | 24.34 | 17.33 | 45.99 | 63.60 | +2.66 |
BETA PRM achieves the best accuracy in every backbone–benchmark combination, with consistent average gains over the standard PRM baseline.
BETA PRM Preserves Error Detection Ability
Evaluated on VisualProcessBench for step-level error detection (binary classification of correct/erroneous steps), BETA PRM remains competitive with standard PRMs, showing the Beta-Binomial training does not degrade this core PRM capability.
Table 2: Step-level Error Detection on VisualProcessBench (Overall Micro-F1)
| Model | Overall F1 |
|---|---|
| InternVL3-14B | |
| Standard PRM | 61.90 |
| BETA PRM | 61.90 |
| InternVL3-8B | |
| Standard PRM | 60.69 |
| BETA PRM | 61.85 |
| InternVL2.5-8B | |
| Standard PRM | 61.54 |
| BETA PRM | 60.97 |
| Qwen2.5-VL-7B | |
| Standard PRM | 62.23 |
| BETA PRM | 62.91 |
Ablations Validate Design Choices
- Auxiliary Regularizer (): Removing it consistently reduces Best-of-N accuracy (Table 3), confirming its role in calibrating the learned concentration .
- Training Dynamics: The mean and 90th percentile of drop early in training and later recover (Figure 4), showing the model first becomes conservative and then learns to assign higher confidence to well-supported predictions. The recovery of a high-confidence upper tail indicates the model learns adaptive, non-uniform reliability.
ACA Improves Accuracy-Token Tradeoff
Using BETA PRM's reliability signal, ACA was evaluated against a fixed-budget Best-of-16 baseline under the same maximum budget ().
Table 4: ACA vs. Vanilla Best-of-16 (BoN)
| Method | MathVision | OlympiadBench | MathVerse | MathVista |
|---|---|---|---|---|
| InternVL2.5-8B | Acc. ↑ / Tokens ↓ | Acc. ↑ / Tokens ↓ | Acc. ↑ / Tokens ↓ | Acc. ↑ / Tokens ↓ |
| Vanilla BoN | 25.00 / 1383k | 15.33 / 1151k | 44.47 / 17932k | 60.90 / 2790k |
| ACA | 26.32 / 965k (↓30.24%) | 16.67 / 958k (↓16.76%) | 45.58 / 11912k (↓33.57%) | 62.20 / 1949k (↓30.14%) |
| Qwen2.5-VL-7B | ||||
| Vanilla BoN | 24.67 / 1383k | 16.67 / 1151k | 45.74 / 17932k | 63.30 / 2790k |
| ACA | 26.65 / 988k (↓28.57%) | 18.00 / 928k (↓19.39%) | 46.40 / 12015k (↓33.00%) | 64.00 / 2030k (↓27.22%) |
ACA improves final-answer accuracy while simultaneously reducing token usage by 16.76% to 33.57%.
A key ablation (Table 5) shows that BETA PRM's learned uncertainty is crucial for ACA's success. Replacing it with a standard PRM using either a reward-only score or a proxy uncertainty () results in worse accuracy-token tradeoffs, demonstrating the value of the explicitly modeled reliability signal .
Theoretical and Practical Implications
Theoretical Implications:
- Shift in PRM Formulation: Proposes a principled, distributional alternative to point-estimate PRMs, aligning the training objective (Beta-Binomial likelihood) with the nature of Monte Carlo supervision (count data).
- Uncertainty as a First-Class Citizen: Introduces reliability () as a core, learnable output of a reward model, enabling downstream algorithms to reason about trustworthiness.
Practical Implications:
- Improved Test-Time Scaling: BETA PRM provides a better signal for PRM-guided candidate selection, leading to higher accuracy in Best-of-N settings.
- Efficient Inference: The ACA method demonstrates that reliability signals can be used for adaptive computation allocation, significantly reducing inference cost (tokens) while maintaining or improving accuracy, moving beyond fixed-budget approaches.
- Broader Applicability: The reliability-aware interface (reward + confidence) is broadly useful for any PRM-guided decision-making, including search, policy optimization, and interactive systems where knowing when to trust a score is critical.
Conclusion
The paper identifies a key limitation in current Process Reward Models: their single-point reward outputs provide no indication of predictive reliability. To address this, it introduces BETA PRM, a distributional PRM that models prefix success probability with a Beta belief and is trained via a Beta-Binomial likelihood on Monte Carlo count observations. This gives the model two outputs: a predicted reward mean and a learned reliability concentration .
Empirically, BETA PRM improves the accuracy of PRM-guided Best-of-N selection across multiple backbones and benchmarks without sacrificing step-level error detection. Furthermore, leveraging this reliability signal, the proposed Adaptive Computation Allocation (ACA) method dynamically allocates inference-time compute, achieving a superior accuracy-token tradeoff by reducing usage by up to 33.57% while improving final-answer accuracy. Overall, BETA PRM advances PRMs from simple scorers to reliability-aware models, enabling more robust and efficient reasoning systems.