Summary of "Believe Your Model: Distribution-Guided Confidence Calibration"

Summary (Overview)

  • Proposes DistriVoting: A novel test-time scaling (TTS) method that uses a Gaussian Mixture Model (GMM) to decompose the confidence distribution of generated reasoning trajectories into positive (correct) and negative (incorrect) components, and then applies a two-step filtering process to improve answer selection.
  • Introduces SelfStepConf (SSC): A method to dynamically adjust the inference process by monitoring step-level confidence in real-time and triggering self-reflection when confidence drops, which increases the separation between confidence distributions of correct and incorrect answers.
  • Theoretically and Empirically Validates: Provides theorems proving that increased separation (δ=μposμneg\delta = \mu_{pos} - \mu_{neg}) between distributions improves voting accuracy, and shows through extensive experiments that both DistriVoting and SSC consistently boost performance.
  • Demonstrates Superior Performance: Experiments across 16 models (DeepSeek, Qwen series) and 5 reasoning benchmarks (HMMT2025, GPQA-D, AIME, BRUMO2025) show that DistriVoting outperforms state-of-the-art TTS methods like Self-Consistency, Best-of-N, and MoB.

Introduction and Theoretical Foundation

Background: Large Reasoning Models (LRMs) benefit from test-time scaling (TTS) techniques like Chain-of-Thought and repeated sampling, which generate multiple candidate answers. A core challenge is selecting the best answer without external labels or reward signals.

Motivation: Prior work shows that internal model confidence scores correlate with answer correctness and follow distinct statistical distributions for correct vs. incorrect trajectories. However, this distributional prior has not been fully exploited to guide answer selection. The overlap between high-confidence incorrect samples and low-confidence correct samples limits the reliability of confidence-based voting.

Theoretical Foundation: The paper models the confidence scores of trajectories as a mixture of two normal distributions:

  • Positive distribution: XposN(μpos,σpos2)X_{pos} \sim \mathcal{N}(\mu_{pos}, \sigma_{pos}^2)
  • Negative distribution: XnegN(μneg,σneg2)X_{neg} \sim \mathcal{N}(\mu_{neg}, \sigma_{neg}^2)

Key Theorems:

  • Theorem 2.1: Defines an integral ratio function R(μ1,μ2)R(\mu_1, \mu_2) and proves it is strictly monotonically increasing with respect to δ=μ1μ2\delta = \mu_1 - \mu_2. This implies that a larger separation between distribution means leads to a higher proportion of correct samples above the midpoint threshold.
  • Theorem 2.2: Proves that the weighted voting accuracy Pvote(δ)P_{vote}(\delta) has a lower bound that increases with δ\delta.

Methodology

1. Confidence Calculation

For a trajectory with NN tokens, confidence is computed using token negative log-probabilities:

Ctraj=1NG×kiGj=1klogPi(j)C_{traj} = -\frac{1}{N_G \times k} \sum_{i \in G} \sum_{j=1}^{k} \log P_i(j)

where GG is the subset of tokens (e.g., the final answer step), NGN_G is the number of tokens in GG, and kk is the number of top probabilities considered.

2. SelfStepConf (SSC)

SSC intervenes during generation to improve trajectory quality.

  • Reflection Trigger: Monitors step-level confidence CGmC_{G_m}. Triggers self-reflection if the relative change Δconf=CGm/τconf\Delta_{conf} = C_{G_m} / \tau_{conf} falls below a threshold δ\delta and confidence is declining. The adaptive threshold τconf\tau_{conf} is updated via EMA when no reflection is triggered.
  • Reflection Injection: When triggered, it forcibly swaps the highest-probability token's logit with that of a predefined "reflection token" to guide the generation, sampling with temperature=0 for a fixed number of steps. This does not affect the confidence calculation of the step.

3. DistriVoting

A two-stage filtering process applied after trajectory generation.

  • GMM Modeling: Models the confidence distribution of all trajectories for a question as a Gaussian Mixture Model:
p(x)=π1N(xμ1,σ12)+π2N(xμ2,σ22)p(x) = \pi_1 \mathcal{N}(x | \mu_1, \sigma_1^2) + \pi_2 \mathcal{N}(x | \mu_2, \sigma_2^2)

The component with the higher mean is mapped to the positive distribution.

  • GMM Filter: Selects trajectories belonging to the positive distribution VposV_{pos} as the initial candidate voting pool.
  • Reject Filter: Uses the trajectories from the negative distribution VnegV_{neg} to vote for a likely incorrect answer AnegA_{neg}. It then filters out from VposV_{pos} any trajectory whose answer matches AnegA_{neg}, producing a refined pool V^pos\hat{V}_{pos}.
  • Hierarchical Voting (HierVoting): The final voting on V^pos\hat{V}_{pos} uses a hierarchical approach:
    1. Confidence scores are split into NCN_C intervals.
    2. Weighted majority voting fWMajf_{WMaj} is applied within each interval to get a sub-answer.
    3. A final weighted majority vote aggregates the sub-answers. The weighted majority voting function is:
    fWMaj(V,C)=argmaxanstrajVI(Atraj=ans)Ctrajf_{WMaj}(V, C) = \arg \max_{ans} \sum_{traj \in V} \mathbb{I}(A_{traj} = ans) \cdot C_{traj}

Empirical Validation / Results

Main Results: Experiments on DeepSeek-R1-8B and Qwen3-32B across five benchmarks show consistent improvements.

Table 1: Main results of SelfStepConf (SSC) and DistriVoting across benchmarks. (Budget=128, 64 repetitions. SC=Self-Consistency, WSC=Weighted SC, BoN=Best of N, DIS=DistriVoting)

ModelMethodHMMT2025GPQA-DAIME2024AIME2025BRUMO2025Avg.
DeepSeek-R1-8BSC69.1167.5086.6780.3693.0773.09
WSC69.6967.6586.6780.7893.3373.30
DIS-GMM*84.9570.6393.2386.6494.2777.84
Qwen3-32BSC62.0870.3086.4676.9893.3373.85
WSC62.2470.5586.8877.0893.3374.07
DIS-GMM*65.7373.1889.1180.0593.3376.53

Key Findings:

  1. GMM vs. Fixed Filter: GMM Filter consistently outperforms a fixed Top-50% filter (e.g., WSC-GMM 76.64% vs. WSC-Top50 74.75% for DeepSeek).
  2. DistriVoting vs. Baselines: DistriVoting (DIS) outperforms weighted voting (WSC) under both filter settings.
  3. SSC Contribution: Applying SSC (marked *) further improves results for both WSC and DIS, validating its role in enhancing distribution separation.

Ablation Studies:

  • Clustering Methods: GMM outperforms K-Means and MeanShift in prediction accuracy and voting performance while being computationally efficient.
  • Budget Sensitivity: DistriVoting shows significant advantages over conventional methods when the sample budget (BB) is ≥ 16, with performance scaling with larger budgets.
  • Filter Effectiveness: Analysis shows both Acc and Weighted Acc (WAcc) increase progressively through the GMM Filter and Reject Filter stages.

Table 4: Effectiveness Analysis of DistriVoting (DeepSeek-R1-8B)

MetricBenchmarkStage I (All)Stage II (After GMM)Stage III (After Reject)
AccAvg.69.2777.6080.41
WAccAvg.69.7477.6880.48

SSC Analysis:

  • Distribution Separation: SSC increases the mean difference δ\delta between correct and incorrect confidence distributions (e.g., from 3.182 to 5.043 on HMMT2025).
  • Sampling Efficiency: SSC improves pass@1 performance significantly, especially for mid-tier models, indicating improved sampling efficiency without expanding fundamental reasoning limits.
  • Inference Dynamics: SSC interventions lead to higher confidence trajectories and often shorter responses without increasing overall time complexity significantly (~2.3% overhead).

Theoretical and Practical Implications

Theoretical Implications:

  • Formalizes the relationship between confidence distribution separation and voting accuracy, providing a theoretical justification for methods that aim to increase δ\delta.
  • Demonstrates that clustering confidence scores based on a bimodal normal assumption (via GMM) is an effective way to leverage distributional priors for answer selection.

Practical Implications:

  • Efficiency: DistriVoting and SSC rely solely on internal model signals, avoiding the cost of training or querying external reward models.
  • Effectiveness: Provides a robust, adaptive filtering mechanism that outperforms fixed-threshold methods, which require per-benchmark tuning.
  • Generality: The methods are model-agnostic and show consistent gains across a wide range of model sizes and architectures on diverse reasoning benchmarks.

Conclusion

This paper addresses the problem of "confidently wrong" predictions in test-time scaling by proposing DistriVoting, a distribution-guided voting method, and SelfStepConf, a step-level confidence intervention technique. DistriVoting leverages a GMM to model confidence distributions and uses a two-stage filter to refine the candidate answer pool, while SSC actively improves the quality and separability of generated trajectories. Both theoretical analysis and extensive experiments across 16 models and 5 benchmarks demonstrate significant and consistent performance improvements over state-of-the-art TTS methods. The work advances the field by showing how to better utilize internal model signals for reliable and efficient answer aggregation.