Thinking in Uncertainty: Mitigating Hallucinations in MLRMs with Latent Entropy-Aware Decoding

Summary (Overview)

Key Insight: Transition words (e.g., "because", "however") in Multimodal Large Reasoning Models (MLRMs) are strongly correlated with hallucinations and tend to occur during high-entropy (high uncertainty) reasoning states.
Core Method: Proposed Latent Entropy-Aware Decoding (LEAD) – a plug-and-play decoding strategy that performs entropy-aware reasoning mode switching. During high-entropy states, it uses probability-weighted continuous embeddings (latent decoding) to preserve semantic diversity; during low-entropy states, it switches back to discrete token embeddings for precise convergence.
Visual Grounding Enhancement: Introduces a prior-guided visual anchor injection strategy during high-entropy phases to encourage the model to refocus on visual information, countering the observed lower visual attention during hallucination-prone states.
Empirical Results: LEAD significantly mitigates hallucinations across various MLRMs (R1-Onevision, Vision-R1, VL-Rethinker, VL-Cogito, OpenVLThinker) on multiple general and scientific multimodal reasoning benchmarks, improving performance while maintaining or enhancing reasoning efficiency.

Introduction and Theoretical Foundation

Background: Recent Multimodal Large Reasoning Models (MLRMs) integrate visual understanding with linguistic reasoning chains but remain highly prone to hallucinations (generating content contradictory to visual evidence or logically inconsistent). Existing mitigation methods often involve costly training adjustments or generic decoding strategies.

Motivation & Observation: The authors observe that transition words (which structure reasoning chains) frequently coincide with hallucinations. Analyzing token-level uncertainty via entropy, they find these transition words consistently exhibit higher entropy, marking high-uncertainty stages in reasoning. During these phases, semantic divergence and competition among potential reasoning paths increase hallucination risk.

Core Hypothesis: Reliance on discrete textual inputs encourages sequential, explicit reasoning, underutilizing dense contextual cues during high-entropy stages. Richer semantic representations constructed from the full token probability distribution can enhance contextual reasoning capability.

Empirical Support: Token masking ablation experiments show:

Masking high-entropy tokens causes significant performance drop, indicating they are critical informational nodes.
Early high-entropy tokens have the strongest influence on the final reasoning trajectory.
High-entropy tokens associated with hallucinations exhibit lower visual attention ratios compared to non-hallucinated high-entropy tokens.

Theoretical Inspiration: The method is inspired by superposed representation theory, proposing to leverage latent superposed reasoning to integrate multiple candidate semantics and maintain latent reasoning trajectories.

Methodology

3.1. MLRMs Generation

Input Processing: An MLRM accepts an image and text. The image is processed by a vision encoder and projected into vision tokens $x_v = \{ x_{v,1}, x_{v,2}, ..., x_{v,N} \}$ . Text is tokenized into text tokens $x_t = \{ x_{t,1}, x_{t,2}, ..., x_{t,M} \}$ . The complete multimodal input sequence is $x = x_v \oplus x_t = \{ x_t \}_{t=1}^T$ , where $T = N + M$ .

Autoregressive Generation: The backbone LLM $R_\theta$ predicts the next token distribution at each step $t$ :

p_t = R_\theta(\cdot | x, y_{<t}) \in \Delta^{|V|-1},

where $y_{<t} = (y_1, y_2, ..., y_{t-1})$ are previously generated tokens, $V$ is the vocabulary, and $\Delta^{|V|-1}$ is the probability simplex.

Discrete Reasoning Decoding: Standard approach. At reasoning step $t$ , the model computes distribution $p_t$ based on embeddings $e(x)$ and $e(r_{<t})$ , and samples token $r_t$ :

p_t = R_\theta(e(x), e(r_{<t})), \quad r_t \sim p_t, \quad r_t \in V.

Latent Reasoning Decoding: Proposed alternative to preserve distributional information. Instead of sampling a discrete token, a probability-weighted embedding is formed and fed back:

\tilde{e}_t = \mathbb{E}_{v \sim p_t}[e(v)],

where $\mathbb{E}$ denotes expectation under $p_t$ , and $e(v)$ is the embedding of token $v$ .

3.2. Entropy-Aware Reasoning Mode Switching

Entropy Calculation: Token-level entropy $H_t$ measures uncertainty at step $t$ :

H_t = -\sum_v p_t[v] \log p_t[v],

where $p_t[v]$ is the predicted probability of token $v$ .

Mode Switch Criterion: Let $\hat{H}$ be a dynamic reference entropy threshold. The next-step input embedding $\tilde{e}_t$ is determined by:

\tilde{e}_t = \begin{cases} e(r_t), & \text{if } H_t < \hat{H} \text{ (Uncertainty drops)}, \\ \mathbb{E}_{v \sim p_t}[e(v)], & \text{otherwise (Uncertainty rises)}. \end{cases}

Low entropy → discrete embeddings (deterministic). High entropy → probability-weighted embeddings (preserve diversity).

Persistence Window: To prevent rapid oscillation, a persistence window is enforced for transitions from discrete (D) to latent (L) mode. Define gating variables:

g^D_t = \mathbf{1}[H_t < \hat{H}],

g^L_t = \mathbf{1}[(H_t > \hat{H}) \land (\rho_t \geq W_{D\to L})],

where $\mathbf{1}[\cdot]$ is the indicator, $\rho_t$ counts consecutive steps in current mode, and $W_{D\to L}$ is the minimum steps before D→L switch. The mode transition rule is:

m_{t+1} = g^D_t D + g^L_t L + (1 - g^D_t - g^L_t) m_t.

When a transition occurs, $\hat{H} \gets H_t$ and $\rho_t$ is reset to 0.

Switch Count Regulation: A global switch counter $C_t$ with upper bound $C_{max}$ limits total mode transitions to prevent overthinking.

3.3. Entropy-Aware Visual Anchor Injection

Visual Anchor Vector: Let $e_{vis}$ denote the averaged embedding of pretrained visual special tokens (e.g., <|vision_start|>, <|image_pad|>, <|vision_end|>).

Injection Strategy: At the first token $t^\star$ of each high-entropy phase (onset of latent reasoning), the visual anchor is injected:

\tilde{e}_{t^\star} = (1 - \lambda) \mathbb{E}_{v \sim p_{t^\star}}[e(v)] + \lambda e_{vis},

where $\lambda \in [0,1]$ controls injection strength. This one-time injection provides visual grounding to stabilize reasoning.

Algorithm Pseudocode (Key Excerpt):

def LEAD_step(logits, E):
    # probability geometry
    p = torch.softmax(logits)
    H = -(p * (p + eps).log()).sum()
    # mode transition with threshold update
    mode = torch.where(H>=tau, LATENT, DISCRETE).where(prev)
    switched = (mode != prev)
    tau = torch.where(switched, H, tau)
    # latent embedding construction
    p = p / (p**2).sum().sqrt() + eps
    base = LATENT * (p.unsqueeze(-1) @ E).sum(dim=0) + (1 - LATENT) * E[argmax_token(p)]
    # visual injection on latent embedding
    inject = base + vis_injected * vis_emb.unsqueeze(-1)
    # last embedding based on termination condition
    last_embedding = K(switch_count, c, ter_emb, inject)
    return last_embedding

Empirical Validation / Results

4.1. Experimental Setup

Models: R1-Onevision-7B, Vision-R1-7B, VL-Rethinker-7B, VL-Cogito-7B, OpenVLThinker-7B. Benchmarks: General Reasoning & Understanding (MMEval-Pro, MMVP, RealWorldQA, VMCBench, VStar); Hallucination Assessment (Bingo, MMHalu, POPE); Domain-Specific (Mathematical: MathVision, MathVista, MathVerse, VisuLogic, Geometry3K, MMK12-Math; Scientific: MMK12-Physics, Chemistry, Biology). Baselines: VCD, MemVR, SID. Implementation: Default switching count maximum $C_{max} = 5$ .

4.2. Ablation Study

Effect of Entropy Threshold: Dynamic thresholding yields best performance. Fixed thresholds (too high or too low) degrade performance.

Figure 5: Comparisons on MMHalu and Bingo datasets show dynamic thresholding (∆) improves scores by +4.7% (R1-Onevision) and +4.1% (Vision-R1) versus fixed thresholds.

Effect of Switching Window Size: Performance improves as window size grows to 128, then declines. Extreme ( $\infty$ ) causes regression to standard CoT performance.

Figure 6: (a) MMHalu and (b) Bingo scores for R1-Onevision and Vision-R1 under window sizes 64, 128, 256, ∞.

Effect of Visual Anchor Injection Strength $\lambda$ : Performance peaks at $\lambda = 0.4$ across datasets.

Table 1: Effect of visual anchor injection strength λ on overall performance.

Model	$\lambda$	VStar	MMEval-Pro	MMHalu	Bingo
R1-Onevision-7B	0	67.5	71.9	3.59	3.74
	0.2	69.6	72.0	3.66	3.73
	0.4	71.2	73.9	3.80	3.84
	0.6	68.1	73.3	3.77	3.76
Vision-R1-7B	0	79.1	72.7	3.69	3.68
	0.2	80.1	73.9	3.78	3.70
	0.4	81.7	75.1	3.89	3.77
	0.6	79.6	74.5	3.83	3.75

Qualitative Analysis: LEAD allocates higher visual attention to query-relevant regions vs. Baseline and MemVR. During latent reasoning, token distribution is more dispersed (higher entropy); during discrete reasoning, distribution approaches one-hot (lower entropy).

4.3. Comparisons to State-of-the-Arts

General Reasoning & Hallucination Benchmarks: LEAD consistently improves performance across all models.

Table 2: Comparisons of different MLRMs with LEAD across general reasoning and hallucination benchmarks. (Table shows accuracy for general benchmarks, scores for MMHalu (0-6) and Bingo (1-5). LEAD improves R1-Onevision by +4.7% on MMHalu and +3.8% on Bingo.)

Domain-Specific Reasoning Benchmarks: LEAD improves average accuracy by +2.0% on mathematics and +3.2% on scientific benchmarks.

Table 3: Comparisons of different MLRMs with LEAD across mathematical and scientific visual reasoning benchmarks. (Table shows accuracy improvements across all mathematical and scientific subsets.)

GPT-5 Assisted Evaluation: LEAD preserves text quality (grammar, fluency, naturalness) and shows lower perplexity (PPL) compared to baselines.

Reasoning Efficiency: LEAD generates shorter reasoning lengths while maintaining highest accuracy (evaluated on MathVision with R1-Onevision).

Figure 9: Comparisons of accuracy and reasoning length. LEAD achieves highest accuracy (~32.4%) with shortest average token length (~460).

Pass@k Performance: LEAD reaches peak accuracy at smaller $k$ values, indicating higher sample efficiency and greater diversity/correctness in reasoning.

Figure 10: Pass@k accuracy on RealWorldQA and MathVista for $k \in [4, 32]$ . LEAD shows steeper increase and higher final accuracy.

Theoretical and Practical Implications

Theoretical Implications:

Provides a novel perspective on hallucination mitigation by linking it to token-level uncertainty (entropy) and transition words.
Introduces the concept of latent superposed reasoning for MLRMs, leveraging full probability distributions to maintain semantic diversity during uncertain phases.
Demonstrates the importance of adaptive reasoning mode switching based on intrinsic confidence signals.

Practical Implications:

LEAD is a lightweight, plug-and-play decoding strategy that can be integrated into existing MLRMs without additional training costs.
It significantly reduces multimodal hallucinations across a wide range of benchmarks and model architectures.
The method improves reasoning efficiency (shorter chains, higher sample efficiency) while enhancing accuracy.
The visual anchor injection mechanism provides a simple way to enhance visual grounding during uncertain reasoning, addressing a key weakness observed in hallucination-prone states.

Conclusion

The paper identifies a strong correlation between transition words, high-entropy states, and hallucinations in MLRMs, and finds that hallucination-associated high-entropy tokens receive lower visual attention.

Motivated by these observations, the proposed Latent Entropy-Aware Decoding (LEAD) framework adaptively switches between discrete and latent semantic representations based on token-level entropy, while injecting visual guidance during high-uncertainty phases.

Extensive evaluations demonstrate that LEAD consistently strengthens reasoning reliability and significantly reduces multimodal hallucinations across both general-purpose and scientific benchmarks, offering an effective, training-free solution for improving MLRM robustness.