# Thinking in Uncertainty: Mitigating Hallucinations in MLRMs with Latent Entropy-Aware Decoding

> LEAD reduces hallucinations in multimodal reasoning models by switching to latent decoding during high-entropy states and injecting visual anchors to refocus on the image.

- **Source:** [arXiv](https://arxiv.org/abs/2603.13366)
- **Published:** 2026-03-19
- **Permalink:** https://picx.dev/p/H9hb20
- **Whiteboard:** https://picx.dev/p/H9hb20/image

## Summary

# Thinking in Uncertainty: Mitigating Hallucinations in MLRMs with Latent Entropy-Aware Decoding

## Summary (Overview)
* **Key Insight:** Transition words (e.g., "because", "however") in Multimodal Large Reasoning Models (MLRMs) are strongly correlated with hallucinations and tend to occur during high-entropy (high uncertainty) reasoning states.
* **Core Method:** Proposed **Latent Entropy-Aware Decoding (LEAD)** – a plug-and-play decoding strategy that performs **entropy-aware reasoning mode switching**. During high-entropy states, it uses probability-weighted continuous embeddings (latent decoding) to preserve semantic diversity; during low-entropy states, it switches back to discrete token embeddings for precise convergence.
* **Visual Grounding Enhancement:** Introduces a **prior-guided visual anchor injection** strategy during high-entropy phases to encourage the model to refocus on visual information, countering the observed lower visual attention during hallucination-prone states.
* **Empirical Results:** LEAD significantly mitigates hallucinations across various MLRMs (R1-Onevision, Vision-R1, VL-Rethinker, VL-Cogito, OpenVLThinker) on multiple general and scientific multimodal reasoning benchmarks, improving performance while maintaining or enhancing reasoning efficiency.

## Introduction and Theoretical Foundation
**Background:** Recent Multimodal Large Reasoning Models (MLRMs) integrate visual understanding with linguistic reasoning chains but remain highly prone to **hallucinations** (generating content contradictory to visual evidence or logically inconsistent). Existing mitigation methods often involve costly training adjustments or generic decoding strategies.

**Motivation & Observation:** The authors observe that **transition words** (which structure reasoning chains) frequently coincide with hallucinations. Analyzing token-level uncertainty via **entropy**, they find these transition words consistently exhibit **higher entropy**, marking high-uncertainty stages in reasoning. During these phases, semantic divergence and competition among potential reasoning paths increase hallucination risk.

**Core Hypothesis:** Reliance on **discrete textual inputs** encourages sequential, explicit reasoning, underutilizing dense contextual cues during high-entropy stages. **Richer semantic representations** constructed from the full token probability distribution can enhance contextual reasoning capability.

**Empirical Support:** Token masking ablation experiments show:
* Masking high-entropy tokens causes significant performance drop, indicating they are critical informational nodes.
* Early high-entropy tokens have the strongest influence on the final reasoning trajectory.
* High-entropy tokens associated with hallucinations exhibit **lower visual attention ratios** compared to non-hallucinated high-entropy tokens.

**Theoretical Inspiration:** The method is inspired by **superposed representation theory**, proposing to leverage **latent superposed reasoning** to integrate multiple candidate semantics and maintain latent reasoning trajectories.

## Methodology

### 3.1. MLRMs Generation
**Input Processing:** An MLRM accepts an image and text. The image is processed by a vision encoder and projected into vision tokens $x_v = \{ x_{v,1}, x_{v,2}, ..., x_{v,N} \}$. Text is tokenized into text tokens $x_t = \{ x_{t,1}, x_{t,2}, ..., x_{t,M} \}$. The complete multimodal input sequence is $x = x_v \oplus x_t = \{ x_t \}_{t=1}^T$, where $T = N + M$.

**Autoregressive Generation:** The backbone LLM $R_\theta$ predicts the next token distribution at each step $t$:
$$
p_t = R_\theta(\cdot | x, y_{<t}) \in \Delta^{|V|-1},
$$
where $y_{<t} = (y_1, y_2, ..., y_{t-1})$ are previously generated tokens, $V$ is the vocabulary, and $\Delta^{|V|-1}$ is the probability simplex.

**Discrete Reasoning Decoding:** Standard approach. At reasoning step $t$, the model computes distribution $p_t$ based on embeddings $e(x)$ and $e(r_{<t})$, and samples token $r_t$:
$$
p_t = R_\theta(e(x), e(r_{<t})), \quad r_t \sim p_t, \quad r_t \in V.
$$

**Latent Reasoning Decoding:** Proposed alternative to preserve distributional information. Instead of sampling a discrete token, a **probability-weighted embedding** is formed and fed back:
$$
\tilde{e}_t = \mathbb{E}_{v \sim p_t}[e(v)],
$$
where $\mathbb{E}$ denotes expectation under $p_t$, and $e(v)$ is the embedding of token $v$.

### 3.2. Entropy-Aware Reasoning Mode Switching
**Entropy Calculation:** Token-level entropy $H_t$ measures uncertainty at step $t$:
$$
H_t = -\sum_v p_t[v] \log p_t[v],
$$
where $p_t[v]$ is the predicted probability of token $v$.

**Mode Switch Criterion:** Let $\hat{H}$ be a dynamic reference entropy threshold. The next-step input embedding $\tilde{e}_t$ is determined by:
$$
\tilde{e}_t = 
\begin{cases}
e(r_t), & \text{if } H_t < \hat{H} \text{ (Uncertainty drops)}, \\
\mathbb{E}_{v \sim p_t}[e(v)], & \text{otherwise (Uncertainty rises)}.
\end{cases}
$$
Low entropy → discrete embeddings (deterministic). High entropy → probability-weighted embeddings (preserve diversity).

**Persistence Window:** To prevent rapid oscillation, a persistence window is enforced for transitions from discrete (D) to latent (L) mode. Define gating variables:
$$
g^D_t = \mathbf{1}[H_t < \hat{H}],
$$
$$
g^L_t = \mathbf{1}[(H_t > \hat{H}) \land (\rho_t \geq W_{D\to L})],
$$
where $\mathbf{1}[\cdot]$ is the indicator, $\rho_t$ counts consecutive steps in current mode, and $W_{D\to L}$ is the minimum steps before D→L switch. The mode transition rule is:
$$
m_{t+1} = g^D_t D + g^L_t L + (1 - g^D_t - g^L_t) m_t.
$$
When a transition occurs, $\hat{H} \gets H_t$ and $\rho_t$ is reset to 0.

**Switch Count Regulation:** A global switch counter $C_t$ with upper bound $C_{max}$ limits total mode transitions to prevent overthinking.

### 3.3. Entropy-Aware Visual Anchor Injection
**Visual Anchor Vector:** Let $e_{vis}$ denote the averaged embedding of pretrained visual special tokens (e.g., `<|vision_start|>`, `<|image_pad|>`, `<|vision_end|>`).

**Injection Strategy:** At the first token $t^\star$ of each high-entropy phase (onset of latent reasoning), the visual anchor is injected:
$$
\tilde{e}_{t^\star} = (1 - \lambda) \mathbb{E}_{v \sim p_{t^\star}}[e(v)] + \lambda e_{vis},
$$
where $\lambda \in [0,1]$ controls injection strength. This one-time injection provides visual grounding to stabilize reasoning.

**Algorithm Pseudocode (Key Excerpt):**
```python
def LEAD_step(logits, E):
    # probability geometry
    p = torch.softmax(logits)
    H = -(p * (p + eps).log()).sum()
    # mode transition with threshold update
    mode = torch.where(H>=tau, LATENT, DISCRETE).where(prev)
    switched = (mode != prev)
    tau = torch.where(switched, H, tau)
    # latent embedding construction
    p = p / (p**2).sum().sqrt() + eps
    base = LATENT * (p.unsqueeze(-1) @ E).sum(dim=0) + (1 - LATENT) * E[argmax_token(p)]
    # visual injection on latent embedding
    inject = base + vis_injected * vis_emb.unsqueeze(-1)
    # last embedding based on termination condition
    last_embedding = K(switch_count, c, ter_emb, inject)
    return last_embedding
```

## Empirical Validation / Results

### 4.1. Experimental Setup
**Models:** R1-Onevision-7B, Vision-R1-7B, VL-Rethinker-7B, VL-Cogito-7B, OpenVLThinker-7B.
**Benchmarks:** General Reasoning & Understanding (MMEval-Pro, MMVP, RealWorldQA, VMCBench, VStar); Hallucination Assessment (Bingo, MMHalu, POPE); Domain-Specific (Mathematical: MathVision, MathVista, MathVerse, VisuLogic, Geometry3K, MMK12-Math; Scientific: MMK12-Physics, Chemistry, Biology).
**Baselines:** VCD, MemVR, SID.
**Implementation:** Default switching count maximum $C_{max} = 5$.

### 4.2. Ablation Study
**Effect of Entropy Threshold:** Dynamic thresholding yields best performance. Fixed thresholds (too high or too low) degrade performance.
> **Figure 5:** Comparisons on MMHalu and Bingo datasets show dynamic thresholding (∆) improves scores by +4.7% (R1-Onevision) and +4.1% (Vision-R1) versus fixed thresholds.

**Effect of Switching Window Size:** Performance improves as window size grows to 128, then declines. Extreme ($\infty$) causes regression to standard CoT performance.
> **Figure 6:** (a) MMHalu and (b) Bingo scores for R1-Onevision and Vision-R1 under window sizes 64, 128, 256, ∞.

**Effect of Visual Anchor Injection Strength $\lambda$:** Performance peaks at $\lambda = 0.4$ across datasets.

**Table 1: Effect of visual anchor injection strength λ on overall performance.**
| Model | $\lambda$ | VStar | MMEval-Pro | MMHalu | Bingo |
|-------|-----------|-------|------------|--------|-------|
| R1-Onevision-7B | 0 | 67.5 | 71.9 | 3.59 | 3.74 |
| | 0.2 | 69.6 | 72.0 | 3.66 | 3.73 |
| | 0.4 | 71.2 | 73.9 | 3.80 | 3.84 |
| | 0.6 | 68.1 | 73.3 | 3.77 | 3.76 |
| Vision-R1-7B | 0 | 79.1 | 72.7 | 3.69 | 3.68 |
| | 0.2 | 80.1 | 73.9 | 3.78 | 3.70 |
| | 0.4 | 81.7 | 75.1 | 3.89 | 3.77 |
| | 0.6 | 79.6 | 74.5 | 3.83 | 3.75 |

**Qualitative Analysis:** LEAD allocates higher visual attention to query-relevant regions vs. Baseline and MemVR. During latent reasoning, token distribution is more dispersed (higher entropy); during discrete reasoning, distribution approaches one-hot (lower entropy).

### 4.3. Comparisons to State-of-the-Arts
**General Reasoning & Hallucination Benchmarks:** LEAD consistently improves performance across all models.

**Table 2: Comparisons of different MLRMs with LEAD across general reasoning and hallucination benchmarks.**
*(Table shows accuracy for general benchmarks, scores for MMHalu (0-6) and Bingo (1-5). LEAD improves R1-Onevision by +4.7% on MMHalu and +3.8% on Bingo.)*

**Domain-Specific Reasoning Benchmarks:** LEAD improves average accuracy by +2.0% on mathematics and +3.2% on scientific benchmarks.

**Table 3: Comparisons of different MLRMs with LEAD across mathematical and scientific visual reasoning benchmarks.**
*(Table shows accuracy improvements across all mathematical and scientific subsets.)*

**GPT-5 Assisted Evaluation:** LEAD preserves text quality (grammar, fluency, naturalness) and shows lower perplexity (PPL) compared to baselines.

**Reasoning Efficiency:** LEAD generates shorter reasoning lengths while maintaining highest accuracy (evaluated on MathVision with R1-Onevision).

> **Figure 9:** Comparisons of accuracy and reasoning length. LEAD achieves highest accuracy (~32.4%) with shortest average token length (~460).

**Pass@k Performance:** LEAD reaches peak accuracy at smaller $k$ values, indicating higher sample efficiency and greater diversity/correctness in reasoning.

> **Figure 10:** Pass@k accuracy on RealWorldQA and MathVista for $k \in [4, 32]$. LEAD shows steeper increase and higher final accuracy.

## Theoretical and Practical Implications
**Theoretical Implications:** 
* Provides a novel perspective on hallucination mitigation by linking it to **token-level uncertainty (entropy)** and **transition words**.
* Introduces the concept of **latent superposed reasoning** for MLRMs, leveraging full probability distributions to maintain semantic diversity during uncertain phases.
* Demonstrates the importance of **adaptive reasoning mode switching** based on intrinsic confidence signals.

**Practical Implications:**
* **LEAD is a lightweight, plug-and-play decoding strategy** that can be integrated into existing MLRMs without additional training costs.
* It **significantly reduces multimodal hallucinations** across a wide range of benchmarks and model architectures.
* The method **improves reasoning efficiency** (shorter chains, higher sample efficiency) while enhancing accuracy.
* The visual anchor injection mechanism provides a simple way to **enhance visual grounding** during uncertain reasoning, addressing a key weakness observed in hallucination-prone states.

## Conclusion
The paper identifies a strong correlation between **transition words, high-entropy states, and hallucinations** in MLRMs, and finds that hallucination-associated high-entropy tokens receive **lower visual attention**. 

Motivated by these observations, the proposed **Latent Entropy-Aware Decoding (LEAD)** framework adaptively switches between discrete and latent semantic representations based on token-level entropy, while injecting visual guidance during high-uncertainty phases. 

**Extensive evaluations** demonstrate that LEAD consistently strengthens reasoning reliability and significantly reduces multimodal hallucinations across both general-purpose and scientific benchmarks, offering an effective, training-free solution for improving MLRM robustness.

---

_Markdown view of https://picx.dev/p/H9hb20, served by PicX — AI-generated visual whiteboard summaries of research papers._