Large Language Models Explore by Latent Distilling

Summary (Overview)

Proposes Exploratory Sampling (ESamp): A novel decoding method that encourages semantic exploration in LLMs by using a lightweight online Latent Distiller (LD) to model and penalize predictable internal representation mappings.
Breaks the diversity-coherence trade-off: Empirical results show ESamp promotes higher semantic diversity (measured by Vendi Score) while maintaining or improving linguistic quality (lower perplexity) compared to standard stochastic and heuristic baselines.
Enhances test-time scaling efficiency: ESamp significantly boosts the Pass@k performance of reasoning models, often achieving comparable results to baselines with a much smaller sampling budget (e.g., Pass@8 vs. Pass@64).
Features a practical, low-overhead implementation: An asynchronous training-inference pipeline ensures the method incurs minimal latency overhead (optimized to ~1.2% in open-source release), making it suitable for large-scale deployment.

Introduction and Theoretical Foundation

Training-free test-time scaling methods, which generate multiple candidate solutions and apply selection mechanisms (e.g., majority voting), have proven effective for enhancing LLM reasoning. However, their success is fundamentally limited by the diversity of underlying reasoning strategies in the candidate set. Standard stochastic sampling (e.g., temperature, Top-p) often yields only surface-level lexical variation, not genuine semantic or strategic diversity, leading to redundant solutions and diminishing returns.

Existing approaches to increase diversity, such as structured search (e.g., Tree of Thoughts) or heuristic sampling constraints, face significant computational overhead or remain limited in eliciting novel reasoning. This paper argues that effective test-time scaling requires a mechanism to efficiently encourage novelty in the model's underlying reasoning behavior.

Theoretical Motivation: The method is grounded in the observation, inspired by Random Network Distillation (RND), that neural networks make more accurate predictions on familiar inputs and exhibit higher error on novel ones. ESamp leverages this property by training a lightweight Distiller online to predict the LLM's deep-layer hidden states from its shallow-layer states. High prediction error signals an under-explored semantic or reasoning pattern.

Problem Formulation: Generation is modeled as a Markov Decision Process (MDP). The goal is to optimize a policy $\pi$ that maximizes a per-step intrinsic novelty reward $r(s_t, z_t)$ while staying close to the base LLM policy $\pi_{\text{ref}}$ via KL regularization:

J(\pi) = \mathbb{E}_\pi [r(s_t, z_t)] - \alpha \text{KL}(\pi(\cdot|s_t) \parallel \pi_{\text{ref}}(\cdot|s_t))

This objective admits a closed-form optimal policy:

\pi^*(z|s) \propto \pi_{\text{ref}}(z|s) \exp\left(\frac{1}{\alpha} r(s, z)\right)

The core challenge is constructing an online estimator for the novelty reward $r(s, z)$ .

Methodology

The ESamp method consists of three key components: the Latent Distiller for novelty estimation, a novelty-driven generation mechanism, and an asynchronous implementation pipeline.

1. Novelty Estimation via Latent Distiller

Instead of operating in token space, ESamp grounds exploration in the LLM's internal representation space. A lightweight Multi-Layer Perceptron (MLP), the Latent Distiller $f_\phi$ , is trained online to map from a shallow-layer (e.g., first layer) hidden representation $h_t^1$ to the corresponding deep-layer (final layer) representation $\hat{h}_t^L$ :

\hat{h}_t^L = f_\phi(h_t^1)

The distiller is trained by minimizing the Mean Squared Error (MSE) over representations encountered during generation:

\mathcal{L}(\phi) = \frac{1}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \| h_{t,i}^L - f_\phi(h_{t,i}^1) \|_2^2

As training proceeds, low prediction error indicates a semantically redundant (familiar) mapping, while high error signals novelty.

2. Novelty-Driven Generation

To incorporate the novelty signal into decoding, both the true deep-layer state $h_t^L$ and the predicted state $\hat{h}_t^L$ are projected into vocabulary space using the frozen LM head $W_{\text{head}}$ :

\pi_{\text{ref}} = \text{softmax}(W_{\text{head}} h_t^L), \quad q_{\text{dist}} = \text{softmax}(W_{\text{head}} \hat{h}_t^L)

The intrinsic reward is defined as the log-likelihood ratio: $r(s, z) = \log \pi_{\text{ref}}(z|s) - \log q_{\text{dist}}(z|s)$ . Substituting into the optimal policy formula (with $\beta = 1/\alpha$ ) yields the ESamp sampling distribution:

\pi_{\text{new}}(z|s) \propto \pi_{\text{ref}}(z|s)^{1+\beta} q_{\text{dist}}(z|s)^{-\beta}

In logit space, this is equivalent to:

\text{logit}_{\text{new}} = \text{logit}_{\text{ref}} + \beta (\text{logit}_{\text{ref}} - \text{logit}_{\text{dist}}) = (1+\beta)\text{logit}_{\text{ref}} - \beta\text{logit}_{\text{dist}}

Let $e_t = h_t^L - \hat{h}_t^L$ be the latent error vector. The change in logit for a token $z$ can be expressed as:

\Delta \text{logit}_z = \beta w_z \cdot e_t = \beta \|w_z\|_2 \cdot \|e_t\|_2 \cdot \cos(w_z, e_t)

This highlights that the adjustment is driven by both the magnitude of context novelty ( $\|e_t\|_2$ ) and the semantic direction ( $\cos(w_z, e_t)$ ) of the unpredicted representation component.

3. Collaborative Exploration & Asynchronous Implementation

In parallel generation of $K$ sequences, the shared Distiller acts as a coordination channel. When one sequence explores a semantic pattern, the Distiller learns it, suppressing the probability of other sequences revisiting the same pattern via Eq. (6), leading to efficient batch-level exploration.

Asynchronous Pipeline: To minimize overhead, the Distiller's forward pass is triggered after the LLM's first layer and runs concurrently with the rest of the LLM's forward pass. The Distiller's backward pass/update is deferred to the post-processing (CPU-bound) interval. This design decouples the Distiller from the critical generation path.

Algorithm 1 outlines the decode-step procedure, integrating forward computation, logit fusion, online Distiller training, and sampling within the asynchronous streams.

Empirical Validation / Results

Experiments were conducted across mathematics (AIME 2024/2025), science (GPQA-Diamond), code generation (LiveCodeBench v5), and creative writing (BookCorpus) using models like Qwen2.5-7B/32B-Instruct, Qwen3-8B, and GPT-OSS-20B.

Key Results

1. Pass@k Performance: ESamp demonstrates superior or comparable Pass@k scaling to strong baselines (Vanilla, Min-p, FIRE, OverRIDE, Contrastive Decoding, Tree of Thoughts), particularly excelling with reasoning models. It often matches the high-k performance of baselines with a significantly lower sampling budget.

2. Diversity and Quality Trade-off: ESamp breaks the typical trade-off between coherence and diversity. Table 1: Diversity and Quality Evaluation

Method	Creative Writing	Math (AIME25)
	Vendi ↑	Sim. ↓
Vanilla	1.62	0.58
Min-P	1.56	0.62
OverRIDE	1.61	0.59
ESamp (Ours)	1.67	0.57

ESamp achieves the highest semantic diversity (Vendi Score), lowest semantic similarity, and best generation quality (lowest perplexity) in creative writing, alongside superior Pass@16 and diversity in math reasoning.

3. Generation Dynamics: Analysis shows that while baseline methods plateau in semantic divergence among parallel sequences, ESamp maintains a continuous downward trend in pairwise cosine similarity throughout generation, confirming sustained exploration.

4. Efficiency Analysis: The asynchronous implementation introduces negligible overhead in standard serving scenarios. Table inside the text: Efficiency comparison on an RTX4090 GPU (Qwen3-8B)

Scenario ( $B \times K$ )	Vanilla	ESamp	Overhead (%)
$B = 1, K = 1$	55.1	54.9	0.3%
$B = 32, K = 1$	1215.2	1193.2	1.81%
$B = 32, K = 16$	4557.7	4364.0	4.25%

5. Sensitivity Analysis: Ablation studies on the exploration strength $\beta$ and logit fusion formulation confirm the robustness of the default settings ( $\beta=0.25$ , $(1+\beta)\text{logit}_{\text{ref}} - \beta\text{logit}_{\text{dist}}$ ). Table 2: Sensitivity analysis of ESamp on AIME25 (Qwen2.5-7B-Instruct)

Ablation Category	Setting	Pass@1	Pass@2	Pass@4	Pass@8	Pass@16	Pass@32	Pass@64
Exploration Strength $\beta$	$\beta = 0.1$	6.0%	10.4%	16.0%	21.5%	26.7%	32.3%	40.0%
	$\beta = 0.25$ (Default)	6.0%	10.4%	16.5%	23.8%	31.7%	39.5%	46.7%
	$\beta = 0.5$	4.8%	8.5%	13.4%	18.8%	23.8%	28.1%	30.0%

Theoretical and Practical Implications

Theoretical: ESamp provides a principled framework for test-time exploration in LLMs by formalizing generation as a KL-regularized optimization problem with a novelty reward derived from internal representation predictability. It demonstrates that coordinating exploration via a shared, online-learned model of the LLM's own internal dynamics is an effective and efficient strategy.

Practical: The method offers a deployable solution for improving the efficiency of test-time scaling (e.g., in reasoning, code generation, creative writing) without sacrificing latency. Its robust performance across diverse models and tasks suggests wide applicability. The open-source implementation within the tLLM framework lowers the barrier for adoption and further research into runtime-adaptation algorithms.

Conclusion

Exploratory Sampling (ESamp) addresses the critical limitation of surface-level diversity in standard LLM decoding. By using an online Latent Distiller to estimate novelty in the model's internal representation space, ESamp effectively steers generation toward under-explored semantic regions. Empirical results confirm that ESamp enhances semantic diversity, improves test-time scaling efficiency (particularly for reasoning models), and maintains generation quality, all with negligible computational overhead due to its asynchronous design. This establishes ESamp as a practical and effective method for enabling deeper exploration in large language models.