Summary (Overview)
- The paper proposes an axiomatic framework for evaluating latent thought representations in LLMs, consisting of four functional axioms: Causality, Minimality, Separability, and Stability.
- For each axiom, a quantitative measure is defined that can be computed directly on the source LLM without retraining and without relying on downstream task accuracy.
- An empirical audit across five open-weight LLMs (dense, MoE, reasoning-distilled, RL-trained) on the 23 BBEH reasoning tasks reveals that no candidate thought representation satisfies all four axioms simultaneously.
- The representations distinguish task type reliably but fail to distinguish between two questions within the same task (near-random same-task accuracy), exposing a representational collapse that benchmark accuracy masks.
- The input embedding (IE) itself is competitive with all candidate thought representations on every axis, indicating that current latent representations add little information beyond what is already present in the prompt.
Introduction and Theoretical Foundation
Background and Motivation
Reasoning in LLMs has moved from discrete Chain-of-Thought (CoT) tokens toward continuous latent representations (e.g., Soft Thinking, Latent Thinking). However, current evaluations assess these representations almost exclusively through downstream task accuracy, conflating representation quality with model capacity. This makes it impossible to attribute failures to the representation itself rather than the model that processes it.
The authors identify three gaps:
- No principled definition of what a thought representation must do.
- No intrinsic evaluation that measures representation quality independently of downstream accuracy.
- The status of current methods is unknown—latent reasoning tokens are often unnecessary and do not implement structured reasoning.
Formal Definition of a Functional Thought
A thought representation is defined not as a communicable linguistic artifact but as a Functional Thought —a latent state that mediates the transformation from an input space to a semantic output space .
Definition 1 (Idealized Thought Representation Mapping): Let denote the input space and denote the semantic output space of a model . A thought representation generator is a function , where represents the thought space. The function satisfies:
This implies that is a many-to-one mapping that compresses into by preserving distinctness only if inputs result in semantically distinct outputs.
Methodology
The Four Axioms and Their Measures
The authors formalize four functional properties that a valid thought representation must satisfy. Table 1 summarizes the axioms and their measures.
| Axiom | Formal requirement | Quantitative measure |
|---|---|---|
| 1. Causality | $D_{\text{KL}}\left(P_\theta(Z | Y) \parallel P_\theta(Z |
| 2. Minimality | IB residual gap | |
| 3. Separability | for some | Discriminator accuracy (same-task / cross-task) |
| 4. Stability | encodes the entropy of $P(\mathcal{S} | x)$ |
1. Causality: The representation must functionally substitute the reasoning prefix within the model’s computational graph. Measured by the KL divergence when replacing the token embeddings of a reasoning prefix with the projected :
2. Minimality: Adapted from the Information Bottleneck principle. The representation should compress input-irrelevant information while retaining output-relevant information. The tractable surrogate is:
A larger indicates better minimality.
3. Separability: The representation must allow a bounded-capacity classifier to distinguish inputs that lead to semantically different outputs. Measured by training a binary discriminator to distinguish pairs of (representation, correct output) vs. (representation, wrong output) in two settings: same-task (same task, different questions) and cross-task (different tasks). Accuracy is the metric.
4. Stability: The representation should be invariant to lexical variations and should reflect the uncertainty of the output distribution. Measured by the Distributional Consistency Score (DCS): the AUROC of a probe predicting whether a question’s beam outputs span more than one semantic equivalence class (based on semantic entropy of Kuhn et al.).
Candidate Thought Representations
- Last Input Token (LIT): from all layers and final layer.
- Soft Thinking (ST): with and without Gumbel noise (STN), varying 1, 16, 32, 64, 128 steps.
- Latent Thinking (LT): varying 1, 16, 32, 64, 128 steps.
- Output Embedding (OE): Exact and Pooled (upper-bound references derived from ).
- Input Embedding (IE): the prompt embedding (baseline).
- Random Vector (RV): information-free reference.
Source Models and Tasks
| Source LLM | Family | Paradigm |
|---|---|---|
| Llama-3.1 8B | Dense | Instruct |
| Llama-3.3 70B | Dense | Instruct |
| DS-R1-Qwen 32B | Dense | Reasoning-distill |
| Skywork-OR1 32B | Dense | Native RL |
| GPT-OSS 20B | Sparse MoE | Adjust effort |
Evaluated on the 23 tasks of Big Bench Extra Hard (BBEH).
Probe Setup
- A frozen Llama-3.2-1B backbone with a trainable linear projection maps thought representations into its token-embedding space.
- Semantic similarity for DCS computed with Embed-Nemotron-8B (top MTEB model at writing).
- Beam search (8 sequences, up to 8192 tokens) approximates the high-probability region of .
Empirical Validation / Results
Per-Axiom Results
Causality (KL error, lower is better):
All candidates have KL substantially below the Random Vector baseline, but none consistently exceeds the Input Embedding reference (Table 4). The thought representations carry no additional causal information beyond the prompt.
| LLM | OE | LIT | ST | STN | LT | IE | RV |
|---|---|---|---|---|---|---|---|
| Llama 8B | 5.21 | 5.01 | 4.96 | 4.70 | 5.32 | 5.36 | 9.49 |
| Llama 70B | 4.56 | 5.28 | 4.65 | 5.08 | 4.21 | 4.71 | 8.93 |
| DS-R1 32B | 4.67 | 4.79 | 4.45 | 4.57 | 4.62 | 4.50 | 9.36 |
| Sky-OR1 32B | 4.10 | 4.09 | 3.90 | 4.68 | 4.34 | 4.08 | 9.31 |
| GPT-OSS 20B | 3.82 | 4.19 | 4.00 | 4.17 | 3.90 | 3.78 | 9.60 |
Minimality (, higher is better):
Results are mixed. Output Embedding is not interpretable (violates decomposition assumptions). Among candidates, soft-thinking is at or above IE, LIT and LT are below or equal to IE. No candidate consistently encodes more output-relevant compression than the prompt.
| LLM | OE | LIT | ST | STN | LT | IE | RV |
|---|---|---|---|---|---|---|---|
| Llama 8B | 0.37 | 0.16 | 0.25 | 0.24 | 0.19 | 0.22 | -0.40 |
| Llama 70B | -0.13 | -0.30 | -0.24 | -0.24 | -0.30 | -0.23 | -0.99 |
| DS-R1 32B | 0.07 | -0.05 | 0.10 | 0.10 | 0.05 | 0.04 | -0.50 |
| Sky-OR1 32B | -0.16 | -0.27 | -0.13 | -0.14 | -0.18 | -0.21 | -0.59 |
| GPT-OSS 20B | -0.22 | -0.25 | -0.21 | -0.20 | -0.17 | -0.34 | -0.30 |
Separability (same-task accuracy, %):
Cross-task accuracy is near saturation for all candidates (task identity easily encoded). Same-task accuracy is near random (50–55%) for all candidates except Output Embedding (62–73%), indicating a severe representational collapse on per-question identity.
| LLM | OE | LIT | ST | STN | LT | IE | RV |
|---|---|---|---|---|---|---|---|
| Llama 8B | 68.8 | 53.9 | 54.7 | 53.5 | 54.7 | 54.5 | 48.9 |
| Llama 70B | 72.6 | 51.6 | 52.9 | 52.8 | 51.4 | 52.1 | 49.7 |
| DS-R1 32B | 63.5 | 52.6 | 54.8 | 51.8 | 50.3 | 53.5 | 50.3 |
| Sky-OR1 32B | 63.4 | 53.3 | 54.2 | 51.8 | 51.2 | 54.0 | 49.9 |
| GPT-OSS 20B | 62.4 | 50.4 | 50.7 | 51.8 | 51.2 | 49.5 | 51.0 |
Stability (DCS AUROC, , higher is better):
DCS is high for all candidates (0.85–0.97) except GPT-OSS-20B (0.46–0.96). Iterative thinking families degrade with more steps. The Input Embedding reference matches or exceeds iterative candidates. DCS is largely predictable from question text alone.
| LLM | OE | LIT | ST | STN | LT | IE | RV |
|---|---|---|---|---|---|---|---|
| Llama 8B | 0.96 | 0.94 | 0.94 | 0.90 | 0.92 | 0.93 | 0.52 |
| Llama 70B | 0.89 | 0.89 | 0.85 | 0.84 | 0.87 | 0.92 | 0.50 |
| DS-R1 32B | 0.96 | 0.95 | 0.95 | 0.85 | 0.92 | 0.94 | 0.50 |
| Sky-OR1 32B | 0.97 | 0.96 | 0.95 | 0.86 | 0.92 | 0.93 | 0.49 |
| GPT-OSS 20B | 0.74 | 0.58 | 0.55 | 0.46 | 0.59 | 0.59 | 0.56 |
Joint Behavior
Table 8 shows the number of cells (out of 20, across 4 axioms × 5 LLMs) where each candidate exceeds the Input Embedding reference.
| Candidate | Cells / 20 |
|---|---|
| Exact OE | 16 |
| Pooled OE | 16 |
| LIT (all, final) | 6 |
| ST@1 | 13 |
| ST@16–128 | 3–5 |
| STN@1 | 7 |
| STN@16–128 | 2–3 |
| LT@1 | 7 |
| LT@16–128 | 2–3 |
No candidate beats IE on every axis. The iterative thinking variants degrade as step count grows. The Output Embedding is the only family that excels (but is an oracle).
Theoretical and Practical Implications
- The framework provides explicit, decomposable optimization targets for developing future thought representations. Instead of optimizing a single accuracy number, researchers can now maximize each axiom independently.
- Diagnostic value: An audit reveals which axiom is the binding constraint on an existing representation. A change in downstream accuracy can be attributed to a specific property (e.g., poor separability, low stability).
- The representational collapse on per-question identity suggests that current latent reasoning methods do not faithfully encode the structured reasoning they claim to implement. This is a structural property, not a function of model size or training procedure.
- The competitive performance of the input embedding indicates that adding explicit latent reasoning steps beyond the prompt may not provide additional information—an important finding for efficiency-motivated work.
- The framework is model-agnostic and axiom-agnostic in design: the four measures can be computed on any thought representation, and the axioms admit alternative quantifications.
Conclusion
The paper introduces an axiomatic evaluation framework (Causality, Minimality, Separability, Stability) for latent thought representations in LLMs, with intrinsic measures computed directly on the source model without retraining. An empirical audit across five LLMs and 23 reasoning tasks reveals that:
- No candidate satisfies all four axioms simultaneously.
- Representations retain coarse task identity but lose per-question identity (near-random same-task accuracy).
- Input embeddings are competitive with all thought representations, indicating that current methods add little information beyond the prompt.
- The failure pattern is consistent across dense, MoE, reasoning-distilled, and RL-trained models, suggesting a structural gap rather than a scale or training issue.
Future directions: (1) Developing new candidate thought representations explicitly trained to satisfy the axioms; (2) Extending the lexical invariance sub-property of Stability; (3) Applying the framework to multilingual workloads and wider generation tasks; (4) Using the four measures as direct loss terms in training.
Limitations: Lexical invariance is not tested (candidates produce identical vectors for paraphrases). The protocol requires extra compute for generating beam outputs and probe training, but this cost is small relative to the diagnostic information gained.
Related papers
- Improved Large Language Diffusion Models
iLLaDA, an 8B masked diffusion language model trained from scratch, achieves performance competitive with Qwen2.5 7B on multiple benchmarks.
- LoopCoder-v2: Only Loop Once for Efficient Test-Time Computation Scaling
Two-loop Parallel Loop Transformer achieves optimal code performance, with further loops degrading due to positional mismatch cost dominating refinement gains.
- Memory is Reconstructed, Not Retrieved: Graph Memory for LLM Agents
Active memory reconstruction with associative graphs proves strictly more powerful than passive retrieval, achieving 23% higher scores and 81% lower cost.