Visual Summary | Formalizing Latent Thoughts: Four Axioms of Thought Representation in LLMs

Summary (Overview)

The paper proposes an axiomatic framework for evaluating latent thought representations in LLMs, consisting of four functional axioms: Causality, Minimality, Separability, and Stability.
For each axiom, a quantitative measure is defined that can be computed directly on the source LLM without retraining and without relying on downstream task accuracy.
An empirical audit across five open-weight LLMs (dense, MoE, reasoning-distilled, RL-trained) on the 23 BBEH reasoning tasks reveals that no candidate thought representation satisfies all four axioms simultaneously.
The representations distinguish task type reliably but fail to distinguish between two questions within the same task (near-random same-task accuracy), exposing a representational collapse that benchmark accuracy masks.
The input embedding (IE) itself is competitive with all candidate thought representations on every axis, indicating that current latent representations add little information beyond what is already present in the prompt.

Introduction and Theoretical Foundation

Background and Motivation

Reasoning in LLMs has moved from discrete Chain-of-Thought (CoT) tokens toward continuous latent representations (e.g., Soft Thinking, Latent Thinking). However, current evaluations assess these representations almost exclusively through downstream task accuracy, conflating representation quality with model capacity. This makes it impossible to attribute failures to the representation itself rather than the model that processes it.

The authors identify three gaps:

No principled definition of what a thought representation must do.
No intrinsic evaluation that measures representation quality independently of downstream accuracy.
The status of current methods is unknown—latent reasoning tokens are often unnecessary and do not implement structured reasoning.

Formal Definition of a Functional Thought

A thought representation is defined not as a communicable linguistic artifact but as a Functional Thought $T$ —a latent state that mediates the transformation from an input space $\mathcal{X}$ to a semantic output space $\mathcal{S}$ .

Definition 1 (Idealized Thought Representation Mapping): Let $\mathcal{X}$ denote the input space and $\mathcal{S}$ denote the semantic output space of a model $M: \mathcal{X} \to \mathcal{S}$ . A thought representation generator is a function $g: \mathcal{X} \to \mathcal{T}$ , where $\mathcal{T}$ represents the thought space. The function $g$ satisfies:

g(x_i) = g(x_j) \iff M(x_i) = M(x_j)

This implies that $g$ is a many-to-one mapping that compresses $\mathcal{X}$ into $\mathcal{T}$ by preserving distinctness only if inputs result in semantically distinct outputs.

Methodology

The Four Axioms and Their Measures

The authors formalize four functional properties that a valid thought representation must satisfy. Table 1 summarizes the axioms and their measures.

Axiom	Formal requirement	Quantitative measure
1. Causality	$D_{\text{KL}}\left(P_\theta(Z	Y) \parallel P_\theta(Z
2. Minimality	$\min_T I(X;T) - \beta I(T;Y)$	IB residual gap $\Delta_{\text{IB}}$
3. Separability	$d_{\mathcal{S}}\left(\phi(T_{x_1}), \phi(T_{x_2})\right) > \delta$ for some $\phi \in \mathcal{H}$	Discriminator accuracy (same-task / cross-task)
4. Stability	$T$ encodes the entropy of $P(\mathcal{S}	x)$

1. Causality: The representation must functionally substitute the reasoning prefix within the model’s computational graph. Measured by the KL divergence when replacing the token embeddings of a reasoning prefix $y_{\text{pre}}$ with the projected $T$ :

\text{Causality Error} = D_{\text{KL}}\left(P(y_{\text{suf}}|y_{\text{pre}}) \parallel P(y_{\text{suf}}|T)\right)

2. Minimality: Adapted from the Information Bottleneck principle. The representation should compress input-irrelevant information while retaining output-relevant information. The tractable surrogate is:

\Delta_{\text{IB}} = \text{CE}(X|Y,T) - \text{CE}(Y|T)

A larger $\Delta_{\text{IB}}$ indicates better minimality.

3. Separability: The representation must allow a bounded-capacity classifier to distinguish inputs that lead to semantically different outputs. Measured by training a binary discriminator $f_{\text{disc}}(T,Y)$ to distinguish pairs of (representation, correct output) vs. (representation, wrong output) in two settings: same-task (same task, different questions) and cross-task (different tasks). Accuracy is the metric.

4. Stability: The representation should be invariant to lexical variations and should reflect the uncertainty of the output distribution. Measured by the Distributional Consistency Score (DCS): the AUROC of a probe predicting whether a question’s beam outputs span more than one semantic equivalence class (based on semantic entropy $H_x$ of Kuhn et al.).

Candidate Thought Representations

Last Input Token (LIT): from all layers and final layer.
Soft Thinking (ST): with and without Gumbel noise (STN), varying 1, 16, 32, 64, 128 steps.
Latent Thinking (LT): varying 1, 16, 32, 64, 128 steps.
Output Embedding (OE): Exact and Pooled (upper-bound references derived from $Y$ ).
Input Embedding (IE): the prompt embedding (baseline).
Random Vector (RV): information-free reference.

Source Models and Tasks

Source LLM	Family	Paradigm
Llama-3.1 8B	Dense	Instruct
Llama-3.3 70B	Dense	Instruct
DS-R1-Qwen 32B	Dense	Reasoning-distill
Skywork-OR1 32B	Dense	Native RL
GPT-OSS 20B	Sparse MoE	Adjust effort

Evaluated on the 23 tasks of Big Bench Extra Hard (BBEH).

Probe Setup

A frozen Llama-3.2-1B backbone with a trainable linear projection maps thought representations into its token-embedding space.
Semantic similarity for DCS computed with Embed-Nemotron-8B (top MTEB model at writing).
Beam search (8 sequences, up to 8192 tokens) approximates the high-probability region of $P(Y|x)$ .

Empirical Validation / Results

Per-Axiom Results

Causality (KL error, lower is better):
All candidates have KL substantially below the Random Vector baseline, but none consistently exceeds the Input Embedding reference (Table 4). The thought representations carry no additional causal information beyond the prompt.

LLM	OE	LIT	ST	STN	LT	IE	RV
Llama 8B	5.21	5.01	4.96	4.70	5.32	5.36	9.49
Llama 70B	4.56	5.28	4.65	5.08	4.21	4.71	8.93
DS-R1 32B	4.67	4.79	4.45	4.57	4.62	4.50	9.36
Sky-OR1 32B	4.10	4.09	3.90	4.68	4.34	4.08	9.31
GPT-OSS 20B	3.82	4.19	4.00	4.17	3.90	3.78	9.60

Minimality ( $\Delta_{\text{IB}}$ , higher is better):
Results are mixed. Output Embedding is not interpretable (violates decomposition assumptions). Among candidates, soft-thinking is at or above IE, LIT and LT are below or equal to IE. No candidate consistently encodes more output-relevant compression than the prompt.

LLM	OE	LIT	ST	STN	LT	IE	RV
Llama 8B	0.37	0.16	0.25	0.24	0.19	0.22	-0.40
Llama 70B	-0.13	-0.30	-0.24	-0.24	-0.30	-0.23	-0.99
DS-R1 32B	0.07	-0.05	0.10	0.10	0.05	0.04	-0.50
Sky-OR1 32B	-0.16	-0.27	-0.13	-0.14	-0.18	-0.21	-0.59
GPT-OSS 20B	-0.22	-0.25	-0.21	-0.20	-0.17	-0.34	-0.30

Separability (same-task accuracy, %):
Cross-task accuracy is near saturation for all candidates (task identity easily encoded). Same-task accuracy is near random (50–55%) for all candidates except Output Embedding (62–73%), indicating a severe representational collapse on per-question identity.

LLM	OE	LIT	ST	STN	LT	IE	RV
Llama 8B	68.8	53.9	54.7	53.5	54.7	54.5	48.9
Llama 70B	72.6	51.6	52.9	52.8	51.4	52.1	49.7
DS-R1 32B	63.5	52.6	54.8	51.8	50.3	53.5	50.3
Sky-OR1 32B	63.4	53.3	54.2	51.8	51.2	54.0	49.9
GPT-OSS 20B	62.4	50.4	50.7	51.8	51.2	49.5	51.0

Stability (DCS AUROC, $\tau = 0.9$ , higher is better):
DCS is high for all candidates (0.85–0.97) except GPT-OSS-20B (0.46–0.96). Iterative thinking families degrade with more steps. The Input Embedding reference matches or exceeds iterative candidates. DCS is largely predictable from question text alone.

LLM	OE	LIT	ST	STN	LT	IE	RV
Llama 8B	0.96	0.94	0.94	0.90	0.92	0.93	0.52
Llama 70B	0.89	0.89	0.85	0.84	0.87	0.92	0.50
DS-R1 32B	0.96	0.95	0.95	0.85	0.92	0.94	0.50
Sky-OR1 32B	0.97	0.96	0.95	0.86	0.92	0.93	0.49
GPT-OSS 20B	0.74	0.58	0.55	0.46	0.59	0.59	0.56

Joint Behavior

Table 8 shows the number of cells (out of 20, across 4 axioms × 5 LLMs) where each candidate exceeds the Input Embedding reference.

Candidate	Cells / 20
Exact OE	16
Pooled OE	16
LIT (all, final)	6
ST@1	13
ST@16–128	3–5
STN@1	7
STN@16–128	2–3
LT@1	7
LT@16–128	2–3

No candidate beats IE on every axis. The iterative thinking variants degrade as step count grows. The Output Embedding is the only family that excels (but is an oracle).

Theoretical and Practical Implications

The framework provides explicit, decomposable optimization targets for developing future thought representations. Instead of optimizing a single accuracy number, researchers can now maximize each axiom independently.
Diagnostic value: An audit reveals which axiom is the binding constraint on an existing representation. A change in downstream accuracy can be attributed to a specific property (e.g., poor separability, low stability).
The representational collapse on per-question identity suggests that current latent reasoning methods do not faithfully encode the structured reasoning they claim to implement. This is a structural property, not a function of model size or training procedure.
The competitive performance of the input embedding indicates that adding explicit latent reasoning steps beyond the prompt may not provide additional information—an important finding for efficiency-motivated work.
The framework is model-agnostic and axiom-agnostic in design: the four measures can be computed on any thought representation, and the axioms admit alternative quantifications.

Conclusion

The paper introduces an axiomatic evaluation framework (Causality, Minimality, Separability, Stability) for latent thought representations in LLMs, with intrinsic measures computed directly on the source model without retraining. An empirical audit across five LLMs and 23 reasoning tasks reveals that:

No candidate satisfies all four axioms simultaneously.
Representations retain coarse task identity but lose per-question identity (near-random same-task accuracy).
Input embeddings are competitive with all thought representations, indicating that current methods add little information beyond the prompt.
The failure pattern is consistent across dense, MoE, reasoning-distilled, and RL-trained models, suggesting a structural gap rather than a scale or training issue.

Future directions: (1) Developing new candidate thought representations explicitly trained to satisfy the axioms; (2) Extending the lexical invariance sub-property of Stability; (3) Applying the framework to multilingual workloads and wider generation tasks; (4) Using the four measures as direct loss terms in training.

Limitations: Lexical invariance is not tested (candidates produce identical vectors for paraphrases). The protocol requires extra compute for generating beam outputs and probe training, but this cost is small relative to the diagnostic information gained.