Summary (Overview)

  • This paper identifies that LLM-derived text embeddings, when projected onto the vocabulary space via Logit Lens, are dominated by high-frequency but semantically uninformative tokens, leading to suboptimal zero-shot performance.
  • Using Logit Spectroscopy on a reverse-engineered "average token," the authors discover that the "edge spectrum" (subspaces corresponding to the largest and smallest singular values of the unembedding matrix) is responsible for encoding these frequent tokens.
  • They introduce EmbedFilter, a simple linear transformation that filters out the edge spectrum by retaining only the bulk of the singular vectors, thereby suppressing frequent token influence and enhancing semantic representations.
  • EmbedFilter provides up to +14.1% improvement on MTEB across multiple LLMs under zero-shot settings, and naturally enables dimensionality reduction (e.g., to 1/8 of original size) without loss of performance.
  • Extensive experiments show EmbedFilter outperforms existing calibration methods (e.g., whitening) while requiring no training or calibration data.

Introduction and Theoretical Foundation

The paper addresses the persistent gap in LLMs' ability to serve as off-the-shelf zero-shot embedding models. While LLMs show strong zero-shot capabilities on many tasks, their text embeddings underperform on benchmarks like MTEB.

Core Observation: When applying Logit Lens (projecting hidden states to the vocabulary space) to raw LLM text embeddings, the top-decoded tokens are consistently high-frequency but uninformative tokens (e.g., "the", ",", "a"), regardless of input semantics. This phenomenon holds across different LLM families (Qwen, Llama, Mistral, see Figure 1 in the paper).

Theoretical Insight: The authors link this to the well-known anisotropy problem – embeddings lie in a narrow cone. They hypothesize that the centroid of this cone corresponds to an "average token" derived from the training corpus. The unembedding matrix WU\boldsymbol{W}_U encodes a subspace that projects embeddings toward this commonality, overshadowing semantic features.

Key Tools:

  • Logit Lens: Maps embeddings to vocabulary probabilities via Softmax(hWU)\mathrm{Softmax}(\boldsymbol{h}\boldsymbol{W}_U^\top).
  • Logit Spectroscopy: Uses SVD of WU=UΣV\boldsymbol{W}_U = \boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{V}^\top and a filter Ψi=IV[i]V[i]\boldsymbol{\Psi}_i = \boldsymbol{I} - \boldsymbol{V}_{[i]}\boldsymbol{V}_{[i]}^\top to measure the contribution of each singular subspace.

Methodology

Step 1: Reverse-Engineer the Average Token. Using the unembedding matrix and empirical token frequencies p^\hat{\boldsymbol{p}} (from RedPajama corpus), the "average token" representation is:

h^=log(p^)WU+\hat{\boldsymbol{h}} = \log(\hat{\boldsymbol{p}})\, \boldsymbol{W}_U^+

where WU+\boldsymbol{W}_U^+ is the Moore-Penrose pseudo-inverse. The bias term b\boldsymbol{b} is omitted as it does not affect spectral properties.

Step 2: Identify Edge Spectrum via Logit Spectroscopy. For each singular dimension ii, apply Ψi\boldsymbol{\Psi}_i to h^\hat{\boldsymbol{h}} and measure the cumulative logit shift for the top kk frequent tokens V+V^+:

Δπ(i)=jV+(w~j(i)w^j)jV+w^j\Delta_\pi(i) = \frac{\sum_{j\in V^+}\big(\tilde{w}_j^{(i)} - \hat{w}_j\big)}{\sum_{j\in V^+}\hat{w}_j}

Figure 2 (paper) shows Δπ\Delta_\pi is significantly larger at the extremes of the spectrum (largest and smallest singular values), confirming the edge subspace encodes frequent tokens.

Step 3: EmbedFilter Formulation. Construct Φτ\boldsymbol{\Phi}_\tau by keeping only the mid-range ("bulk") singular vectors:

Φτ=V[lτ:rτ]V[lτ:rτ]\boldsymbol{\Phi}_\tau = \boldsymbol{V}_{[l_\tau: r_\tau]} \boldsymbol{V}_{[l_\tau: r_\tau]}^\top

where τ\tau is a filtering ratio, and lτ,rτl_\tau, r_\tau define the retained columns. Refined embeddings:

e~i=eiΦτ\tilde{\boldsymbol{e}}_i = \boldsymbol{e}_i \boldsymbol{\Phi}_\tau^\top

Dimensionality Reduction: Because V\boldsymbol{V} is orthogonal, distance is preserved:

xΦτyΦτ2=xV[lτ:rτ]yV[lτ:rτ]2\| \boldsymbol{x}\boldsymbol{\Phi}_\tau^\top - \boldsymbol{y}\boldsymbol{\Phi}_\tau^\top \|_2 = \| \boldsymbol{x}\boldsymbol{V}_{[l_\tau:r_\tau]} - \boldsymbol{y}\boldsymbol{V}_{[l_\tau:r_\tau]} \|_2

Thus the embedding dimension can be reduced to d/τd/\tau (number of retained columns) without any similarity metric loss.

Empirical Validation / Results

Experiments are conducted on the MTEB benchmark (STS, Classification, Clustering, PairClassification, Reranking, Retrieval) using three LLM backbones: Qwen2.5-0.5B, Llama-3.1-8B-Instruct, Mistral-7B-Instruct-v0.3.

Main Results (Table 1):

Model + BaselineConfigAvg. (↑)
Qwen + PromptEOLVanilla50.07
+ EmbFilter (τ=2)dim=44854.57 (+9.0%)
Qwen + ECHOVanilla46.03
+ EmbFilter (τ=2)dim=44852.55 (+14.1%)
Llama + PromptEOLVanilla55.13
+ EmbFilter (τ=2)dim=204856.79 (+3.0%)
Mistral + ECHOVanilla53.21
+ EmbFilter (τ=2)dim=204856.10 (+5.4%)

EmbFilter consistently improves across all setups, with gains up to +14.1%. Even at τ=8\tau = 8 (dimension reduced to 1/8), performance remains competitive or superior.

Ablation Studies (Table 5):

  • Truncation (taking first half dims) and random dimension selection both underperform vanilla baseline, proving improvement is not due to simple dimension reduction.
  • Filtering only the secondary (small singular values) subspace gives improvements (+3.12), but the full EmbedFilter (both edges removed) achieves the best result.
  • The optimal strategy (filtering dimensions with highest Δπ\Delta_\pi) yields nearly identical performance to EmbedFilter, confirming its effectiveness without task-specific calibration.

Comparison with Whitening (Table 6): EmbedFilter outperforms BERT-whitening (55% improvement vs 5.9%) despite requiring no calibration data.

Dimensionality Reduction (Table 4): With Llama and τ=8\tau=8 (512 dimensions), EmbFilter achieves an average of 56.61 on MTEB, surpassing strong baselines like SimCSE (53.54) and coCondenser (55.48) while using far fewer dimensions.

Theoretical and Practical Implications

  • Theoretical: The work provides a mechanistic interpretation of LLM embedding anisotropy: the unembedding matrix encodes an "edge spectrum" subspace that biases representations toward frequent, uninformative tokens. This explains the suboptimal zero-shot performance and suggests that the unembedding matrix can serve as a "feature lens" for analyzing embedding spaces.
  • Practical: EmbedFilter is a lightweight, training-free post-processing step that can be applied to any LLM. It improves performance across multiple backbones and prompt strategies (PromptEOL, ECHO, MetaEOL, GenEOL). The built-in dimensionality reduction reduces index storage and speeds up retrieval by a factor of τ\tau, making LLM-based embeddings viable for large-scale applications.
  • Broader impact: The insights suggest that future text embedding training should explicitly account for and suppress high-frequency token biases, potentially by designing loss functions that penalize alignment with edge spectrum directions.

Conclusion

The authors discover that the unembedding matrix of LLMs encodes a subspace (edge spectrum) responsible for projecting text embeddings toward high-frequency tokens, which limits their zero-shot semantic representation ability. They propose EmbedFilter, a simple linear transformation that filters out this subspace, yielding substantial improvements on MTEB (up to +14.1%) across multiple model families without training. EmbedFilter also enables lossless dimensionality reduction, improving efficiency. The work provides a mechanistic understanding of LLM embedding shortcomings and offers a practical, principled fix. Future work may explore optimal asymmetric filtering strategies and deeper integration into embedding training pipelines.

Related papers