LLM2Vec-Gen: Generative Embeddings from Large Language Models - Summary

Summary (Overview)

Novel Paradigm: Proposes a self-supervised framework where text embeddings represent the LLM's potential response to an input, rather than encoding the input's semantics, to bridge the "input-output gap" in embedding tasks.
Efficient Training: Adds trainable "thought" and "compression" special tokens to a frozen LLM's vocabulary. Training uses a dual objective of reconstructing the LLM's own response and aligning with an unsupervised teacher embedding, requiring only unlabeled queries.
State-of-the-Art Performance: Achieves new self-supervised SOTA on the MTEB benchmark, improving over the best unsupervised teacher by up to 9.3% and closing over 60% of the gap to supervised methods.
Transferred Capabilities: Demonstrates significant improvements in safety (up to 43.2% reduction in harmful content retrieval on AdvBench-IR) and reasoning-intensive retrieval (up to 29.3% improvement on BRIGHT) by inheriting these aligned capabilities from the frozen LLM backbone.
Interpretable Embeddings: The learned compression token embeddings are interpretable and can be decoded back into text via the LLM, revealing the semantic content they capture (e.g., safe refusals or answer concepts).

Introduction and Theoretical Foundation

Text embeddings are fundamental for NLP applications like semantic search and retrieval. While LLM-based encoders have advanced the state-of-the-art, they largely follow an input-centric paradigm, where the model encodes the semantic content of the input text itself. This paradigm struggles with the core requirement of embedding tasks: mapping diverse inputs (e.g., different queries about the same topic) to similar outputs in the embedding space. Bridging this "input-output gap" typically requires large-scale, curated paired data and contrastive learning.

This work proposes a paradigm shift: instead of encoding the input, an embedding model should encode the LLM's potential response to that input (see Figure 1). The motivation is twofold:

It directly addresses the input-output gap, as semantically distinct inputs that elicit similar LLM responses (e.g., different harmful queries that all receive a refusal) will naturally be mapped closer together.
It allows the transfer of LLM capabilities like safety alignment and reasoning into the embedding space, as these capabilities manifest in the model's response, not its input.

The theoretical foundation connects to Joint Embedding Predictive Architectures (JEPA), which advocate learning by predicting in representation space. LLM2Vec-Gen's alignment objective shares this motivation, predicting a target representation (the teacher's embedding of the response) rather than reconstructing raw tokens.

Methodology

LLM2Vec-Gen is a self-supervised framework that distills a frozen LLM's potential response into a fixed set of latent suffix embeddings.

Core Components and Process (see Figure 2):

Data Generation: For a corpus of unlabeled queries $\{q_i\}$ , generate a response $r_i$ for each using the frozen LLM $M$ .
Teacher Embedding: Use an unsupervised teacher encoder $E$ (e.g., LLM2Vec) to create a target embedding for the response: $e_i = E(r_i)$ .
Special Tokens: Add two types of trainable tokens to $M$ $M$ 's vocabulary:
- $m$ thought tokens $(t_1, ..., t_m)$ : Act as an intermediate computational buffer.
- $n$ compression tokens $(c_1, ..., c_n)$ : Designed to capture the semantic content of the response. These tokens are appended to the input query: $x_i = q_i \oplus t_{1:m} \oplus c_{1:n}$ .
Forward Pass & Embedding Formation: Process $x_i$ $x_{i}$ through the frozen LLM to get the last-layer hidden states of the compression tokens: $[h^1_i, ..., h^n_i] = \text{LLM}(x_i)$ $[h_{i}^{1}, ..., h_{i}^{n}] = LLM (x_{i})$ . These are projected via lightweight MLPs:
- $[p^1_i, ..., p^n_i] = \text{MLP}_{\text{recon}}([h^1_i, ..., h^n_i])$ (for reconstruction)
- $\hat{e}_i = \text{Pool}(\text{MLP}_{\text{align}}([p^1_i, ..., p^n_i]))$ (for alignment)

Training Objectives:

Reconstruction Loss ( $\mathcal{L}_{\text{recon}}$ ): Forces the compressed representation $p_i$ to preserve the content of $r_i$ . The projected tokens $p_i$ are used as a soft prompt in a second forward pass, and the LLM is trained to reconstruct $r_i$ via next-token prediction: $\mathcal{L}_{\text{recon}} = -\sum_{j=1}^{|r_i|} \log P_{\text{LLM}}(r_{i,j} \mid p^1_i, ..., p^n_i, r_{i,<j})$
Embedding Alignment Loss ( $\mathcal{L}_{\text{align}}$ ): Ensures the final embedding $\hat{e}_i$ matches the teacher's response embedding $e_i$ : $\mathcal{L}_{\text{align}} = \| e_i - \hat{e}_i \|^2$

The final loss is $\mathcal{L} = \mathcal{L}_{\text{recon}} + \mathcal{L}_{\text{align}}$ . Only the special tokens and the two MLP projection layers are trained; the LLM backbone remains frozen.

Inference: Requires only a single forward pass. Append the trained special tokens to the input query, extract the compression token hidden states, and apply $\text{MLP}_{\text{recon}}$ and $\text{MLP}_{\text{align}}$ to obtain the final embedding $\hat{e}$ .

Empirical Validation / Results

Experimental Setup:

Models: Applied to Qwen-3 (0.6B to 8B), Qwen-2.5 (0.5B to 7B), and Llama-3.1/3.2 (1B to 8B) families.
Training Data: 160K single-turn queries from the Tulu dataset. Uses the LLM's own generated responses, not ground-truth Tulu answers.
Baselines: Compared against Echo Embeddings, HyDE, InBedder, GIRCSE, and LLM2Vec.
Evaluation:
- General Embeddings: MTEB(eng, v2) benchmark (41 tasks across 7 categories).
- Safety: AdvBench-IR (measures retrieval of harmful content; lower score is better).
- Reasoning: BRIGHT benchmark (reasoning-intensive retrieval; higher nDCG@10 is better).

Key Results:

1. MTEB Performance (State-of-the-Art): LLM2Vec-Gen achieves new SOTA for self-supervised methods on MTEB. The following table shows detailed results for the Qwen-3 family:

Table 1: Results on MTEB (eng, v2) for Qwen-3 Models

Method	Retr. (10)	Rerank. (2)	Clust. (8)	Pair. (3)	Class. (8)	STS (9)	Summ. (1)	Avg. (41)
Qwen-3-1.7B
LLM2Vec	34.9	39.9	41.1	76.4	71.1	73.4	30.4	54.8
LLM2Vec-Gen	37.0 (+6.2%)	44.9 (+12.6%)	49.3 (+19.8%)	75.6 (-1.1%)	74.4 (+4.7%)	77.3 (+5.3%)	29.0 (-4.6%)	58.6 (+6.9%)
Qwen-3-4B
LLM2Vec	41.1	40.0	43.0	78.5	72.5	71.6	31.1	56.8
LLM2Vec-Gen	38.0 (-7.5%)	45.2 (+12.8%)	50.9 (+18.3%)	78.5 (+0.0%)	76.6 (+5.6%)	77.9 (+8.7%)	28.5 (-8.4%)	59.9 (+5.5%)
Qwen-3-8B
LLM2Vec	42.7	40.9	40.6	77.3	72.5	72.6	31.7	56.8
LLM2Vec-Gen	42.2 (-1.3%)	46.7 (+14.2%)	50.3 (+23.9%)	80.8 (+4.5%)	79.1 (+9.2%)	80.3 (+10.5%)	31.9 (+0.7%)	62.1 (+9.3%)

LLM2Vec-Gen consistently outperforms its LLM2Vec teacher across model families and scales (see Figure 3), acting as a "performance equalizer."
Largest gains are in Clustering, Classification, and Semantic Textual Similarity (STS)—tasks where mapping diverse inputs to similar outputs is crucial.

2. Safety and Reasoning Performance:

Table 2: Safety (AdvBench-IR) and Reasoning (BRIGHT) Evaluation

Backbone	Method	AdvBench-IR ↓	BRIGHT ↑
Qwen-3-1.7B	LLM2Vec	46.7	14.0
	LLM2Vec-Gen	26.5 (-43.2%)	15.1 (+8.0%)
Qwen-3-4B	LLM2Vec	50.8	15.7
	LLM2Vec-Gen	34.8 (-31.4%)	19.2 (+22.1%)
Qwen-3-8B	LLM2Vec	54.2	14.9
	LLM2Vec-Gen	44.4 (-18.1%)	19.3 (+29.3%)

Safety: LLM2Vec-Gen shows significantly safer retrieval behavior (lower scores on AdvBench-IR), as it encodes the LLM's refusal response rather than the harmful query's intent.
Reasoning: LLM2Vec-Gen demonstrates superior reasoning-intensive retrieval (higher scores on BRIGHT), with improvements scaling with model size, indicating effective transfer of the LLM's reasoning capabilities.

3. Ablation Studies (Key findings on MTEB-Lite):

Training Objective: Both $\mathcal{L}_{\text{align}}$ and $\mathcal{L}_{\text{recon}}$ are critical. Removing $\mathcal{L}_{\text{align}}$ crashes performance (41.8 vs. 62.4). Removing $\mathcal{L}_{\text{recon}}$ slightly reduces performance (62.1) and destroys embedding interpretability.
Special Tokens: Both thought and compression tokens contribute. Performance improves up to ~20 tokens (default), then plateaus.
Response Generator: Using the model's own responses works best. Using stronger external models (e.g., Gemini) does not improve performance, suggesting in-distribution responses are easier to compress.
Embedding Teacher: Must share the same model family as the student LLM for best performance. Cross-family teachers degrade results.
Frozen LLM: Training only special tokens/MLPs (LLM2Vec-Gen) is highly efficient. Adding LoRA to the backbone offers minor gains but requires separate weights for embedding vs. generation tasks.

Theoretical and Practical Implications

Paradigm Shift for Embeddings: Demonstrates the viability and strength of a response-centric embedding paradigm, moving beyond traditional input encoding.
Efficient Knowledge Transfer: Provides a mechanism to efficiently transfer complex LLM capabilities (safety, reasoning) into a compact, single-vector embedding space without fine-tuning the core model.
Self-Supervised SOTA: Establishes a new strong baseline for self-supervised text embedding, significantly reducing the performance gap with supervised methods and mitigating the need for large-scale labeled contrastive data.
Interpretable Latent Representations: The decodable nature of the compression tokens offers a rare degree of interpretability for dense embeddings, allowing inspection of what semantic content they capture.
Practical Deployment: The frozen backbone enables a single model to serve both generative and embedding tasks seamlessly, unlike methods requiring full or partial (LoRA) fine-tuning.

Conclusion

LLM2Vec-Gen introduces a novel self-supervised framework that redefines text embedding by learning to represent an LLM's potential response. By freezing the LLM and training only lightweight special tokens via a dual reconstruction and alignment objective, the method effectively bridges the input-output gap, achieves state-of-the-art self-supervised performance on MTEB, and successfully transfers safety and reasoning capabilities into the embedding space. The resulting embeddings are both high-performing and interpretable. This work positions generative embeddings as a powerful and efficient alternative for adapting large language models into high-quality text encoders, especially in data-scarce scenarios.

Open Frontiers

The paper outlines promising future directions:

Full JEPA Mode: Exploring a variant where the same frozen LLM acts as both the world model (generator) and target encoder, eliminating the need for a separate teacher model and potentially the reconstruction loss.
Hyper-speed Inference via Latent Chaining: Investigating whether the compressed latent tokens can be chained across multiple forward passes to enable multi-step "reasoning" in compressed space, bypassing slow autoregressive decoding.
Latent Communication Between Agents: Proposing the use of LLM2Vec-Gen's dense, decodable latent tokens as a efficient and transparent communication protocol for multi-agent systems, overcoming the bottleneck of verbose natural language exchange.