LMEB: Long-horizon Memory Embedding Benchmark - Summary

Summary (Overview)

Introduces LMEB, a comprehensive benchmark for evaluating text embedding models on long-horizon memory retrieval tasks, addressing a gap in existing benchmarks (like MTEB) that focus on traditional passage retrieval.
Categorizes memory into four types along two dimensions (Level of Abstraction and Temporal Dependency): Episodic (low abstraction, high dependency), Dialogue (high abstraction, high dependency), Semantic (low abstraction, low dependency), and Procedural (high abstraction, low dependency).
Consolidates 22 datasets across these four memory types, comprising 193 zero-shot retrieval tasks, with a mix of AI-generated and human-annotated data.
Key Findings from Evaluation: 1) LMEB presents a reasonable level of difficulty (top N@10 score: 61.41). 2) Model size does not guarantee better performance; smaller models can outperform larger ones. 3) LMEB and MTEB are orthogonal (low correlation), indicating that performance on traditional retrieval does not generalize to long-horizon memory tasks.
Provides an open-source, extensible framework built on MTEB's standards, with a unified data format, evaluation toolkit, and public leaderboard to foster reproducible research.

Introduction and Theoretical Foundation

Memory embeddings are fundamental for advanced, memory-augmented systems (e.g., agentic systems) that require storing, retrieving, and reasoning over vast amounts of information. While retrieval is central to these systems, current text embedding benchmarks (e.g., MTEB, BEIR) primarily evaluate traditional passage retrieval—searching for well-organized information in a corpus.

This leaves a significant gap, as long-horizon memory retrieval involves recalling fragmented, context-dependent, and temporally distant information, which existing benchmarks fail to assess effectively. To bridge this gap, the paper introduces the Long-horizon Memory Embedding Benchmark (LMEB).

The benchmark is built upon a taxonomy of memory inspired by cognitive science and prior work [Du et al., 2025a], categorizing it into four types:

Episodic Memory: Retrieval of past events linked to specific temporal and contextual cues.
Dialogue Memory: Maintaining context across multi-turn interactions.
Semantic Memory: Recalling general knowledge and facts, independent of specific context.
Procedural Memory: Retrieving learned skills, action sequences, and experiences.

These types are characterized along two key dimensions (visualized in Figure 2 of the paper):

Level of Abstraction: From concrete (episodic) to abstract (procedural, dialogue).
Temporal Dependency: The extent to which retrieval relies on temporal context (high for episodic/dialogue, low for semantic/procedural).

Methodology

1. Benchmark Construction & Data: LMEB consolidates 22 existing datasets into a unified framework. Table 1 provides the core statistics.

Table 1: Statistics of datasets in LMEB benchmark.

Memory Type	Dataset	Granularity	# Tasks	Query Source	Corpus Source	#Query	#Corpus	Avg. D / Q
Episodic	EPBench, KnowMeBench	Event	69	AI	Hybrid (AI/Human)	5,806	29,900	2.807
Dialogue	LoCoMo, LongMemEval, etc.	Multi (Turn, Session)	42	Hybrid	Hybrid	21,156	1,689,280	5.907
Semantic	QASPER, NovelQA, etc.	Multi (Sentence, Para.)	15	Hybrid	Human	7,499	200,411	1.606
Procedural	Gorilla, ToolBench, etc.	Multi (Tool, Experience)	67	Hybrid	Hybrid	124,550	157,431	1.019
Query src and Corpus src denote the source of queries and corpus, respectively. Avg. D / Q indicates the average relevant documents per query.

All data is converted into a standardized IR-style schema with four files: queries.jsonl, corpus.jsonl, qrels.tsv, and an optional candidates.jsonl (to restrict retrieval to a bounded memory pool, e.g., a specific conversation history).

2. Evaluation Protocol:

Built on top of the MTEB v2 framework for reproducibility.
Main Metric: NDCG@10 (N@10), with Recall@10 (R@10) also reported.
Evaluation Setting: Zero-shot retrieval. Models are evaluated in two input formats:
- w/o inst.: Model encodes the query text alone.
- w/ inst.: Model encodes a task instruction concatenated with the query text.
Models Evaluated: 15 widely-used embedding models, ranging from 239M to 12B parameters (e.g., jina-v5, Qwen3-Embedding, BGE models, NV-Embed-v2, EmbeddingGemma).

3. Diversity Analysis: The inter-dataset diversity is quantified using pairwise weighted Jaccard Similarity (JS) over unigram word distributions. The analysis (Figure 3) shows that datasets within the same memory type cluster together, with procedural datasets being particularly distinct, confirming LMEB's coverage of diverse domains.

Empirical Validation / Results

The main results are presented in Table 2 (w/o inst.) and Table 3 (w/ inst.). Key observations:

1. Reasonable Difficulty & Model Performance: The top model (bge-multilingual-gemma2) achieves a Mean (Dataset) N@10 score of 61.41 under the w/ inst. setting, indicating LMEB is challenging but not insurmountable.

2. Model Size is Not Deterministic: Smaller models frequently match or outperform larger ones.

In the w/o inst. setting, EmbeddingGemma-300M (307M params) achieves the best score on LMEB-Episodic (N@10: 68.19) and is highly competitive overall.
bge-m3 (Dense) (560M params) also performs strongly, outperforming many larger models.

3. Orthogonality with MTEB: Correlation analysis between LMEB and the MTEB (eng, v2) retrieval subset reveals very low correlation coefficients.

Pearson: -0.115
Spearman: -0.130 This orthogonality confirms that LMEB evaluates distinct capabilities not captured by traditional passage retrieval benchmarks. Figure 5 visualizes this lack of correlation.

4. Variable Impact of Task Instructions: The effect of adding task instructions (w/ inst.) varies significantly by model (Figure 4).

Positive Impact: Models like bge-multilingual-gemma2, KaLM-Embedding-Gemma3.
Neutral/Negative Impact: Models like NV-Embed-v2, jina-v5-text-small are insensitive; bge-m3, EmbeddingGemma-300M perform better without instructions. This suggests model performance is heavily influenced by specific training data and methodologies.

5. MTEB Generalization is Uneven:

Poor Generalization to Episodic/Dialogue: Correlation with MTEB is very low (Dialogue: Pearson -0.496), showing traditional retrieval prowess does not transfer to fragmented, context-heavy memory tasks.
Partial Generalization to Semantic/Procedural: Weak positive correlation exists (Procedural: Spearman 0.429), likely due to some overlap in task domains (e.g., code/tool retrieval).

Theoretical and Practical Implications

Theoretical Implications:

Establishes a novel taxonomy and evaluation framework for memory embeddings, moving beyond semantic similarity to incorporate abstraction and temporal dependency.
Provides empirical evidence that scaling model parameters is insufficient for mastering long-horizon memory retrieval; architectural innovation and specialized training are critical.
Demonstrates the orthogonality of evaluation domains, arguing for a multi-faceted benchmark suite to properly assess embedding models for real-world applications.

Practical Implications:

For Researchers: LMEB offers a standardized, open-source tool to diagnose model weaknesses in memory retrieval, guiding the development of next-generation embeddings for agents and memory-augmented systems.
For Practitioners: Highlights that model selection should be task-specific; a top performer on MTEB may fail on memory-intensive applications. LMEB serves as a crucial validation benchmark for deployment in such scenarios.
For the Community: The extensible framework (easy addition of new models/datasets) and public leaderboard will accelerate reproducible research and progress in this sub-field.

Conclusion

The Long-horizon Memory Embedding Benchmark (LMEB) fills a critical gap in the evaluation of text embedding models by focusing on complex, context-dependent, and temporally extended memory retrieval tasks. By systematically evaluating 15 models across 4 memory types and 193 tasks, the work reveals that:

Long-horizon memory retrieval presents unique challenges not addressed by current benchmarks.
Performance on traditional retrieval (MTEB) does not generalize to these tasks.
Larger models are not inherently better, emphasizing the need for specialized architectures and training.

LMEB provides a standardized, reproducible foundation for future research. The authors open-source the benchmark to drive innovation towards embedding models capable of powering sophisticated, real-world memory-augmented systems. Future work may involve expanding LMEB to include multilingual and multimodal memory retrieval tasks.