# KaLM-Reranker-V1: Fast but Not Late Interaction for Compressed Document Reranking

> KaLM-Reranker-V1 introduces a fast but not late-interaction encoder-decoder reranker that matches top performance at up to 203× lower online cost.

- **Source:** [arXiv](https://arxiv.org/abs/2606.22807)
- **Published:** 2026-06-24
- **Permalink:** https://picx.dev/p/1PgpkB
- **Whiteboard:** https://picx.dev/p/1PgpkB/image

## Summary

## Summary (Overview)

- **New reranking paradigm (FBNL):** KaLM-Reranker-V1 introduces a *fast but not late-interaction* (FBNL) design based on an encoder–decoder architecture. The encoder pre-encodes passages offline (reusable across queries), while the decoder models query intent and computes fine-grained relevance via cross-attention. This decouples query–passage computation without sacrificing expressive interaction.
- **Matryoshka Embedding Pooling (MEP):** Passage representations are compressed along the sequence dimension (by factor \(r \in \{1,2,4,8,16,32\}\)) via mean pooling, enabling flexible trade-offs between storage cost and reranking quality. At moderate compression (\(r=2\)–\(8\)), performance degradation is negligible.
- **Strong performance with superior efficiency:** On BEIR, KaLM-Reranker-V1-Large (4B) achieves an average nDCG@10 of 62.87, on par with Qwen3-Reranker-4B (63.50) but at ~5.5× lower online computation cost. On MIRACL (18 languages), it outperforms bge-reranker-v2-gemma (2.5B) while being nearly 2× more efficient. On LMEB, even the 0.27B Nano model competes with 7–12B embedding models.
- **Multi-stage training:** A progressive three-stage pipeline (general reranking → task-specific adaptation → fine-grained distillation) improves generalization and discriminative ability.
- **In-depth analysis:** ROC-AUC analysis confirms that moderate compression preserves discriminative power; larger models are more robust to compression. Efficiency gains reach up to 203× for long passages (\(n=4096\)) and 33× at \(r=8\).

## Introduction and Theoretical Foundation

**Background.** Neural retrieval systems typically use a two-stage pipeline: first-stage retrieval recalls a small set of candidates, and a reranking stage performs finer-grained relevance modeling. Reranking quality is critical for search, recommendation, and retrieval-augmented generation (RAG).

**Limitations of existing rerankers.** Most current rerankers (encoder-based or decoder-based) jointly encode query and passage, tightly coupling their computation. This forces each query–passage pair to be recomputed online, making pre-computation impossible and scaling costly. Late-interaction methods (e.g., ColBERT) decouple query and passage encoding but limit interaction to similarity operations over independently contextualized tokens, missing deep relevance signals.

**Proposed solution.** The paper introduces the *fast but not late-interaction* (FBNL) paradigm. It uses an encoder–decoder architecture where:
- The **encoder** produces reusable passage representations \(H_p \in \mathbb{R}^{n \times d}\) (Eq. 1).
- The **decoder** takes the system instruction, task instruction \(I\), and query \(q\) and attends to \(H_p\) via merged self-attention and cross-attention (Eqs. 2–4). Relevance is scored as the softmax-normalized likelihood of "yes" vs. "no" (Eq. 5).

This design preserves fine-grained interaction (unlike late interaction) while supporting offline passage encoding (unlike joint encoders).

## Methodology

### Model Architecture
Built on T5Gemma2 foundation models, KaLM-Reranker-V1 comes in three sizes:

| Model | Activated Params | Non-Embedding Params | Embedding Params | #Layers | Sequence Length | Document Token Dim. | MEP Support |
|-------|-----------------|---------------------|------------------|---------|-----------------|--------------------|-------------|
| Nano  | 0.27B           | 100M                | 168M             | 18      | 128K            | 640                | 1×–32×      |
| Small | 1B              | 698M                | 302M             | 26      | 128K            | 1152               | 1×–32×      |
| Large | 4B              | 3209M               | 675M             | 34      | 128K            | 2560               | 1×–32×      |
**Table 1:** KaLM-Reranker-V1 series specifications. MEP = Matryoshka Embedding Pooling.

**Encoder–Decoder Computation.**  
Given document \(p\), the encoder produces:
$$H_p = \text{Enc}(p) \in \mathbb{R}^{n \times d} \tag{1}$$
The decoder receives instruction \(I\) and query \(q\). With decoder input \(X \in \mathbb{R}^{m \times d}\), the merged attention uses:
$$\begin{aligned}
Q &= XW^Q, \\
K &= [X; H_p] W^K, \\
V &= [X; H_p] W^V,
\end{aligned} \tag{3}$$
and the output is:
$$O = \text{softmax}\left( \frac{QK^\top}{\sqrt{d_h}} + M \right) V W^O, \tag{4}$$
where \(M\) masks invalid positions. The relevance score is:
$$\text{score}(q,p) = \frac{\exp(z_{\text{yes}})}{\exp(z_{\text{yes}}) + \exp(z_{\text{no}})} \tag{5}$$

**Prompt Template.** The encoder takes only the document; the decoder receives system instruction + user instruction + query. The model predicts "yes"/"no" as the next token (Table 2).

### Training Objective

**Supervised Fine-Tuning (SFT).** For a query–document pair with binary label \(l \in \{\text{yes},\text{no}\}\), the loss is:
$$\mathcal{L}_{\text{sft}}(I,q,p,l) = -\log\frac{\exp(z_l)}{\exp(z_{\text{yes}})+\exp(z_{\text{no}})} \tag{6}$$

**Matryoshka Embedding Pooling (MEP).** To reduce storage, MEP compresses passage representations along the sequence dimension by ratio \(r\):
$$H_p^{(r)}[j] = \text{MeanPool}\left( H_p[(j-1)r+1 : \min(jr, n)] \right), \quad j = 1,\dots,\lceil n/r \rceil \tag{7}$$
The final SFT objective over a set of ratios \(R\) is:
$$\mathcal{L}_{\text{sft}}(I,q,p,l,R) = \sum_{r \in R} \lambda_r \left( -\log\frac{\exp(z_l^{(r)})}{\exp(z_{\text{yes}}^{(r)})+\exp(z_{\text{no}}^{(r)})} \right) \tag{9}$$
By default \(\lambda_r = 1\) and \(R = \{1,2,4,8,16,32\}\).

**Knowledge Distillation (KD).** For student score \(\hat{y}\) and teacher soft label \(y \in [0,1]\):
$$\mathcal{L}_{\text{kd}} = -y\log\hat{y} - (1-y)\log(1-\hat{y}) \tag{10}$$

### Multi-stage Training
Three progressive stages:
1. **General Reranking Ability Learning:** No task instructions; domain-agnostic binary relevance training.
2. **Task-Specific Reranking Adaptation:** Introduce task-specific instructions; higher-quality data.
3. **Fine-Grained Relevance Distillation:** Soft labels from a stronger teacher (KaLM-Reranker-V1-Large for Small & Nano).

### Complexity Analysis
**Time complexity** (Table 3):

| Model Type | Interaction Pattern | Time Complexity |
|-----------|-------------------|-----------------|
| Encoder-based reranker | Bidirectional attention | \(O\left(KL\left((|q|+n)^2 d + (|q|+n)d^2\right)\right)\) |
| Decoder-based reranker | Causal attention | Same as above |
| **Enc-Dec (ours)** | Merged self-attn. + cross-attn. | \(O\left(\frac{L}{2}K\left(|q|\left(|q|+\lceil n/r\rceil\right)d + (|q|+\lceil n/r\rceil)d^2\right)\right)\) |

**Space complexity:** Without MEP, caching \(N\) passages costs \(O(Nnd)\). With MEP ratio \(r\), it reduces to \(O(N \lceil n/r \rceil d)\).

**Efficiency gains (Figure 4):**  
- At \(n=4096\) (long passages): **203.4×** speedup over traditional rerankers.  
- At compression ratio \(r=8\): **33.3×** speedup; at \(r=4\): **18.5×**; even at \(r=2\): ~10×.

## Empirical Validation / Results

### Benchmarks and Setup
- **BEIR:** 13 heterogeneous English retrieval tasks (nDCG@10).  
- **MIRACL:** 18 multilingual languages.  
- **LMEB:** 6 dialogue memory retrieval tasks (see Appendix).  
- First-stage retriever: KaLM-Embedding-V2.5 (top-100 candidates).  
- Default MEP ratio \(r=4\).

### Main Results on BEIR

| Model | Size | Cost (norm.) | Avg. nDCG@10 |
|-------|------|--------------|--------------|
| Qwen3-Reranker-8B | 8B | 539.7× | 65.11 |
| Qwen3-Reranker-4B | 4B | 236.8× | 63.50 |
| **KaLM-Reranker-V1-L** | **4B** | **43.7×** | **62.87** |
| KaLM-Reranker-V1-S | 1B | 6.9× | 60.01 |
| KaLM-Reranker-V1-N | 0.27B | 1.0× | 57.41 |
| bge-reranker-large | 0.6B | 36.3× | 51.86 |
| gte-reranker-base | 0.3B | 11.9× | 56.77 |
| Qwen3-Reranker-0.6B | 0.6B | 42.4× | 59.36 |
**Table 4 (excerpt):** BEIR results. Cost normalized to Nano as 1.0. KaLM-Reranker-V1 achieves strong performance at much lower cost.

Key observations:
- KaLM-Reranker-V1-Large is on par with Qwen3-Reranker-4B but at ~5.5× lower cost.
- KaLM-Reranker-V1-Small outperforms Qwen3-Reranker-0.6B at 6.9× vs 42.4× cost.
- Nano (0.27B) slightly outperforms gte-reranker-base (0.3B) at ~12× lower cost.

### Main Results on MIRACL (18 languages)

| Model | Size | Cost | Avg. nDCG@10 |
|-------|------|------|--------------|
| **KaLM-Reranker-V1-L** | **4B** | **43.7×** | **70.07** |
| bge-reranker-v2-gemma | 2.5B | 81.3× | 69.82 |
| KaLM-Reranker-V1-S | 1B | 6.9× | 66.89 |
| KaLM-Reranker-V1-N | 0.27B | 1.0× | 62.08 |
| mxbai-rerank-large-v2 | 1.5B | 79.2× | 63.26 |
| bge-reranker-large | 0.6B | 36.3× | 52.52 |
**Table 5 (excerpt):** MIRACL results. Despite limited multilingual training, KaLM-Reranker-V1-Large outperforms bge-reranker-v2-gemma at nearly 2× efficiency.

### Matryoshka Embedding Pooling Analysis (Figure 5)
- Performance decreases gradually from \(r=2\) to \(r=16\); drop from \(r=16\) to \(r=32\) is more severe.
- Larger models are more robust: e.g., on BEIR, Large degrades less than Small and Nano.
- Moderate compression (\(r=2\)–\(8\)) retains most effectiveness while gaining substantial efficiency.

### ROC-AUC Analysis (Figure 6)
- As \(r\) increases, AUC decreases; smaller models suffer larger drops.
- For FQA (BEIR), Nano AUC drops from 0.871 (\(r=2\)) to 0.832 (\(r=32\)), while Large drops only from 0.952 to 0.948.
- Compression divides into two regimes: \(r=2,4,8\) stable; \(r=16,32\) cause noticeable degradation.

## Theoretical and Practical Implications

- **Paradigm shift in reranking:** The FBNL design breaks the traditional efficiency–effectiveness trade-off. By decoupling passage encoding from query-passage interaction, it enables offline pre-computation of passage representations (like bi-encoders) while preserving rich cross-attention (like joint encoders). This is a novel middle ground between late-interaction and full-interaction methods.
- **Practical scalability:** For large-scale deployments (e.g., RAG with millions of documents), the ability to pre-encode and cache compressed passages yields massive savings in online compute and latency. The 203× efficiency gain for long passages is critical for real-world search systems.
- **Flexible storage-efficiency trade-off:** MEP allows practitioners to choose compression ratios based on storage budget. With \(r=4\) (default), performance loss is minimal (<1% on average), making it suitable for most applications. The training with multiple ratios enables a single model to serve at any compression level without retraining.
- **Multi-stage training framework:** The progressive three-stage pipeline (general → task-specific → distillation) is a reusable recipe that improves both generalization and fine-grained discrimination. This can be applied to other reranker architectures.
- **Implication for multilingual reranking:** Despite limited multilingual data, KaLM-Reranker-V1 shows strong cross-lingual transfer, suggesting that the encoder–decoder architecture with cross-attention may better handle language-agnostic relevance signals. However, Chinese performance remains a bottleneck, indicating the need for better foundation models or more Chinese training data.

## Conclusion

- **Main contributions:**  
  1. A new FBNL reranking paradigm that decouples query and passage computation while preserving expressive cross-attention-based relevance modeling.  
  2. Matryoshka Embedding Pooling (MEP) for flexible compression of passage representations (1×–32×).  
  3. Three model sizes (0.27B, 1B, 4B) achieving state-of-the-art efficiency–effectiveness trade-offs on BEIR, MIRACL, and LMEB.

- **Key findings:**  
  - KaLM-Reranker-V1 matches or outperforms strong industrial rerankers (Qwen3, BGE, Jina) at substantially lower online cost (up to 203× for long passages).  
  - Moderate MEP compression (\(r=2\)–\(8\)) preserves performance; larger models are more robust to compression.  
  - ROC analysis confirms that discriminative ability is maintained under moderate compression.

- **Limitations and future work:**  
  - Current compression method (mean pooling) may lose informative signals at high ratios (\(r=16,32\)). Future work could explore more effective compression methods (e.g., learned pooling, attention-based aggregation).  
  - Multilingual and Chinese reranking capabilities are limited by the foundation model; more extensive multilingual training data and better backbone models are needed.  
  - The current training data size (3.7M samples) is smaller than some competitors (e.g., Qwen3 uses 19M); scaling up data could further improve performance.

---

_Markdown view of https://picx.dev/p/1PgpkB, served by PicX — AI-generated visual whiteboard summaries of research papers._
